The aim of this assessment is to design, implement and evalu

CS5608 Big Data Analytics

Coursework for 2019/20

04/04/2020

The dataset contains 18K job descriptions out of which about 800 are fake. The data consists of both textual information and meta-information about the jobs. The dataset will be used to create machine learning classification model which can learn the job descriptions which are fraudulent.

The following tables describes all the variables in the dataset.

Variable Name	Variable Description
job_id	Unique Job ID
title	The title of the job ad entry.
location	Geographical location of the job ad.
department	Corporate department (e.g. sales).
salary_range	Indicative salary range (e.g. $50,000-$60,000)
company_profile	A brief company description.
description	The details description of the job ad.
requirements	Enlisted requirements for the job opening.
benefits	Enlisted offered benefits by the employer.
telecommuting	True for telecommuting positions.
has_company_logo	True if company logo is present.
has_questions	True if screening questions are present.
employment_type	Full-type, Part-time, Contract, etc.
required_experience	Executive, Entry level, Intern, etc.
required_education	Doctorate, Master???s Degree, Bachelor, etc.
industry	Automotive, IT, Health care, Real estate, etc.
function	Consulting, Engineering, Research, Sales etc.
fraudulent	target - Classification attribute.

The data was loaded into R workspace in such a way that the empty string was treated as NA.

Also loading the data containing 19,000 job postings that were posted through the Armenian human resource portal CareerCenter.

The common columns in the two datasets are title, location, description, requirements, required Experience,

The following table shows the total number of instances of each fraudalent class

[,1] [,2]

fraudulent 0 1 count 17014 866

Therefore, we have 17014 instances of non-fake jobs and 866 instances of fake jobs.

The following table shows the number of missing values in each of the variables described in the table above for both of the target levels.

                     [,1] [,2]
fraudulent              0    1
job_id                  0    0
title                   0    0
location              327   19
department          11016 531
salary_range        14369 643
company_profile      2721 587
description             0    0
requirements         2541 153
benefits             6845 363
telecommuting           0    0
has_company_logo        0    0
has_questions           0    0
employment_type      3230 241
required_experience 6615 435
required_education   7654 451
industry             4628 275
function.            6118 337

We can see that were are missing alot of values in both the fraudalent class and therefore, it will be not a good idea to drop all the incomplete cases from the dataset. But since, we are missing so many data in many characteristics of the dataset, we will be using only description, telecommuting, has_company_logo and has_questions variables for building the machine learning prediction model to predict whether a job posting is fake or not.

The following plot shows the distribution of the fraudulent class on whether the job is for telecommuting positions or not.

The following plot shows the distribution of the fraudulent class on whether company logo is present or not.

The following plot shows the distribution of the fraudulent class on whether screening questions are present or not.

From the three plots above, we can see that it is almost impossible to comment divide the two classes that is whther the job is fraud or not based on the three features plots above, namely telecommuting, has_company_logo and has_questions. Therefore, we will be building the machine learning classification model only using the description of the job.

The following steps were performed in order to build a machine learning classification model using the tect in the description of the job description.

Build a Corpus

corpus <-VCorpus(VectorSource(job$description))

Remove Punctuation

corpus <-tm_map(corpus,
content_transformer(removePunctuation))

Remove numbers

corpus <-tm_map(corpus, removeNumbers)

Convert to lower case

corpus <-tm_map(corpus, content_transformer(tolower))

Remove stop words

corpus <-tm_map(corpus,
content_transformer(removeWords),
stopwords("english"))

Stemming Process

corpus <-tm_map(corpus, stemDocument)

Strip Whitespace

corpus <-tm_map(corpus, stripWhitespace)

Create a DTM or Document Term Matrix

Performing Principal Component Analysis on the data in order to reduce the dimensionality of the whole dataset. The followinf Scree plot shows the percentage of variances explained by each principal component.

From the above Scree plot, we selected first 10 principal components to build the machine learning classification model

Then the data was splitted into training (80%) and testing (20%) datasets.

The Bayesian Generalized Linear Model was used from the caret package available in R in order to build a predictive model that when trained on the training model can predict whether a job from test

The following shows the results/performance of the selected Bayesian Generalized Linear classification model on the testing (out-of-sample) dataset using only the description of the job.

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 13618   674
         1     6     6

               Accuracy : 0.9525
                 95% CI : (0.9488, 0.9559)
    No Information Rate : 0.9525
    P-Value [Acc > NIR] : 0.5102

                  Kappa : 0.0157

Mcnemar's Test P-Value : <2e-16

            Sensitivity : 0.999560
            Specificity : 0.008824
         Pos Pred Value : 0.952841
         Neg Pred Value : 0.500000
             Prevalence : 0.952461
         Detection Rate : 0.952041
   Detection Prevalence : 0.999161
      Balanced Accuracy : 0.504192

       'Positive' Class : 0

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 3386 184
         1    4    2

               Accuracy : 0.9474
                 95% CI : (0.9396, 0.9545)
    No Information Rate : 0.948
    P-Value [Acc > NIR] : 0.5789

                  Kappa : 0.0176

Mcnemar's Test P-Value : <2e-16

            Sensitivity : 0.99882
            Specificity : 0.01075
         Pos Pred Value : 0.94846
         Neg Pred Value : 0.33333
             Prevalence : 0.94799
         Detection Rate : 0.94687
   Detection Prevalence : 0.99832
      Balanced Accuracy : 0.50479

       'Positive' Class : 0

In this papers, we built a Bayesian Generalized Linear classification model for predicting the various characteristics of a job positing. In initial analysis, we found that most of the characteristics of the job listings were missing in both the classes that is whther that job listing is fake or not. We then plotted the distribution of the remaining complete characteristics of the job listings and found that telecommuting, has_company_logo and has_questions variables were unable to divide the two classes using visual inspection and ignoring any posibility of interaction between different levels of these three variables. We then used just the description of the job listings to develop the machine learning classification model. This was done by first building a Sparsed Document Term Matrix from the decription after removing all the punctuations, whitespaces, stopwords, number and converting all the description to lower cases. We then performed PCA on the resulting DTM and kept only first 10 Principal Component for building the model. The DTM was divided into two dataset that is the training (80%) and testing dataset. The Bayesian Generalized Linear classification model was trained on the training dataset and then tested on both training and the testing dataset. From the evaluation of the performance of the model, we received a training and testing accuracy of 95.25% and 94.74% respectively which is very good. But when we look at both the testing and training confusion matrix, we realize that the model is predicting most of the instances to be not fake and therefore, unable to detect the fake cases effeciently. A major cause of this problem is the imbalanced class instances in the job listing dataset. The issues also include the lose of explained variance in the dataset by using less number of principal components for reducing the dimensionality of the dataset and therefore, the processing power and processing time required for building the classification model for predicting whether a job listing is fake or not.

Appendix

if(!require(dplyr)) {
install.packages("dplyr")
library(dplyr)
}

if(!require(kableExtra)) {
install.packages("kableExtra")
library(kableExtra)
}

if(!require(ggplot2)) {
install.packages("ggplot2")
library(ggplot2)
}

if(!require(tm)) {
install.packages("tm")
library(tm)
}

if(!require(SnowballC)) {
install.packages("SnowballC")
library(SnowballC)
}

if(!require(caret)) {
install.packages("caret")
library(caret)
}

if(!require(factoextra)) {
install.packages("factoextra")
library(factoextra)
}

# Data preparation and cleaning

job <-read.csv(
"fake_job_postings.csv",
na.strings=c("","NA")
)

data <-read.csv("data job posts.csv")

data <-data.frame(
"id" =c(job$job_id, data$jobpost),
"title" =c(job$title, data$Title),
"location" =c(job$location, data$Location),
"salary" =c(job$salary_range, data$Salary),
"company" =c(job$company_profile, data$Company),
"description" =c(job$description, data$JobDescription),
"requirements" =c(job$requirements, data$RequiredQual)
)

kable(job %>%
group_by(fraudulent) %>%
summarise(
count =n()
        ))

kable(t(job %>%
group_by(fraudulent) %>%
summarise_all(~sum(is.na(.)))))

job <-job %>%
select(
    description,
    telecommuting,
    has_company_logo,
    has_questions,
    fraudulent
) %>%
mutate(
telecommuting =as.factor(telecommuting),
has_company_logo =as.factor(has_company_logo),
has_questions =as.factor(has_questions),
fraudulent =as.factor(fraudulent)
)

# Exploratory data analysis

job %>%
ggplot(aes(telecommuting, ..count..)) +
geom_bar(aes(fill=fraudulent),
position ="dodge")

job %>%
ggplot(aes(has_company_logo, ..count..)) +
geom_bar(aes(fill=fraudulent),
position ="dodge")

job %>%
ggplot(aes(has_questions, ..count..)) +
geom_bar(aes(fill=fraudulent),
position ="dodge")

job <-job %>%
select(
    description,
    fraudulent
)

# Machine learning prediction

corpus <-VCorpus(VectorSource(job$description))

corpus <-tm_map(corpus,
content_transformer(removePunctuation))

corpus <-tm_map(corpus, removeNumbers)

corpus <-tm_map(corpus, content_transformer(tolower))

corpus <-tm_map(corpus,
content_transformer(removeWords),
stopwords("english"))

corpus <-tm_map(corpus, stemDocument)

corpus <-tm_map(corpus, stripWhitespace)

dtm <-DocumentTermMatrix(corpus)
dtm <-removeSparseTerms(dtm, 0.9)
dtm.matrix <-as.matrix(dtm)
description <-as.data.frame(dtm.matrix)

rm(dtm, dtm.matrix, corpus)

res.pca <-prcomp(description,
scale = T)
fviz_eig(res.pca,
ncp =15)

description <-as.data.frame(res.pca$x[,1:10])
description$fraudulent <-job$fraudulent
set.seed(123)
smp <-floor(0.8*nrow(description))
trainIndex <-sample(seq_len(nrow(description)),
size = smp)
train <-description[ trainIndex,]
test <-description[-trainIndex,]

rm(description, smp, trainIndex, res.pca, job)

# Performance evaluation

model <-train(fraudulent ~.,
data = train,
method ="bayesglm")

## Training Accuracy

prediction <-predict(
model,
newdata = train,
type ="raw"
)

reference <-train$fraudulent

confusionMatrix(prediction, reference)

## Testing Accuracy

prediction <-predict(
model,
newdata = test,
type ="raw"
)

reference <-test$fraudulent

confusionMatrix(prediction, reference)

RCode file attached

Attachment:- R_code.zip

View Complete Question

Request for Solution File

Ask an Expert for Answer!!

Management Theories: The aim of this assessment is to design, implement and evalu

Reference No:- TGS03065293

Expected delivery within 24 Hours

Have a Question? (oR Write a Review)

Write atleast 100 words!!

Appendix

Request for Solution File

Ask an Expert for Answer!!

Management Theories: The aim of this assessment is to design, implement and evalu

Reference No:- TGS03065293

Have a Question? (oR Write a Review)

Recent Questions Asked Management Theories

Q : Evaluate the types of lean techniques and concepts

Q : What path or alternative paths might individuals take

Q : The company is a fast food chain company that runs more than

Q : The project is for providing training to the employees of th

Q : The aim of this assessment is to design, implement and evalu

Q : Critically evaluate the current taxonomies of css including

Q : What are the fundamental bases for business ethics

Q : Explain the simultaneous increases in the price of lithium

Q : What you believe to be meaningful about qualitative research

What is correct regarding eisenmenger syndrome

Whether plant will grow more when it receives more sunlight

What is one of the unique things about red wolves in nc

Mechanism of atherosclerosis development at vascular branch

Why important in conservation biology to empower women

What research are scientists conducting with whale sharks

Why does the gi tract have a plexus in the muscularis

Appendix

Request for Solution File

Ask an Expert for Answer!!

Management Theories: The aim of this assessment is to design, implement and evalu

Reference No:- TGS03065293

Recent Questions Asked Management Theories

Q : Evaluate the types of lean techniques and concepts

Q : What path or alternative paths might individuals take

Q : The company is a fast food chain company that runs more than

Q : The project is for providing training to the employees of th

Q : The aim of this assessment is to design, implement and evalu

Q : Critically evaluate the current taxonomies of css including

Q : What are the fundamental bases for business ethics

Q : Explain the simultaneous increases in the price of lithium

Q : What you believe to be meaningful about qualitative research

Asked Questions