CS5608 Big Data Analytics
Coursework for 2019/20
04/04/2020
The dataset contains 18K job descriptions out of which about 800 are fake. The data consists of both textual information and meta-information about the jobs. The dataset will be used to create machine learning classification model which can learn the job descriptions which are fraudulent.
The following tables describes all the variables in the dataset.
Variable Name
|
Variable Description
|
job_id
|
Unique Job ID
|
title
|
The title of the job ad entry.
|
location
|
Geographical location of the job ad.
|
department
|
Corporate department (e.g. sales).
|
salary_range
|
Indicative salary range (e.g. $50,000-$60,000)
|
company_profile
|
A brief company description.
|
description
|
The details description of the job ad.
|
requirements
|
Enlisted requirements for the job opening.
|
benefits
|
Enlisted offered benefits by the employer.
|
telecommuting
|
True for telecommuting positions.
|
has_company_logo
|
True if company logo is present.
|
has_questions
|
True if screening questions are present.
|
employment_type
|
Full-type, Part-time, Contract, etc.
|
required_experience
|
Executive, Entry level, Intern, etc.
|
required_education
|
Doctorate, Master???s Degree, Bachelor, etc.
|
industry
|
Automotive, IT, Health care, Real estate, etc.
|
function
|
Consulting, Engineering, Research, Sales etc.
|
fraudulent
|
target - Classification attribute.
|
The data was loaded into R workspace in such a way that the empty string was treated as NA.
Also loading the data containing 19,000 job postings that were posted through the Armenian human resource portal CareerCenter.
The common columns in the two datasets are title, location, description, requirements, required Experience,
The following table shows the total number of instances of each fraudalent class
[,1] [,2]
fraudulent 0 1 count 17014 866
Therefore, we have 17014 instances of non-fake jobs and 866 instances of fake jobs.
The following table shows the number of missing values in each of the variables described in the table above for both of the target levels.
[,1] [,2]
fraudulent 0 1
job_id 0 0
title 0 0
location 327 19
department 11016 531
salary_range 14369 643
company_profile 2721 587
description 0 0
requirements 2541 153
benefits 6845 363
telecommuting 0 0
has_company_logo 0 0
has_questions 0 0
employment_type 3230 241
required_experience 6615 435
required_education 7654 451
industry 4628 275
function. 6118 337
We can see that were are missing alot of values in both the fraudalent class and therefore, it will be not a good idea to drop all the incomplete cases from the dataset. But since, we are missing so many data in many characteristics of the dataset, we will be using only description, telecommuting, has_company_logo and has_questions variables for building the machine learning prediction model to predict whether a job posting is fake or not.
The following plot shows the distribution of the fraudulent class on whether the job is for telecommuting positions or not.
The following plot shows the distribution of the fraudulent class on whether company logo is present or not.
The following plot shows the distribution of the fraudulent class on whether screening questions are present or not.
From the three plots above, we can see that it is almost impossible to comment divide the two classes that is whther the job is fraud or not based on the three features plots above, namely telecommuting, has_company_logo and has_questions. Therefore, we will be building the machine learning classification model only using the description of the job.
The following steps were performed in order to build a machine learning classification model using the tect in the description of the job description.
corpus <-VCorpus(VectorSource(job$description))
corpus <-tm_map(corpus,
content_transformer(removePunctuation))
corpus <-tm_map(corpus, removeNumbers)
corpus <-tm_map(corpus, content_transformer(tolower))
corpus <-tm_map(corpus,
content_transformer(removeWords),
stopwords("english"))
corpus <-tm_map(corpus, stemDocument)
corpus <-tm_map(corpus, stripWhitespace)
- Create a DTM or Document Term Matrix
Performing Principal Component Analysis on the data in order to reduce the dimensionality of the whole dataset. The followinf Scree plot shows the percentage of variances explained by each principal component.
From the above Scree plot, we selected first 10 principal components to build the machine learning classification model
Then the data was splitted into training (80%) and testing (20%) datasets.
The Bayesian Generalized Linear Model was used from the caret package available in R in order to build a predictive model that when trained on the training model can predict whether a job from test
The following shows the results/performance of the selected Bayesian Generalized Linear classification model on the testing (out-of-sample) dataset using only the description of the job.
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 13618 674
1 6 6
Accuracy : 0.9525
95% CI : (0.9488, 0.9559)
No Information Rate : 0.9525
P-Value [Acc > NIR] : 0.5102
Kappa : 0.0157
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.999560
Specificity : 0.008824
Pos Pred Value : 0.952841
Neg Pred Value : 0.500000
Prevalence : 0.952461
Detection Rate : 0.952041
Detection Prevalence : 0.999161
Balanced Accuracy : 0.504192
'Positive' Class : 0
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 3386 184
1 4 2
Accuracy : 0.9474
95% CI : (0.9396, 0.9545)
No Information Rate : 0.948
P-Value [Acc > NIR] : 0.5789
Kappa : 0.0176
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.99882
Specificity : 0.01075
Pos Pred Value : 0.94846
Neg Pred Value : 0.33333
Prevalence : 0.94799
Detection Rate : 0.94687
Detection Prevalence : 0.99832
Balanced Accuracy : 0.50479
'Positive' Class : 0
In this papers, we built a Bayesian Generalized Linear classification model for predicting the various characteristics of a job positing. In initial analysis, we found that most of the characteristics of the job listings were missing in both the classes that is whther that job listing is fake or not. We then plotted the distribution of the remaining complete characteristics of the job listings and found that telecommuting, has_company_logo and has_questions variables were unable to divide the two classes using visual inspection and ignoring any posibility of interaction between different levels of these three variables. We then used just the description of the job listings to develop the machine learning classification model. This was done by first building a Sparsed Document Term Matrix from the decription after removing all the punctuations, whitespaces, stopwords, number and converting all the description to lower cases. We then performed PCA on the resulting DTM and kept only first 10 Principal Component for building the model. The DTM was divided into two dataset that is the training (80%) and testing dataset. The Bayesian Generalized Linear classification model was trained on the training dataset and then tested on both training and the testing dataset. From the evaluation of the performance of the model, we received a training and testing accuracy of 95.25% and 94.74% respectively which is very good. But when we look at both the testing and training confusion matrix, we realize that the model is predicting most of the instances to be not fake and therefore, unable to detect the fake cases effeciently. A major cause of this problem is the imbalanced class instances in the job listing dataset. The issues also include the lose of explained variance in the dataset by using less number of principal components for reducing the dimensionality of the dataset and therefore, the processing power and processing time required for building the classification model for predicting whether a job listing is fake or not.
Appendix
if(!require(dplyr)) {
install.packages("dplyr")
library(dplyr)
}
if(!require(kableExtra)) {
install.packages("kableExtra")
library(kableExtra)
}
if(!require(ggplot2)) {
install.packages("ggplot2")
library(ggplot2)
}
if(!require(tm)) {
install.packages("tm")
library(tm)
}
if(!require(SnowballC)) {
install.packages("SnowballC")
library(SnowballC)
}
if(!require(caret)) {
install.packages("caret")
library(caret)
}
if(!require(factoextra)) {
install.packages("factoextra")
library(factoextra)
}
# Data preparation and cleaning
job <-read.csv(
"fake_job_postings.csv",
na.strings=c("","NA")
)
data <-read.csv("data job posts.csv")
data <-data.frame(
"id" =c(job$job_id, data$jobpost),
"title" =c(job$title, data$Title),
"location" =c(job$location, data$Location),
"salary" =c(job$salary_range, data$Salary),
"company" =c(job$company_profile, data$Company),
"description" =c(job$description, data$JobDescription),
"requirements" =c(job$requirements, data$RequiredQual)
)
kable(job %>%
group_by(fraudulent) %>%
summarise(
count =n()
))
kable(t(job %>%
group_by(fraudulent) %>%
summarise_all(~sum(is.na(.)))))
job <-job %>%
select(
description,
telecommuting,
has_company_logo,
has_questions,
fraudulent
) %>%
mutate(
telecommuting =as.factor(telecommuting),
has_company_logo =as.factor(has_company_logo),
has_questions =as.factor(has_questions),
fraudulent =as.factor(fraudulent)
)
# Exploratory data analysis
job %>%
ggplot(aes(telecommuting, ..count..)) +
geom_bar(aes(fill=fraudulent),
position ="dodge")
job %>%
ggplot(aes(has_company_logo, ..count..)) +
geom_bar(aes(fill=fraudulent),
position ="dodge")
job %>%
ggplot(aes(has_questions, ..count..)) +
geom_bar(aes(fill=fraudulent),
position ="dodge")
job <-job %>%
select(
description,
fraudulent
)
# Machine learning prediction
corpus <-VCorpus(VectorSource(job$description))
corpus <-tm_map(corpus,
content_transformer(removePunctuation))
corpus <-tm_map(corpus, removeNumbers)
corpus <-tm_map(corpus, content_transformer(tolower))
corpus <-tm_map(corpus,
content_transformer(removeWords),
stopwords("english"))
corpus <-tm_map(corpus, stemDocument)
corpus <-tm_map(corpus, stripWhitespace)
dtm <-DocumentTermMatrix(corpus)
dtm <-removeSparseTerms(dtm, 0.9)
dtm.matrix <-as.matrix(dtm)
description <-as.data.frame(dtm.matrix)
rm(dtm, dtm.matrix, corpus)
res.pca <-prcomp(description,
scale = T)
fviz_eig(res.pca,
ncp =15)
description <-as.data.frame(res.pca$x[,1:10])
description$fraudulent <-job$fraudulent
set.seed(123)
smp <-floor(0.8*nrow(description))
trainIndex <-sample(seq_len(nrow(description)),
size = smp)
train <-description[ trainIndex,]
test <-description[-trainIndex,]
rm(description, smp, trainIndex, res.pca, job)
# Performance evaluation
model <-train(fraudulent ~.,
data = train,
method ="bayesglm")
## Training Accuracy
prediction <-predict(
model,
newdata = train,
type ="raw"
)
reference <-train$fraudulent
confusionMatrix(prediction, reference)
## Testing Accuracy
prediction <-predict(
model,
newdata = test,
type ="raw"
)
reference <-test$fraudulent
confusionMatrix(prediction, reference)
RCode file attached
Attachment:- R_code.zip