Assignment- Decision Tree and Naïve Bayes
Build Decision Tree Model
Packages required: Install and load C50, caret, rminerpackages
Data: The data are taken from Shmueli et al. (2010). The data set consists of 2201 airplane flights in January 2004 from the Washington DC area into the NYC area. The characteristic of interest (the response) is whether or not a flight has been delayed by more than 15 min.
The explanatory variables include three different arrival airports (Kennedy, Newark, and LaGuardia); three different departure airports (Reagan, Dulles, and Baltimore); eight carriers; a categorical variable for 16 different hours of departure (6 am to 10 pm); weather condition (0=good and 1 = bad); day of week (1 = Monday, 2 = Tuesday, 3 = Wednesday, ... , 6 = Saturday and 7 = Sunday);Here the objective is to identify flights that are likely to be delayed.
Tasks:
1) Import and explore data
a. Open FlightDelay.csv and store the results into a data frame, e.g., called datFlight. All of the character values should be imported as factors. Transform specific numeric values such as weather condition, day of week and day of month as factors.
b. Use the str() and summary commands to provide a listing of the imported columns and their basic statistics. Make sure that the data types are imported as expected.
2) Prepare data for classification
a. Using a seed of 100, randomly select 60% of the rows into training (e.g. called traindata). Divide the other 40% of the rows evenly into two holdout test/validation sets (e.g., called testdata1 and testdata2).
b. Inspect (show) the distributions of the target variable in the subsets. They should preserve the distribution of the target variable in the whole data set.
3) C5.0 decision tree classifiers
a. Build/train a tree model
i. Build the tree using the C50 function with default settings
ii. Show the (textual) model/tree.
iii. How many leaves are in the tree? (Note: In C50, the size of tree is the number of leaves. In J48, the size of the tree is the number of nodes, and J48 also provides the number of leaves.)
iv. What is the predictor that first splits the tree?
b. Find rules (paths) in the tree
i. Find one path in the tree to a leaf node that is classified to ontime. Starting with the condition on the first (or top) branch of the path, write down the conditions on the tree branches belonging to this path. Enclose a condition in a pair of parentheses and precede it with "If" - e.g.
If (house <= 600469),..., and (income <57578), then STAY
ii. How many conditions and how many unique predictors are in your selected rule?
iii. What is this rule's misclassification error rate (e.g., 20/50 misclassified)?
iv. Similarly, describe a rule that classifies an instance to delay.
v. What is this rule's misclassification error?
vi. Find a shorter or longer rule with fewer or more conditions for ontine than previous rules. Repeat this for Delay. Show these two rules and their misclassification errors.
vii. What are the reasons that long rules are included in a decision tree model?
viii. What is the disadvantage of a long rule?
c. Apply and evaluate the trained model to two hold-out testing sets, one set at a time. The process for each data set includes:
i. Generate predictions (i.e. estimations) of the values of the target variable for the testing instances.
ii. Generate a confusion matrix that shows the counts of true-positive, true-negative, false-positive and false-negative predictions for both testdata1 and testdata2. Consider Ontime as positive class.
iii. Generate seven performance metrics - Accuracy (percent of all correctly classified testing instances), and precision (percent of instances predicted to have a class are accurate), recall (also true positive) and F-measure (also F-score) of Ontime and of Delay respectively. (Note: References of performance metrics in the rest of the assignment refers to these seven metrics or a set of metrics that are inclusive of these.)
iv. Report all performance differences in the same performance metric between the two data sets that are more than 10%. Does this tree generalize well over these two testing sets? Explain the reason for your answer.
4) C50 pruning
a. Build another C50 tree using the train set by changing the confidence factor to 0.05 (i.e. CF=0.05 in C50 function's control).
b. Describe the size of the tree built.
c. Generate predictions, confusion matrixes and performance metrics using two test sets.
d. Report all performance differences in the same performance metric between the two data sets that are more than 10%. Does this tree generalize well over these two testing sets? Explain the reason for your answer.
e. Would you adopt this pruning setting? Why or why not?
5) Returning to the default pruning setting, build another C50 tree with only two predictors of your choice.
a. Build a tree using the predictors of your choice in the train set.
b. Describe the size of the tree built.
c. Generate predictions, confusion matrices and performance metrics using two test sets.
d. Report all performance differences in the same performance metric between the two data sets that are more than 10%. Does this tree generalize well over these two testing sets?
Build Naïve Bayes Model
1) e1071 naiveBayes classifiers
a. Prepare DelayFlight for building and evaluating Naïve Bayesian classifiers. Load the caret package. Using a seed of 100, 500 and 900, randomly select 67% of a file three times into three training sets and save other 33% in three testing sets respectively. Calculate the average number of examples in testing sets.
b. Use for loop to build and understand e1071 naiveBayes models with all predictors for delay.
i. Load the e1071 and rminer packages.
ii. Build a Naïve Bayesian models using the naiveBayes function in e1071 with each traindata.
iii. Show each model. What are the values of A-priori probabilities - P(Delay) for the delay class and P(Ontime) for the ontime class for each model?
iv. Generate predictions (i.e. estimations) of the values of the target variable for instances in each testdata.
v. Save the values of TP, TN, FP, FN and calculate the average of these four values after the loop.
vi. Save the value of performance metrics of three models on their corresponding testing samples and fill out the following table.
|
Accuracy
|
Precision_Delay
|
Precision_Ontime
|
Recall_Delay
|
Recall_Ontime
|
F1(Delay)
|
F1(Ontime)
|
Model1
|
|
|
|
|
|
|
|
Model2
|
|
|
|
|
|
|
|
Model3
|
|
|
|
|
|
|
|
Cost Sensitive Learning
1) Imbalanced target variable class distribution
a. What is the distribution proportion of target variable from original FlightDelay dataset? Which one is the majority class (more instances) and which one is the minority class (less instances)?
b. A simple majority_rule model always classifies all instances as the majority class which is the class that has more instances in a data set. This rule is a heuristic (man-made) rule. (No code needed for this questions)
i. Use the majority_rule model to classify all of the instances in FlightDelay.csv. How many TP, TN, FP and FN will this model generate? What is the accuracy rate of applying this model to FlightDelay.csv?
2) Cost-benefit calculations and cost-sensitive models using all of the predictors
a. Using the mean values of TP, FP, TN and FN from three C50 classifier testing results and the average number of test instances over all three test sets, calculate and print the average net-benefit per flight over all three testing results. Assume the following cost and benefit factors.
o Cost of sending notification message to a classified delay (Predicted as Delay): $50 per flight.
o Loss of delay waiting: $1000 for providing food and hotel service for customers per flight.
o Benefits of predicting a correct delay flight: $500
o No additional benefits from correctly classifying actual on time flight.
Predicted as
|
Actual
|
On Time
|
Delay
|
On Time
|
0
|
-1000
|
Delay
|
-50
|
500
|
b. Using the mean values of TP, FN, TN and FP from three naïvebayes classifier testing results and the average number of test instances over all three test sets, calculate and print the average net-benefit per customer over all three testing results. Assume the same cost and benefit factors.
c. Create a cost matrix to specify the cost of misclassifying a delay flight as aon time flight to be 10 times the cost of misclassifying a on time to delay.
d. In a For loop, build, predict and evaluate C50 classifiers using this cost matrix with three pairs of train and test sets. These are C50 cost-sensitive classifiers. Print the performance metrics for each testing set as well as the average value of each performance metric over three testing sets. Generate confusion matrix for each test set. Calculate and save the mean values of TP, FP, TN and FN over the three confusion matrices of testing results.
e. Using the mean values of TP, FN, TN and FP from three C50 cost-sensitive classifier testing results and the average number of test instances over all three test sets, calculate and print the average net-benefit per customer over all three testing results. Assume the same cost and benefit factors.