1) Consider again the churn dataset. Create two learning curves (using WEKA) of the out of sample AUC on the test set (churn_test.arfff) using both logistic regression and the decision tree J48 (just go with the default settings). In particular, starting from the full training set, after each iteration, reduce the training set to half until you reach less than 100 examples. Provide a plot with both curves (copy the data into EXCEL and create the charts) .
• You can cut the dataset in half easily in Weka. In the Preprocess tab, in the box marked Filter, click on Choose. Under weka->filters->unsupervised->instance you will see RemovePercentage. (Normally, it is a good idea first to run the filter Randomize, to make sure that you are removing the data randomly; real data often will be sorted based on some attribute, which can result in throwing away many data items with similar values. Don't Randomize for this assignment; the data for this assignment already will be randomized.)
• The Undo button on the preprocess tab will undo the preprocessing (like Randomizing, RemovePercentage, etc.). Keep an eye on the data statistics (like the number of instances) in the preprocess tab to verify.
2) Create a fitting curve of the generalization AUC for decision trees as a function of the MinNumObj parameter. First change the option ‘unpruned' to ‘true'. Provide a plot of the parameter and the resulting out of sample performance using either cross validation or a training/test split. What does the parameter do? What is the optimal selection for the parameter?
3) Repeat the same experiment as in step 1, but setting minnumObj=100 and unpruned=TRUE. How does the learning curve of the decision tree change? What do you infer from this result?
Attachment:- Assignment.rar