1) Label each case as describing either data mining (DM), or the use of the results of data mining (Use).
a) _____ Choose customers who are most likely to respond to an on-line ad.
b) _____ Discover rules that indicate when an account has been defrauded.
c) _____ Find patterns indicating what customer behavior is more likely to lead to response to an on-line ad.
d) _____ Estimate probability of default for a credit application.
e) _____ Predict whether a customer is pregnant
2) Plumbing Inc. has been selling plumbing supplies for the last 20 years. The owner, Joe, decides that next year it is finally time to diversify by adding gardening tools to his products. Having had success using customer data to build predictive models to guide direct mail campaigns for special plumbing offers, he considers that data mining could help him to identify a subset of customers who should be good prospects for his new set of products. Is Joe ready to solve this as a supervised learning problem?
If yes - what would you suggest as the target variable?
If no - why not? What would you recommend that Joe do to achieve his business goal?
3) Choose a problem from a past job, hobby, or interest that would make for a good predictive modeling classification application. Describe it in one page or less, using the relevant concepts introduced in classes 1 & 2 and Ch. 1 - 3 in the book. Your description should be as complete and precise as possible, referring to the concepts introduced in class/in the book. Please do not choose one of the applications we have discussed already (churn, targeted marketing, default prediction, pregnancy prediction).
Include answers to the following:
a) What exactly is the business decision you want to support with this solution? (Specifically, what is the business action you are considering? Discuss briefly the timing of the decision and the eventual outcome.)
b) Describe the use phase.
c) Why did you select this as a good predictive modeling problem?
d) How and where would you get the data?
e) Explain precisely why and how you expect doing the predictive modeling will add value.
f) What exactly is the quantity that you inherently do not know and need to predict?
g) Is this a classification, ranking, or probability estimation problem?
h) What are the features? Provide a list of at least 5 features that you think (a) you can get and (b) you think might be useful.
i) What exactly would be your training data?
4 Hands on (WEKA version). This is a first simple hands-on modeling task using Weka. Your task is to experiment with the classification tree induction algorithm in Weka. The data is available on NYU Classes in the data section under Resources->Datasets->Mailing (mailing_train.arff and mailing_test.arff). Build a classification tree using the J48 algorithm. If our classroom Weka demonstration was not enough, please consult the Weka tutorial (available under Resources->Weka >Weka_tutorial. It is useful to try to figure things out on your own, but if you get frustrated trying to figure out how to do something, please post a question to the discussion forum.
HINTS: A quick guide to the required commands: start Weka; select Explorer; use ‘Open file' to load a dataset; go to the Classify tab and use the Choose button to pick J48 from the trees. Scroll around in the ‘Classifier output' and try to understand what you see there.
I) Explore the evaluation options (test options in the Classify tab, on the left under the Choose button). Understand what they do in light of Chapter 5 (it is fairly straightforward, but you can also consult the Weka documentation or Google). Build/evaluate a tree under each of the 4 options (use the default whenever there is a parameter). Report the "accuracy" for each option and write a sentence or two about your observations (look at the summary in the Classifier output and identify the accuracy as the percent ‘Correctly Classified Instances' - you can ignore all the other stuff for now).
II) Figure out how to get predictions out of Weka (try the "More Options" button in the Test options) and copy a dozen of them from the ‘Classifier output' window here.
III) Identify the most INFORMATIVE attribute (according to the tree induction) and explain how you found it.
IV) Examine the parameters of the tree induction by clicking on J48 in the box just to the right of the ‘Choose' button. Set "unpruned" to True. Now, try changing the values for ‘minNumObj' and see (i) how it affects in-sample accuracy by evaluating on the training set, and (ii) how it affects the generalization accuracy using the test set. Explain the results. Use the concepts from the readings where appropria