Supervised Data Mining Capstone Assignment
Introduction
The purpose of this assignment is to demonstrate your knowledge and understanding of the analytical techniques and tools learned in the course and to show your understanding of how it relates to a business scenario. This assignment is somewhat different from previous ones: I do not give you detailed instructions on how to build your analytical process in RapidMiner. Instead, you are expected to use what you have learned in the course to do the modeling, validation and performance analysis on the given dataset, answer the questions about the analysis, and make recommendations for the business based on the results.
This exercise will be completed in the same teams of 2-3 students as for previous exercises.
Submission Instructions
Perform the necessary tasks using RapidMiner, answer the questions below and prepare the required screenshots.
Here is the explanation of the variables in the dataset:
a. Gender_Female: female or not
b. PhoneService_Yes: whether the customer has phone service with the company
c. MultipleLines_No: whether the customer has no multiple line service
d. MultipleLines_Yes: whether the customer has multiple line service
e. InternetService_DSL: whether the customer has DSL internet
f. InternetService_Fiber optic: whether the customer has Fiber optic internet
g. TechSupport_Yes: whether the customer signed up for tech support service
h. TechSupport_No internet service: whether the customer had no internet service
i. StreamingTV_Yes: customer streams TV
j. StreamingTV_ No internet service: customer had no internet service for TV streaming
k. StreamingMovies_Yes: customer streams movies
l. StreamingMovies_ No internet service: customer had no internet service for movie streaming
m. Contract_One year: type of contract for customer: 1 yr
n. Contract_Two year: type of contract the customer: 2 yr
o. PaperlessBilling_Yes: whether the customer signed up for paperless billing
p. PaymentMethod_ Electronic check: payment made by Electronic check
q. PaymentMethod_ Bank transfer: payment made by Bank transfer
r. PaymentMethod_ Credit card: payment made by credit card
s. Retired: 0 for not, 1 for yes
t. Tenure (months): how long has been a customer with the company
u. MonthlyCharges: $ amount of monthly payments for the subscribed services
v. Churn: Whether the customer churned (assume that ‘positive' means that the customer churns)
1. As a first step, build at least 3 models using different classification techniques that are capable of classifying customers into 2 categories (churn/no churn.)
At least one of the 3 models should be either a decision tree or a neural network. Make sure that you build the process to include the cross-validation operator (it is enough to have 3 folds in each validation to save some process runtime). Where possible, experiment with the parameter settings of the model operators to try to improve the model's performance.
Make readable screenshots of the following for all 3 models:
- Processes
- Parameter settings for the type of model (Decision Trees, Neural Networks, etc.)
- Appropriate model results (i.e. Coefficients, Tree, Network, etc.)
2. For measuring the performance of the 3 models look at the following performance measures:
- Accuracy
- Kappa
- Precision
- Recall
- Lift
- AUC (NOT the optimistic or pessimistic)
(Hint: use the binomial classification performance operator to obtain all of these measures.)
a. Make a screenshot of the confusion matrix output of each of the 3 methods.
b. Prepare a table to report the above values for the 3 models.
c. Discuss the performance for each of the 3 models based on the above values. Relate the performances to the a priori probabilities of the outcome as well.
d. Suppose that if you can correctly predict that a customer will churn (a true positive), you can make a special promotional offer that will result in the customer staying with you. You estimate that the net profit from such an outcome is $200 per customer.
However, if you predict a customer will churn when they actually wouldn't have (a false positive), you incur a net cost of $50 per customer by unnecessarily
making the offer.
Using this information, compute and show a profit/cost matrix for each model and report the per-customer expected value of each model.
e. Prepare a visual evaluation of the 3 models by including a screenshot of the ROC comparison chart.
(Hint: This may require the building of a separate process from the previous ones.)
f. Using the above information, compare the performance of the 3 models. What are the similarities/differences in their performance?
3. Answer the following questions:
a. Which attributes seem to matter the most? How do you know it? Discuss their importance and/or effect sizes. How can you interpret the results of the models?
b. Are the 3 models giving you more or less same suggestions? Do the models agree in most aspects? If there are differences, what are they?
4. What business recommendations can be given based on the results? How could the results of the model(s) be useful for the company?