Question: Campaign organizers for both the Republican and Democrat parties are interested in identifying individual undecided voters who would consider voting for their party in an upcoming election. The file Blue Or Red contains data on a sample of voters with tracked variables including: whether or not they are undecided regarding their candidate preference, age, whether they own a home, gender, marital status, household size, income, years of education, and whether they attend church. Partition the data into training (50 percent), validation (30 percent), and test (20 percent) sets. Classify the data using k-nearest neighbors with up to k 5 20. Use Age, Home Owner, Female, Married, Household Size, Income, and Education as input variables and Undecided as the output variable. In Step 2 of XL Miner's k-nearest neighbors Classification procedure, be sure to Normalize input data and to Score on best k between 1 and specified value. Generate lift charts for both the validation data and test data.
a. For k = 1, why is the overall rate equal to 0 percent on the training set? Why isn't the overall rate equal to 0 percent on the validation set?
b. For the cutoff probability value 0.5, what value of k minimizes the overall error rate on the validation data? Explain the difference in the overall error rate on the training, validation, and test data.
c. Examine the decile-wise lift chart. What is the first decile lift on the test data? Interpret this value.
d. In the effort to identify undecided voters, a campaign is willing to accept an increase in the misclassification of decided voters as undecided if it can correctly classify more undecided voters. For cutoff probability values of 0.5, 0.4, 0.3, and 0.2, what are the corresponding Class 1 error rates and Class 0 error rates on the validation data?