Problem 1: Pete, owner of Pistol Pete's Diamond Emporium, is investing in a diamond classification system due to his deteriorating eyesight. Pete buys and sells diamonds of varying quality: Low ($1,000-$3,000), Medium ($4,000-$7,000), and High ($8,000-$10,000). It is very important to Pete that his classifier properly classifies his diamonds so that he can not only have a profitable business, but also, so that his customers will continue to trust him as a business owner.
Using the possible cost matrix values given below, fill out the cost matrix that most accurately reflects Pete's needs for his diamond classifier model. After completing the cost matrix, justify your proposed cost matrix.
Possible cost matrix values: -1, -1, 0, 20, 20, 20, 20, 100, 100
Actual class
|
|
Predicted class
|
High
|
Medium
|
Low
|
High
|
|
|
|
Medium
|
|
|
|
Low
|
|
|
|
Problem 2: Weka recently added the fictitious Super Happy Terrific Classifier (SHTC) algorithm to its suite of available classifiers and you would like to use it in your analysis. Upon reading the SHTC documentation, you realize that it only accepts discrete attributes as input. However, many of the attributes in your data set are continuous. Can you still use the SHTC algorithm in your analysis? If yes, explain how. If no, explain why not.
Problem 3: You have decided to use J48 as a classifier in Weka for your data set. After your analysis, you have found that the accuracy of J48 for your data set is greater than that of ZeroR, but less than the accuracy of OneR. Should you continue to use J48 as a classifier for your data set? Why or why not?
Problem 4: You have performed an unsupervised k-means clustering on a data set with two attributes and the results indicate a k of 2. Later, you determine the class values for each data instance (there are four class values) and a supervised clustering results in a k of 4. Provide a possible explanation for why the two clustering methods disagree on a k value and a draw a sketch of the two clustering to go along with your explanation.
Problem 5: You are using a 3-nearest neighbor classifier with Euclidean distance as the metric. Determine the class value of the data point Q (7, 2, 6) using the known data points with associated class values, below. Recall the general form for calculating Euclidean distance is
d(p, q) = √Σi(pi - qi)2
P1 (-4, 9, 3), class value 1
P2 (8, -2, 1), class value 1
P3 (6, 1, 5), class value 0
P4 (10, 8, 4), class value 0
P5 (-1, 0, -1), class value 1
Problem 6: Run the Nearest Neighbor classifier with a k-value of 7 and a Support Vector Machine with default values using 10-folds cross validation on the diabetes data set (diabetes.arff in Assignment 3 on myCourses) in Weka. Fill in the confusion matrices for the models in the tables below and use the cost matrix to compute the cost for each model. Based upon the cost, which model should be selected and why?
Nearest Neighbor (k=7) Confusion Matrix
|
|
|
|
Tested Negative
|
Tested Positive
|
Tested Negative
|
|
|
Tested Positive
|
|
|
Support Vector Machine Confusion Matrix
|
|
|
|
Tested Negative
|
Tested Positive
|
Tested Negative
|
|
|
Tested Positive
|
|
|
Cost Matrix
|
|
|
|
Tested Negative
|
Tested Positive
|
Tested Negative
|
0
|
50
|
Tested Positive
|
100
|
-1
|