Data Mining - Using WEKA for Classification
Step 1: Understanding File Format
Before we start using Weka, let's spend a few minutes on understanding the file format for the input data. In notepad or another text editor, open file ‘sunburn.arff'. The file is in the attribute-relation file format (ARFF format). This is one of the file formats that Weka allows for the input file. Weka can also take the other file formats, for example the csv format. In the arff file, lines beginning with a % sign are comments. Following the comments at the beginning of the file are the name of the relation (‘sunburn') and a block defining the attributes (‘hair', ‘height', ‘weight', ‘lotion', ‘burned'). Nominal attributes are followed by the set of values they cantake on, enclosed in curly braces. Numeric values are followed by the keywordnumeric. There are two further attribute types, string and date.
Although the problem is to predict the class value ‘burned' from the values of the other attributes, the class attribute is not distinguished in any way in the data file. The ARFF format merely gives a dataset; it does not specify which of the attributes is the one that is supposed to be predicted. This means that the same file can be used for investigating how well each attribute can be predicted from the others, or to find association rules, or for clustering.
Following the attribute definitions is an @data line that signals thestart of the instances in the dataset. Instances are written one per line,with values for each attribute in turn, separated by commas. If a valueis missing it is represented by a single question mark.
Step 2: Exploring Training Data
Launch Weka by clicking on: RunWeka.bat
Select ‘Explorer' from the list of Applications.
Select the ‘Preprocess' tab and click on ‘Open File'. Choose the file ‘sunburn.arff' which contains the training data set.
Once the file is open, spend some time exploring the training data set. Weka gives a summary of the relation in the dataset and shows a list of attributes in the relation. An attribute can be selected from the attribute list. Once the attribute is selected, a summary of the attribute is displayed, which includes the list of attribute values (labels) and their counts in the dataset. Finally, the class attribute can be selected and the class distributions for the different attribute values are visualized.
Q1. What's the relation for the training data set? How many instances in the data set? How many attributes are in the relation?
Q2. How many distinct values for attribute "weight"? What are the counts for these attribute values? If you select attribute "burned" as the class attribute, what are the class distributions for the distinct values of attribute "weight"? If you select attribute "height" as the class attribute, what are the class distributions for the distinct values of attribute "hair"?
Step 3: Exploring Classifiers and Decision Trees
Select the ‘Classify' tab and make sure that "J48" is chosen from the classifier list and "Use training set" is clicked as the test option. Note that attribute "burned" is chosen by default as the class attribute but the class attribute can be changed if needed.
Click ‘Start' will create a classification model/classifier from the training dataset. The classifier is listed in the Result list while the details about the classifier are displayed in the ‘Classifier output' window.
Right click on ‘trees J48' in the ‘Result List' and select ‘Visualise Tree'. This will create the "Tree View" window.
A decision tree representation of the classifier will be displayed. Now spend some time examining the decision tree. On each of the leaf nodes there is a class label and two numbers. For instance, the leaf node on the most right of the tree is "burned (9.0/2.0)". This means that 9 instances in the training dataset reach that node, of which 2 are classified incorrectly. As you can see that there are 16 instances in total across all the leaf nodes.
The displayed decision tree is learned using an implementation (J48 in this case) of the C4.5 classification algorithm. This algorithm uses entropy as the impurity function for selecting the splitting attribute. We have yet to cover the algorithm. However, we have learned another impurity function, Gini index/impurity. Can you generate a decision tree using the Hunt's algorithm along with the Gini index as the impurity function?
Q3. Generate the optimal decision tree by hand using the Hunt's algorithm along with the Gini index.
You can then compare the decision tree generated by the C4.5 algorithm with the one generated by the Hunt's algorithm.
Q4. Are these two decision trees the same?
Step 4: Examining Classifier Output
The classifier output window shows the full output. At the beginning, the Run information provides a summary of the classifier, the training data set, and the test option. Then comes the classifier model, in which a pruned decision tree in textual form is shown. On the tree, the first split is on attribute "lotion", and then, at the second level, the split is on attribute "hair". In the tree structure, a colon introduces the class label that has been assigned to a particular leaf node, followed by the number of instances that reach that node. If there were incorrectly classified instances, their number would appear, too.
The next part of the output gives a summary of the evaluation on the dataset chosen as the test option. In this case, the evaluation results are obtained using the training set.
Now you can have a look at the evaluation results.
Q5. What are the accuracy and error rates of the evaluation? How do you calculate each of these rates?
Next comes the Detailed Accuracy by Class. Here we have a table that contains TP, FP, Precision, Recall, F-Measure etc.
Q6. What are the TP, FP, Precision, Recall and F-Measure for the "burned" class? What does each of them measure? How are these metrics calculated?
Finally comes the Confusion Matrix.
Q7. How to interpret the Confusion Matrix? What does each of the four cells in the table represent?
Step 5: Using Cross-validation and examining the classification results
You can easily run J48 again with a different evaluation method. Select the "cross-validation" test option with 10 folds as default and click Start again. The classifier output is quickly replaced to show how well the learned model performs on the cross-validation. As you can see, 25% of the instances (4 out of 16) have been misclassified in the cross-validation. This indicates that the results obtained from the training set(12.5% of the instances (2 out of 16)) earlier are optimistic compared with what might be obtained from an independent test set from the same source.
Q8. How are the figures under the Detailed Accuracy By Class (e.g., TP, FP, Precision, Recall and F-Measure) compared with the ones obtained on the training set?
Q9. Have you observed any changes to the Confusion Matrix? If so what are the changes?
Step 6:
In notepad or another text editor, open file ‘sunburn2.arff'.
Add an additional attribute ‘shade' to the list of attributes:
@ATTRIBUTE 'shade' {yes, no}
The values for ‘shade' should be listed at the start of each instance. For instance, the first instance:
blonde, average,light, no, burned
becomes:
no,blonde, average,light, no, burned
Values (in order, top to bottom) for each instance are as follows:
no, no, no, no, no, no, no, no, no, no, no, yes, yes, no, no, no
Accordingly, update each instance in the file ‘sunburn2.arff' and then save the file.
In WEKA Explorer click the ‘Preprocess' tab and then click ‘Open File'. Select the new file ‘sunburn2.arff'.
Step 7:
Repeat Step3and use J48 to create a new decision tree with this file.
Q10. Does the classification accuracy increase or decrease for this new file?
Q11. Does the J48 decision tree change, if so in what way?
Step 8:
In WEKA Explorer stay in the ‘Classify' tab. Select the ‘Supplied Test set' radio button and click the ‘Set' button, followed by the ‘Open file' button. Choose and open the file ‘sunburn2TEST.arff' and click ‘Close'.
Click the ‘More Options' button and ensure there is a tick beside ‘Output predications' then press ‘OK'.
Right click on ‘tree J48' and select ‘Re-evaluate model on current test set'.
The prediction results will appear in the ‘Classifier output' under the heading ‘Predictions on test set'.
Compare the predictions to the instances in the file ‘sunburn2TEST.arff'.
Q12. Are the predictions reasonable? Are the predictions as you would expect?
Attachment:- Practical1.rar