KNN Assignment:
Build a very straightforward and fully functional machine learning classifier with the K-Nearest Neighbor (KNN) algorithm. The KNN model will read a set of data specified by the user, decide the appropriate class of the new instance, and finally output the performance metrics of the model based on test data set.
Requirements:
- The implementation should be written in Java.
- The distance metric used should be Euclidean distance.
- The model should use the training data set as the existing data set to classify the data in the test set.
- The performance metrics output needs to include the confusion matrix and accuracy (1-misclassification rate).
- The code must be commented so that each implemented component of the KNN algorithm is properly labeled.
- Use object-orien ted techniques (do not make the whole program one monolithic procedu ral implementation).
- Separate the distance calculations, sorting, data structure, and other components so that each of them can be developed and reviewed individually.
- Implement the classifier as a standalone program.
Assignment Deliverable:
This is what I need for my output
1) All source code
2) A copy of the output when K=3
3) A copy of the output when K=5
4) A copy of the output when K=7
5) Confusion matrix and accuracy
A few notes:
- Input/output streams
- Distance calculation
- KNN algorithm
- Confusion matrix and accuracy
- Object-oriented and commented codes
2) You must take the input file path from the command line. We want students to learn how to input dataset from file into memory. In the real world, we will run multiple datasets via multiple files to analyze the results. Coding the file name in your program is not practical.
3) Make sure you test all the rows and use all the other data points as your training data. For example, when you are testing row 1, you need to use rows 2 to 150 as the training data for the calculation. When you are testing row 2, you need to use rows 1 and rows 3 to 150 for the calculation.
4) You should not use a K with an even number. After locating the K nearest neighbors, we need to take the majority for the predicted value. An even number K will potentially have many ties when you are trying to select the majority.
5) I may choose to run a subset of the input data. Please make sure your program works for any input data with the same format.