Assignment: Clustering
Your task for this assignment is to implement and evaluate the k-means clustering algorithm.
1. Implement the k-means clustering algorithm.
a. You can use any programming language that you are familiar with.
b. The program should be executable with at least 3 parameters: the name of the dataset file, k, and the name of the output file.
c. The output file should contain numerical class labels (formatted as one number per row) for all the records in the test dataset and report the sum squared error (SSE) in the last row.
d. You only need to handle numerical attributes (categorical attributes are not required).
2. Select two datasets from the UCI repository and evaluate the algorithm using SSE and another metric of your choice (e.g. BCubed precision and recall or Jaccard score if you have the class labels) with varying k. (I intend to run your implementation on the fisher iris dataset without the labels.
3. Write a brief report to:
a. Describe the datasets.
b. Describe your implementation and experiment setup, e.g. any preprocessing you performed on the dataset such as normalizing the attributes, distance metrics you used, etc.
c. Present the experiment results with varying k.
d. Discuss the insights and conclusions from your experiments.
4. This is an individual assignment.
5. Submission. You will upload two items to Canvas: your PDF report and a zip or tar file.
This zip/tar file must contain:
Your source files (include your name(s) in commented form at the top of all source files), the executable, a README file explaining how to compile/run your program, the output files for your test datasets.