Assignment:
- Part I:
- Use the data collected in the attachments as Excel file.
- Select a reasonable size of objects to represent the population, -e.g., 2000, 1000, 500, ... ( your call as the domain expert)
- Select a set "representative attributes", e.g., 8, 15, 20, .... (your call as the domain expert)
- Decide a similarity measure between any two objects.
- Number the objects.
- Note: it is a good idea to "remove" those outliers and objects with missing values.
-
- Part II:
- Select a "k" as the number of clusters. (Justify k as the domain expert.)
- Manuallly select k centroids.
- Cluster the data into k clusters.
- Compute the SE for each cluster. Show the sum of SEs.
- Explain the clustering.
- Randomly select k centroids.
- Cluster the data into k clusters.
- Compute the SE for each cluster. Show the sum of SEs.
- Explain the clustering difference between this clustering and the previous one.
-
- Part III:
- Add a new feature on this assignment.
- Explain your algorithm, program, and result.
- For example:
- A new way to select data.
- A new way to calculate similarity, sepecifically on your own unique data.
- A new clustering algorithm.
- A new way to calculate the effectiveness of clustering.
A new way to visualize the clustering.
• What to turn in-
- Part I:
- number of objects selected, number of attributes selected, and why and how to select them?
- explain the similarity measure function, i.e., what is the similarity/dis-similarity between any two objects?
-
- Part II:
- what is k- and why k- What are the k centroids?
- what is the clustering result?
- what is the SSE for this clustering?
-
- Repeat the above for randomly selected centroids.
-
- Part III:
- Clearly describe your idea, algorithm, program, and result.
-
- Part IV:
Source Code and explain how to run your program.
Attachment:- project datamining.rar