Question 1
Get the dataset "food.txt" from GauchoSpace and read it with R. Alternatively you can download this data set from the library cluster.datasets with the following code:
library(cluster.datasets)
data(nutrients.meat.fish.fowl.1959)
The Data Set contains the quantity of Energy, Protein, Fat, Calcium and Iron of 27 differen aliments.
The task here is to finding meaningful clusters in the data. To this end perform the following:
1. Find clusters using a K-means algorithm. Try out different values of K and determine your best best solution. The number of clusters you choose should be based either on appropriate measures of fit, for example SSE as defined in the book IDM, and interpretability of the results. For each value of K that you try out provide:
a. the centroids
b. the size of each cluster and a list of the aliments and their cluster membership
c. the ratio between-SS/total-SS
d. a meaning (use your imagination) to each cluster formed, e.g. what are the summarizing characteristics of the aliments in group 1?
e. to answer part d above you might find useful using a parallel coordinate plot of the centroids
2. Apply hierarchical clustering using min, max and average distances (respectively single, complete and average methods in R).
a. For each method produce a dendrogram with the labels of the aliments
b. What are the differences, in any, in using the three different measures of distances?
c. Can you individuate clusters similar to those obtained by K-means clustering?
Additional exercises for PStat 231
Question 2
Perform PCA of the food.txtdata and use a biplot to visualize the first two PC and the Variables. Based on the biplot one could still individuate groups (clusters) of aliments with similar characteristics.
a. Is the grouping obtained by PCA similar or different from that obtained by the clustering algorithms above? Explain with some detail.
b. Which technique do you find most useful in describing the data set? Why?
1
Question 3
Suppose that we have four observations, for which we compute a dissimilarity matrix, given by
0.3 0.4 0.7
0.3 0.5 0.8
0.4 0.5 0.45
0.7 0.8 0.45
For instance, the dissimilarity between the first and second observations is 0.3, and the dissimilarity between the second and fourth observations is 0.8.
a. On the basis of this dissimilarity matrix, sketch the dendrogram that results from hierarchically clustering these four observations using complete linkage. Be sure to indicate on the plot the height at which each fusion occurs, as well as the observations corresponding to each leaf in the dendrogram.
b. Suppose that we cut the dendogram obtained in (a) such that two clusters result. Which observations are in each cluster?