In the presentation of a data set made up of an aggregate of several clusters, we may say that, for the clusters to be individually visible, the separation between them has to be larger than the internal scatter of the clusters. If it happens that there are only a few clusters in the data set, then the leading principal axes found by PCA are to pick projections of the clusters with good separation, thereby providing an effective basis for feature extraction.
In Section 4.19 of Chapter 4, we described structural risk minimization as a method of systematically realizing the best generalization performance by matching the capacity of a learning machine to the available size of the training sample. Given the principal-components analyzer as a preprocessor aimed at reducing the dimension of the input data space, discuss how such a processor can embed structure into the learning process by ranking a family of pattern classifiers.