The paper by Golub et al. that was the focus of the second part of the BioConductor practical was the first analysis of its kind, demonstrating that gene expression analysis could potentially be used to classify leukaemia sub-types. Since its publication in 1999 there has been considerable interest in developing this approach for diagnostic use and a paper about this, published in Blood in 2002, and provided, is the focus for this assignment.
In this paper, Mary Ross and her colleagues at St Jude’s Children’s Research Hospital in Memphis, performed gene expression analysis on a larger cohort of 155 leukaemia cases, from five different sub-types. While the paper by Golub showed that it was possible to broadly classify leukaemia with this kind of analysis, this later paper demonstrated that even closely related sub-types have very distinct gene expression signatures.
The aim of this assignment is to reproduce the analysis of the data set described in the paper by Ross et al. You are provided with all of the raw Affymetrix data files that were used in the original analysis, and the experiment description files that define which samples belong to the training and test data sets. You will be expected to use a machine learning approach to classification of the data, perform hierarchal clustering on the data, and identify a sub set of genes whose expression defines the data classes.