Journal of Chemical Engineering of Japan, Vol.41, No.9, 898-914, 2008
Classification and Diagnostic Output Prediction of Cancer Using Gene Expression Profiling and Supervised Machine Learning Algorithms
In this paper, a new supervised clustering and classification method is proposed. First, the application of discriminant partial least squares (DPLS) for the selection of a minimum number of key genes is applied on a gene expression microarray data set. Second, supervised hierarchical clustering based on the information of the cancer type is subsequently proposed to find key gene groups and to group the cancer samples into different subclasses. Here, the weights of the genes in the DPLS are proportional to their importance in the determination of the class labels, that is, the variable importance in the projection (VIP) information of the DPLS method. The power of the gene selection method and the proposed supervised hierarchical clustering method is illustrated on a three microarray data sets of leukemia, breast, and colon cancer. Supervised machine learning algorithms thus enable the subtype classification 3 data sets solely on the basis of molecular-level monitoring. Compared to unsupervised clustering, the supervised method performed better for discriminating between cancer types and cancer subtypes for the leukemia data set. The performance of the proposed method, using only a limited set of informative genes, is demonstrated to be comparable or better than results reported in the literature for the three data sets. Furthermore the method was successful in predicting the outcome of medical treatment (success or failure) based on the microarray data, which could make the method an important tool for clinical doctors.
Keywords:Bioinformatics;Data Mining;Gene Expression Profiling;Cancer Classification;Supervised Clustering