Page 40 - Biosystems Engineering
P. 40
Micr oarray Data Analysis Using Machine Learning Methods 21
Most classification algorithms perform suboptimally with thou-
sands of genes and require the selection of the most relevant genes
that are most predictive of a phenotype. Performing appropriate
gene selection helps in achieving accurate classification. There are
two objectives in gene selection: improving the prediction perfor-
mance of the models and providing a better understanding of the
underlying concepts that generated the data. Gene selection may
start by filtering genes with no or significantly low fold change. A
small subset of genes can be selected from the remaining genes using
various techniques described in Sec. 1.4. Clustering methods can
also be used to identify groups of coregulated genes; cluster centers
of these groups can then be used as inputs to a classifier. Supervised
methods identify the most informative genes using approaches such
as (1) analysis of differential expression via a two-sample t-test, anal-
ysis of variance, etc., (2) selecting a gene’s signal-to-noise ratio of
above a prespecified cutoff, and (3) choosing genes that are corre-
lated with an expected outcome (e.g., class labels). Optimizations
methods can also be used in which a subset of genes is selected
recursively (sequential or via “evolutionary” trial and error) and the
best possible combination of genes is selected based on its classifica-
tion performance.
Molecular classification based on machine learning algorithms
have been shown to have statistical and clinical relevance for a variety
of tumor types: leukemia (Golub et al. 1999), lymphoma (Shipp et al.
2002), brain cancer (Pomeroy et al. 2002), lung cancer (Bhattacharjee
et al. 2001), and the classification of multiple primary tumors (Ramas-
wamy et al. 2001). The performance of machine learning methods in
classifying microarray data can be enhanced if the most informative
genes are used. For example, Guyon et al. (2002) applied a gene selec-
tion method that used SVM based on recursive feature elimination.
They demonstrated experimentally that the selected genes yielded
improved classification performance.
1.5.3 Genetic Network Modeling
With the help of global expression data—especially using time series
microarray data—one can attempt to reverse engineer a network of
gene interaction. The benefits of characterizing gene interaction are
many; for example, the effects of drugs on a regulatory pathway can
be characterized; tumor development in cells can be tracked, etc. Sev-
eral methods have been proposed to develop maps of gene interac-
tion, including linear equations (D’haeseleer et al. 1999; Weaver et al.
1999), differential equations (Chen et al. 1999), Boolean networks
(Liang et al. 1998; Shmulevich et al. 2002), fuzzy logic–based methods
(Woolf and Wang 2000; Ressom et al. 2003a), correlation-based
approaches (Herrero et al. 2003; Schmitt et al. 2004), and Bayesian
networks (Friedman et al. 2000).