Page 38 - Biosystems Engineering
P. 38
Micr oarray Data Analysis Using Machine Learning Methods 19
the samples resulted in similar patterns or responses across all genes.
The latter identifies genes with similar expression profiles across vari-
ous experimental conditions. Genes with similar expression patterns
might be transcriptionally regulated through a same transduction
pathway or share some common function or regulatory elements.
Depending on how they cluster data, we can distinguish clustering
algorithms into hierarchical and nonhierarchical clustering methods.
Hierarchical clustering organizes the input patterns in a hierarchical
tree structure, which allows detecting higher-order relationships
between clusters of patterns. Nonhierarchical clustering begins from
a predefined number of clusters and iteratively reallocates cluster
members to minimize the overall within-cluster dispersion. We can
also divide clustering algorithms into two categories, hard and fuzzy.
Although a hard-clustering algorithm assigns each data point to only
one of the clusters, fuzzy clustering assigns a certain degree of close-
ness or similarity to each object in a cluster.
A wide variety of clustering algorithms (hierarchical/nonhierarchical
as well as hard/fuzzy) has been applied to group similar gene expres-
sion patterns together. In particular, hierarchical clustering (Eisen et al.
1998), self-organizing maps (SOM) (Tamayo et al. 1999), and k-means
(Somogyi 1999) are widely used by the bioinformatics research com-
munity for clustering microarray data. Other clustering methods
such as fuzzy c-means (Gasch and Eisen 2002) and adaptive reso-
nance theory (Tomida et al. 2002) have provided useful results. How-
ever, detecting the number of clusters or selecting a “good” clustering
algorithm remains a challenge. To address this challenge, several sta-
tistical methods have been proposed in the literature. For example,
Yeung et al. (2001b) suggested a metric known as figure of merit
(FOM) that they calculated by clustering the dataset while leaving
out one experiment at a time. They used FOM to evaluate the perfor-
mance of different clustering results as they vary the number of clusters.
Tibshirani et al. (2000) proposed estimating the number of clusters in
a dataset via the gap statistic. The gap statistic estimates the number
of clusters by comparing within-cluster dispersion to that of a reference
null distribution. Kerr and Churchill (2001) introduced a technique
based on the application of a randomization technique (bootstrap-
ping) for making statistical inference from clustering tools. They
applied this technique to assess the stability of results from a cluster
analysis of gene expression microarray data. Dudoit and Fridlyand
(2002) developed a prediction-based resampling method to estimate
the number of clusters in dataset. Fraley and Raftery (2002) provided
functionality for visualizing cluster results in their model-based clus-
tering technique. They characterized and compared various probabil-
ity models in their clustering algorithm through a Bayesian informa-
tion criterion (BIC) introduced by Yeung et al. (2001a). They calculated
BIC scores for each model over a given range of number of clusters. A
model with large BIC scores was selected, and the number of clusters