Page 38 - Biosystems Engineering
P. 38

Micr oarray Data Analysis Using Machine Learning Methods       19

               the samples resulted in similar patterns or responses across all genes.
               The latter identifies genes with similar expression profiles across vari-
               ous experimental conditions. Genes with similar expression patterns
               might be transcriptionally regulated through a same transduction
               pathway or share some common function or regulatory elements.
                   Depending on how they cluster data, we can distinguish clustering
               algorithms into hierarchical and nonhierarchical clustering methods.
               Hierarchical clustering organizes the input patterns in a hierarchical
               tree structure, which allows detecting higher-order relationships
               between clusters of patterns. Nonhierarchical clustering begins from
               a predefined number of clusters and iteratively reallocates cluster
               members to minimize the overall within-cluster dispersion. We can
               also divide clustering algorithms into two categories, hard and fuzzy.
               Although a hard-clustering algorithm assigns each data point to only
               one of the clusters, fuzzy clustering assigns a certain degree of close-
               ness or similarity to each object in a cluster.
                   A wide variety of clustering algorithms (hierarchical/nonhierarchical
               as well as hard/fuzzy) has been applied to group similar gene expres-
               sion patterns together. In particular, hierarchical clustering (Eisen et al.
               1998), self-organizing maps (SOM)  (Tamayo et al. 1999), and k-means
               (Somogyi 1999) are widely used by the bioinformatics research com-
               munity for clustering microarray data. Other clustering methods
               such as fuzzy c-means (Gasch and Eisen 2002) and adaptive reso-
               nance theory (Tomida et al. 2002) have provided useful results. How-
               ever, detecting the number of clusters or selecting a “good” clustering
               algorithm remains a challenge. To address this challenge, several sta-
               tistical methods have been proposed in the literature. For example,
               Yeung et al. (2001b) suggested a metric known as figure of merit
               (FOM) that they calculated by clustering the dataset while leaving
               out one experiment at a time. They used FOM to evaluate the perfor-
               mance of different clustering results as they vary the number of clusters.
               Tibshirani et al. (2000) proposed estimating the number of clusters in
               a dataset via the gap statistic. The gap statistic estimates the number
               of clusters by comparing within-cluster dispersion to that of a reference
               null distribution. Kerr and Churchill (2001) introduced a technique
               based on the application of a randomization technique (bootstrap-
               ping) for making statistical inference from clustering tools. They
               applied this technique to assess the stability of results from a cluster
               analysis of gene expression microarray data. Dudoit and Fridlyand
               (2002) developed a prediction-based resampling method to estimate
               the number of clusters in dataset. Fraley and Raftery (2002) provided
               functionality for visualizing cluster results in their model-based clus-
               tering technique. They characterized and compared various probabil-
               ity models in their clustering algorithm through a Bayesian informa-
               tion criterion (BIC) introduced by Yeung et al. (2001a). They calculated
               BIC scores for each model over a given range of number of clusters. A
               model with large BIC scores was selected, and the number of clusters
   33   34   35   36   37   38   39   40   41   42   43