Page 39 - Biosystems Engineering
P. 39
20 Cha pte r O n e
with the highest BIC value for the chosen model was considered to be
the best number of clusters. This model-based approach assumes that
a finite mixture of underlying probability distributions such as multi-
variate normal distributions generates the data. With the underlying
probability model, the problems of determining the number of clusters
and of choosing an appropriate clustering method become statistical
model choice problems. These and many other heuristic methods
have provided useful results in gene expression clustering.
Su and Chang (2001) developed a technique known as double
self-organizing maps (DSOM). Unlike SOM, DSOM nodes are repre-
sented not only by their weight vectors but also by two-dimensional
position vectors. Weight vectors serve the same purpose as in SOM.
Position vectors are projection of the weight vectors into a two-
dimensional space and serve as a visualization tool for deciding how
many clusters are needed, thus combining clustering and cluster
visualization in one computational procedure. In other words, with
the help of position vectors, DSOM adjusts its network structure dur-
ing the learning phase so that neurons that respond to similar stimuli
will not only have similar weight vectors but also move spatially
nearer to each other. Ressom et al. (2003b) developed an adaptive
double self-organizing map (ADSOM), which updates not only the
weight vectors and position vectors but also all the free parameters
involved in DSOM during the training process. After training, the
final location of the positions vectors is used to detect the number of
clusters by visually counting the clusters they form.
In fuzzy c-means, a cluster validity index is commonly used to esti-
mate the best number of clusters. Several indices were proposed to
help the detection of number of clusters. One of these is the partition
coefficient introduced by Bezdek (1981), which ranges from 0 to 1. A
partition coefficient of 1 indicates no membership sharing between
clusters, whereas a low value indicates overlap between clusters. Thus,
a high partition coefficient is desired. Although this is a good indicator,
its calculation is solely based on membership values without involving
the structure of the data. Other indices that are based on both member-
ship values as well as the data structure include the Xie and Beni’s
index (Xie and Beni 1991), the Fukuyama and Sugeno’s index (Pal and
Bezdek 1995), the Gath and Geva’s index (Gath and Geva 1989), and
the Rezaee, Lelieveldt, and Reiber’s index (Rezaee et al. 1998).
1.5.2 Classification
Classification involves the automated grouping of objects (e.g., sam-
ples or genes) into prespecified categories. Various statistical methods
have been used to classify microarray data. These include discrimi-
nant analysis (linear, quadratic, and logistic), classification and regres-
sion trees (CART), generalized additive models, compound covariate
predictor, weighted voting, k-nearest neighbor rule, and nearest cen-
troid classifier.