Page 39 - Biosystems Engineering
P. 39

20    Cha pte r  O n e

               with the highest BIC value for the chosen model was considered to be
               the best number of clusters. This model-based approach assumes that
               a finite mixture of underlying probability distributions such as multi-
               variate normal distributions generates the data. With the underlying
               probability model, the problems of determining the number of clusters
               and of choosing an appropriate clustering method become statistical
               model choice problems. These and many other heuristic methods
               have provided useful results in gene expression clustering.
                   Su and Chang (2001) developed a technique known as double
               self-organizing maps (DSOM). Unlike SOM, DSOM nodes are repre-
               sented not only by their weight vectors but also by two-dimensional
               position vectors. Weight vectors serve the same purpose as in SOM.
               Position vectors are projection of the weight vectors into a two-
               dimensional space and serve as a visualization tool for deciding how
               many clusters are needed, thus combining clustering and cluster
               visualization in one computational procedure. In other words, with
               the help of position vectors, DSOM adjusts its network structure dur-
               ing the learning phase so that neurons that respond to similar stimuli
               will not only have similar weight vectors but also move spatially
               nearer to each other. Ressom et al. (2003b) developed an adaptive
               double self-organizing map (ADSOM), which updates not only the
               weight vectors and position vectors but also all the free parameters
               involved in DSOM during the training process. After training, the
               final location of the positions vectors is used to detect the number of
               clusters by visually counting the clusters they form.
                   In fuzzy c-means, a cluster validity index is commonly used to esti-
               mate the best number of clusters. Several indices were proposed to
               help the detection of number of clusters. One of these is the partition
               coefficient introduced by Bezdek (1981), which ranges from 0 to 1. A
               partition coefficient of 1 indicates no membership sharing between
               clusters, whereas a low value indicates overlap between clusters. Thus,
               a high partition coefficient is desired. Although this is a good indicator,
               its calculation is solely based on membership values without involving
               the structure of the data. Other indices that are based on both member-
               ship values as well as the data structure include the Xie and Beni’s
               index (Xie and Beni 1991), the Fukuyama and Sugeno’s index (Pal and
               Bezdek 1995), the Gath and Geva’s index (Gath and Geva 1989), and
               the Rezaee, Lelieveldt, and Reiber’s index (Rezaee et al. 1998).

               1.5.2 Classification
               Classification involves the automated grouping of objects (e.g., sam-
               ples or genes) into prespecified categories. Various statistical methods
               have been used to classify microarray data. These include discrimi-
               nant analysis (linear, quadratic, and logistic), classification and regres-
               sion trees (CART), generalized additive models, compound covariate
               predictor, weighted voting, k-nearest neighbor rule, and nearest cen-
               troid classifier.
   34   35   36   37   38   39   40   41   42   43   44