Page 83 -
P. 83

70     3 Data Clustering





                               3.5  K-Means Clustering

                               The k-Means or ISODATA clustering algorithm is the most popular example of an
                               algorithm  that  performs  iterative  adjustment  of  c  (k  in  the  original  algorithm
                               version)  cluster  centroids.  It  has  a  distinct  application  from  the  tree  clustering
                               methods since it is intended to yield  a clustering solution for an arbitrarily  large
                               number of  patterns and for a previously defined number of clusters. The choice of
                               the  number  of  clusters  can  be  made  by  performing  tree  clustering  either  in  a
                               smaller  set  of  data  or  in  a  reduced  dimension  space  obtained  for  instance  by
                               multidimensional scaling.
                                 The  central  idea  of  the  algorithm  is  to  minimize  an  overall  within-cluster
                               distance  from the patterns to the centroids. Usually, except in the case of a quite
                               small number of patterns, an  exhaustive search in  all partitions of  n patterns in c
                               clusters  is  prohibitive.  Instead,  a  local  minimum  of  the  overall  within-cluster
                               distance  is  searched  by  iteratively  adjusting  c  cluster  centroids,  mi,  and  by
                               assigning each pattern to the closest centroid:






                                 Notice that Ecan also be viewed as the error (sum of squared deviations) of the
                               cluster partition.
                                 The algorithm, in a simple formulation, is as follows:

                               I. Denote: c, the number of clusters; maxiter, the maximum number of  iterations
                                 allowed; A, a minimum threshold for the allowed error deviation in consecutive
                                 iterations. Assume c initial centroids  my)for the iteration kl.
                               2. Assign each  xi in the dataset to the cluster represented by the nearest  m  (k   .
                               3. Compute  for  the  previous  partition  the  new  centroids  mF*')  and  the  error
                                  E(k+l).
                                                                             I
                                                                  I
                               4.  Repeat steps 2 and 3 until k=rnaxiter or  E(~+') - E(~) < A .
                                 There  are  several  variants  of  this  algorithm  depending  on  the  choice  of  the
                               initial  cluster  centroids,  the  pattern  assignment  rule,  the  centroid  computation
                               (often computed as the cluster mean) and  the stop criteria. Usually the algorithm
                               stops after a small number of iterations (<lo). As for the initial choice of centroids,
                               which may have a strong influence on the final solution, it can be done in  several
                               ways. Statistics and SPSS provide the following alternatives:

                               1. Specify which patterns are used as initial centroids. Tree clustering in a reduced
                                  number of patterns may be performed for this purpose.
                               2. Choose the tjrst c patterns as initial centroids.
   78   79   80   81   82   83   84   85   86   87   88