Page 84 -
P. 84

3.5 K-Means Clustering   7 1


                       3. Sort the distances between all patterns and choose patterns at constant intervals
                         of  these distances as initial centroids.
                       4. Choose patterns  that  maximize between-cluster  distance,  as  follows:  start  by
                         selecting  the  first  c  patterns  as  centroids;  a  subsequent  pattern  replaces  the
                         closest centroid  if  its  distance  from  the  centroid  is  greater  than  the  distance
                         between the two closest centroids or the smallest distance between that centroid
                         and any of the others.

                         Clustering  solutions  are  usually  evaluated  using  the  overall  within-cluster
                       distance  of  formula  (3-5) as  well  as using  the  within-cluster  distance  for  each
                       feature  j:






                         Let  E?)   and  E(") denote the errors for c clusters  for feature j  and for all
                       features, respectively. If the pattern features follow a normal distribution, it can be
                       shown that:












                       where Fu,b is the distribution F (Fisher) with (a,b) degrees of freedom.
                         When the normality assumption is not satisfied, the cluster merit indexes R and
                       Rj are still useful since they measure the decrease in overall within-cluster distance
                       when passing from a solution with c clusters to one with c+l clusters. A high value
                       of  the  merit  indexes  indicates  a  substantial  decrease  in  overall  within-cluster
                       distance.
                         Let us consider the Rocks dataset with  134 cases, characterized by  physical and
                       chemical features. Performing a factor analysis produces the factor loadings graph
                       shown in Figure 3.18. Factor  1 is highly correlated with chemical features such as
                       the CaO and SiOz contents, which determine important categorizations of the rocks
                       (e.g. silicate vs.  non-silicate  rocks).  Factor  2  is  highly  correlated  with  physical
                       features such as PAOA (apparent porosity), AAPN (water absorption) and MVAP
                       (volumetric weight).
                         We  now perform several experiments of k-means clustering using the factors as
                       features, for c between 2 and 8. Using 10 iterations and rule 4 for the initial choice
                       of  centroids,  we  compute  for  each  solution  the  cluster  merit  indexes.  Using
                        Statisrica, one can use for this purpose the computed values of the within sum of
   79   80   81   82   83   84   85   86   87   88   89