Page 86 -
P. 86

3.5 K-Means Clustering   73

                          that  the  values  c  = 3, 5, 8  are  sensible  choices  for  the  number  of  clusters,  In
                          particular, the solution with 3 clusters looks quite attractive since it corresponds to
                          high values of the merit indexes. This solution is shown in Figure 3.19.  Cluster #I
                          has  74 cases  corresponding  to  calcium carbonate  rocks,  such  as  limestones and
                          marbles. Cluster #2 has 11 cases that correspond to the same type of rocks but with
                          higher  porosity.  Finally,  cluster  #3  has  49  cases,  which  correspond  to  silicate
                          stones such as granites and diorites.







                                                                          Factor 2











                           Figure  3.20.  Variation  of  the cluster  index R  for the two  features  Factor  1 and
                           Factor 2, with the number of clusters c (Rocks dataset).



                             The solution with 5 clusters is also an interesting solution. It divides the calcium
                           carbonate rocks into three clusters corresponding to  "high", "medium" and  "low"
                           porosity.  The  silicate  rocks  are  divided  into  two  clusters,  also  according  to  the
                           porosity. The solution with 8 clusters is not interesting, since it contains a singleton
                           cluster.
                             Notice that in all these experiments we are using a Euclidian distance, therefore
                            imposing  a  circular  shape  onto  the  cluster  boundaries.  Another  metric  for
                            measuring distances, such as the Mahalanobis metric, could be more appropriate in
                            some cases.



                            3.6  Cluster Validation

                            Clustering results  assessment is usually  performed by  some  kind  of  measure  of
                            within-cluster dissimilarity. In the previous section we used cluster merit indexes
                            that  reflect  such dissimilarity. Other  statistical  indexes have  been  proposed  (see
                            e.g.  Milligan, 1996). As a simple validation test, one could also apply the Kruskal-
                            Wallis test to the cluster solution and consider it acceptable if  the corresponding
                            test probability is below a certain confidence level.
   81   82   83   84   85   86   87   88   89   90   91