Page 87 -
P. 87

74     3 Data Clustering

                                        A different approach to cluster validation is to perform a replication  analysis,
                                      developed by McIntyre and Blashfield (1980). This is essentially a cross-validation
                                      process,  where  the  results  on  a  subset  of  the  data  are  cross-validated  with  the
                                      results obtained on another subset. In the following we describe the main steps of
                                      the replication analysis, illustrating it  with  the application to the k-means cluster
                                      solution of the Rocks dataset derived in the previous section.

                                      1. Divide the original dataset into two datasets.
                                        The original dataset is randomly split into two sets. Statistical software such as
                                      SPSS or Statistica make this possible by  using filter variables filled in with  zeros
                                      and  ones. With  the Rocks dataset, two  samples S1  and S2  with  66 and 68 cases
                                      respectively, were obtained in this way.
                                      2. Cluster the first dataset and determine the centroids.
                                        Performing the k-means clustering on S1 the centroids shown in Table 3.1 were
                                      found.



                                      Table 3.1. Centroids of the first Rocks dataset, SI.


                                                           Cluster #I    Cluster #2     Cluster #3
                                               Factor 1         -0.64         -1.51          1.24
                                               Factor 2         -0.53         3.37          0.3 1





                                      3. Assign the data of the second dataset to the nearest centroids.
                                        The distances between the patterns of the second dataset, S2, and the centroids
                                      previously  determined  on  S1  are computed.  Each  S2  pattern  is  assigned  to  the
                                      nearest  centroid.  SPSS  makes  it  possible  to  save  the  previously  determined
                                      centroids,  making  them  available  for  "classification" (assignment)  alone  in  this
                                      step.



                                      Table 3.2. Centroids of the second Rocks dataset, S2.

                                                            Cluster #I    Cluster #2    Cluster #3
                                               Factor 1 I       -0.64         -1.24          1.27 1
                                               Factor 2 1       -0.65         2.29           0.29 1





                                      4.  Cluster the second dataset.
   82   83   84   85   86   87   88   89   90   91   92