Page 87 -
P. 87
74 3 Data Clustering
A different approach to cluster validation is to perform a replication analysis,
developed by McIntyre and Blashfield (1980). This is essentially a cross-validation
process, where the results on a subset of the data are cross-validated with the
results obtained on another subset. In the following we describe the main steps of
the replication analysis, illustrating it with the application to the k-means cluster
solution of the Rocks dataset derived in the previous section.
1. Divide the original dataset into two datasets.
The original dataset is randomly split into two sets. Statistical software such as
SPSS or Statistica make this possible by using filter variables filled in with zeros
and ones. With the Rocks dataset, two samples S1 and S2 with 66 and 68 cases
respectively, were obtained in this way.
2. Cluster the first dataset and determine the centroids.
Performing the k-means clustering on S1 the centroids shown in Table 3.1 were
found.
Table 3.1. Centroids of the first Rocks dataset, SI.
Cluster #I Cluster #2 Cluster #3
Factor 1 -0.64 -1.51 1.24
Factor 2 -0.53 3.37 0.3 1
3. Assign the data of the second dataset to the nearest centroids.
The distances between the patterns of the second dataset, S2, and the centroids
previously determined on S1 are computed. Each S2 pattern is assigned to the
nearest centroid. SPSS makes it possible to save the previously determined
centroids, making them available for "classification" (assignment) alone in this
step.
Table 3.2. Centroids of the second Rocks dataset, S2.
Cluster #I Cluster #2 Cluster #3
Factor 1 I -0.64 -1.24 1.27 1
Factor 2 1 -0.65 2.29 0.29 1
4. Cluster the second dataset.