Page 88 -

P. 88

3.6 Cluster Validation 75

The second dataset is clustered using the k-means algorithm in the same
conditions as in step 2. The centroids derived in this step for the Rocks dataset S2
are shown in Table 3.2. Notice the proximity to the centroids of Table 3.1.

5. Compute a measure of agreement between the clustering of S2 based on the
nearest centroid of S1 and the direct clustering of S2.
For the Rocks dataset S2 only two patterns changed their assignments: from
cluster #I according to the centroids of Sl, to the neighbour cluster #2 (see Figure
3.19).

Table 3.3. Agreement table for the two clustering methods of the Rocks dataset.

Cluster Cluster Cluster Nr of
#I #2 #3 occurrences
2 0 0 33
1 1 0 1
1 1 0 1
0 2 0 6
0 0 2 27

The agreement between the two clustering methods (using the centroids of S1 or
via directly clustering S2) is shown in Table 3.3. The entries in this table under a
cluster column are the number of times a pattern was assigned to that cluster. The
"Nr of occurrences" column indicates how many times this event occurred. For
instance, both methods unanimously assigned a pattern to cluster #1 thirty-three
times.
A measure of agreement can be computed using Cohen's K statistic. As this
method is of interest in a broad class of pattern recognition situations, namely for
comparing classifiers, we will describe here the major aspects of this statistical
method whose details can be found in e.g. Siege1 and Castellan (1988).
Consider an "agreement table" such as Table 3.3 with n objects assigned by k
judges (classifiers, methods) to one of c categories (clusters, classes). In the Rocks
example we have n=68, k=2 and c=3. Instead of filling in a table with 68 rows we
condensed it by adding the extra "Nr of occurrences" column. Let us denote n,, the
number of times an object i is assigned to category j.
The d statistic is given by the formula:

83 84 85 86 87 88 89 90 91 92 93