Page 86 -
P. 86
3.5 K-Means Clustering 73
that the values c = 3, 5, 8 are sensible choices for the number of clusters, In
particular, the solution with 3 clusters looks quite attractive since it corresponds to
high values of the merit indexes. This solution is shown in Figure 3.19. Cluster #I
has 74 cases corresponding to calcium carbonate rocks, such as limestones and
marbles. Cluster #2 has 11 cases that correspond to the same type of rocks but with
higher porosity. Finally, cluster #3 has 49 cases, which correspond to silicate
stones such as granites and diorites.
Factor 2
Figure 3.20. Variation of the cluster index R for the two features Factor 1 and
Factor 2, with the number of clusters c (Rocks dataset).
The solution with 5 clusters is also an interesting solution. It divides the calcium
carbonate rocks into three clusters corresponding to "high", "medium" and "low"
porosity. The silicate rocks are divided into two clusters, also according to the
porosity. The solution with 8 clusters is not interesting, since it contains a singleton
cluster.
Notice that in all these experiments we are using a Euclidian distance, therefore
imposing a circular shape onto the cluster boundaries. Another metric for
measuring distances, such as the Mahalanobis metric, could be more appropriate in
some cases.
3.6 Cluster Validation
Clustering results assessment is usually performed by some kind of measure of
within-cluster dissimilarity. In the previous section we used cluster merit indexes
that reflect such dissimilarity. Other statistical indexes have been proposed (see
e.g. Milligan, 1996). As a simple validation test, one could also apply the Kruskal-
Wallis test to the cluster solution and consider it acceptable if the corresponding
test probability is below a certain confidence level.