Page 84 -

P. 84

3.5 K-Means Clustering 7 1

3. Sort the distances between all patterns and choose patterns at constant intervals
of these distances as initial centroids.
4. Choose patterns that maximize between-cluster distance, as follows: start by
selecting the first c patterns as centroids; a subsequent pattern replaces the
closest centroid if its distance from the centroid is greater than the distance
between the two closest centroids or the smallest distance between that centroid
and any of the others.

Clustering solutions are usually evaluated using the overall within-cluster
distance of formula (3-5) as well as using the within-cluster distance for each
feature j:

Let E?) and E(") denote the errors for c clusters for feature j and for all
features, respectively. If the pattern features follow a normal distribution, it can be
shown that:

where Fu,b is the distribution F (Fisher) with (a,b) degrees of freedom.
When the normality assumption is not satisfied, the cluster merit indexes R and
Rj are still useful since they measure the decrease in overall within-cluster distance
when passing from a solution with c clusters to one with c+l clusters. A high value
of the merit indexes indicates a substantial decrease in overall within-cluster
distance.
Let us consider the Rocks dataset with 134 cases, characterized by physical and
chemical features. Performing a factor analysis produces the factor loadings graph
shown in Figure 3.18. Factor 1 is highly correlated with chemical features such as
the CaO and SiOz contents, which determine important categorizations of the rocks
(e.g. silicate vs. non-silicate rocks). Factor 2 is highly correlated with physical
features such as PAOA (apparent porosity), AAPN (water absorption) and MVAP
(volumetric weight).
We now perform several experiments of k-means clustering using the factors as
features, for c between 2 and 8. Using 10 iterations and rule 4 for the initial choice
of centroids, we compute for each solution the cluster merit indexes. Using
Statisrica, one can use for this purpose the computed values of the within sum of

79 80 81 82 83 84 85 86 87 88 89