Page 83 -
P. 83
70 3 Data Clustering
3.5 K-Means Clustering
The k-Means or ISODATA clustering algorithm is the most popular example of an
algorithm that performs iterative adjustment of c (k in the original algorithm
version) cluster centroids. It has a distinct application from the tree clustering
methods since it is intended to yield a clustering solution for an arbitrarily large
number of patterns and for a previously defined number of clusters. The choice of
the number of clusters can be made by performing tree clustering either in a
smaller set of data or in a reduced dimension space obtained for instance by
multidimensional scaling.
The central idea of the algorithm is to minimize an overall within-cluster
distance from the patterns to the centroids. Usually, except in the case of a quite
small number of patterns, an exhaustive search in all partitions of n patterns in c
clusters is prohibitive. Instead, a local minimum of the overall within-cluster
distance is searched by iteratively adjusting c cluster centroids, mi, and by
assigning each pattern to the closest centroid:
Notice that Ecan also be viewed as the error (sum of squared deviations) of the
cluster partition.
The algorithm, in a simple formulation, is as follows:
I. Denote: c, the number of clusters; maxiter, the maximum number of iterations
allowed; A, a minimum threshold for the allowed error deviation in consecutive
iterations. Assume c initial centroids my)for the iteration kl.
2. Assign each xi in the dataset to the cluster represented by the nearest m (k .
3. Compute for the previous partition the new centroids mF*') and the error
E(k+l).
I
I
4. Repeat steps 2 and 3 until k=rnaxiter or E(~+') - E(~) < A .
There are several variants of this algorithm depending on the choice of the
initial cluster centroids, the pattern assignment rule, the centroid computation
(often computed as the cluster mean) and the stop criteria. Usually the algorithm
stops after a small number of iterations (<lo). As for the initial choice of centroids,
which may have a strong influence on the final solution, it can be done in several
ways. Statistics and SPSS provide the following alternatives:
1. Specify which patterns are used as initial centroids. Tree clustering in a reduced
number of patterns may be performed for this purpose.
2. Choose the tjrst c patterns as initial centroids.