Page 67 -
P. 67
54 3 Data cluster in^
In order to show the basic difficulties of data clustering let us look to the scatter
plot of the first two classes of the corker stoppers data, features N and PRT,
without any pattern labelling, such as represented in Figure 3.1. Do we really see
two clusters? If we were asked to delineate two clusters, where should we set the
limits? Would such limits approximate the class limits?
From Figure 3.1 one readily gets the impression that data clustering is an ill-
formulated problem with an arbitrary number of solutions, especially in the case
where no cluster separation is visually identifiable. However, there are a number of
reasons for applying clustering techniques, such as the following important ones:
In order to obtain a useful summary and interpretation of the problem
data. This is the classical purpose of clustering, applied to a wide variety of
problems. A good review can be found in Hartigan (1975). Data is summarized
by referring to the attributes of the clusters instead of the attributes of the
individual patterns. Cluster interpretation can provide guidance on the
development of scient~fic theories. Good examples of this last application are
the clusters of stars, which inspired theories about the life of stars, and the
clusters of animal species, which inspired evolutionary theories.
In order to initialise a supervised statistical classification approach. This
type of application is attractive when data collection is a very slow and costly
process. Clustering is used then as an unsupervised classification method at a
starting phase in order to obtain parametric estimates of the class distributions
for a small initial set of patterns. These estimates will be updated, as more
patterns become available. The reader can find a good description of this topic
in Duda and Hart (1 973).
In order to provide centroid estimates for neural network classifiers. The
centroid of a cluster is the average point in the multidimensional feature space.
In a sense, it is the centre of gravity of the respective cluster. We will refer to
this application in detail in chapter five, when discussing radial basis functions.
There are a large number of clustering algorithms, depending on the type of
data, on the rules for pattern amalgamation in a cluster and on the approach
followed for applying such rules. Regarding the type of amalgamation approach
two categories of algorithms can be considered:
- Hierarchical algorithms
The hierarchical or tree clustering algorithms use linkage rules of the patterns in
order to produce a hierarchical sequence of clustering solutions. Algorithms of this
type are adequate for a reduced and meaningful set of patterns, i.e., the set must
contain representative pattern exemplars.
- Centroid adjustment algorithms
Algorithms of this type use an iterative method in order to adjust prototypes,
also called centroids, of the clusters, as a consequence of the patterns assignable to
them. In section 3.3 a popular algorithm of this type will be explained.