Page 66 -

P. 66

3 Data Clustering

3.1 Unsupervised Classification

In the previous chapters, when introducing the idea of similarity as a distance
between feature vectors, we often computed this distance relative to a prototype
pattern. We have also implicitly assumed that the shape of the class distributions in
the feature space around a prototype was known, and based on this, we could
choose a suitable distance metric. The knowledge of such class shapes and
prototypes is obtained from a previously classified training set of patterns. The
design of a PR system using this "teacher" information is called a supervised
design. For the moment our interest is in classification systems and we will refer
then to supervised classification.
We are often confronted with a more primitive situation where no previous
knowledge about the patterns is available or obtainable (after all, we learn to
classify a lot of things without being taught). Therefore our classifying system
must "discover" the internal similarity structure of the patterns in a useful way. We
then need to design our system using a so-called unsupervised approach. The
present chapter is dedicated to the unsupervised classification of feature vectors,
also called data clustering. This is essentially a data-driven approach, that attempts
to discover structure within the data itself, grouping together the feature vectors in
clusters of data.

Figure 3.1. Scatter plot of the first 100 cork stoppers, using features N and PRT.

61 62 63 64 65 66 67 68 69 70 71