Page 68 -
P. 68
3.2 The Standardization Issue 5 5
An important aspect of data clustering is which pattcrns or cases to choose for
deriving the cluster solution. Usually, in data clustering, one wants typical cases to
be represented, i.e., cases that the designer suspects of typifying the structure of the
data. If the clustering solution supports typical case, it is generally considered a
good solution. A totally different perspective is followed in supervised
classification, where no distinction is made between typical and atypical cases,
since a classifier is expected to perform uniformly for all cases. Therefore, random
data sampling, a requirement when designing supervised classifiers, is not usually
needed. It is only when one is interested in generalizing the clustering results that
the issue of random sampling should be considered.
3.2 The Standardization Issue
Data clustering explores the metric properties of the feature vectors in the feature
space (described in 2.2) in order to join them into meaningful groups. As one has
no previous knowledge concerning prototypes or cluster shape, there is an
arbitrarily large number of data clustering solutions that can be radically different
from each other. As an example, consider the +Cross data available in the
Clusfer.xls file, which is represented in Figure 3.2a.
Figure 3.2. Cross data with: (a) Euclidian clustering; (b) City-block clustering.
Imagine that we are searching for a 2-cluster solution that minimizes the within-
clusfer average error:
with ni different pairs of patterns x, y in cluster LO,.