Page 77 -
P. 77
64 3 Data Clustering
Euclidian metrics seem appropriate, given the somewhat globular aspect of the
data. Using the Ward's method with the Euclidian metric the solution shown in
Figure 3.10 is obtained, which clearly identifies two clusters that are easy to
interpret: high and low crime rates against property. The city-block metric could
also be used with similar results. A single linkage rule, on the contrary, would
produce drastically different solutions, as it would tend to leave aside singleton
clusters ({Coimbra] and {Aveiro]), rendering the interpretation more problematic.
Clustering can also be used to assess the "data-support" of a supervised
classification. As a matter of fact, if a supervised classification uses distance
measures in a "natural" way we would expect that a data-driven approach would
also tend to reproduce the same classification as the supervised one. Let us refer to
the cork stoppers data of Figure 3.1. In order to perform clustering it is advisable
for the features to have similar value ranges and thereby contribute equally to the
distance measures. We can achieve this by using the new feature PRTlO = PRTIIO
(see also the beginning of section 2.3). Figure 3.1 la shows the scatter plot for the
supervised classification.
AVFIRO
SETUBAL
V. CASTELO
BEJA
PORT0
VlSEU
BRAGA
SANTAREM
BRAGAN A
CoImFiA
C. BRANCO
PORTALEGRE
EVORA
V. REAL
GUARDA
LEIRIA
FAR0
LISBOA
0 1 2 3 4 5 6 7
Linkage Distance
Figure 3.10. Dendrogram for the Crimes data using Ward's method. Two clusters
are clearly identifiable.
Experimenting with the complete linkage, UWGMA and Ward's rules we obtain
the best results with Ward's rule and squared Euclidian distance metrics. The
respective scatter plot is shown in Figure 3.1 1 b. The resemblance to the supervised
classification is quite good (only 19 differences in 100 patterns).