Page 77 -
P. 77

64     3 Data Clustering


                              Euclidian  metrics  seem  appropriate,  given  the  somewhat  globular  aspect  of  the
                              data. Using  the  Ward's  method  with  the Euclidian  metric the  solution  shown  in
                              Figure  3.10  is  obtained,  which  clearly  identifies  two  clusters  that  are  easy  to
                              interpret: high  and low crime rates against property. The city-block  metric could
                              also  be  used  with  similar results.  A  single  linkage rule,  on  the  contrary,  would
                              produce  drastically  different  solutions, as  it  would  tend  to  leave  aside  singleton
                              clusters ({Coimbra] and {Aveiro]), rendering the interpretation more problematic.
                                Clustering  can  also  be  used  to  assess  the  "data-support"  of  a  supervised
                              classification.  As  a  matter  of  fact,  if  a  supervised  classification  uses  distance
                              measures in  a "natural" way  we  would expect that a data-driven approach  would
                              also tend to reproduce the same classification as the supervised one. Let us refer to
                              the cork stoppers data of Figure 3.1. In order to perform clustering it is advisable
                              for the features to have similar value ranges and thereby contribute equally to the
                              distance measures. We can achieve this by using the new feature PRTlO = PRTIIO
                              (see also the beginning of  section 2.3). Figure 3.1 la shows the scatter plot for the
                              supervised classification.






                                           AVFIRO
                                          SETUBAL
                                        V. CASTELO
                                            BEJA
                                           PORT0
                                            VlSEU
                                           BRAGA
                                        SANTAREM
                                        BRAGAN  A
                                          CoImFiA
                                        C. BRANCO
                                       PORTALEGRE
                                           EVORA
                                           V. REAL
                                          GUARDA
                                            LEIRIA
                                            FAR0
                                           LISBOA
                                                0     1    2     3     4     5    6     7
                                                           Linkage Distance
                              Figure 3.10.  Dendrogram for the Crimes data using Ward's  method. Two clusters
                              are clearly identifiable.


                                Experimenting with the complete linkage, UWGMA  and  Ward's  rules we obtain
                              the  best  results  with  Ward's  rule  and  squared  Euclidian  distance  metrics.  The
                              respective scatter plot is shown in Figure 3.1 1 b. The resemblance to the supervised
                              classification is quite good (only 19 differences in 100 patterns).
   72   73   74   75   76   77   78   79   80   81   82