Page 75 -
P. 75

62     3 Data Clusterinn


                               Complete linkage
                               Also  called  FN  (furthest  neighbour)  rule,  this  method  (previously  introduced)
                               evaluates  the dissimilarity between two clusters as the greatest distance between
                               any two patterns, one from each cluster.

                                  ~(Lo~,w~)= max  /IX-~JJ.
                                            x€wi,y€wj
                                 This rule performs well when the clusters are compact, roughly globular and of
                               equal  size,  as  shown  in  the  example  of  the  globular  data  (Figure  3.5~~). It  is,
                               however, inadequate for filamentary clusters.  For instance, for the filamentary data
                               of  Figure  3.4b, the 2  and  3 cluster solutions obtained are of  globular  shape and
                               therefore inadequate. The 2 cluster solution is: {I, 2, 3, 4, 9, A, B, C] and  (5, 6, 7,
                               8, D, E, F, G, H, 11.
                                 The single and complete linkage rules represent two extremes in  dissimilarity
                               assessment and tend to be sensitive to atypical patterns, since they depend only on
                               nearest  or  furthest  neighbours,  thereby  discarding  a  lot  of  information.  The
                               following  rules  use  all  available  information  in  an  averaging  approach  and  are
                               therefore less sensitive to atypical patterns.

                               Average linkage between groups
                               Also  known  as  UPGMA  (un-weighted  pair-group  method  using  arithmetic
                               averages), this rule assesses the distance between two clusters as the average of the
                               distances between all pairs of patterns, each pattern from a distinct cluster:






                                  This method is quite effective for several types of cluster shapes.

                                Average linkage within groups
                               This rule, known as UWGM (un-weighted within-group method using arithmetic
                               averages), is a variant of the previous one, where the distance between two clusters
                               is taken to be the average of the distances of all possible pairs of patterns, as if they
                               formed a single cluster:






                                  Notice  that  this  formula  is  equivalent  to  formula  (3-1) when  applied  to  all
                               clusters.
                                  This  method  is  quite  efficient  when  we  intend  to  keep  the  "volume"  of  the
                               resulting cluster as small as possible, minimizing the within-cluster average error
   70   71   72   73   74   75   76   77   78   79   80