Page 67 -
P. 67

54     3 Data cluster in^

              In  order to show the basic difficulties of data clustering let us look to the scatter
            plot  of  the  first  two  classes  of  the  corker  stoppers  data,  features  N  and  PRT,
            without any  pattern  labelling, such as represented  in  Figure 3.1. Do we  really  see
            two clusters? If  we were asked  to delineate two clusters, where should we set the
            limits? Would such limits approximate the class limits?
              From  Figure  3.1  one readily  gets  the  impression  that  data clustering  is  an  ill-
            formulated  problem  with  an  arbitrary  number  of  solutions, especially  in  the case
            where no cluster separation is visually identifiable. However, there are a number of
            reasons for applying clustering techniques, such as the following important ones:

              In  order  to  obtain  a  useful  summary  and  interpretation  of  the  problem
              data. This  is  the classical  purpose  of  clustering,  applied  to  a  wide variety  of
              problems. A good review can be found in Hartigan  (1975). Data is summarized
              by  referring  to  the  attributes  of  the  clusters  instead  of  the  attributes  of  the
              individual  patterns.  Cluster  interpretation  can  provide  guidance  on  the
              development of  scient~fic theories.  Good  examples  of  this  last  application  are
              the  clusters  of  stars,  which  inspired  theories  about  the  life  of  stars,  and  the
              clusters of animal species, which inspired evolutionary theories.
              In  order  to  initialise  a  supervised  statistical  classification  approach. This
              type  of  application is attractive when  data collection  is a very  slow  and  costly
              process.  Clustering  is  used  then  as  an  unsupervised classification method  at a
              starting phase  in  order to  obtain  parametric  estimates of  the class distributions
              for  a  small  initial  set  of  patterns.  These  estimates  will  be  updated,  as  more
              patterns become available. The reader can find a good description of  this  topic
              in Duda and Hart (1 973).
              In  order to  provide centroid estimates for neural  network classifiers. The
              centroid of  a cluster is the average point in the multidimensional feature space.
              In  a sense, it  is the centre of  gravity  of  the respective cluster. We will refer to
              this application in detail in chapter five, when discussing radial basis functions.

              There  are  a  large  number  of  clustering  algorithms, depending  on  the  type  of
            data,  on  the  rules  for  pattern  amalgamation  in  a  cluster  and  on  the  approach
            followed  for  applying  such  rules. Regarding  the  type of  amalgamation  approach
            two categories of algorithms can be considered:
            -  Hierarchical algorithms
              The hierarchical or tree clustering algorithms use linkage rules of the patterns in
            order to produce a hierarchical  sequence of clustering solutions. Algorithms of this
            type are adequate for a reduced  and meaningful set of  patterns,  i.e.,  the  set must
            contain representative pattern exemplars.

            -  Centroid adjustment algorithms
              Algorithms  of  this  type  use  an  iterative  method  in  order to  adjust  prototypes,
            also called centroids, of the clusters, as a consequence of the patterns assignable to
            them. In section 3.3 a popular algorithm of this type will be explained.
   62   63   64   65   66   67   68   69   70   71   72