Page 68 -
P. 68

3.2 The Standardization Issue   5 5


     An  important aspect of  data clustering is  which pattcrns or cases to choose for
   deriving the cluster solution. Usually, in data clustering, one wants typical cases to
   be represented, i.e., cases that the designer suspects of typifying the structure of the
   data. If  the clustering solution supports typical  case, it  is  generally considered  a
   good  solution.  A  totally  different  perspective  is  followed  in  supervised
   classification, where  no  distinction  is  made  between  typical  and  atypical cases,
   since a classifier is expected to perform uniformly for all cases. Therefore, random
   data sampling, a requirement when designing supervised classifiers, is not usually
   needed. It is only when one is interested in generalizing the clustering results that
   the issue of random sampling should be considered.



   3.2  The Standardization Issue

    Data  clustering explores the metric properties of  the feature vectors in  the feature
    space (described in  2.2) in  order to join  them into meaningful groups. As one has
    no  previous  knowledge  concerning  prototypes  or  cluster  shape,  there  is  an
    arbitrarily large number of data clustering solutions that can be radically different
    from  each  other.  As  an  example,  consider  the  +Cross  data  available  in  the
    Clusfer.xls file, which is represented in Figure 3.2a.


















    Figure 3.2. Cross data with: (a) Euclidian clustering; (b) City-block clustering.




      Imagine that we are searching for a 2-cluster solution that minimizes the within-
     clusfer average error:






     with ni different pairs of patterns x, y in cluster LO,.
   63   64   65   66   67   68   69   70   71   72   73