Page 70 -
P. 70

3.2 The Standardization Issue   57

      First  of  all,  the  need  for  any  standardization  must  be  questioned.  If  the
    interesting  clusters  are  based  on  the  original  features,  then  any  standardization
    method  may  distort  or mask  those clusters. It is only when  there are grounds to
    search for clusters in a transformed space that some standardization rule should be
    used.
      Several simple standardization methods have been proposed for achieving scale
    invariance or at least attempting a balanced contribution of  all features to distance
    measurements:

      y, = (xi - m)ls  with  rn, s  resp. mean and standard deviation of  xi ;   (3-2a)
      yi  = (xi - min(xi))l(max(xi)-min(xi));                     (3-2b)
      yi= x, l(max(xi)-min(xi));                                  (3-2c)
      yi= xi /a .                                                 (3-2d)

       There  is  also,  of  course,  the  more  sophisticated  orthonormal  transformation,
     described  in  section  2.3,  preserving  the  Mahalanobis  distance.  All  these
     standardization methods  have  some  pitfalls.  Consider,  for  instance,  the  popular
     standardization method of obtaining scale invariance by using transformed features
     with  zero  mean  and  unit  variance  (3-2a).  An  evident  pitfall  is  that  semantic
     information  from  the  features  can  be  lost  with  this  standardization.  Another
     problem is that this unit variance standardization is only adequate if  the differing
     feature variances are due only to random variation. However, if  such variation is
     due to data partition in distinct clusters it may produce totally wrong results.
       If we know beforehand the type of clusters we are dealing with, we can devise a
     suitable standardization method. This poses the following vicious circle:
     1. In order to perform clustering we need an appropriate distance measure.
     2. The  appropriate  distance  measure  depends  on  the  feature  standardization
       method.
     3. In  order  to  select  the  standardization  method  we  need  to  know  the  type  of
       clusters needed.

       There is no methodological way out of this vicious circle except by a "trial and
     error"  approach,  experimenting  with  various  alternatives  and  evaluating  the
     corresponding solutions aided by  visual  inspection, data interpretation and utility
      considerations.  An  easy  standardization  method  that  we  will  often  follow  and
      frequently achieves good results is the division or multiplication by  a simple scale
      factor  (e.g. a power  of  lo), properly chosen  so  that  all feature values  occupy  a
      suitable interval. This corresponds to method (3-2d). In  this way  we can balance
      the contribution of the features and still retain semantic information.
   65   66   67   68   69   70   71   72   73   74   75