Page 53 -
P. 53

2.3 The Covariance Matrix   39

     Suppose that we apply a linear transformation  with the transpose of Z, u = Z'y,
   to  the  vectors  y  with  covariance  C. Then,  as  shown  in  Appendix  C,  the  new
   covariance  matrix  will  be diagonal  (uncorrelated  features), having the squares of
   the eigenvalues  as new  variances  and  preserving  the Mahalanobis distances. This
   orthonormal transjormation is also called Karhunen-Loeve transformation.
     Not~ce that  we  have  been  using  the  eigenvectors  computed  from  the  original
   linear transformation  matrix A. Usually  we do not  know  this  matrix, and instead,
   we  will  compute  the  eigenvectors  of  C  itself,  which  are  the  same.  The
   corresponding  eigenvalues  of  C  are, however,  the  square of  the  ones  computed
   from  A,  and  represent  the  variances  along  the  eigenvectors or, equivalently, the
   eigenvectors of A represent the standard deviations, as indicated in Figure 2.15. It
   is  customary  to  sort  the  eigenvectors  by  decreasing  magnitude,  the  first  one
   corresponding to the direction of  maximum variance, as shown in Figure 2.15, the
   second one to the direction  of  the  maximum remaining  variance,  and  so on, until
   the last one (/12 in Figure 2.15) representing only a residual variance.
     Appendix C also includes the demonstration  of  the positive  definiteness  of  the
   covariance matrices and other equally interesting results.


   2.4  Principal Components


    As mentioned in  the previous chapter the initial choice of features in PR problems
    is mostly guided by  common  sense ideas of  pattern  discrimination. Therefore, it is
    not unusual to have a large initial  set constituted by  features that  may  exhibit high
    correlations  among  them,  and  whose  contribution  to  pattern  discrimination  may
    vary  substantially. Large feature sets are inconvenient  for a number of  reasons, an
    obvious one being  the computational  burden.  Less obvious and  more  compelling
    reasons  will  be  explained  later.  A common  task  in  PR  problems  is  therefore  to
    perform  some type of  feature selection.  In  the  initial  phase of  a PR  project,  such
    selection aims to either discard  features whose contribution  is insignificant  as will
    be described in section 2.5, or to perform some kind of dimensionality reduction by
    using an alternative and smaller set of features derived from the initial ones.
      Principal components  analysis  is a  method  commonly  used  for data  reduction
    purposes. It is based  on  the  idea of  performing  an  orthonormal transformation  as
    described  in  the  previous  section,  retaining  only  significant  eigenvectors.  As
    explained  in  the  section  on  orthonormal  transformation,  each  eigenvector  is
    associated  with  a  variance  represented  by  the  corresponding  eigenvalue.  Each
    eigenvector corresponding to an eigenvalue that represents a significant variance of
    the whole dataset is called a principal  component of  the data. For instance, in  the
    example portrayed  in  Figure  2.15 the first  eigenvector  represents  /i: /(/i: + 4) =
    98%  of  the  total  variance;  in  short, z, alone contains practically  all  information
    about the data.
      Let us see how this works in a real situation. Figure 2.16 shows the sorted list of
    the eigenvalues (classes w,, y) of the cork stoppers data, computed with Statistica.
    The  ninth  eigenvalue,  for  instance,  is  responsible  for  about  0.01%  of  the  total
   48   49   50   51   52   53   54   55   56   57   58