Page 150 - Introduction to Statistical Pattern Recognition
P. 150

132                        Introduction to Statistical Pattern Recognition



                                                           01
                                            h(X) = vTx + v,  ><  0               (4.18)
                                                           o?
                      The  term  h(X) is  a  linear  function of  X  and  is  called  a  linear- discriminant
                     function.  Our design work is to find the optimum coefficients V = [VI . . . v,lT
                      and the threshold value v,,  for given distributions under various criteria.  The
                      linear discriminant function becomes the minus-log likelihood ratio when the
                      given distributions are normal with equal covariance matrices.
                           However, the reader should be cautioned that no linear classifiers work
                      well for the  distributions which  are not  separated by  the mean-difference but
                      separated by the covariance-difference. In this case, we  have no choice but to
                      adopt a more complex classifier such as a quadratic one.  The first and second
                      terms  of  the  Bhattacharyya  distance,  (3.152), will  indicate  where  the  class
                      separability comes from, namely mean- or covariance-difference.

                      Optimum Design Procedure

                           Equation  (4.18)  indicates that  an  n-dimensional vector X  is  projected
                      onto  a  vector  V,  and  that  the  variable,  y = VTX,  in  the  projected  one-
                      dimensional h-space is  classified to  either  o1 or  %, depending on  whether
                      y < -v,  or y > -v,,.   Figure 4-7 shows an example in  which distributions are
                      projected onto two vectors, V and  V’.  On  each mapped  space, the threshold,
                      vo, is chosen to separate the wI- and @*-regions, resulting in the hatched error
                      probability.  As  seen  in  Fig. 4-7, the  error on  V  is  smaller than  that  on  V’.
                      Therefore, the  optimum design procedure for a  linear classifier is  to  select V
                      and v,  which give the smallest error in the projected h-space.
                           When  X is normally distributed, h (X) of  (4.18) is  also normal.  There-
                      fore,  the  error  in  the  h-space  is  determined  by  qi = E{h(X) loi] and
                      0’ = Var(h(X)Ioi), which  are functions of  V  and  v,.  Thus, as will  be  dis-
                      cussed later, the error may be minimized with respect to V and v,,.  Even if  X
                      is not normally distributed, h (X) could be close to normal for large n, because
                      h (X) is the summation of n terms and the central limit theorem may come into
                      effect.  In this case, a function of qi and 0’ could be  a reasonable criterion to
                      measure the class separability in the h-space.
   145   146   147   148   149   150   151   152   153   154   155