Page 319 -
P. 319

308    Aooendix B . CD Tools


                          8.7  k-NN Bounds


                          The  k-NN  Bounds.xls  file  allows the  computation of  error  bounds  for  a  k-NN
                          classifier.  Bounds  for  a  number  of  neighbours  k=l,  3, 9  and  21  are  already
                          computed and presented @ denotes the Bayes error). Bounds for other values of k
                          are easily computed by filling in the C(k, 1)  column.

                          Author: JP Marques de SB, Engineering Faculty, Oporto University.


                          B.8 k-NN Classification


                          The  KNN  program  allows  k-NN  classification to  be  performed  on  a  two-class
                          dataset using either a partition or an edition approach.
                            Data is read from a text file with the following format:
                            n    number of patterns.
                            n,   number of patterns of the first class.
                            d    dimension.
                            . . .   n lines with d values, first nl lines for the first class, followed by n-nl lines
                                 for the second class.

                            The first line of the text file is the total number of patterns, n (n i 500); n, is the
                          number  of  patterns belonging to  class  w,; d  is  the  number of  features  (d  6).
                          Succeeding lines represent the feature vectors, with d feature values separated by
                          commas. The  first  nl lines  must  contain  the  feature  values  relative to  class  wl
                          patterns.
                            In the  "Specifications" frame the user must fill in the file name, the value of k
                          (number of  neighbours) and choose either the partition  or the edit method. If  the
                          partition method is chosen, the number of partitions must also be specified.
                            Classification of  the  data is  obtained by  clicking the  "Compute" button. The
                          program  then  shows  the classification matrix  with  the  class and  overall  test  set
                          errors, in percentage values. For the partition method, the standard deviation of the
                          errors across the partitions is also presented.
                            Suppose that the distributed N-PRTlO.txt file is used. This file contains the N,
                          PRTlO  feature values corresponding to the first two classes of  the cork stoppers
                          data, with a total of 100 patterns, the first 50 from class wl.
                            Performing  a  k-NN  classification  with  one  neighbour  and  2  partitions,  one
                          obtains an overall error of  21%  with  9.9%  standard deviation. If the  "Stepwise"
                          box  is  checked,  the  classification  matrices for  the  individual  partitions  can  be
                          observed. For the first step, patterns 1  through 25 and 51 though 75 are used for
                          test, the others for training. An  overall error of  28%  is obtained. For the second
                          step,  the  set  roles  are  reversed  and  an  overall  error  of  14%  is  obtained. The
                          individual test set errors are distributed around the 21%  average value with 9.9%
                          standard deviation.
   314   315   316   317   318   319   320   321   322   323   324