Page 319 -
P. 319
308 Aooendix B . CD Tools
8.7 k-NN Bounds
The k-NN Bounds.xls file allows the computation of error bounds for a k-NN
classifier. Bounds for a number of neighbours k=l, 3, 9 and 21 are already
computed and presented @ denotes the Bayes error). Bounds for other values of k
are easily computed by filling in the C(k, 1) column.
Author: JP Marques de SB, Engineering Faculty, Oporto University.
B.8 k-NN Classification
The KNN program allows k-NN classification to be performed on a two-class
dataset using either a partition or an edition approach.
Data is read from a text file with the following format:
n number of patterns.
n, number of patterns of the first class.
d dimension.
. . . n lines with d values, first nl lines for the first class, followed by n-nl lines
for the second class.
The first line of the text file is the total number of patterns, n (n i 500); n, is the
number of patterns belonging to class w,; d is the number of features (d 6).
Succeeding lines represent the feature vectors, with d feature values separated by
commas. The first nl lines must contain the feature values relative to class wl
patterns.
In the "Specifications" frame the user must fill in the file name, the value of k
(number of neighbours) and choose either the partition or the edit method. If the
partition method is chosen, the number of partitions must also be specified.
Classification of the data is obtained by clicking the "Compute" button. The
program then shows the classification matrix with the class and overall test set
errors, in percentage values. For the partition method, the standard deviation of the
errors across the partitions is also presented.
Suppose that the distributed N-PRTlO.txt file is used. This file contains the N,
PRTlO feature values corresponding to the first two classes of the cork stoppers
data, with a total of 100 patterns, the first 50 from class wl.
Performing a k-NN classification with one neighbour and 2 partitions, one
obtains an overall error of 21% with 9.9% standard deviation. If the "Stepwise"
box is checked, the classification matrices for the individual partitions can be
observed. For the first step, patterns 1 through 25 and 51 though 75 are used for
test, the others for training. An overall error of 28% is obtained. For the second
step, the set roles are reversed and an overall error of 14% is obtained. The
individual test set errors are distributed around the 21% average value with 9.9%
standard deviation.