Page 322 - Classification Parameter Estimation & State Estimation An Engg Approach Using MATLAB

P. 322

BOSTON HOUSING CLASSIFICATION PROBLEM 311

ignore this problem, and we will assume that all features are real valued.
(Obviously, we will lose some performance using this assumption.)

9.1.2 Simple classification methods

Given the varying nature of the different features, and the fact that further
expert knowledge is not given, it will be difficult to construct a good model
for this data. The scatter diagram of Figure 9.1 shows that an assumption
of Gaussian distributed data is clearly wrong (if only by the presence of the
discrete features), but when just classification performance is considered,
the decision boundary might still be good enough. Perhaps more flexible
methods such as the Parzen density or the -nearest neighbour method will
perform better; after a suitable feature selection and feature scaling.
Let us start with some baseline methods and train a linear and quad-
ratic Bayes classifier, ldc and qdc:

Listing 9.1

% Load the housing dataset, and set the baseline performance
load housing.mat;
z % Show what dataset we have
w ¼ ldc; % Define an untrained linear
classifier
err_ldc_baseline ¼ crossval(z,w,5) % Perform 5-fold
cross-validation
err_qdc_baseline ¼ crossval(z,qdc,5) % idem for the quadratic
classifier

5
25
4
3
20
2
15 1
0
10
–1
–2
5
–3
0 –4
0 5 10 15 20 25 30 35 –6 –4 –2 0 2 4 6

Figure 9.1 Scatter plots of the Boston Housing data set. The left subplot shows
features STATUS and INDUSTRY, where the discrete nature of INDUSTRY can be
spotted. In the right subplot, the data set is first scaled to unit variance, after which it
is projected onto its first two principal components

317 318 319 320 321 322 323 324 325 326 327