Page 238 - Classification Parameter Estimation & State Estimation An Engg Approach Using MATLAB

P. 238

CLUSTERING 227

y w then computes the distance between any vector in z and any vector
in y. For the squared Euclidean distances a short-cut function distm is
defined.

Listing 7.3
PRTools code for defining and applying a proximity mapping.

z ¼ gendatb(3); % Create some train data
y ¼ gendats(5); % and some test data
w ¼ proxm(z,‘d’,2); % Squared Euclidean distance to z
D ¼ y*w; % 5 3 distance matrix
D ¼ distm(y,z); % The same 5 3 distance matrix
w ¼ proxm(z,‘o’); % Cosine distance to z
D ¼ y*w; % New 5 3 distance matrix

The distance between objects should reflect the important structures in
the data set. It is assumed in all clustering algorithms that distances
between objects are informative. This means that when objects are close
in the feature space, they should also resemble each other in the real
world. When the distances are not defined sensibly, and remote objects
in the feature space correspond to similar real-world objects, no clus-
tering algorithm will be able to give acceptable results without extra
information from the user. In these sections it will therefore be assumed
that the features are scaled such that the distances between objects are
informative. Note that a cluster does not necessarily correspond directly
to a class. A class can consist of multiple clusters, or multiple classes may
form a single cluster (and will therefore probably be hard to discriminate
between).
By the fact that clustering is unsupervised, it is very hard to evaluate a
clustering result. Different clustering methods will yield a different set of
clusters, and the user has to decide which clustering is to be preferred.
A quantitative measure of the quality of the clustering is the average
distance of the objects to their respective cluster centre. Assume that the
objects z i (i ¼ 1, .. . , N S ) are clustered in K clusters, C k (k ¼ 1, ... , K)
with cluster centre m and to each of the clusters N k objects are assigned:
k
K
1 X 1 X 2
J ¼ ðz m Þ ð7:10Þ
k
N S N k
k¼1 z2C k
Other criteria, such as the ones defined in Chapter 5 for supervised
learning, can also be applied.

233 234 235 236 237 238 239 240 241 242 243