Page 237 - Classification Parameter Estimation & State Estimation An Engg Approach Using MATLAB
P. 237
226 UNSUPERVISED LEARNING
Listing 7.2
PRTools code for performing an MDS mapping.
load worldcities; % Load dataset D
options.q ¼ 2;
w ¼ mds(D,2,options); % Map to 2D with q ¼ 2
figure; clf; scatterd(D*w,‘both’); % Plot projections
7.2 CLUSTERING
Instead of reducing the number of features, we now focus on reducing
the number of objects in the data set. The aim is to detect ‘natural’
clusters in the data, i.e. clusters which agree with our human interpret-
ation of the data. Unfortunately, it is very hard to define what a natural
cluster is. In most cases, a cluster is defined as a subset of objects for
which the resemblance between the objects within the subset is larger
than the resemblance with other objects in other subsets (clusters).
This immediately introduces the next problem: how is the resemblance
between objects defined? The most important cue for the resemblance of
two objects is the distance between the objects, i.e. their dissimilarity. In
most cases the Euclidean distance between objects is used as a dissimi-
larity measure, but there are many other possibilities. The L p norm is
well known (see Appendix A.1.1 and A.2):
!1
N p
p
X
d p ðz i ; z j Þ¼ ðz i;n z j;n Þ ð7:8Þ
n¼1
The cosine distance uses the angle between two vectors as a dissimilarity
measure. It is often used in the automatic clustering of text documents:
T
z z j
i
dðz i ; z j Þ¼ 1 ð7:9Þ
jjz i jj jjz j jj
2 2
In PRTools the basic distance computation is implemented in the
function proxm. Several methods for computing distances (and similar-
ities) are defined. Next to the two basic distances mentioned above, also
some similarity measures are defined, like the inner product between
vectors, and the Gaussian kernel. The function is implemented as a
mapping. When a mapping w is trained on some data z, the application