Page 237 - Classification Parameter Estimation & State Estimation An Engg Approach Using MATLAB
P. 237

226                                     UNSUPERVISED LEARNING

            Listing 7.2
            PRTools code for performing an MDS mapping.

            load worldcities;                  % Load dataset D
            options.q ¼ 2;
            w ¼ mds(D,2,options);              % Map to 2D with q ¼ 2
            figure; clf; scatterd(D*w,‘both’); % Plot projections



            7.2   CLUSTERING


            Instead of reducing the number of features, we now focus on reducing
            the number of objects in the data set. The aim is to detect ‘natural’
            clusters in the data, i.e. clusters which agree with our human interpret-
            ation of the data. Unfortunately, it is very hard to define what a natural
            cluster is. In most cases, a cluster is defined as a subset of objects for
            which the resemblance between the objects within the subset is larger
            than the resemblance with other objects in other subsets (clusters).
              This immediately introduces the next problem: how is the resemblance
            between objects defined? The most important cue for the resemblance of
            two objects is the distance between the objects, i.e. their dissimilarity. In
            most cases the Euclidean distance between objects is used as a dissimi-
            larity measure, but there are many other possibilities. The L p norm is
            well known (see Appendix A.1.1 and A.2):

                                                        !1
                                           N             p
                                                       p
                                          X
                              d p ðz i ; z j Þ¼  ðz i;n   z j;n Þ       ð7:8Þ
                                          n¼1
            The cosine distance uses the angle between two vectors as a dissimilarity
            measure. It is often used in the automatic clustering of text documents:

                                                  T
                                                 z z j
                                                  i
                                 dðz i ; z j Þ¼ 1                       ð7:9Þ
                                              jjz i jj jjz j jj
                                                  2   2
            In PRTools the basic distance computation is implemented in the
            function proxm. Several methods for computing distances (and similar-
            ities) are defined. Next to the two basic distances mentioned above, also
            some similarity measures are defined, like the inner product between
            vectors, and the Gaussian kernel. The function is implemented as a
            mapping. When a mapping w is trained on some data z, the application
   232   233   234   235   236   237   238   239   240   241   242