Page 238 - Classification Parameter Estimation & State Estimation An Engg Approach Using MATLAB
P. 238

CLUSTERING                                                   227

            y w then computes the distance between any vector in z and any vector
            in y. For the squared Euclidean distances a short-cut function distm is
            defined.


            Listing 7.3
            PRTools code for defining and applying a proximity mapping.


            z ¼ gendatb(3);       % Create some train data
            y ¼ gendats(5);       % and some test data
            w ¼ proxm(z,‘d’,2);   % Squared Euclidean distance to z
            D ¼ y*w;              % 5   3 distance matrix
            D ¼ distm(y,z);       % The same 5   3 distance matrix
            w ¼ proxm(z,‘o’);     % Cosine distance to z
            D ¼ y*w;              % New 5   3 distance matrix

            The distance between objects should reflect the important structures in
            the data set. It is assumed in all clustering algorithms that distances
            between objects are informative. This means that when objects are close
            in the feature space, they should also resemble each other in the real
            world. When the distances are not defined sensibly, and remote objects
            in the feature space correspond to similar real-world objects, no clus-
            tering algorithm will be able to give acceptable results without extra
            information from the user. In these sections it will therefore be assumed
            that the features are scaled such that the distances between objects are
            informative. Note that a cluster does not necessarily correspond directly
            to a class. A class can consist of multiple clusters, or multiple classes may
            form a single cluster (and will therefore probably be hard to discriminate
            between).
              By the fact that clustering is unsupervised, it is very hard to evaluate a
            clustering result. Different clustering methods will yield a different set of
            clusters, and the user has to decide which clustering is to be preferred.
            A quantitative measure of the quality of the clustering is the average
            distance of the objects to their respective cluster centre. Assume that the
            objects z i (i ¼ 1, .. . , N S ) are clustered in K clusters, C k (k ¼ 1, ... , K)
            with cluster centre m and to each of the clusters N k objects are assigned:
                              k
                                       K
                                    1  X   1  X        2
                                J ¼             ðz   m Þ               ð7:10Þ
                                                      k
                                    N S   N k
                                       k¼1   z2C k
            Other criteria, such as the ones defined in Chapter 5 for supervised
            learning, can also be applied.
   233   234   235   236   237   238   239   240   241   242   243