Page 81 -
P. 81

68      3 Data Clustering
























                                                              Data Similarity
                               Figure 3.15. Shepard diagram for the Food dataset, showing the target distance d'
                               (dots)  and  the  transformed distance Ad) (step function)  against  the original  data
                               similarity.



                                  We illustrate the application of this technique using the Food dataset. In order to
                               have  comparable  features  in  terms  of  their  contribution  to  the  dissimilarity
                               measure, the following standardized features were computed: Call0 = Calllo; P10
                               = P110; CalO = Ca/lO; B10 = BlB2xlO.
                                  When using Statistics for multidimensional scaling, the first step is the creation
                               of  a  dissimilarity  matrix  for  all  patterns,  which  can  be  done  during  the  cluster
                               analysis process. The elements of the dissimilarity matrix are the distances d(x,, x,).
                               Figure  3.14 shows this matrix using Euclidian distances. For notation simplicity,
                               food cases have been named from A to Z and then from AA to AJ.
                                  Next, the iterative multidimensional scaling is performed on this  matrix, after
                                setting  the  desired  number  of  dimensions  (d=2  for  this  example).  After
                                convergence  of  the  iterative  process  the  user  has  the  option  to  evaluate  the
                                goodness of fit of the solution by  looking at the Shepard diagram shown in Figure
                                3.15.  This  diagram  shows  the  target  distances  d0(xi, xi)  and  the  transformed
                                distances Jd(xi,  xi))  against  the  original  ones.  The  transformed  distances  are
                                represented by  a step-line. As Figure 3.15 shows the target distances are very close
                                to the step-function, indicating a good fit to the monotone transformation.
                                  Figure 3.16 shows the Food data represented in the two derived dimensions with
                                a four cluster solution. When tree clustering is performed with  the Ward method
                                and a Euclidian metric, the solution shown by  the vertical icicle of  Figure 3.17 is
                                obtained, which shows a good agreement with the scatter plot in the reduced two-
                                dimensional space.
                                  The interpretation of the clusters can be a problem when using multidimensional
                                scaling  alone.  With  principal  components  analysis  we  can  derive  two  factors
   76   77   78   79   80   81   82   83   84   85   86