Page 112 -
P. 112

#37
                                                                           Page 75
                                                                     3:15
                                                             2011/6/1
                          HAN 09-ch02-039-082-9780123814791
                                                       2.4 Measuring Data Similarity and Dissimilarity  75


                 Example 2.21 Dissimilarity between ordinal attributes. Suppose that we have the sample data shown
                               earlier in Table 2.2, except that this time only the object-identifier and the continuous
                               ordinal attribute, test-2, are available. There are three states for test-2: fair, good, and
                               excellent, that is, M f = 3. For step 1, if we replace each value for test-2 by its rank, the
                               four objects are assigned the ranks 3, 1, 2, and 3, respectively. Step 2 normalizes the
                               ranking by mapping rank 1 to 0.0, rank 2 to 0.5, and rank 3 to 1.0. For step 3, we can
                               use, say, the Euclidean distance (Eq. 2.16), which results in the following dissimilarity
                               matrix:


                                                                       
                                                           0
                                                         1.0  0        
                                                         
                                                                        .
                                                                        
                                                         
                                                         0.5  0.5  0   
                                                           0  1.0  0.5  0
                               Therefore, objects 1 and 2 are the most dissimilar, as are objects 2 and 4 (i.e., d(2,1) =
                               1.0 and d(4,2) = 1.0). This makes intuitive sense since objects 1 and 4 are both excellent.
                               Object 2 is fair, which is at the opposite end of the range of values for test-2.

                                 Similarity values for ordinal attributes can be interpreted from dissimilarity as
                               sim(i,j) = 1 − d(i,j).

                         2.4.6 Dissimilarity for Attributes of Mixed Types

                               Sections 2.4.2 through 2.4.5 discussed how to compute the dissimilarity between objects
                               described by attributes of the same type, where these types may be either nominal, sym-
                               metric binary, asymmetric binary, numeric, or ordinal. However, in many real databases,
                               objects are described by a mixture of attribute types. In general, a database can contain
                               all of these attribute types.
                                 “So, how can we compute the dissimilarity between objects of mixed attribute types?”
                               One approach is to group each type of attribute together, performing separate data
                               mining (e.g., clustering) analysis for each type. This is feasible if these analyses derive
                               compatible results. However, in real applications, it is unlikely that a separate analysis
                               per attribute type will generate compatible results.
                                 A more preferable approach is to process all attribute types together, performing a
                               single analysis. One such technique combines the different attributes into a single dis-
                               similarity matrix, bringing all of the meaningful attributes onto a common scale of the
                               interval [0.0, 1.0].
                                 Suppose that the data set contains p attributes of mixed type. The dissimilarity d(i, j)
                               between objects i and j is defined as
                                                               P p   (f ) (f )
                                                                    δ  d
                                                                 f =1 ij  ij
                                                        d(i, j) =         ,                    (2.22)
                                                                P p   (f )
                                                                      δ
                                                                   f =1 ij
   107   108   109   110   111   112   113   114   115   116   117