Page 108 -
P. 108

#33
                                                             2011/6/1
                          HAN 09-ch02-039-082-9780123814791
                                                                     3:15
                                                                           Page 71
                                                       2.4 Measuring Data Similarity and Dissimilarity  71


                               dissimilarity between i and j is
                                                                   r + s
                                                        d(i, j) =         .                    (2.13)
                                                                q + r + s + t

                                 For asymmetric binary attributes, the two states are not equally important, such as
                               the positive (1) and negative (0) outcomes of a disease test. Given two asymmetric binary
                               attributes, the agreement of two 1s (a positive match) is then considered more signifi-
                               cant than that of two 0s (a negative match). Therefore, such binary attributes are often
                               considered “monary” (having one state). The dissimilarity based on these attributes is
                               called asymmetric binary dissimilarity, where the number of negative matches, t, is
                               considered unimportant and is thus ignored in the following computation:
                                                                   r + s
                                                          d(i, j) =     .                      (2.14)
                                                                 q + r + s

                                 Complementarily, we can measure the difference between two binary attributes based
                               on the notion of similarity instead of dissimilarity. For example, the asymmetric binary
                               similarity between the objects i and j can be computed as

                                                                q
                                                    sim(i, j) =      = 1 − d(i, j).            (2.15)
                                                             q + r + s
                               The coefficient sim(i, j) of Eq. (2.15) is called the Jaccard coefficient and is popularly
                               referenced in the literature.
                                 When both symmetric and asymmetric binary attributes occur in the same data set,
                               the mixed attributes approach described in Section 2.4.6 can be applied.

                 Example 2.18 Dissimilarity between binary attributes. Suppose that a patient record table (Table 2.4)
                               contains the attributes name, gender, fever, cough, test-1, test-2, test-3, and test-4, where
                               name is an object identifier, gender is a symmetric attribute, and the remaining attributes
                               are asymmetric binary.

                                 For asymmetric attribute values, let the values Y (yes) and P (positive) be set to 1,
                               and the value N (no or negative) be set to 0. Suppose that the distance between objects


                     Table 2.4 Relational Table Where Patients Are Described by Binary Attributes
                               name   gender  fever  cough   test-1  test-2  test-3  test-4
                               Jack   M       Y      N       P      N      N       N
                               Jim    M       Y      Y       N      N      N       N
                               Mary   F       Y      N       P      N      P       N
                               .      .       .      .       .      .      .       .
                               .      .       .      .       .      .      .       .
                               .      .       .      .       .      .      .       .
   103   104   105   106   107   108   109   110   111   112   113