Page 105 -
P. 105

HAN 09-ch02-039-082-9780123814791


          68    Chapter 2 Getting to Know Your Data          2011/6/1  3:15  Page 68  #30



                           Each row corresponds to an object. As part of our notation, we may use f to index
                           through the p attributes.
                           Dissimilarity matrix (or object-by-object structure): This structure stores a collection
                           of proximities that are available for all pairs of n objects. It is often represented by an
                           n-by-n table:
                                                                       
                                                  0
                                                d(2, 1)  0             
                                               
                                                                        
                                               
                                               d(3, 1)  d(3, 2)  0      ,               (2.9)
                                                                        
                                                   .      .     .
                                                                       
                                                  .      .     .       
                                                  .      .     .       
                                                d(n, 1)  d(n, 2)  ··· ···  0
                           where d(i, j) is the measured dissimilarity or “difference” between objects i and j. In
                           general, d(i, j) is a non-negative number that is close to 0 when objects i and j are
                           highly similar or “near” each other, and becomes larger the more they differ. Note
                           that d(i, i) = 0; that is, the difference between an object and itself is 0. Furthermore,
                           d(i, j) = d(j, i). (For readability, we do not show the d(j, i) entries; the matrix is
                           symmetric.) Measures of dissimilarity are discussed throughout the remainder of this
                           chapter.

                         Measures of similarity can often be expressed as a function of measures of dissimilarity.
                         For example, for nominal data,
                                                   sim(i, j) = 1 − d(i, j),              (2.10)

                         where sim(i, j) is the similarity between objects i and j. Throughout the rest of this
                         chapter, we will also comment on measures of similarity.
                           A data matrix is made up of two entities or “things,” namely rows (for objects)
                         and columns (for attributes). Therefore, the data matrix is often called a two-mode
                         matrix. The dissimilarity matrix contains one kind of entity (dissimilarities) and so is
                         called a one-mode matrix. Many clustering and nearest-neighbor algorithms operate
                         on a dissimilarity matrix. Data in the form of a data matrix can be transformed into a
                         dissimilarity matrix before applying such algorithms.


                   2.4.2 Proximity Measures for Nominal Attributes
                         A nominal attribute can take on two or more states (Section 2.1.2). For example,
                         map color is a nominal attribute that may have, say, five states: red, yellow, green, pink,
                         and blue.
                           Let the number of states of a nominal attribute be M. The states can be denoted by
                         letters, symbols, or a set of integers, such as 1, 2,..., M. Notice that such integers are
                         used just for data handling and do not represent any specific ordering.
   100   101   102   103   104   105   106   107   108   109   110