Page 105 -

P. 105

HAN 09-ch02-039-082-9780123814791

68 Chapter 2 Getting to Know Your Data 2011/6/1 3:15 Page 68 #30

Each row corresponds to an object. As part of our notation, we may use f to index
through the p attributes.
Dissimilarity matrix (or object-by-object structure): This structure stores a collection
of proximities that are available for all pairs of n objects. It is often represented by an
n-by-n table:
 
0
 d(2, 1) 0 



d(3, 1) d(3, 2) 0  , (2.9)

. . .
 
 . . . 
 . . . 
d(n, 1) d(n, 2) ··· ··· 0
where d(i, j) is the measured dissimilarity or “difference” between objects i and j. In
general, d(i, j) is a non-negative number that is close to 0 when objects i and j are
highly similar or “near” each other, and becomes larger the more they differ. Note
that d(i, i) = 0; that is, the difference between an object and itself is 0. Furthermore,
d(i, j) = d(j, i). (For readability, we do not show the d(j, i) entries; the matrix is
symmetric.) Measures of dissimilarity are discussed throughout the remainder of this
chapter.

Measures of similarity can often be expressed as a function of measures of dissimilarity.
For example, for nominal data,
sim(i, j) = 1 − d(i, j), (2.10)

where sim(i, j) is the similarity between objects i and j. Throughout the rest of this
chapter, we will also comment on measures of similarity.
A data matrix is made up of two entities or “things,” namely rows (for objects)
and columns (for attributes). Therefore, the data matrix is often called a two-mode
matrix. The dissimilarity matrix contains one kind of entity (dissimilarities) and so is
called a one-mode matrix. Many clustering and nearest-neighbor algorithms operate
on a dissimilarity matrix. Data in the form of a data matrix can be transformed into a
dissimilarity matrix before applying such algorithms.

2.4.2 Proximity Measures for Nominal Attributes
A nominal attribute can take on two or more states (Section 2.1.2). For example,
map color is a nominal attribute that may have, say, ﬁve states: red, yellow, green, pink,
and blue.
Let the number of states of a nominal attribute be M. The states can be denoted by
letters, symbols, or a set of integers, such as 1, 2,..., M. Notice that such integers are
used just for data handling and do not represent any speciﬁc ordering.

100 101 102 103 104 105 106 107 108 109 110