Page 105 -
P. 105
HAN 09-ch02-039-082-9780123814791
68 Chapter 2 Getting to Know Your Data 2011/6/1 3:15 Page 68 #30
Each row corresponds to an object. As part of our notation, we may use f to index
through the p attributes.
Dissimilarity matrix (or object-by-object structure): This structure stores a collection
of proximities that are available for all pairs of n objects. It is often represented by an
n-by-n table:
0
d(2, 1) 0
d(3, 1) d(3, 2) 0 , (2.9)
. . .
. . .
. . .
d(n, 1) d(n, 2) ··· ··· 0
where d(i, j) is the measured dissimilarity or “difference” between objects i and j. In
general, d(i, j) is a non-negative number that is close to 0 when objects i and j are
highly similar or “near” each other, and becomes larger the more they differ. Note
that d(i, i) = 0; that is, the difference between an object and itself is 0. Furthermore,
d(i, j) = d(j, i). (For readability, we do not show the d(j, i) entries; the matrix is
symmetric.) Measures of dissimilarity are discussed throughout the remainder of this
chapter.
Measures of similarity can often be expressed as a function of measures of dissimilarity.
For example, for nominal data,
sim(i, j) = 1 − d(i, j), (2.10)
where sim(i, j) is the similarity between objects i and j. Throughout the rest of this
chapter, we will also comment on measures of similarity.
A data matrix is made up of two entities or “things,” namely rows (for objects)
and columns (for attributes). Therefore, the data matrix is often called a two-mode
matrix. The dissimilarity matrix contains one kind of entity (dissimilarities) and so is
called a one-mode matrix. Many clustering and nearest-neighbor algorithms operate
on a dissimilarity matrix. Data in the form of a data matrix can be transformed into a
dissimilarity matrix before applying such algorithms.
2.4.2 Proximity Measures for Nominal Attributes
A nominal attribute can take on two or more states (Section 2.1.2). For example,
map color is a nominal attribute that may have, say, five states: red, yellow, green, pink,
and blue.
Let the number of states of a nominal attribute be M. The states can be denoted by
letters, symbols, or a set of integers, such as 1, 2,..., M. Notice that such integers are
used just for data handling and do not represent any specific ordering.