Page 112 -

P. 112

#37
Page 75
3:15
2011/6/1
HAN 09-ch02-039-082-9780123814791
2.4 Measuring Data Similarity and Dissimilarity 75

Example 2.21 Dissimilarity between ordinal attributes. Suppose that we have the sample data shown
earlier in Table 2.2, except that this time only the object-identiﬁer and the continuous
ordinal attribute, test-2, are available. There are three states for test-2: fair, good, and
excellent, that is, M f = 3. For step 1, if we replace each value for test-2 by its rank, the
four objects are assigned the ranks 3, 1, 2, and 3, respectively. Step 2 normalizes the
ranking by mapping rank 1 to 0.0, rank 2 to 0.5, and rank 3 to 1.0. For step 3, we can
use, say, the Euclidean distance (Eq. 2.16), which results in the following dissimilarity
matrix:

 
0
1.0 0 

.


0.5 0.5 0 
0 1.0 0.5 0
Therefore, objects 1 and 2 are the most dissimilar, as are objects 2 and 4 (i.e., d(2,1) =
1.0 and d(4,2) = 1.0). This makes intuitive sense since objects 1 and 4 are both excellent.
Object 2 is fair, which is at the opposite end of the range of values for test-2.

Similarity values for ordinal attributes can be interpreted from dissimilarity as
sim(i,j) = 1 − d(i,j).

2.4.6 Dissimilarity for Attributes of Mixed Types

Sections 2.4.2 through 2.4.5 discussed how to compute the dissimilarity between objects
described by attributes of the same type, where these types may be either nominal, sym-
metric binary, asymmetric binary, numeric, or ordinal. However, in many real databases,
objects are described by a mixture of attribute types. In general, a database can contain
all of these attribute types.
“So, how can we compute the dissimilarity between objects of mixed attribute types?”
One approach is to group each type of attribute together, performing separate data
mining (e.g., clustering) analysis for each type. This is feasible if these analyses derive
compatible results. However, in real applications, it is unlikely that a separate analysis
per attribute type will generate compatible results.
A more preferable approach is to process all attribute types together, performing a
single analysis. One such technique combines the different attributes into a single dis-
similarity matrix, bringing all of the meaningful attributes onto a common scale of the
interval [0.0, 1.0].
Suppose that the data set contains p attributes of mixed type. The dissimilarity d(i, j)
between objects i and j is deﬁned as
P p (f ) (f )
δ d
f =1 ij ij
d(i, j) = , (2.22)
P p (f )
δ
f =1 ij

107 108 109 110 111 112 113 114 115 116 117