Page 151 -
P. 151
HAN
10-ch03-083-124-9780123814791
114 Chapter 3 Data Preprocessing 2011/6/1 3:16 Page 114 #32
attributes with initially large ranges (e.g., income) from outweighing attributes with
initially smaller ranges (e.g., binary attributes). It is also useful when given no prior
knowledge of the data.
There are many methods for data normalization. We study min-max normalization,
z-score normalization, and normalization by decimal scaling. For our discussion, let A be
a numeric attribute with n observed values, v 1 ,v 2 ,...,v n .
Min-max normalization performs a linear transformation on the original data. Sup-
pose that min A and max A are the minimum and maximum values of an attribute, A.
0
Min-max normalization maps a value, v i , of A to v in the range [new min A ,new max A ]
i
by computing
0 v i − min A
v = (new max A − new min A ) + new min A . (3.8)
i
max A − min A
Min-max normalization preserves the relationships among the original data values. It
will encounter an “out-of-bounds” error if a future input case for normalization falls
outside of the original data range for A.
Example 3.4 Min-max normalization. Suppose that the minimum and maximum values for the
attribute income are $12,000 and $98,000, respectively. We would like to map income
to the range [0.0,1.0]. By min-max normalization, a value of $73,600 for income is
transformed to 73,600−12,000 (1.0 − 0) + 0 = 0.716.
98,000−12,000
In z-score normalization (or zero-mean normalization), the values for an attribute,
A, are normalized based on the mean (i.e., average) and standard deviation of A. A value,
0
v i , of A is normalized to v by computing
i
v i − A ¯
0
v = , (3.9)
i
σ A
¯
where A and σ A are the mean and standard deviation, respectively, of attribute A. The
¯
1
mean and standard deviation were discussed in Section 2.2, where A = (v 1 + v 2 + ··· +
n
v n ) and σ A is computed as the square root of the variance of A (see Eq. (2.6)). This
method of normalization is useful when the actual minimum and maximum of attribute
A are unknown, or when there are outliers that dominate the min-max normalization.
Example 3.5 z-score normalization. Suppose that the mean and standard deviation of the values for
the attribute income are $54,000 and $16,000, respectively. With z-score normalization,
a value of $73,600 for income is transformed to 73,600−54,000 = 1.225.
16,000
A variation of this z-score normalization replaces the standard deviation of Eq. (3.9)
by the mean absolute deviation of A. The mean absolute deviation of A, denoted s A , is
1
¯
¯
¯
s A = (|v 1 − A| + |v 2 − A| + ··· + |v n − A|). (3.10)
n