Page 151 -
P. 151

HAN
                               10-ch03-083-124-9780123814791

          114   Chapter 3 Data Preprocessing                2011/6/1  3:16 Page 114  #32



                         attributes with initially large ranges (e.g., income) from outweighing attributes with
                         initially smaller ranges (e.g., binary attributes). It is also useful when given no prior
                         knowledge of the data.
                           There are many methods for data normalization. We study min-max normalization,
                         z-score normalization, and normalization by decimal scaling. For our discussion, let A be
                         a numeric attribute with n observed values, v 1 ,v 2 ,...,v n .
                           Min-max normalization performs a linear transformation on the original data. Sup-
                         pose that min A and max A are the minimum and maximum values of an attribute, A.
                                                                 0
                         Min-max normalization maps a value, v i , of A to v in the range [new min A ,new max A ]
                                                                i
                         by computing
                                     0    v i − min A
                                    v =            (new max A − new min A ) + new min A .  (3.8)
                                     i
                                        max A − min A
                         Min-max normalization preserves the relationships among the original data values. It
                         will encounter an “out-of-bounds” error if a future input case for normalization falls
                         outside of the original data range for A.

            Example 3.4 Min-max normalization. Suppose that the minimum and maximum values for the
                         attribute income are $12,000 and $98,000, respectively. We would like to map income
                         to the range [0.0,1.0]. By min-max normalization, a value of $73,600 for income is
                         transformed to  73,600−12,000  (1.0 − 0) + 0 = 0.716.
                                     98,000−12,000
                           In z-score normalization (or zero-mean normalization), the values for an attribute,
                         A, are normalized based on the mean (i.e., average) and standard deviation of A. A value,
                                             0
                         v i , of A is normalized to v by computing
                                             i
                                                          v i − A ¯
                                                       0
                                                      v =       ,                         (3.9)
                                                       i
                                                            σ A
                              ¯
                         where A and σ A are the mean and standard deviation, respectively, of attribute A. The
                                                                             ¯
                                                                                1
                         mean and standard deviation were discussed in Section 2.2, where A = (v 1 + v 2 + ··· +
                                                                                n
                         v n ) and σ A is computed as the square root of the variance of A (see Eq. (2.6)). This
                         method of normalization is useful when the actual minimum and maximum of attribute
                         A are unknown, or when there are outliers that dominate the min-max normalization.
            Example 3.5 z-score normalization. Suppose that the mean and standard deviation of the values for
                         the attribute income are $54,000 and $16,000, respectively. With z-score normalization,
                         a value of $73,600 for income is transformed to  73,600−54,000  = 1.225.
                                                                 16,000
                           A variation of this z-score normalization replaces the standard deviation of Eq. (3.9)
                         by the mean absolute deviation of A. The mean absolute deviation of A, denoted s A , is
                                              1
                                                     ¯
                                                                          ¯
                                                             ¯
                                          s A = (|v 1 − A| + |v 2 − A| + ··· + |v n − A|).  (3.10)
                                              n
   146   147   148   149   150   151   152   153   154   155   156