Page 152 -
P. 152

HAN
                               10-ch03-083-124-9780123814791
                                                                                    #33
                                                                     3:16 Page 115
                                                            2011/6/1
                                                      3.5 Data Transformation and Data Discretization  115


                               Thus, z-score normalization using the mean absolute deviation is
                                                                     ¯
                                                                 v i − A
                                                             0
                                                            v =       .                        (3.11)
                                                             i
                                                                  s A
                               The mean absolute deviation, s A , is more robust to outliers than the standard deviation,
                               σ A . When computing the mean absolute deviation, the deviations from the mean (i.e.,
                               |x i − ¯x|) are not squared; hence, the effect of outliers is somewhat reduced.
                                 Normalization by decimal scaling normalizes by moving the decimal point of values
                               of attribute A. The number of decimal points moved depends on the maximum absolute
                                                                   0
                               value of A. A value, v i , of A is normalized to v by computing
                                                                   i
                                                              0
                                                             v =  v i  ,                       (3.12)
                                                              i
                                                                  10 j
                                                                    0
                               where j is the smallest integer such that max(|v |) < 1.
                                                                    i
                  Example 3.6 Decimal scaling. Suppose that the recorded values of A range from −986 to 917. The
                               maximum absolute value of A is 986. To normalize by decimal scaling, we therefore
                               divide each value by 1000 (i.e., j = 3) so that −986 normalizes to −0.986 and 917
                               normalizes to 0.917.

                                 Note that normalization can change the original data quite a bit, especially when
                               using z-score normalization or decimal scaling. It is also necessary to save the normaliza-
                               tion parameters (e.g., the mean and standard deviation if using z-score normalization)
                               so that future data can be normalized in a uniform manner.

                         3.5.3 Discretization by Binning

                               Binning is a top-down splitting technique based on a specified number of bins.
                               Section 3.2.2 discussed binning methods for data smoothing. These methods are also
                               used as discretization methods for data reduction and concept hierarchy generation. For
                               example, attribute values can be discretized by applying equal-width or equal-frequency
                               binning, and then replacing each bin value by the bin mean or median, as in smoothing
                               by bin means or smoothing by bin medians, respectively. These techniques can be applied
                               recursively to the resulting partitions to generate concept hierarchies.
                                 Binning does not use class information and is therefore an unsupervised discretiza-
                               tion technique. It is sensitive to the user-specified number of bins, as well as the presence
                               of outliers.


                         3.5.4 Discretization by Histogram Analysis
                               Like binning, histogram analysis is an unsupervised discretization technique because it
                               does not use class information. Histograms were introduced in Section 2.2.3. A his-
                               togram partitions the values of an attribute, A, into disjoint ranges called buckets
                               or bins.
   147   148   149   150   151   152   153   154   155   156   157