Page 127 -
P. 127

10-ch03-083-124-9780123814791
                          HAN

          90    Chapter 3 Data Preprocessing                 2011/6/1  3:16 Page 90  #8



                               Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34

                                           Partition into (equal-frequency) bins:
                                           Bin 1: 4, 8, 15
                                           Bin 2: 21, 21, 24
                                           Bin 3: 25, 28, 34

                                           Smoothing by bin means:
                                           Bin 1: 9, 9, 9
                                           Bin 2: 22, 22, 22
                                           Bin 3: 29, 29, 29

                                           Smoothing by bin boundaries:
                                           Bin 1: 4, 4, 15
                                           Bin 2: 21, 21, 24
                                           Bin 3: 25, 25, 34


               Figure 3.2 Binning methods for data smoothing.


                           greater the effect of the smoothing. Alternatively, bins may be equal width, where the
                           interval range of values in each bin is constant. Binning is also used as a discretization
                           technique and is further discussed in Section 3.5.
                           Regression: Data smoothing can also be done by regression, a technique that con-
                           forms data values to a function. Linear regression involves finding the “best” line to
                           fit two attributes (or variables) so that one attribute can be used to predict the other.
                           Multiple linear regression is an extension of linear regression, where more than two
                           attributes are involved and the data are fit to a multidimensional surface. Regression
                           is further described in Section 3.4.5.
                           Outlier analysis: Outliers may be detected by clustering, for example, where similar
                           values are organized into groups, or “clusters.” Intuitively, values that fall outside of
                           the set of clusters may be considered outliers (Figure 3.3). Chapter 12 is dedicated to
                           the topic of outlier analysis.
                           Many data smoothing methods are also used for data discretization (a form of data
                         transformation) and data reduction. For example, the binning techniques described
                         before reduce the number of distinct values per attribute. This acts as a form of data
                         reduction for logic-based data mining methods, such as decision tree induction, which
                         repeatedly makes value comparisons on sorted data. Concept hierarchies are a form of
                         data discretization that can also be used for data smoothing. A concept hierarchy for
                         price, for example, may map real price values into inexpensive, moderately priced, and
                         expensive, thereby reducing the number of data values to be handled by the mining
   122   123   124   125   126   127   128   129   130   131   132