Page 126 -
P. 126

2011/6/1
                                10-ch03-083-124-9780123814791
                                                                                    #7
                                                                      3:16 Page 89
                          HAN
                                                                               3.2 Data Cleaning  89


                                 induction. For example, using the other customer attributes in your data set, you
                                 may construct a decision tree to predict the missing values for income. Decision trees
                                 and Bayesian inference are described in detail in Chapters 8 and 9, respectively, while
                                 regression is introduced in Section 3.4.5.

                                 Methods 3 through 6 bias the data—the filled-in value may not be correct. Method 6,
                               however, is a popular strategy. In comparison to the other methods, it uses the most
                               information from the present data to predict missing values. By considering the other
                               attributes’ values in its estimation of the missing value for income, there is a greater
                               chance that the relationships between income and the other attributes are preserved.
                                 It is important to note that, in some cases, a missing value may not imply an error
                               in the data! For example, when applying for a credit card, candidates may be asked to
                               supply their driver’s license number. Candidates who do not have a driver’s license may
                               naturally leave this field blank. Forms should allow respondents to specify values such
                               as “not applicable.” Software routines may also be used to uncover other null values
                               (e.g., “don’t know,” “?” or “none”). Ideally, each attribute should have one or more rules
                               regarding the null condition. The rules may specify whether or not nulls are allowed
                               and/or how such values should be handled or transformed. Fields may also be inten-
                               tionally left blank if they are to be provided in a later step of the business process. Hence,
                               although we can try our best to clean the data after it is seized, good database and data
                               entry procedure design should help minimize the number of missing values or errors in
                               the first place.


                         3.2.2 Noisy Data
                               “What is noise?” Noise is a random error or variance in a measured variable. In
                               Chapter 2, we saw how some basic statistical description techniques (e.g., boxplots
                               and scatter plots), and methods of data visualization can be used to identify outliers,
                               which may represent noise. Given a numeric attribute such as, say, price, how can we
                               “smooth” out the data to remove the noise? Let’s look at the following data smoothing
                               techniques.
                                 Binning: Binning methods smooth a sorted data value by consulting its “neighbor-
                                 hood,” that is, the values around it. The sorted values are distributed into a number
                                 of “buckets,” or bins. Because binning methods consult the neighborhood of values,
                                 they perform local smoothing. Figure 3.2 illustrates some binning techniques. In this
                                 example, the data for price are first sorted and then partitioned into equal-frequency
                                 bins of size 3 (i.e., each bin contains three values). In smoothing by bin means, each
                                 value in a bin is replaced by the mean value of the bin. For example, the mean of the
                                 values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value in this bin is replaced
                                 by the value 9.
                                    Similarly, smoothing by bin medians can be employed, in which each bin value
                                 is replaced by the bin median. In smoothing by bin boundaries, the minimum and
                                 maximum values in a given bin are identified as the bin boundaries. Each bin value
                                 is then replaced by the closest boundary value. In general, the larger the width, the
   121   122   123   124   125   126   127   128   129   130   131