Page 126 -
P. 126
2011/6/1
10-ch03-083-124-9780123814791
#7
3:16 Page 89
HAN
3.2 Data Cleaning 89
induction. For example, using the other customer attributes in your data set, you
may construct a decision tree to predict the missing values for income. Decision trees
and Bayesian inference are described in detail in Chapters 8 and 9, respectively, while
regression is introduced in Section 3.4.5.
Methods 3 through 6 bias the data—the filled-in value may not be correct. Method 6,
however, is a popular strategy. In comparison to the other methods, it uses the most
information from the present data to predict missing values. By considering the other
attributes’ values in its estimation of the missing value for income, there is a greater
chance that the relationships between income and the other attributes are preserved.
It is important to note that, in some cases, a missing value may not imply an error
in the data! For example, when applying for a credit card, candidates may be asked to
supply their driver’s license number. Candidates who do not have a driver’s license may
naturally leave this field blank. Forms should allow respondents to specify values such
as “not applicable.” Software routines may also be used to uncover other null values
(e.g., “don’t know,” “?” or “none”). Ideally, each attribute should have one or more rules
regarding the null condition. The rules may specify whether or not nulls are allowed
and/or how such values should be handled or transformed. Fields may also be inten-
tionally left blank if they are to be provided in a later step of the business process. Hence,
although we can try our best to clean the data after it is seized, good database and data
entry procedure design should help minimize the number of missing values or errors in
the first place.
3.2.2 Noisy Data
“What is noise?” Noise is a random error or variance in a measured variable. In
Chapter 2, we saw how some basic statistical description techniques (e.g., boxplots
and scatter plots), and methods of data visualization can be used to identify outliers,
which may represent noise. Given a numeric attribute such as, say, price, how can we
“smooth” out the data to remove the noise? Let’s look at the following data smoothing
techniques.
Binning: Binning methods smooth a sorted data value by consulting its “neighbor-
hood,” that is, the values around it. The sorted values are distributed into a number
of “buckets,” or bins. Because binning methods consult the neighborhood of values,
they perform local smoothing. Figure 3.2 illustrates some binning techniques. In this
example, the data for price are first sorted and then partitioned into equal-frequency
bins of size 3 (i.e., each bin contains three values). In smoothing by bin means, each
value in a bin is replaced by the mean value of the bin. For example, the mean of the
values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value in this bin is replaced
by the value 9.
Similarly, smoothing by bin medians can be employed, in which each bin value
is replaced by the bin median. In smoothing by bin boundaries, the minimum and
maximum values in a given bin are identified as the bin boundaries. Each bin value
is then replaced by the closest boundary value. In general, the larger the width, the