Page 125 -

P. 125

HAN
10-ch03-083-124-9780123814791

88 Chapter 3 Data Preprocessing 2011/6/1 3:16 Page 88 #6

3.2 Data Cleaning

Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data
cleansing) routines attempt to ﬁll in missing values, smooth out noise while identi-
fying outliers, and correct inconsistencies in the data. In this section, you will study
basic methods for data cleaning. Section 3.2.1 looks at ways of handling missing values.
Section 3.2.2 explains data smoothing techniques. Section 3.2.3 discusses approaches to
data cleaning as a process.

3.2.1 Missing Values

Imagine that you need to analyze AllElectronics sales and customer data. You note that
many tuples have no recorded value for several attributes such as customer income. How
can you go about ﬁlling in the missing values for this attribute? Let’s look at the following
methods.

1. Ignore the tuple: This is usually done when the class label is missing (assuming the
mining task involves classiﬁcation). This method is not very effective, unless the tuple
contains several attributes with missing values. It is especially poor when the percent-
age of missing values per attribute varies considerably. By ignoring the tuple, we do
not make use of the remaining attributes’ values in the tuple. Such data could have
been useful to the task at hand.

2. Fill in the missing value manually: In general, this approach is time consuming and
may not be feasible given a large data set with many missing values.
3. Use a global constant to ﬁll in the missing value: Replace all missing attribute values
by the same constant such as a label like “Unknown” or −∞. If missing values are
replaced by, say, “Unknown,” then the mining program may mistakenly think that
they form an interesting concept, since they all have a value in common—that of
“Unknown.” Hence, although this method is simple, it is not foolproof.
4. Use a measure of central tendency for the attribute (e.g., the mean or median) to
ﬁll in the missing value: Chapter 2 discussed measures of central tendency, which
indicate the “middle” value of a data distribution. For normal (symmetric) data dis-
tributions, the mean can be used, while skewed data distribution should employ
the median (Section 2.2). For example, suppose that the data distribution regard-
ing the income of AllElectronics customers is symmetric and that the mean income is
$56,000. Use this value to replace the missing value for income.
5. Use the attribute mean or median for all samples belonging to the same class as
the given tuple: For example, if classifying customers according to credit risk, we
may replace the missing value with the mean income value for customers in the same
credit risk category as that of the given tuple. If the data distribution for a given class
is skewed, the median value is a better choice.
6. Use the most probable value to ﬁll in the missing value: This may be determined
with regression, inference-based tools using a Bayesian formalism, or decision tree

120 121 122 123 124 125 126 127 128 129 130