Page 125 -
P. 125
HAN
10-ch03-083-124-9780123814791
88 Chapter 3 Data Preprocessing 2011/6/1 3:16 Page 88 #6
3.2 Data Cleaning
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data
cleansing) routines attempt to fill in missing values, smooth out noise while identi-
fying outliers, and correct inconsistencies in the data. In this section, you will study
basic methods for data cleaning. Section 3.2.1 looks at ways of handling missing values.
Section 3.2.2 explains data smoothing techniques. Section 3.2.3 discusses approaches to
data cleaning as a process.
3.2.1 Missing Values
Imagine that you need to analyze AllElectronics sales and customer data. You note that
many tuples have no recorded value for several attributes such as customer income. How
can you go about filling in the missing values for this attribute? Let’s look at the following
methods.
1. Ignore the tuple: This is usually done when the class label is missing (assuming the
mining task involves classification). This method is not very effective, unless the tuple
contains several attributes with missing values. It is especially poor when the percent-
age of missing values per attribute varies considerably. By ignoring the tuple, we do
not make use of the remaining attributes’ values in the tuple. Such data could have
been useful to the task at hand.
2. Fill in the missing value manually: In general, this approach is time consuming and
may not be feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute values
by the same constant such as a label like “Unknown” or −∞. If missing values are
replaced by, say, “Unknown,” then the mining program may mistakenly think that
they form an interesting concept, since they all have a value in common—that of
“Unknown.” Hence, although this method is simple, it is not foolproof.
4. Use a measure of central tendency for the attribute (e.g., the mean or median) to
fill in the missing value: Chapter 2 discussed measures of central tendency, which
indicate the “middle” value of a data distribution. For normal (symmetric) data dis-
tributions, the mean can be used, while skewed data distribution should employ
the median (Section 2.2). For example, suppose that the data distribution regard-
ing the income of AllElectronics customers is symmetric and that the mean income is
$56,000. Use this value to replace the missing value for income.
5. Use the attribute mean or median for all samples belonging to the same class as
the given tuple: For example, if classifying customers according to credit risk, we
may replace the missing value with the mean income value for customers in the same
credit risk category as that of the given tuple. If the data distribution for a given class
is skewed, the median value is a better choice.
6. Use the most probable value to fill in the missing value: This may be determined
with regression, inference-based tools using a Bayesian formalism, or decision tree