Page 120 -

P. 120

2011/6/1
10-ch03-083-124-9780123814791
3:16 Page 83
#1
HAN

Data Preprocessing

Today’s real-world databases are highly susceptible to noisy, missing, and inconsistent data
due to their typically huge size (often several gigabytes or more) and their likely origin
from multiple, heterogenous sources. Low-quality data will lead to low-quality mining
results. “How can the data be preprocessed in order to help improve the quality of the data
and, consequently, of the mining results? How can the data be preprocessed so as to improve
the efﬁciency and ease of the mining process?”
There are several data preprocessing techniques. Data cleaning can be applied to
remove noise and correct inconsistencies in data. Data integration merges data from
multiple sources into a coherent data store such as a data warehouse. Data reduction
can reduce data size by, for instance, aggregating, eliminating redundant features, or
clustering. Data transformations (e.g., normalization) may be applied, where data are
scaled to fall within a smaller range like 0.0 to 1.0. This can improve the accuracy and
efﬁciency of mining algorithms involving distance measurements. These techniques are
not mutually exclusive; they may work together. For example, data cleaning can involve
transformations to correct wrong data, such as by transforming all entries for a date ﬁeld
to a common format.
In Chapter 2, we learned about the different attribute types and how to use basic
statistical descriptions to study data characteristics. These can help identify erroneous
values and outliers, which will be useful in the data cleaning and integration steps.
Data processing techniques, when applied before mining, can substantially improve the
overall quality of the patterns mined and/or the time required for the actual mining.
In this chapter, we introduce the basic concepts of data preprocessing in Section 3.1.
The methods for data preprocessing are organized into the following categories: data
cleaning (Section 3.2), data integration (Section 3.3), data reduction (Section 3.4), and
data transformation (Section 3.5).

115 116 117 118 119 120 121 122 123 124 125