Page 124 -
P. 124
2011/6/1
3:16 Page 87
10-ch03-083-124-9780123814791
#5
HAN
3.2 Data Preprocessing: An Overview 87
discretization, and concept hierarchy generation are forms of data transformation.
You soon realize such data transformation operations are additional data preprocessing
procedures that would contribute toward the success of the mining process. Data
integration and data discretization are discussed in Sections 3.5.
Figure 3.1 summarizes the data preprocessing steps described here. Note that the pre-
vious categorization is not mutually exclusive. For example, the removal of redundant
data may be seen as a form of data cleaning, as well as data reduction.
In summary, real-world data tend to be dirty, incomplete, and inconsistent. Data pre-
processing techniques can improve data quality, thereby helping to improve the accuracy
and efficiency of the subsequent mining process. Data preprocessing is an important step
in the knowledge discovery process, because quality decisions must be based on qual-
ity data. Detecting data anomalies, rectifying them early, and reducing the data to be
analyzed can lead to huge payoffs for decision making.
Data cleaning
Data integration
Data reduction
Attributes Attributes
A1 A2 A3 ... A126 A1 A3 ... A115
T1 T1
Transactions T3 Transactions ...
T4
T2
T1456
T4
...
T2000
Data transformation 2, 32, 100, 59, 48 0.02, 0.32, 1.00, 0.59, 0.48
Figure 3.1 Forms of data preprocessing.