Page 157 -
P. 157
HAN
10-ch03-083-124-9780123814791
120 Chapter 3 Data Preprocessing 2011/6/1 3:16 Page 120 #38
3.6 Summary
Data quality is defined in terms of accuracy, completeness, consistency, timeliness,
believability, and interpretabilty. These qualities are assessed based on the intended
use of the data.
Data cleaning routines attempt to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in the data. Data cleaning is usually
performed as an iterative two-step process consisting of discrepancy detection and
data transformation.
Data integration combines data from multiple sources to form a coherent data
store. The resolution of semantic heterogeneity, metadata, correlation analysis,
tuple duplication detection, and data conflict detection contribute to smooth data
integration.
Data reduction techniques obtain a reduced representation of the data while mini-
mizing the loss of information content. These include methods of dimensionality
reduction, numerosity reduction, and data compression. Dimensionality reduction
reduces the number of random variables or attributes under consideration. Methods
include wavelet transforms, principal components analysis, attribute subset selection,
and attribute creation. Numerosity reduction methods use parametric or nonparat-
metric models to obtain smaller representations of the original data. Parametric
models store only the model parameters instead of the actual data. Examples
include regression and log-linear models. Nonparamteric methods include his-
tograms, clustering, sampling, and data cube aggregation. Data compression meth-
ods apply transformations to obtain a reduced or “compressed” representation of
the original data. The data reduction is lossless if the original data can be recon-
structed from the compressed data without any loss of information; otherwise, it is
lossy.
Data transformation routines convert the data into appropriate forms for min-
ing. For example, in normalization, attribute data are scaled so as to fall within a
small range such as 0.0 to 1.0. Other examples are data discretization and concept
hierarchy generation.
Data discretization transforms numeric data by mapping values to interval or con-
cept labels. Such methods can be used to automatically generate concept hierarchies
for the data, which allows for mining at multiple levels of granularity. Discretiza-
tion techniques include binning, histogram analysis, cluster analysis, decision tree
analysis, and correlation analysis. For nominal data, concept hierarchies may be
generated based on schema definitions as well as the number of distinct values per
attribute.
Although numerous methods of data preprocessing have been developed, data pre-
processing remains an active area of research, due to the huge amount of inconsistent
or dirty data and the complexity of the problem.