Page 157 -
P. 157

HAN
                               10-ch03-083-124-9780123814791

          120   Chapter 3 Data Preprocessing                2011/6/1  3:16 Page 120  #38


                 3.6     Summary


                           Data quality is defined in terms of accuracy, completeness, consistency, timeliness,
                           believability, and interpretabilty. These qualities are assessed based on the intended
                           use of the data.
                           Data cleaning routines attempt to fill in missing values, smooth out noise while
                           identifying outliers, and correct inconsistencies in the data. Data cleaning is usually
                           performed as an iterative two-step process consisting of discrepancy detection and
                           data transformation.
                           Data integration combines data from multiple sources to form a coherent data
                           store. The resolution of semantic heterogeneity, metadata, correlation analysis,
                           tuple duplication detection, and data conflict detection contribute to smooth data
                           integration.
                           Data reduction techniques obtain a reduced representation of the data while mini-
                           mizing the loss of information content. These include methods of dimensionality
                           reduction, numerosity reduction, and data compression. Dimensionality reduction
                           reduces the number of random variables or attributes under consideration. Methods
                           include wavelet transforms, principal components analysis, attribute subset selection,
                           and attribute creation. Numerosity reduction methods use parametric or nonparat-
                           metric models to obtain smaller representations of the original data. Parametric
                           models store only the model parameters instead of the actual data. Examples
                           include regression and log-linear models. Nonparamteric methods include his-
                           tograms, clustering, sampling, and data cube aggregation. Data compression meth-
                           ods apply transformations to obtain a reduced or “compressed” representation of
                           the original data. The data reduction is lossless if the original data can be recon-
                           structed from the compressed data without any loss of information; otherwise, it is
                           lossy.
                           Data transformation routines convert the data into appropriate forms for min-
                           ing. For example, in normalization, attribute data are scaled so as to fall within a
                           small range such as 0.0 to 1.0. Other examples are data discretization and concept
                           hierarchy generation.
                           Data discretization transforms numeric data by mapping values to interval or con-
                           cept labels. Such methods can be used to automatically generate concept hierarchies
                           for the data, which allows for mining at multiple levels of granularity. Discretiza-
                           tion techniques include binning, histogram analysis, cluster analysis, decision tree
                           analysis, and correlation analysis. For nominal data, concept hierarchies may be
                           generated based on schema definitions as well as the number of distinct values per
                           attribute.
                           Although numerous methods of data preprocessing have been developed, data pre-
                           processing remains an active area of research, due to the huge amount of inconsistent
                           or dirty data and the complexity of the problem.
   152   153   154   155   156   157   158   159   160   161   162