Page 120 -
P. 120

2011/6/1
                                10-ch03-083-124-9780123814791
                                                                      3:16 Page 83
                                                                                    #1
                          HAN






                                                                                3



                                                         Data Preprocessing










                     Today’s real-world databases are highly susceptible to noisy, missing, and inconsistent data
                               due to their typically huge size (often several gigabytes or more) and their likely origin
                               from multiple, heterogenous sources. Low-quality data will lead to low-quality mining
                               results. “How can the data be preprocessed in order to help improve the quality of the data
                               and, consequently, of the mining results? How can the data be preprocessed so as to improve
                               the efficiency and ease of the mining process?”
                                 There are several data preprocessing techniques. Data cleaning can be applied to
                               remove noise and correct inconsistencies in data. Data integration merges data from
                               multiple sources into a coherent data store such as a data warehouse. Data reduction
                               can reduce data size by, for instance, aggregating, eliminating redundant features, or
                               clustering. Data transformations (e.g., normalization) may be applied, where data are
                               scaled to fall within a smaller range like 0.0 to 1.0. This can improve the accuracy and
                               efficiency of mining algorithms involving distance measurements. These techniques are
                               not mutually exclusive; they may work together. For example, data cleaning can involve
                               transformations to correct wrong data, such as by transforming all entries for a date field
                               to a common format.
                                 In Chapter 2, we learned about the different attribute types and how to use basic
                               statistical descriptions to study data characteristics. These can help identify erroneous
                               values and outliers, which will be useful in the data cleaning and integration steps.
                               Data processing techniques, when applied before mining, can substantially improve the
                               overall quality of the patterns mined and/or the time required for the actual mining.
                                 In this chapter, we introduce the basic concepts of data preprocessing in Section 3.1.
                               The methods for data preprocessing are organized into the following categories: data
                               cleaning (Section 3.2), data integration (Section 3.3), data reduction (Section 3.4), and
                               data transformation (Section 3.5).









                               Data Mining: Concepts and Techniques                               83
                               c 
 2012 Elsevier Inc. All rights reserved.
   115   116   117   118   119   120   121   122   123   124   125