Page 124 -
P. 124

2011/6/1
                                                                      3:16 Page 87
                                10-ch03-083-124-9780123814791
                                                                                    #5
                          HAN
                                                                3.2 Data Preprocessing: An Overview  87


                               discretization, and concept hierarchy generation are forms of data transformation.
                               You soon realize such data transformation operations are additional data preprocessing
                               procedures that would contribute toward the success of the mining process. Data
                               integration and data discretization are discussed in Sections 3.5.
                                 Figure 3.1 summarizes the data preprocessing steps described here. Note that the pre-
                               vious categorization is not mutually exclusive. For example, the removal of redundant
                               data may be seen as a form of data cleaning, as well as data reduction.
                                 In summary, real-world data tend to be dirty, incomplete, and inconsistent. Data pre-
                               processing techniques can improve data quality, thereby helping to improve the accuracy
                               and efficiency of the subsequent mining process. Data preprocessing is an important step
                               in the knowledge discovery process, because quality decisions must be based on qual-
                               ity data. Detecting data anomalies, rectifying them early, and reducing the data to be
                               analyzed can lead to huge payoffs for decision making.








                                  Data cleaning





                                 Data integration











                                  Data reduction
                                              Attributes                    Attributes
                                       A1   A2  A3  ...  A126           A1  A3   ...  A115
                                 T1                               T1
                                Transactions  T3                 Transactions  ...
                                                                  T4
                                 T2
                                                                  T1456
                                 T4
                                 ...
                                 T2000
                                 Data transformation   2, 32, 100, 59, 48   0.02, 0.32, 1.00, 0.59, 0.48


                     Figure 3.1 Forms of data preprocessing.
   119   120   121   122   123   124   125   126   127   128   129