Page 149 -
P. 149

10-ch03-083-124-9780123814791
                         HAN

          112   Chapter 3 Data Preprocessing                2011/6/1  3:16 Page 112  #30



                   3.5.1 Data Transformation Strategies Overview

                         In data transformation, the data are transformed or consolidated into forms appropriate
                         for mining. Strategies for data transformation include the following:

                         1. Smoothing, which works to remove noise from the data. Techniques include binning,
                           regression, and clustering.
                         2. Attribute construction (or feature construction), where new attributes are con-
                           structed and added from the given set of attributes to help the mining process.
                         3. Aggregation, where summary or aggregation operations are applied to the data. For
                           example, the daily sales data may be aggregated so as to compute monthly and annual
                           total amounts. This step is typically used in constructing a data cube for data analysis
                           at multiple abstraction levels.
                         4. Normalization, where the attribute data are scaled so as to fall within a smaller range,
                           such as −1.0 to 1.0, or 0.0 to 1.0.

                         5. Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by
                           interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior).
                           The labels, in turn, can be recursively organized into higher-level concepts, resulting
                           in a concept hierarchy for the numeric attribute. Figure 3.12 shows a concept hierarchy
                           for the attribute price. More than one concept hierarchy can be defined for the same
                           attribute to accommodate the needs of various users.
                         6. Concept hierarchy generation for nominal data, where attributes such as street can
                           be generalized to higher-level concepts, like city or country. Many hierarchies for
                           nominal attributes are implicit within the database schema and can be automatically
                           defined at the schema definition level.

                         Recall that there is much overlap between the major data preprocessing tasks. The first
                         three of these strategies were discussed earlier in this chapter. Smoothing is a form of



                                                        ($0...$1000]



                           ($0...$200]   ($200...$400]  ($400...$600]  ($600...$800]  ($800...$1000]



                          ($0...  ($100...  ($200... ($300...  ($400...  ($500...  ($600...  ($700...  ($800...  ($900...
                          $100]  $200]  $300]  $400]   $500]  $600]  $700]  $800]  $900]  $1000]


              Figure 3.12 A concept hierarchy for the attribute price, where an interval ($X ...$Y] denotes the range
                         from $X (exclusive) to $Y (inclusive).
   144   145   146   147   148   149   150   151   152   153   154