Page 150 -
P. 150

3:16 Page 113
                                                            2011/6/1
                               10-ch03-083-124-9780123814791
                         HAN
                                                                                    #31
                                                      3.5 Data Transformation and Data Discretization  113


                               data cleaning and was addressed in Section 3.2.2. Section 3.2.3 on the data cleaning
                               process also discussed ETL tools, where users specify transformations to correct data
                               inconsistencies. Attribute construction and aggregation were discussed in Section 3.4
                               on data reduction. In this section, we therefore concentrate on the latter three strategies.
                                 Discretization techniques can be categorized based on how the discretization is per-
                               formed, such as whether it uses class information or which direction it proceeds (i.e.,
                               top-down vs. bottom-up). If the discretization process uses class information, then we
                               say it is supervised discretization. Otherwise, it is unsupervised. If the process starts by first
                               finding one or a few points (called split points or cut points) to split the entire attribute
                               range, and then repeats this recursively on the resulting intervals, it is called top-down
                               discretization or splitting. This contrasts with bottom-up discretization or merging, which
                               starts by considering all of the continuous values as potential split-points, removes some
                               by merging neighborhood values to form intervals, and then recursively applies this
                               process to the resulting intervals.
                                 Data discretization and concept hierarchy generation are also forms of data reduc-
                               tion. The raw data are replaced by a smaller number of interval or concept labels. This
                               simplifies the original data and makes the mining more efficient. The resulting patterns
                               mined are typically easier to understand. Concept hierarchies are also useful for mining
                               at multiple abstraction levels.
                                 The rest of this section is organized as follows. First, normalization techniques are
                               presented in Section 3.5.2. We then describe several techniques for data discretization,
                               each of which can be used to generate concept hierarchies for numeric attributes. The
                               techniques include binning (Section 3.5.3) and histogram analysis (Section 3.5.4), as
                               well as cluster analysis, decision tree analysis, and correlation analysis (Section 3.5.5).
                               Finally, Section 3.5.6 describes the automatic generation of concept hierarchies for
                               nominal data.

                         3.5.2 Data Transformation by Normalization

                               The measurement unit used can affect the data analysis. For example, changing mea-
                               surement units from meters to inches for height, or from kilograms to pounds for weight,
                               may lead to very different results. In general, expressing an attribute in smaller units will
                               lead to a larger range for that attribute, and thus tend to give such an attribute greater
                               effect or “weight.” To help avoid dependence on the choice of measurement units, the
                               data should be normalized or standardized. This involves transforming the data to fall
                               within a smaller or common range such as [−1,1] or [0.0, 1.0]. (The terms standardize
                               and normalize are used interchangeably in data preprocessing, although in statistics, the
                               latter term also has other connotations.)
                                 Normalizing the data attempts to give all attributes an equal weight. Normaliza-
                               tion is particularly useful for classification algorithms involving neural networks or
                               distance measurements such as nearest-neighbor classification and clustering. If using
                               the neural network backpropagation algorithm for classification mining (Chapter 9),
                               normalizing the input values for each attribute measured in the training tuples will help
                               speed up the learning phase. For distance-based methods, normalization helps prevent
   145   146   147   148   149   150   151   152   153   154   155