Page 153 -
P. 153

HAN
                               10-ch03-083-124-9780123814791

          116   Chapter 3 Data Preprocessing                2011/6/1  3:16 Page 116  #34



                           Various partitioning rules can be used to define histograms (Section 3.4.6). In an
                         equal-width histogram, for example, the values are partitioned into equal-size partitions
                         or ranges (e.g., earlier in Figure 3.8 for price, where each bucket has a width of $10).
                         With an equal-frequency histogram, the values are partitioned so that, ideally, each par-
                         tition contains the same number of data tuples. The histogram analysis algorithm can be
                         applied recursively to each partition in order to automatically generate a multilevel con-
                         cept hierarchy, with the procedure terminating once a prespecified number of concept
                         levels has been reached. A minimum interval size can also be used per level to control the
                         recursive procedure. This specifies the minimum width of a partition, or the minimum
                         number of values for each partition at each level. Histograms can also be partitioned
                         based on cluster analysis of the data distribution, as described next.


                   3.5.5 Discretization by Cluster, Decision Tree,
                         and Correlation Analyses
                         Clustering, decision tree analysis, and correlation analysis can be used for data dis-
                         cretization. We briefly study each of these approaches.
                           Cluster analysis is a popular data discretization method. A clustering algorithm can
                         be applied to discretize a numeric attribute, A, by partitioning the values of A into clus-
                         ters or groups. Clustering takes the distribution of A into consideration, as well as the
                         closeness of data points, and therefore is able to produce high-quality discretization
                         results.
                           Clustering can be used to generate a concept hierarchy for A by following either a
                         top-down splitting strategy or a bottom-up merging strategy, where each cluster forms
                         a node of the concept hierarchy. In the former, each initial cluster or partition may
                         be further decomposed into several subclusters, forming a lower level of the hiera-
                         rchy. In the latter, clusters are formed by repeatedly grouping neighboring clusters in
                         order to form higher-level concepts. Clustering methods for data mining are studied in
                         Chapters 10 and 11.
                           Techniques to generate decision trees for classification (Chapter 8) can be applied to
                         discretization. Such techniques employ a top-down splitting approach. Unlike the other
                         methods mentioned so far, decision tree approaches to discretization are supervised,
                         that is, they make use of class label information. For example, we may have a data set of
                         patient symptoms (the attributes) where each patient has an associated diagnosis class
                         label. Class distribution information is used in the calculation and determination of
                         split-points (data values for partitioning an attribute range). Intuitively, the main idea
                         is to select split-points so that a given resulting partition contains as many tuples of the
                         same class as possible. Entropy is the most commonly used measure for this purpose. To
                         discretize a numeric attribute, A, the method selects the value of A that has the minimum
                         entropy as a split-point, and recursively partitions the resulting intervals to arrive at a
                         hierarchical discretization. Such discretization forms a concept hierarchy for A.
                           Because decision tree–based discretization uses class information, it is more likely
                         that the interval boundaries (split-points) are defined to occur in places that may help
                         improve classification accuracy. Decision trees and the entropy measure are described in
                         greater detail in Section 8.2.2.
   148   149   150   151   152   153   154   155   156   157   158