Page 154 -
P. 154

2011/6/1
                                                                                    #35
                                                                     3:16 Page 117
                         HAN
                               10-ch03-083-124-9780123814791
                                                      3.5 Data Transformation and Data Discretization  117


                                                                                              2
                                 Measures of correlation can be used for discretization. ChiMerge is a χ -based
                               discretization method. The discretization methods that we have studied up to this
                               point have all employed a top-down, splitting strategy. This contrasts with ChiMerge,
                               which employs a bottom-up approach by finding the best neighboring intervals and
                               then merging them to form larger intervals, recursively. As with decision tree analysis,
                               ChiMerge is supervised in that it uses class information. The basic notion is that for
                               accurate discretization, the relative class frequencies should be fairly consistent within
                               an interval. Therefore, if two adjacent intervals have a very similar distribution of classes,
                               then the intervals can be merged. Otherwise, they should remain separate.
                                 ChiMerge proceeds as follows. Initially, each distinct value of a numeric attribute A is
                                                        2
                               considered to be one interval. χ tests are performed for every pair of adjacent intervals.
                                                          2
                                                                                              2
                               Adjacent intervals with the least χ values are merged together, because low χ values
                               for a pair indicate similar class distributions. This merging process proceeds recursively
                               until a predefined stopping criterion is met.

                         3.5.6 Concept Hierarchy Generation for Nominal Data

                               We now look at data transformation for nominal data. In particular, we study concept
                               hierarchy generation for nominal attributes. Nominal attributes have a finite (but pos-
                               sibly large) number of distinct values, with no ordering among the values. Examples
                               include geographic location, job category, and item type.
                                 Manual definition of concept hierarchies can be a tedious and time-consuming task
                               for a user or a domain expert. Fortunately, many hierarchies are implicit within the
                               database schema and can be automatically defined at the schema definition level. The
                               concept hierarchies can be used to transform the data into multiple levels of granular-
                               ity. For example, data mining patterns regarding sales may be found relating to specific
                               regions or countries, in addition to individual branch locations.
                                 We study four methods for the generation of concept hierarchies for nominal data,
                               as follows.

                               1. Specification of a partial ordering of attributes explicitly at the schema level by
                                 users or experts: Concept hierarchies for nominal attributes or dimensions typically
                                 involve a group of attributes. A user or expert can easily define a concept hierarchy by
                                 specifying a partial or total ordering of the attributes at the schema level. For exam-
                                 ple, suppose that a relational database contains the following group of attributes:
                                 street, city, province or state, and country. Similarly, a data warehouse location dimen-
                                 sion may contain the same attributes. A hierarchy can be defined by specifying the
                                 total ordering among these attributes at the schema level such as street < city <
                                 province or state < country.
                               2. Specification of a portion of a hierarchy by explicit data grouping: This is essen-
                                 tially the manual definition of a portion of a concept hierarchy. In a large database,
                                 it is unrealistic to define an entire concept hierarchy by explicit value enumera-
                                 tion. On the contrary, we can easily specify explicit groupings for a small portion
                                 of intermediate-level data. For example, after specifying that province and country
   149   150   151   152   153   154   155   156   157   158   159