Page 212 -
P. 212

3:17 Page 175
                                                                                    #51
                                                            2011/6/1
                               11-ch04-125-186-9780123814791
                         HAN
                                                4.5 Data Generalization by Attribute-Oriented Induction  175


                         4.5.3 Attribute-Oriented Induction for Class Comparisons
                               In many applications, users may not be interested in having a single class (or con-
                               cept) described or characterized, but prefer to mine a description that compares or
                               distinguishes one class (or concept) from other comparable classes (or concepts).
                               Class discrimination or comparison (hereafter referred to as class comparison) mines
                               descriptions that distinguish a target class from its contrasting classes. Notice that the
                               target and contrasting classes must be comparable in the sense that they share similar
                               dimensions and attributes. For example, the three classes person, address, and item are
                               not comparable. However, sales in the last three years are comparable classes, and so are,
                               for example, computer science students versus physics students.
                                 Our discussions on class characterization in the previous sections handle multilevel
                               data summarization and characterization in a single class. The techniques developed can
                               be extended to handle class comparison across several comparable classes. For example,
                               the attribute generalization process described for class characterization can be modified
                               so that the generalization is performed synchronously among all the classes compared.
                               This allows the attributes in all of the classes to be generalized to the same abstraction
                               levels.
                                 Suppose, for instance, that we are given the AllElectronics data for sales in 2009 and
                               in 2010 and want to compare these two classes. Consider the dimension location with
                               abstractions at the city, province or state, and country levels. Data in each class should be
                               generalized to the same location level. That is, they are all synchronously generalized to
                               either the city level, the province or state level, or the country level. Ideally, this is more
                               useful than comparing, say, the sales in Vancouver in 2009 with the sales in the United
                               States in 2010 (i.e., where each set of sales data is generalized to a different level). The
                               users, however, should have the option to overwrite such an automated, synchronous
                               comparison with their own choices, when preferred.
                                 “How is class comparison performed?” In general, the procedure is as follows:

                               1. Data collection: The set of relevant data in the database is collected by query process-
                                 ing and is partitioned respectively into a target class and one or a set of contrasting
                                 classes.
                               2. Dimension relevance analysis: If there are many dimensions, then dimension rele-
                                 vance analysis should be performed on these classes to select only the highly relevant
                                 dimensions for further analysis. Correlation or entropy-based measures can be used
                                 for this step (Chapter 3).
                               3. Synchronous generalization: Generalization is performed on the target class to the
                                 level controlled by a user- or expert-specified dimension threshold, which results in
                                 a prime target class relation. The concepts in the contrasting class(es) are generali-
                                 zed to the same level as those in the prime target class relation, forming the prime
                                 contrasting class(es) relation.
                               4. Presentation of the derived comparison: The resulting class comparison description
                                 can be visualized in the form of tables, graphs, and rules. This presentation usually
                                 includes a “contrasting” measure such as count% (percentage count) that reflects the
   207   208   209   210   211   212   213   214   215   216   217