Page 212 -
P. 212
3:17 Page 175
#51
2011/6/1
11-ch04-125-186-9780123814791
HAN
4.5 Data Generalization by Attribute-Oriented Induction 175
4.5.3 Attribute-Oriented Induction for Class Comparisons
In many applications, users may not be interested in having a single class (or con-
cept) described or characterized, but prefer to mine a description that compares or
distinguishes one class (or concept) from other comparable classes (or concepts).
Class discrimination or comparison (hereafter referred to as class comparison) mines
descriptions that distinguish a target class from its contrasting classes. Notice that the
target and contrasting classes must be comparable in the sense that they share similar
dimensions and attributes. For example, the three classes person, address, and item are
not comparable. However, sales in the last three years are comparable classes, and so are,
for example, computer science students versus physics students.
Our discussions on class characterization in the previous sections handle multilevel
data summarization and characterization in a single class. The techniques developed can
be extended to handle class comparison across several comparable classes. For example,
the attribute generalization process described for class characterization can be modified
so that the generalization is performed synchronously among all the classes compared.
This allows the attributes in all of the classes to be generalized to the same abstraction
levels.
Suppose, for instance, that we are given the AllElectronics data for sales in 2009 and
in 2010 and want to compare these two classes. Consider the dimension location with
abstractions at the city, province or state, and country levels. Data in each class should be
generalized to the same location level. That is, they are all synchronously generalized to
either the city level, the province or state level, or the country level. Ideally, this is more
useful than comparing, say, the sales in Vancouver in 2009 with the sales in the United
States in 2010 (i.e., where each set of sales data is generalized to a different level). The
users, however, should have the option to overwrite such an automated, synchronous
comparison with their own choices, when preferred.
“How is class comparison performed?” In general, the procedure is as follows:
1. Data collection: The set of relevant data in the database is collected by query process-
ing and is partitioned respectively into a target class and one or a set of contrasting
classes.
2. Dimension relevance analysis: If there are many dimensions, then dimension rele-
vance analysis should be performed on these classes to select only the highly relevant
dimensions for further analysis. Correlation or entropy-based measures can be used
for this step (Chapter 3).
3. Synchronous generalization: Generalization is performed on the target class to the
level controlled by a user- or expert-specified dimension threshold, which results in
a prime target class relation. The concepts in the contrasting class(es) are generali-
zed to the same level as those in the prime target class relation, forming the prime
contrasting class(es) relation.
4. Presentation of the derived comparison: The resulting class comparison description
can be visualized in the form of tables, graphs, and rules. This presentation usually
includes a “contrasting” measure such as count% (percentage count) that reflects the