Page 207 -
P. 207
HAN
11-ch04-125-186-9780123814791
2011/6/1
170 Chapter 4 Data Warehousing and Online Analytical Processing 3:17 Page 170 #46
to the attribute. This rule is based on the following reasoning. Use of a generalization
operator to generalize an attribute value within a tuple, or rule, in the working relation
will make the rule cover more of the original data tuples, thus generalizing the concept it
represents. This corresponds to the generalization rule known as climbing generalization
trees in learning from examples, or concept tree ascension.
Both rules–attribute removal and attribute generalization–claim that if there is a large
set of distinct values for an attribute, further generalization should be applied. This
raises the question: How large is “a large set of distinct values for an attribute” considered
to be?
Depending on the attributes or application involved, a user may prefer some
attributes to remain at a rather low abstraction level while others are generalized to
higher levels. The control of how high an attribute should be generalized is typically
quite subjective. The control of this process is called attribute generalization control.
If the attribute is generalized “too high,” it may lead to overgeneralization, and the
resulting rules may not be very informative.
On the other hand, if the attribute is not generalized to a “sufficiently high level,”
then undergeneralization may result, where the rules obtained may not be informative
either. Thus, a balance should be attained in attribute-oriented generalization. There are
many possible ways to control a generalization process. We will describe two common
approaches and illustrate how they work.
The first technique, called attribute generalization threshold control, either sets one
generalization threshold for all of the attributes, or sets one threshold for each attribute.
If the number of distinct values in an attribute is greater than the attribute threshold,
further attribute removal or attribute generalization should be performed. Data mining
systems typically have a default attribute threshold value generally ranging from 2 to 8
and should allow experts and users to modify the threshold values as well. If a user feels
that the generalization reaches too high a level for a particular attribute, the threshold
can be increased. This corresponds to drilling down along the attribute. Also, to further
generalize a relation, the user can reduce an attribute’s threshold, which corresponds to
rolling up along the attribute.
The second technique, called generalized relation threshold control, sets a threshold
for the generalized relation. If the number of (distinct) tuples in the generalized relation
is greater than the threshold, further generalization should be performed. Otherwise,
no further generalization should be performed. Such a threshold may also be preset in
the data mining system (usually within a range of 10 to 30), or set by an expert or user,
and should be adjustable. For example, if a user feels that the generalized relation is too
small, he or she can increase the threshold, which implies drilling down. Otherwise, to
further generalize a relation, the threshold can be reduced, which implies rolling up.
These two techniques can be applied in sequence: First apply the attribute threshold
control technique to generalize each attribute, and then apply relation threshold control
to further reduce the size of the generalized relation. No matter which generalization
control technique is applied, the user should be allowed to adjust the generalization
thresholds in order to obtain interesting concept descriptions.
In many database-oriented induction processes, users are interested in obtaining
quantitative or statistical information about the data at different abstraction levels.