Page 207 -
P. 207

HAN
                               11-ch04-125-186-9780123814791
                                                            2011/6/1
          170   Chapter 4 Data Warehousing and Online Analytical Processing  3:17 Page 170  #46



                         to the attribute. This rule is based on the following reasoning. Use of a generalization
                         operator to generalize an attribute value within a tuple, or rule, in the working relation
                         will make the rule cover more of the original data tuples, thus generalizing the concept it
                         represents. This corresponds to the generalization rule known as climbing generalization
                         trees in learning from examples, or concept tree ascension.
                           Both rules–attribute removal and attribute generalization–claim that if there is a large
                         set of distinct values for an attribute, further generalization should be applied. This
                         raises the question: How large is “a large set of distinct values for an attribute” considered
                         to be?
                           Depending on the attributes or application involved, a user may prefer some
                         attributes to remain at a rather low abstraction level while others are generalized to
                         higher levels. The control of how high an attribute should be generalized is typically
                         quite subjective. The control of this process is called attribute generalization control.
                         If the attribute is generalized “too high,” it may lead to overgeneralization, and the
                         resulting rules may not be very informative.
                           On the other hand, if the attribute is not generalized to a “sufficiently high level,”
                         then undergeneralization may result, where the rules obtained may not be informative
                         either. Thus, a balance should be attained in attribute-oriented generalization. There are
                         many possible ways to control a generalization process. We will describe two common
                         approaches and illustrate how they work.
                           The first technique, called attribute generalization threshold control, either sets one
                         generalization threshold for all of the attributes, or sets one threshold for each attribute.
                         If the number of distinct values in an attribute is greater than the attribute threshold,
                         further attribute removal or attribute generalization should be performed. Data mining
                         systems typically have a default attribute threshold value generally ranging from 2 to 8
                         and should allow experts and users to modify the threshold values as well. If a user feels
                         that the generalization reaches too high a level for a particular attribute, the threshold
                         can be increased. This corresponds to drilling down along the attribute. Also, to further
                         generalize a relation, the user can reduce an attribute’s threshold, which corresponds to
                         rolling up along the attribute.
                           The second technique, called generalized relation threshold control, sets a threshold
                         for the generalized relation. If the number of (distinct) tuples in the generalized relation
                         is greater than the threshold, further generalization should be performed. Otherwise,
                         no further generalization should be performed. Such a threshold may also be preset in
                         the data mining system (usually within a range of 10 to 30), or set by an expert or user,
                         and should be adjustable. For example, if a user feels that the generalized relation is too
                         small, he or she can increase the threshold, which implies drilling down. Otherwise, to
                         further generalize a relation, the threshold can be reduced, which implies rolling up.
                           These two techniques can be applied in sequence: First apply the attribute threshold
                         control technique to generalize each attribute, and then apply relation threshold control
                         to further reduce the size of the generalized relation. No matter which generalization
                         control technique is applied, the user should be allowed to adjust the generalization
                         thresholds in order to obtain interesting concept descriptions.
                           In many database-oriented induction processes, users are interested in obtaining
                         quantitative or statistical information about the data at different abstraction levels.
   202   203   204   205   206   207   208   209   210   211   212