Page 208 -
P. 208

#47
                                                                     3:17 Page 171
                               11-ch04-125-186-9780123814791
                                                            2011/6/1
                         HAN
                                                4.5 Data Generalization by Attribute-Oriented Induction  171


                               Thus, it is important to accumulate count and other aggregate values in the induction
                               process. Conceptually, this is performed as follows. The aggregate function, count(), is
                               associated with each database tuple. Its value for each tuple in the initial working relation
                               is initialized to 1. Through attribute removal and attribute generalization, tuples within
                               the initial working relation may be generalized, resulting in groups of identical tuples. In
                               this case, all of the identical tuples forming a group should be merged into one tuple.
                                 The count of this new, generalized tuple is set to the total number of tuples from the
                               initial working relation that are represented by (i.e., merged into) the new generalized
                               tuple. For example, suppose that by attribute-oriented induction, 52 data tuples from
                               the initial working relation are all generalized to the same tuple, T. That is, the generali-
                               zation of these 52 tuples resulted in 52 identical instances of tuple T. These 52 identical
                               tuples are merged to form one instance of T, with a count that is set to 52. Other popular
                               aggregate functions that could also be associated with each tuple include sum() and avg().
                               For a given generalized tuple, sum() contains the sum of the values of a given numeric
                               attribute for the initial working relation tuples making up the generalized tuple. Suppose
                               that tuple T contained sum(units sold) as an aggregate function. The sum value for tuple
                               T would then be set to the total number of units sold for each of the 52 tuples. The
                               aggregate avg() (average) is computed according to the formula avg() = sum()/count().

                 Example 4.12 Attribute-oriented induction. Here we show how attribute-oriented induction is per-
                               formed on the initial working relation of Table 4.5. For each attribute of the relation,
                               the generalization proceeds as follows:

                               1. name: Since there are a large number of distinct values for name and there is no
                                 generalization operation defined on it, this attribute is removed.
                               2. gender: Since there are only two distinct values for gender, this attribute is retained
                                 and no generalization is performed on it.
                               3. major: Suppose that a concept hierarchy has been defined that allows the attribute
                                 major to be generalized to the values {arts&sciences, engineering, business}. Suppose
                                 also that the attribute generalization threshold is set to 5, and that there are more than
                                 20 distinct values for major in the initial working relation. By attribute generalization
                                 and attribute generalization control, major is therefore generalized by climbing the
                                 given concept hierarchy.
                               4. birth place: This attribute has a large number of distinct values; therefore, we would
                                 like to generalize it. Suppose that a concept hierarchy exists for birth place, defined as
                                 “city < province or state < country.” If the number of distinct values for country in
                                 the initial working relation is greater than the attribute generalization threshold, then
                                 birth place should be removed, because even though a generalization operator exists
                                 for it, the generalization threshold would not be satisfied. If, instead, the number
                                 of distinct values for country is less than the attribute generalization threshold, then
                                 birth place should be generalized to birth country.
                               5. birth date: Suppose that a hierarchy exists that can generalize birth date to age and
                                 age to age range, and that the number of age ranges (or intervals) is small with
   203   204   205   206   207   208   209   210   211   212   213