Page 208 -

P. 208

#47
3:17 Page 171
11-ch04-125-186-9780123814791
2011/6/1
HAN
4.5 Data Generalization by Attribute-Oriented Induction 171

Thus, it is important to accumulate count and other aggregate values in the induction
process. Conceptually, this is performed as follows. The aggregate function, count(), is
associated with each database tuple. Its value for each tuple in the initial working relation
is initialized to 1. Through attribute removal and attribute generalization, tuples within
the initial working relation may be generalized, resulting in groups of identical tuples. In
this case, all of the identical tuples forming a group should be merged into one tuple.
The count of this new, generalized tuple is set to the total number of tuples from the
initial working relation that are represented by (i.e., merged into) the new generalized
tuple. For example, suppose that by attribute-oriented induction, 52 data tuples from
the initial working relation are all generalized to the same tuple, T. That is, the generali-
zation of these 52 tuples resulted in 52 identical instances of tuple T. These 52 identical
tuples are merged to form one instance of T, with a count that is set to 52. Other popular
aggregate functions that could also be associated with each tuple include sum() and avg().
For a given generalized tuple, sum() contains the sum of the values of a given numeric
attribute for the initial working relation tuples making up the generalized tuple. Suppose
that tuple T contained sum(units sold) as an aggregate function. The sum value for tuple
T would then be set to the total number of units sold for each of the 52 tuples. The
aggregate avg() (average) is computed according to the formula avg() = sum()/count().

Example 4.12 Attribute-oriented induction. Here we show how attribute-oriented induction is per-
formed on the initial working relation of Table 4.5. For each attribute of the relation,
the generalization proceeds as follows:

1. name: Since there are a large number of distinct values for name and there is no
generalization operation deﬁned on it, this attribute is removed.
2. gender: Since there are only two distinct values for gender, this attribute is retained
and no generalization is performed on it.
3. major: Suppose that a concept hierarchy has been deﬁned that allows the attribute
major to be generalized to the values {arts&sciences, engineering, business}. Suppose
also that the attribute generalization threshold is set to 5, and that there are more than
20 distinct values for major in the initial working relation. By attribute generalization
and attribute generalization control, major is therefore generalized by climbing the
given concept hierarchy.
4. birth place: This attribute has a large number of distinct values; therefore, we would
like to generalize it. Suppose that a concept hierarchy exists for birth place, deﬁned as
“city < province or state < country.” If the number of distinct values for country in
the initial working relation is greater than the attribute generalization threshold, then
birth place should be removed, because even though a generalization operator exists
for it, the generalization threshold would not be satisﬁed. If, instead, the number
of distinct values for country is less than the attribute generalization threshold, then
birth place should be generalized to birth country.
5. birth date: Suppose that a hierarchy exists that can generalize birth date to age and
age to age range, and that the number of age ranges (or intervals) is small with

203 204 205 206 207 208 209 210 211 212 213