Page 210 -
P. 210
3:17 Page 173
2011/6/1
#49
11-ch04-125-186-9780123814791
HAN
4.5 Data Generalization by Attribute-Oriented Induction 173
Algorithm: Attribute-oriented induction. Mining generalized characteristics in a relational
database given a user’s data mining request.
Input:
DB, a relational database;
DMQuery, a data mining query;
a list, a list of attributes (containing attributes, a i );
Gen(a i ), a set of concept hierarchies or generalization operators on attributes, a i ;
a gen thresh(a i ), attribute generalization thresholds for each a i .
Output: P, a Prime generalized relation.
Method:
1. W ← get task relevant data (DMQuery, DB); // Let W, the working relation, hold the
task-relevant data.
2. prepare for generalization (W); // This is implemented as follows.
(a) Scan W and collect the distinct values for each attribute, a i . (Note: If W is very large,
this may be done by examining a sample of W.)
(b) For each attribute a i , determine whether a i should be removed. If not, compute its
minimum desired level L i based on its given or default attribute threshold, and
0
0
determine the mapping pairs (v, v ), where v is a distinct value of a i in W, and v is its
corresponding generalized value at level L i .
3. P ← generalization (W),
The Prime generalized relation, P, is derived by replacing each value v in W by its
0
corresponding v in the mapping while accumulating count and computing any other
aggregate values.
This step can be implemented efficiently using either of the two following variations:
(a) For each generalized tuple, insert the tuple into a sorted prime relation P by a binary
search: if the tuple is already in P, simply increase its count and other aggregate
values accordingly; otherwise, insert it into P.
(b) Since in most cases the number of distinct values at the prime relation level is small,
the prime relation can be coded as an m-dimensional array, where m is the number of
attributes in P, and each dimension contains the corresponding generalized attribute
values. Each array element holds the corresponding count and other aggregation
values, if any. The insertion of a generalized tuple is performed by measure
aggregation in the corresponding array element.
Figure 4.18 Basic algorithm for attribute-oriented induction.