Page 210 -
P. 210

3:17 Page 173
                                                            2011/6/1
                                                                                    #49
                               11-ch04-125-186-9780123814791
                         HAN
                                                4.5 Data Generalization by Attribute-Oriented Induction  173


                               Algorithm: Attribute-oriented induction. Mining generalized characteristics in a relational
                               database given a user’s data mining request.
                               Input:

                                    DB, a relational database;
                                    DMQuery, a data mining query;
                                    a list, a list of attributes (containing attributes, a i );
                                    Gen(a i ), a set of concept hierarchies or generalization operators on attributes, a i ;
                                    a gen thresh(a i ), attribute generalization thresholds for each a i .

                               Output: P, a Prime generalized relation.
                               Method:
                                 1. W ← get task relevant data (DMQuery, DB); // Let W, the working relation, hold the
                                    task-relevant data.
                                 2. prepare for generalization (W); // This is implemented as follows.
                                    (a) Scan W and collect the distinct values for each attribute, a i . (Note: If W is very large,
                                       this may be done by examining a sample of W.)
                                    (b) For each attribute a i , determine whether a i should be removed. If not, compute its
                                       minimum desired level L i based on its given or default attribute threshold, and
                                                                                                0
                                                               0
                                       determine the mapping pairs (v, v ), where v is a distinct value of a i in W, and v is its
                                       corresponding generalized value at level L i .
                                 3. P ← generalization (W),
                                    The Prime generalized relation, P, is derived by replacing each value v in W by its
                                               0
                                    corresponding v in the mapping while accumulating count and computing any other
                                    aggregate values.
                                    This step can be implemented efficiently using either of the two following variations:
                                    (a) For each generalized tuple, insert the tuple into a sorted prime relation P by a binary
                                       search: if the tuple is already in P, simply increase its count and other aggregate
                                       values accordingly; otherwise, insert it into P.
                                    (b) Since in most cases the number of distinct values at the prime relation level is small,
                                       the prime relation can be coded as an m-dimensional array, where m is the number of
                                       attributes in P, and each dimension contains the corresponding generalized attribute
                                       values. Each array element holds the corresponding count and other aggregation
                                       values, if any. The insertion of a generalized tuple is performed by measure
                                       aggregation in the corresponding array element.


                    Figure 4.18 Basic algorithm for attribute-oriented induction.
   205   206   207   208   209   210   211   212   213   214   215