Page 209 -
P. 209

11-ch04-125-186-9780123814791
                         HAN
                                                            2011/6/1
          172   Chapter 4 Data Warehousing and Online Analytical Processing  3:17 Page 172  #48



               Table 4.6 Generalized Relation Obtained by Attribute-Oriented Induction on Table 4.5’s Data
                         gender  major   birth country  age range  residence city  gpa  count
                         M       Science  Canada      20 – 25   Richmond     very good  16
                         F       Science  Foreign     25 – 30   Burnaby      excellent  22
                         ···     ···     ···          ···       ···          ···        ···


                           respect to the attribute generalization threshold. Generalization of birth date should
                           therefore take place.
                         6. residence: Suppose that residence is defined by the attributes number, street, resi-
                           dence city, residence province or state, and residence country. The number of distinct
                           values for number and street will likely be very high, since these concepts are quite low
                           level. The attributes number and street should therefore be removed so that residence
                           is then generalized to residence city, which contains fewer distinct values.
                         7. phone#: As with the name attribute, phone# contains too many distinct values and
                           should therefore be removed in generalization.
                         8. gpa: Suppose that a concept hierarchy exists for gpa that groups values for grade
                           point average into numeric intervals like {3.75–4.0, 3.5–3.75, ...}, which in turn are
                           grouped into descriptive values such as {“excellent”, “very good”, ...}. The attribute
                           can therefore be generalized.

                           The generalization process will result in groups of identical tuples. For example, the
                         first two tuples of Table 4.5 both generalize to the same identical tuple (namely, the first
                         tuple shown in Table 4.6). Such identical tuples are then merged into one, with their
                         counts accumulated. This process leads to the generalized relation shown in Table 4.6.
                           Based on the vocabulary used in OLAP, we may view count( ) as a measure, and the
                         remaining attributes as dimensions. Note that aggregate functions, such as sum( ), may be
                         applied to numeric attributes (e.g., salary and sales). These attributes are referred to as
                         measure attributes.


                   4.5.2 Efficient Implementation of Attribute-Oriented Induction
                         “How is attribute-oriented induction actually implemented?” Section 4.5.1 provided an
                         introduction to attribute-oriented induction. The general procedure is summarized in
                         Figure 4.18. The efficiency of this algorithm is analyzed as follows:

                           Step 1 of the algorithm is essentially a relational query to collect the task-relevant data
                           into the working relation, W. Its processing efficiency depends on the query pro-
                           cessing methods used. Given the successful implementation and commercialization
                           of database systems, this step is expected to have good performance.
                           Step 2 collects statistics on the working relation. This requires scanning the relation
                           at most once. The cost for computing the minimum desired level and determining
                                              0
                           the mapping pairs, (v, v ), for each attribute is dependent on the number of distinct
   204   205   206   207   208   209   210   211   212   213   214