Page 204 -
P. 204

2011/6/1
                                                                     3:17 Page 167
                         HAN
                                                                                    #43
                               11-ch04-125-186-9780123814791
                                                4.5 Data Generalization by Attribute-Oriented Induction  167


                                 User control versus automation: Online analytical processing in data warehouses
                                 is a user-controlled process. The selection of dimensions and the application of
                                 OLAP operations (e.g., drill-down, roll-up, slicing, and dicing) are primarily directed
                                 and controlled by users. Although the control in most OLAP systems is quite user-
                                 friendly, users do require a good understanding of the role of each dimension.
                                 Furthermore, in order to find a satisfactory description of the data, users may need to
                                 specify a long sequence of OLAP operations. It is often desirable to have a more auto-
                                 mated process that helps users determine which dimensions (or attributes) should
                                 be included in the analysis, and the degree to which the given data set should be
                                 generalized in order to produce an interesting summarization of the data.
                                 This section presents an alternative method for concept description, called attribute-
                               oriented induction, which works for complex data types and relies on a data-driven
                               generalization process.


                         4.5.1 Attribute-Oriented Induction for Data Characterization
                               The attribute-oriented induction (AOI) approach to concept description was first pro-
                               posed in 1989, a few years before the introduction of the data cube approach. The data
                               cube approach is essentially based on materialized views of the data, which typically
                               have been precomputed in a data warehouse. In general, it performs offline aggre-
                               gation before an OLAP or data mining query is submitted for processing. On the
                               other hand, the attribute-oriented induction approach is basically a query-oriented,
                               generalization-based, online data analysis technique. Note that there is no inherent
                               barrier distinguishing the two approaches based on online aggregation versus offline
                               precomputation. Some aggregations in the data cube can be computed online, while
                               offline precomputation of multidimensional space can speed up attribute-oriented
                               induction as well.
                                 The general idea of attribute-oriented induction is to first collect the task-relevant
                               data using a database query and then perform generalization based on the examination
                               of the number of each attribute’s distinct values in the relevant data set. The generali-
                               zation is performed by either attribute removal or attribute generalization. Aggregation
                               is performed by merging identical generalized tuples and accumulating their respec-
                               tive counts. This reduces the size of the generalized data set. The resulting generalized
                               relation can be mapped into different forms (e.g., charts or rules) for presentation to
                               the user.
                                 The following illustrates the process of attribute-oriented induction. We first discuss
                               its use for characterization. The method is extended for the mining of class comparisons
                               in Section 4.5.3.
                 Example 4.11 A data mining query for characterization. Suppose that a user wants to describe
                               the general characteristics of graduate students in the Big University database, given
                               the attributes name, gender, major, birth place, birth date, residence, phone# (telephone
   199   200   201   202   203   204   205   206   207   208   209