Page 205 -
P. 205

HAN
                               11-ch04-125-186-9780123814791
                                                            2011/6/1
          168   Chapter 4 Data Warehousing and Online Analytical Processing  3:17 Page 168  #44



                         number), and gpa (grade point average). A data mining query for this characterization
                         can be expressed in the data mining query language, DMQL, as follows:
                             use Big University DB
                             mine characteristics as “Science Students”
                             in relevance to name, gender, major, birth place, birth date, residence,
                                 phone#, gpa
                             from student
                             where status in “graduate”
                         We will see how this example of a typical data mining query can apply attribute-oriented
                         induction to the mining of characteristic descriptions.
                           First, data focusing should be performed before attribute-oriented induction. This
                         step corresponds to the specification of the task-relevant data (i.e., data for analysis). The
                         data are collected based on the information provided in the data mining query. Because
                         a data mining query is usually relevant to only a portion of the database, selecting the
                         relevant data set not only makes mining more efficient, but also derives more meaningful
                         results than mining the entire database.
                           Specifying the set of relevant attributes (i.e., attributes for mining, as indicated in
                         DMQL with the in relevance to clause) may be difficult for the user. A user may select
                         only a few attributes that he or she feels are important, while missing others that could
                         also play a role in the description. For example, suppose that the dimension birth place
                         is defined by the attributes city, province or state, and country. Of these attributes, let’s
                         say that the user has only thought to specify city. In order to allow generalization on
                         the birth place dimension, the other attributes defining this dimension should also be
                         included. In other words, having the system automatically include province or state and
                         country as relevant attributes allows city to be generalized to these higher conceptual
                         levels during the induction process.
                           At the other extreme, suppose that the user may have introduced too many attributes
                         by specifying all of the possible attributes with the clause in relevance to ∗. In this case,
                         all of the attributes in the relation specified by the from clause would be included in the
                         analysis. Many of these attributes are unlikely to contribute to an interesting description.
                         A correlation-based analysis method (Section 3.3.2) can be used to perform attribute
                         relevance analysis and filter out statistically irrelevant or weakly relevant attributes from
                         the descriptive mining process. Other approaches such as attribute subset selection, are
                         also described in Chapter 3.

            Table 4.5 Initial Working Relation: A Collection of Task-Relevant Data
            name        gender major  birth place      birth date residence        phone#  gpa
            Jim Woodman M     CS    Vancouver, BC, Canada 12-8-76  3511 Main St., Richmond 687-4598 3.67
            Scott Lachance M  CS    Montreal, Que, Canada 7-28-75  345 1st Ave., Richmond  253-9106 3.70
            Laura Lee   F     Physics Seattle, WA, USA  8-25-70  125 Austin Ave., Burnaby 420-5232 3.83
            ···         ···   ···   ···                ···     ···                 ···     ···
   200   201   202   203   204   205   206   207   208   209   210