Page 206 -
P. 206

3:17 Page 169
                               11-ch04-125-186-9780123814791
                                                                                    #45
                                                            2011/6/1
                         HAN
                                                4.5 Data Generalization by Attribute-Oriented Induction  169


                                 “What does the ‘where status in “graduate”’ clause mean?” The where clause implies
                               that a concept hierarchy exists for the attribute status. Such a concept hierarchy organizes
                               primitive-level data values for status (e.g., “M.Sc.,” “M.A.,” “M.B.A.,” “Ph.D.,” “B.Sc.,”
                               and “B.A.”) into higher conceptual levels (e.g., “graduate” and “undergraduate”). This
                               use of concept hierarchies does not appear in traditional relational query languages, yet
                               is likely to become a common feature in data mining query languages.
                                 The data mining query presented in Example 4.11 is transformed into the following
                               relational query for the collection of the task-relevant data set:

                                   use Big University DB
                                   select name, gender, major, birth place, birth date, residence, phone#, gpa
                                   from student
                                   where status in {“M.Sc.,” “M.A.,” “M.B.A.,” “Ph.D.”}

                               The transformed query is executed against the relational database, Big University DB,
                               and returns the data shown earlier in Table 4.5. This table is called the (task-relevant)
                               initial working relation. It is the data on which induction will be performed. Note that
                               each tuple is, in fact, a conjunction of attribute–value pairs. Hence, we can think of a
                               tuple within a relation as a rule of conjuncts, and of induction on the relation as the
                               generalization of these rules.

                                 “Now that the data are ready for attribute-oriented induction, how is attribute-oriented
                               induction performed?” The essential operation of attribute-oriented induction is data
                               generalization, which can be performed in either of two ways on the initial working
                               relation: attribute removal and attribute generalization.
                                 Attribute removal is based on the following rule: If there is a large set of distinct values
                               for an attribute of the initial working relation, but either (case 1) there is no generalization
                               operator on the attribute (e.g., there is no concept hierarchy defined for the attribute), or
                               (case 2) its higher-level concepts are expressed in terms of other attributes, then the attribute
                               should be removed from the working relation.
                                 Let’s examine the reasoning behind this rule. An attribute–value pair represents a
                               conjunct in a generalized tuple, or rule. The removal of a conjunct eliminates a con-
                               straint and thus generalizes the rule. If, as in case 1, there is a large set of distinct values
                               for an attribute but there is no generalization operator for it, the attribute should be
                               removed because it cannot be generalized. Preserving it would imply keeping a large
                               number of disjuncts, which contradicts the goal of generating concise rules. On the
                               other hand, consider case 2, where the attribute’s higher-level concepts are expressed
                               in terms of other attributes. For example, suppose that the attribute in question is street,
                               with higher-level concepts that are represented by the attributes hcity, province or state,
                               countryi. The removal of street is equivalent to the application of a generalization oper-
                               ator. This rule corresponds to the generalization rule known as dropping condition in the
                               machine learning literature on learning from examples.
                                 Attribute generalization is based on the following rule: If there is a large set of distinct
                               values for an attribute in the initial working relation, and there exists a set of generalization
                               operators on the attribute, then a generalization operator should be selected and applied
   201   202   203   204   205   206   207   208   209   210   211