Page 206 -

P. 206

3:17 Page 169
11-ch04-125-186-9780123814791
#45
2011/6/1
HAN
4.5 Data Generalization by Attribute-Oriented Induction 169

“What does the ‘where status in “graduate”’ clause mean?” The where clause implies
that a concept hierarchy exists for the attribute status. Such a concept hierarchy organizes
primitive-level data values for status (e.g., “M.Sc.,” “M.A.,” “M.B.A.,” “Ph.D.,” “B.Sc.,”
and “B.A.”) into higher conceptual levels (e.g., “graduate” and “undergraduate”). This
use of concept hierarchies does not appear in traditional relational query languages, yet
is likely to become a common feature in data mining query languages.
The data mining query presented in Example 4.11 is transformed into the following
relational query for the collection of the task-relevant data set:

use Big University DB
select name, gender, major, birth place, birth date, residence, phone#, gpa
from student
where status in {“M.Sc.,” “M.A.,” “M.B.A.,” “Ph.D.”}

The transformed query is executed against the relational database, Big University DB,
and returns the data shown earlier in Table 4.5. This table is called the (task-relevant)
initial working relation. It is the data on which induction will be performed. Note that
each tuple is, in fact, a conjunction of attribute–value pairs. Hence, we can think of a
tuple within a relation as a rule of conjuncts, and of induction on the relation as the
generalization of these rules.

“Now that the data are ready for attribute-oriented induction, how is attribute-oriented
induction performed?” The essential operation of attribute-oriented induction is data
generalization, which can be performed in either of two ways on the initial working
relation: attribute removal and attribute generalization.
Attribute removal is based on the following rule: If there is a large set of distinct values
for an attribute of the initial working relation, but either (case 1) there is no generalization
operator on the attribute (e.g., there is no concept hierarchy deﬁned for the attribute), or
(case 2) its higher-level concepts are expressed in terms of other attributes, then the attribute
should be removed from the working relation.
Let’s examine the reasoning behind this rule. An attribute–value pair represents a
conjunct in a generalized tuple, or rule. The removal of a conjunct eliminates a con-
straint and thus generalizes the rule. If, as in case 1, there is a large set of distinct values
for an attribute but there is no generalization operator for it, the attribute should be
removed because it cannot be generalized. Preserving it would imply keeping a large
number of disjuncts, which contradicts the goal of generating concise rules. On the
other hand, consider case 2, where the attribute’s higher-level concepts are expressed
in terms of other attributes. For example, suppose that the attribute in question is street,
with higher-level concepts that are represented by the attributes hcity, province or state,
countryi. The removal of street is equivalent to the application of a generalization oper-
ator. This rule corresponds to the generalization rule known as dropping condition in the
machine learning literature on learning from examples.
Attribute generalization is based on the following rule: If there is a large set of distinct
values for an attribute in the initial working relation, and there exists a set of generalization
operators on the attribute, then a generalization operator should be selected and applied

201 202 203 204 205 206 207 208 209 210 211