Page 205 -
P. 205
HAN
11-ch04-125-186-9780123814791
2011/6/1
168 Chapter 4 Data Warehousing and Online Analytical Processing 3:17 Page 168 #44
number), and gpa (grade point average). A data mining query for this characterization
can be expressed in the data mining query language, DMQL, as follows:
use Big University DB
mine characteristics as “Science Students”
in relevance to name, gender, major, birth place, birth date, residence,
phone#, gpa
from student
where status in “graduate”
We will see how this example of a typical data mining query can apply attribute-oriented
induction to the mining of characteristic descriptions.
First, data focusing should be performed before attribute-oriented induction. This
step corresponds to the specification of the task-relevant data (i.e., data for analysis). The
data are collected based on the information provided in the data mining query. Because
a data mining query is usually relevant to only a portion of the database, selecting the
relevant data set not only makes mining more efficient, but also derives more meaningful
results than mining the entire database.
Specifying the set of relevant attributes (i.e., attributes for mining, as indicated in
DMQL with the in relevance to clause) may be difficult for the user. A user may select
only a few attributes that he or she feels are important, while missing others that could
also play a role in the description. For example, suppose that the dimension birth place
is defined by the attributes city, province or state, and country. Of these attributes, let’s
say that the user has only thought to specify city. In order to allow generalization on
the birth place dimension, the other attributes defining this dimension should also be
included. In other words, having the system automatically include province or state and
country as relevant attributes allows city to be generalized to these higher conceptual
levels during the induction process.
At the other extreme, suppose that the user may have introduced too many attributes
by specifying all of the possible attributes with the clause in relevance to ∗. In this case,
all of the attributes in the relation specified by the from clause would be included in the
analysis. Many of these attributes are unlikely to contribute to an interesting description.
A correlation-based analysis method (Section 3.3.2) can be used to perform attribute
relevance analysis and filter out statistically irrelevant or weakly relevant attributes from
the descriptive mining process. Other approaches such as attribute subset selection, are
also described in Chapter 3.
Table 4.5 Initial Working Relation: A Collection of Task-Relevant Data
name gender major birth place birth date residence phone# gpa
Jim Woodman M CS Vancouver, BC, Canada 12-8-76 3511 Main St., Richmond 687-4598 3.67
Scott Lachance M CS Montreal, Que, Canada 7-28-75 345 1st Ave., Richmond 253-9106 3.70
Laura Lee F Physics Seattle, WA, USA 8-25-70 125 Austin Ave., Burnaby 420-5232 3.83
··· ··· ··· ··· ··· ··· ··· ···