Page 203 -
P. 203
2011/6/1
HAN
11-ch04-125-186-9780123814791
166 Chapter 4 Data Warehousing and Online Analytical Processing 3:17 Page 166 #42
4.5 Data Generalization by Attribute-Oriented
Induction
Conceptually, the data cube can be viewed as a kind of multidimensional data generali-
zation. In general, data generalization summarizes data by replacing relatively low-level
values (e.g., numeric values for an attribute age) with higher-level concepts (e.g., young,
middle-aged, and senior), or by reducing the number of dimensions to summarize data
in concept space involving fewer dimensions (e.g., removing birth date and telephone
number when summarizing the behavior of a group of students). Given the large amount
of data stored in databases, it is useful to be able to describe concepts in concise and suc-
cinct terms at generalized (rather than low) levels of abstraction. Allowing data sets to
be generalized at multiple levels of abstraction facilitates users in examining the gen-
eral behavior of the data. Given the AllElectronics database, for example, instead of
examining individual customer transactions, sales managers may prefer to view the
data generalized to higher levels, such as summarized by customer groups according
to geographic regions, frequency of purchases per group, and customer income.
This leads us to the notion of concept description, which is a form of data gene-
ralization. A concept typically refers to a data collection such as frequent buyers, grad-
uate students, and so on. As a data mining task, concept description is not a simple
enumeration of the data. Instead, concept description generates descriptions for data
characterization and comparison. It is sometimes called class description when the con-
cept to be described refers to a class of objects. Characterization provides a concise and
succinct summarization of the given data collection, while concept or class compari-
son (also known as discrimination) provides descriptions comparing two or more data
collections.
Up to this point, we have studied data cube (or OLAP) approaches to concept
description using multidimensional, multilevel data generalization in data warehouses.
“Is data cube technology sufficient to accomplish all kinds of concept description tasks for
large data sets?” Consider the following cases.
Complex data types and aggregation: Data warehouses and OLAP tools are based
on a multidimensional data model that views data in the form of a data cube, con-
sisting of dimensions (or attributes) and measures (aggregate functions). However,
many current OLAP systems confine dimensions to non-numeric data and measures
to numeric data. In reality, the database can include attributes of various data types,
including numeric, non-numeric, spatial, text, or image, which ideally should be
included in the concept description.
Furthermore, the aggregation of attributes in a database may include sophisticated
data types such as the collection of non-numeric data, the merging of spatial regions,
the composition of images, the integration of texts, and the grouping of object point-
ers. Therefore, OLAP, with its restrictions on the possible dimension and measure
types, represents a simplified model for data analysis. Concept description should
handle complex data types of the attributes and their aggregations, as necessary.