Page 203 -
P. 203

2011/6/1
                         HAN
                               11-ch04-125-186-9780123814791
          166   Chapter 4 Data Warehousing and Online Analytical Processing  3:17 Page 166  #42


                 4.5     Data Generalization by Attribute-Oriented

                         Induction

                         Conceptually, the data cube can be viewed as a kind of multidimensional data generali-
                         zation. In general, data generalization summarizes data by replacing relatively low-level
                         values (e.g., numeric values for an attribute age) with higher-level concepts (e.g., young,
                         middle-aged, and senior), or by reducing the number of dimensions to summarize data
                         in concept space involving fewer dimensions (e.g., removing birth date and telephone
                         number when summarizing the behavior of a group of students). Given the large amount
                         of data stored in databases, it is useful to be able to describe concepts in concise and suc-
                         cinct terms at generalized (rather than low) levels of abstraction. Allowing data sets to
                         be generalized at multiple levels of abstraction facilitates users in examining the gen-
                         eral behavior of the data. Given the AllElectronics database, for example, instead of
                         examining individual customer transactions, sales managers may prefer to view the
                         data generalized to higher levels, such as summarized by customer groups according
                         to geographic regions, frequency of purchases per group, and customer income.
                           This leads us to the notion of concept description, which is a form of data gene-
                         ralization. A concept typically refers to a data collection such as frequent buyers, grad-
                         uate students, and so on. As a data mining task, concept description is not a simple
                         enumeration of the data. Instead, concept description generates descriptions for data
                         characterization and comparison. It is sometimes called class description when the con-
                         cept to be described refers to a class of objects. Characterization provides a concise and
                         succinct summarization of the given data collection, while concept or class compari-
                         son (also known as discrimination) provides descriptions comparing two or more data
                         collections.
                           Up to this point, we have studied data cube (or OLAP) approaches to concept
                         description using multidimensional, multilevel data generalization in data warehouses.
                         “Is data cube technology sufficient to accomplish all kinds of concept description tasks for
                         large data sets?” Consider the following cases.

                           Complex data types and aggregation: Data warehouses and OLAP tools are based
                           on a multidimensional data model that views data in the form of a data cube, con-
                           sisting of dimensions (or attributes) and measures (aggregate functions). However,
                           many current OLAP systems confine dimensions to non-numeric data and measures
                           to numeric data. In reality, the database can include attributes of various data types,
                           including numeric, non-numeric, spatial, text, or image, which ideally should be
                           included in the concept description.
                              Furthermore, the aggregation of attributes in a database may include sophisticated
                           data types such as the collection of non-numeric data, the merging of spatial regions,
                           the composition of images, the integration of texts, and the grouping of object point-
                           ers. Therefore, OLAP, with its restrictions on the possible dimension and measure
                           types, represents a simplified model for data analysis. Concept description should
                           handle complex data types of the attributes and their aggregations, as necessary.
   198   199   200   201   202   203   204   205   206   207   208