Page 47 -
P. 47

HAN 08-ch01-001-038-9780123814791


          10    Chapter 1 Introduction                       2011/6/1  3:12  Page 10  #10



                           Relational data can be accessed by database queries written in a relational query
                         language (e.g., SQL) or with the assistance of graphical user interfaces. A given query is
                         transformed into a set of relational operations, such as join, selection, and projection,
                         and is then optimized for efficient processing. A query allows retrieval of specified sub-
                         sets of the data. Suppose that your job is to analyze the AllElectronics data. Through the
                         use of relational queries, you can ask things like, “Show me a list of all items that were
                         sold in the last quarter.” Relational languages also use aggregate functions such as sum,
                         avg (average), count, max (maximum), and min (minimum). Using aggregates allows you
                         to ask: “Show me the total sales of the last month, grouped by branch,” or “How many sales
                         transactions occurred in the month of December?” or “Which salesperson had the highest
                         sales?”
                           When mining relational databases, we can go further by searching for trends or
                         data patterns. For example, data mining systems can analyze customer data to predict
                         the credit risk of new customers based on their income, age, and previous credit
                         information. Data mining systems may also detect deviations—that is, items with sales
                         that are far from those expected in comparison with the previous year. Such deviations
                         can then be further investigated. For example, data mining may discover that there has
                         been a change in packaging of an item or a significant increase in price.
                           Relational databases are one of the most commonly available and richest information
                         repositories, and thus they are a major data form in the study of data mining.




                   1.3.2 Data Warehouses
                         Suppose that AllElectronics is a successful international company with branches around
                         the world. Each branch has its own set of databases. The president of AllElectronics has
                         asked you to provide an analysis of the company’s sales per item type per branch for the
                         third quarter. This is a difficult task, particularly since the relevant data are spread out
                         over several databases physically located at numerous sites.
                           If AllElectronics had a data warehouse, this task would be easy. A data warehouse
                         is a repository of information collected from multiple sources, stored under a unified
                         schema, and usually residing at a single site. Data warehouses are constructed via a
                         process of data cleaning, data integration, data transformation, data loading, and peri-
                         odic data refreshing. This process is discussed in Chapters 3 and 4. Figure 1.6 shows the
                         typical framework for construction and use of a data warehouse for AllElectronics.
                           To facilitate decision making, the data in a data warehouse are organized around
                         major subjects (e.g., customer, item, supplier, and activity). The data are stored to pro-
                         vide information from a historical perspective, such as in the past 6 to 12 months, and are
                         typically summarized. For example, rather than storing the details of each sales transac-
                         tion, the data warehouse may store a summary of the transactions per item type for each
                         store or, summarized to a higher level, for each sales region.
                           A data warehouse is usually modeled by a multidimensional data structure, called a
                         data cube, in which each dimension corresponds to an attribute or a set of attributes
                         in the schema, and each cell stores the value of some aggregate measure such as count
   42   43   44   45   46   47   48   49   50   51   52