Page 147 -
P. 147

HAN
                               10-ch03-083-124-9780123814791

          110   Chapter 3 Data Preprocessing                2011/6/1  3:16 Page 110  #28



                           representative sample, especially when the data are skewed. For example, a stratified
                           sample may be obtained from customer data, where a stratum is created for each cus-
                           tomer age group. In this way, the age group having the smallest number of customers
                           will be sure to be represented.

                           An advantage of sampling for data reduction is that the cost of obtaining a sample
                         is proportional to the size of the sample, s, as opposed to N, the data set size. Hence,
                         sampling complexity is potentially sublinear to the size of the data. Other data reduc-
                         tion techniques can require at least one complete pass through D. For a fixed sample
                         size, sampling complexity increases only linearly as the number of data dimensions,
                         n, increases, whereas techniques using histograms, for example, increase exponentially
                         in n.
                           When applied to data reduction, sampling is most commonly used to estimate the
                         answer to an aggregate query. It is possible (using the central limit theorem) to deter-
                         mine a sufficient sample size for estimating a given function within a specified degree
                         of error. This sample size, s, may be extremely small in comparison to N. Sampling is
                         a natural choice for the progressive refinement of a reduced data set. Such a set can be
                         further refined by simply increasing the sample size.

                   3.4.9 Data Cube Aggregation
                         Imagine that you have collected the data for your analysis. These data consist of the
                         AllElectronics sales per quarter, for the years 2008 to 2010. You are, however, interested
                         in the annual sales (total per year), rather than the total per quarter. Thus, the data can
                         be aggregated so that the resulting data summarize the total sales per year instead of per
                         quarter. This aggregation is illustrated in Figure 3.10. The resulting data set is smaller in
                         volume, without loss of information necessary for the analysis task.
                           Data cubes are discussed in detail in Chapter 4 on data warehousing and Chapter 5
                         on data cube technology. We briefly introduce some concepts here. Data cubes store

                               Year 2010
                             Quarter  Sales
                             Year 2009
                              Q1
                                   $224,000
                                   $408,000
                              Q2
                           Quarter  Sales
                              Q3
                            Year 2008  $350,000
                                  $224,000
                             Q1
                                   $586,000
                                  $408,000
                             Q2
                         Quarter Q4  Sales         Year   Sales
                             Q3   $350,000
                           Q1 Q4  $224,000         2008  $1,568,000
                                  $586,000
                           Q2   $408,000           2009  $2,356,000
                           Q3   $350,000           2010  $3,594,000
                           Q4   $586,000
              Figure 3.10 Sales data for a given branch of AllElectronics for the years 2008 through 2010. On the left,
                         the sales are shown per quarter. On the right, the data are aggregated to provide the annual
                         sales.
   142   143   144   145   146   147   148   149   150   151   152