Page 256 -
P. 256

3:19 Page 219
                                                            2011/6/1
                               12-ch05-187-242-9780123814791
                                                                                    #33
                         HAN
                                   5.3 Processing Advanced Kinds of Queries by Exploring Cube Technology  219


                               The resulting data are called sample data. Data are often sampled to save on costs,
                               manpower, time, and materials. In many applications, the collection of the entire data
                               population of interest is unrealistic. In the study of TV ratings or pre-election polls, for
                               example, it is impossible to gather the opinion of everyone in the population. Most pub-
                               lished ratings or polls rely on a data sample for analysis. The results are extrapolated for
                               the entire population, and associated with certain statistical measures such as a confi-
                               dence interval. The confidence interval tells us how reliable a result is. Statistical surveys
                               based on sampling are a common tool in many fields like politics, healthcare, market
                               research, and social and natural sciences.
                                 “How effective is OLAP on sample data?” OLAP traditionally has the full data pop-
                               ulation on hand, yet with sample data, we have only a small subset. If we try to apply
                               traditional OLAP tools to sample data, we encounter three challenges. First, sample data
                               are often sparse in the multidimensional sense. When a user drills down on the data, it
                               is easy to reach a point with very few or no samples even when the overall sample size
                               is large. Traditional OLAP simply uses whatever data are available to compute a query
                               answer. To extrapolate such an answer for a population based on a small sample could
                               be misleading: A single outlier or a slight bias in the sampling can distort the answer sig-
                               nificantly. Second, with sample data, statistical methods are used to provide a measure
                               of reliability (e.g., a confidence interval) to indicate the quality of the query answer as it
                               pertains to the population. Traditional OLAP is not equipped with such tools.
                                 A sampling cube framework was introduced to tackle each of the preceding
                               challenges.

                               Sampling Cube Framework

                               The sampling cube is a data cube structure that stores the sample data and their multi-
                               dimensional aggregates. It supports OLAP on sample data. It calculates confidence inter-
                               vals as a quality measure for any multidimensional query. Given a sample data relation
                               (i.e., base cuboid) R, the sampling cube C R typically computes the sample mean, sample
                               standard deviation, and other task-specific measures.
                                 In statistics, a confidence interval is used to indicate the reliability of an estimate.
                               Suppose we want to estimate the mean age of all viewers of a given TV show. We have
                               sample data (a subset) of this data population. Let’s say our sample mean is 35 years. This
                               becomes our estimate for the entire population of viewers as well, but how confident can
                               we be that 35 is also the mean of the true population? It is unlikely that the sample mean
                               will be exactly equal to the true population mean because of sampling error. Therefore,
                               we need to qualify our estimate in some way to indicate the general magnitude of this
                               error. This is typically done by computing a confidence interval, which is an estimated
                               value range with a given high probability of covering the true population value. A con-
                               fidence interval for our example could be “the actual mean will not vary by +/− two
                               standard deviations 95% of the time.” (Recall that the standard deviation is just a num-
                               ber, which can be computed as shown in Section 2.2.2.) A confidence interval is always
                               qualified by a particular confidence level. In our example, it is 95%.
                                 The confidence interval is calculated as follows. Let x be a set of samples. The mean of
                               the samples is denoted by ¯x, and the number of samples in x is denoted by l. Assuming
   251   252   253   254   255   256   257   258   259   260   261