Page 257 -
P. 257

12-ch05-187-242-9780123814791
                         HAN

          220   Chapter 5 Data Cube Technology              2011/6/1  3:19 Page 220  #34



                         that the standard deviation of the population is unknown, the sample standard deviation
                         of x is denoted by s. Given a desired confidence level, the confidence interval for ¯x is
                                                        ¯ x ± t c ˆσ ¯x ,                 (5.1)

                                                                                        s
                         where t c is the critical t-value associated with the confidence level and ˆσ ¯x = √ is the
                                                                                        l
                         estimated standard error of the mean. To find the appropriate t c , specify the desired
                         confidence level (e.g., 95%) and also the degree of freedom, which is just l − 1.
                           The important thing to note is that the computation involved in computing a confi-
                         dence interval is algebraic. Let’s look at the three terms involved in Eq. (5.1). The first is
                         the mean of the sample set, ¯x, which is algebraic; the second is the critical t-value, which
                         is calculated by a lookup, and with respect to x, it depends on l, a distributive measure;
                                           s
                         and the third is ˆσ ¯x = √ , which also turns out to be algebraic if one records the linear
                                            l
                             P l                    P l   2
                                  x
                         sum (  i=1 i ) and squared sum (  i=1 i
                                                        x ). Because the terms involved are either alge-
                         braic or distributive, the confidence interval computation is algebraic. Actually, since
                         both the mean and confidence interval are algebraic, at every cell, exactly three values
                         are sufficient to calculate them—all of which are either distributive or algebraic:
                         1. l
                                 P l
                         2. sum =  i=1 i
                                      x
                                       P l   2
                         3. squared sum =   x
                                         i=1 i
                           There are many efficient techniques for computing algebraic and distributive mea-
                         sures (Section 4.2.4). Therefore, any of the previously developed cubing algorithms can
                         be used to efficiently construct a sampling cube.
                           Now that we have established that sampling cubes can be computed efficiently, our
                         next step is to find a way of boosting the confidence of results obtained for queries on
                         sample data.

                         Query Processing: Boosting Confidences
                         for Small Samples
                         A query posed against a data cube can be either a point query or a range query. With-
                         out loss of generality, consider the case of a point query. Here, it corresponds to a cell
                         in sampling cube C R . The goal is to provide an accurate point estimate for the samples
                         in that cell. Because the cube also reports the confidence interval associated with the
                         sample mean, there is some measure of “reliability” to the returned answer. If the con-
                         fidence interval is small, the reliability is deemed good; however, if the interval is large,
                         the reliability is questionable.
                           “What can we do to boost the reliability of query answers?” Consider what affects the
                         confidence interval size. There are two main factors: the variance of the sample data and
                         the sample size. First, a rather large variance in the cell may indicate that the chosen cube
   252   253   254   255   256   257   258   259   260   261   262