Page 257 -
P. 257
12-ch05-187-242-9780123814791
HAN
220 Chapter 5 Data Cube Technology 2011/6/1 3:19 Page 220 #34
that the standard deviation of the population is unknown, the sample standard deviation
of x is denoted by s. Given a desired confidence level, the confidence interval for ¯x is
¯ x ± t c ˆσ ¯x , (5.1)
s
where t c is the critical t-value associated with the confidence level and ˆσ ¯x = √ is the
l
estimated standard error of the mean. To find the appropriate t c , specify the desired
confidence level (e.g., 95%) and also the degree of freedom, which is just l − 1.
The important thing to note is that the computation involved in computing a confi-
dence interval is algebraic. Let’s look at the three terms involved in Eq. (5.1). The first is
the mean of the sample set, ¯x, which is algebraic; the second is the critical t-value, which
is calculated by a lookup, and with respect to x, it depends on l, a distributive measure;
s
and the third is ˆσ ¯x = √ , which also turns out to be algebraic if one records the linear
l
P l P l 2
x
sum ( i=1 i ) and squared sum ( i=1 i
x ). Because the terms involved are either alge-
braic or distributive, the confidence interval computation is algebraic. Actually, since
both the mean and confidence interval are algebraic, at every cell, exactly three values
are sufficient to calculate them—all of which are either distributive or algebraic:
1. l
P l
2. sum = i=1 i
x
P l 2
3. squared sum = x
i=1 i
There are many efficient techniques for computing algebraic and distributive mea-
sures (Section 4.2.4). Therefore, any of the previously developed cubing algorithms can
be used to efficiently construct a sampling cube.
Now that we have established that sampling cubes can be computed efficiently, our
next step is to find a way of boosting the confidence of results obtained for queries on
sample data.
Query Processing: Boosting Confidences
for Small Samples
A query posed against a data cube can be either a point query or a range query. With-
out loss of generality, consider the case of a point query. Here, it corresponds to a cell
in sampling cube C R . The goal is to provide an accurate point estimate for the samples
in that cell. Because the cube also reports the confidence interval associated with the
sample mean, there is some measure of “reliability” to the returned answer. If the con-
fidence interval is small, the reliability is deemed good; however, if the interval is large,
the reliability is questionable.
“What can we do to boost the reliability of query answers?” Consider what affects the
confidence interval size. There are two main factors: the variance of the sample data and
the sample size. First, a rather large variance in the cell may indicate that the chosen cube