Page 257 -

P. 257

12-ch05-187-242-9780123814791
HAN

220 Chapter 5 Data Cube Technology 2011/6/1 3:19 Page 220 #34

that the standard deviation of the population is unknown, the sample standard deviation
of x is denoted by s. Given a desired conﬁdence level, the conﬁdence interval for ¯x is
¯ x ± t c ˆσ ¯x , (5.1)

s
where t c is the critical t-value associated with the conﬁdence level and ˆσ ¯x = √ is the
l
estimated standard error of the mean. To ﬁnd the appropriate t c , specify the desired
conﬁdence level (e.g., 95%) and also the degree of freedom, which is just l − 1.
The important thing to note is that the computation involved in computing a conﬁ-
dence interval is algebraic. Let’s look at the three terms involved in Eq. (5.1). The ﬁrst is
the mean of the sample set, ¯x, which is algebraic; the second is the critical t-value, which
is calculated by a lookup, and with respect to x, it depends on l, a distributive measure;
s
and the third is ˆσ ¯x = √ , which also turns out to be algebraic if one records the linear
l
P l P l 2
x
sum ( i=1 i ) and squared sum ( i=1 i
x ). Because the terms involved are either alge-
braic or distributive, the conﬁdence interval computation is algebraic. Actually, since
both the mean and conﬁdence interval are algebraic, at every cell, exactly three values
are sufﬁcient to calculate them—all of which are either distributive or algebraic:
1. l
P l
2. sum = i=1 i
x
P l 2
3. squared sum = x
i=1 i
There are many efﬁcient techniques for computing algebraic and distributive mea-
sures (Section 4.2.4). Therefore, any of the previously developed cubing algorithms can
be used to efﬁciently construct a sampling cube.
Now that we have established that sampling cubes can be computed efﬁciently, our
next step is to ﬁnd a way of boosting the conﬁdence of results obtained for queries on
sample data.

Query Processing: Boosting Conﬁdences
for Small Samples
A query posed against a data cube can be either a point query or a range query. With-
out loss of generality, consider the case of a point query. Here, it corresponds to a cell
in sampling cube C R . The goal is to provide an accurate point estimate for the samples
in that cell. Because the cube also reports the conﬁdence interval associated with the
sample mean, there is some measure of “reliability” to the returned answer. If the con-
ﬁdence interval is small, the reliability is deemed good; however, if the interval is large,
the reliability is questionable.
“What can we do to boost the reliability of query answers?” Consider what affects the
conﬁdence interval size. There are two main factors: the variance of the sample data and
the sample size. First, a rather large variance in the cell may indicate that the chosen cube

252 253 254 255 256 257 258 259 260 261 262