Page 256 -

P. 256

3:19 Page 219
2011/6/1
12-ch05-187-242-9780123814791
#33
HAN
5.3 Processing Advanced Kinds of Queries by Exploring Cube Technology 219

The resulting data are called sample data. Data are often sampled to save on costs,
manpower, time, and materials. In many applications, the collection of the entire data
population of interest is unrealistic. In the study of TV ratings or pre-election polls, for
example, it is impossible to gather the opinion of everyone in the population. Most pub-
lished ratings or polls rely on a data sample for analysis. The results are extrapolated for
the entire population, and associated with certain statistical measures such as a conﬁ-
dence interval. The conﬁdence interval tells us how reliable a result is. Statistical surveys
based on sampling are a common tool in many ﬁelds like politics, healthcare, market
research, and social and natural sciences.
“How effective is OLAP on sample data?” OLAP traditionally has the full data pop-
ulation on hand, yet with sample data, we have only a small subset. If we try to apply
traditional OLAP tools to sample data, we encounter three challenges. First, sample data
are often sparse in the multidimensional sense. When a user drills down on the data, it
is easy to reach a point with very few or no samples even when the overall sample size
is large. Traditional OLAP simply uses whatever data are available to compute a query
answer. To extrapolate such an answer for a population based on a small sample could
be misleading: A single outlier or a slight bias in the sampling can distort the answer sig-
niﬁcantly. Second, with sample data, statistical methods are used to provide a measure
of reliability (e.g., a conﬁdence interval) to indicate the quality of the query answer as it
pertains to the population. Traditional OLAP is not equipped with such tools.
A sampling cube framework was introduced to tackle each of the preceding
challenges.

Sampling Cube Framework

The sampling cube is a data cube structure that stores the sample data and their multi-
dimensional aggregates. It supports OLAP on sample data. It calculates conﬁdence inter-
vals as a quality measure for any multidimensional query. Given a sample data relation
(i.e., base cuboid) R, the sampling cube C R typically computes the sample mean, sample
standard deviation, and other task-speciﬁc measures.
In statistics, a conﬁdence interval is used to indicate the reliability of an estimate.
Suppose we want to estimate the mean age of all viewers of a given TV show. We have
sample data (a subset) of this data population. Let’s say our sample mean is 35 years. This
becomes our estimate for the entire population of viewers as well, but how conﬁdent can
we be that 35 is also the mean of the true population? It is unlikely that the sample mean
will be exactly equal to the true population mean because of sampling error. Therefore,
we need to qualify our estimate in some way to indicate the general magnitude of this
error. This is typically done by computing a conﬁdence interval, which is an estimated
value range with a given high probability of covering the true population value. A con-
ﬁdence interval for our example could be “the actual mean will not vary by +/− two
standard deviations 95% of the time.” (Recall that the standard deviation is just a num-
ber, which can be computed as shown in Section 2.2.2.) A conﬁdence interval is always
qualiﬁed by a particular conﬁdence level. In our example, it is 95%.
The conﬁdence interval is calculated as follows. Let x be a set of samples. The mean of
the samples is denoted by ¯x, and the number of samples in x is denoted by l. Assuming

251 252 253 254 255 256 257 258 259 260 261