Page 259 -

P. 259

12-ch05-187-242-9780123814791
HAN

222 Chapter 5 Data Cube Technology 2011/6/1 3:19 Page 222 #36

value (i.e., the value to be predicted). Expanding within these dimensions will likely
increase the sample size and not shift the query’s answer. Consider an example of a 2-D
query specifying education = “college” and birth month = “July.” Let the cube measure
be average income. Intuitively, education has a high correlation to income while birth
month does not. It would be harmful to expand the education dimension to include val-
ues such as “graduate” or “high school.” They are likely to alter the ﬁnal result. However,
expansion in the birth month dimension to include other month values could be helpful,
because it is unlikely to change the result but will increase sampling size.
To mathematically measure the correlation of a dimension to the cube value, the
correlation between the dimension’s values and their aggregated cube measures is com-
2
puted. Pearson’s correlation coefﬁcient for numeric data and the χ correlation test for
nominal data are popularly used correlation measures, although many other measures,
such as covariance, can be used. (These measures were presented in Section 3.3.2.) A
dimension that is strongly correlated with the value to be predicted should not be a
candidate for expansion. Notice that since the correlation of a dimension with the cube
measure is independent of a particular query, it should be precomputed and stored with
the cube measure to facilitate efﬁcient online analysis.
After selecting dimensions for expansion, the next question is “Which values within
these dimensions should the expansion use?” This relies on the semantic knowledge of
the dimensions in question. The goal should be to select semantically similar values to
minimize the risk of altering the ﬁnal result. Consider the age dimension—similarity
of values in this dimension is clear. There is a deﬁnite (numeric) order to the val-
ues. Dimensions with numeric or ordinal (ranked) data (like education) have a deﬁnite
ordering among data values. Therefore, we can select values that are close to the instan-
tiated query value. For nominal data of a dimension that is organized in a multilevel
hierarchy in a data cube (e.g., location), we should select those values located in the
same branch of the tree (e.g., the same district or city).
By considering additional data during query expansion, we are aiming for a more
accurate and reliable answer. As mentioned before, strongly correlated dimensions are
precluded from expansion for this purpose. An additional strategy is to ensure that
new samples share the “same” cube measure value (e.g., mean income) as the exist-
ing samples in the query cell. The two-sample t-test is a relatively simple statistical
method that can be used to determine whether two samples have the same mean (or
any other point estimate), where “same” means that they do not differ signiﬁcantly. (It
is described in greater detail in Section 8.5.5 on model selection using statistical tests of
signiﬁcance.)
The test determines whether two samples have the same mean (the null hypothesis)
with the only assumption being that they are both normally distributed. The test fails
if there is evidence that the two samples do not share the same mean. Furthermore, the
test can be performed with a conﬁdence level as an input. This allows the user to control
how strict or loose the query expansion will be.
Example 5.14 shows how the intracuboid expansion strategies just described can be
used to answer a query on sample data.

254 255 256 257 258 259 260 261 262 263 264