Page 261 -

P. 261

12-ch05-187-242-9780123814791
HAN

224 Chapter 5 Data Cube Technology 2011/6/1 3:19 Page 224 #38

age-occupation can use its parent in either of the 1-D cuboids, age or occupation. Think
of intercuboid expansion as just an extreme case of intracuboid expansion, where all the
cells within a dimension are used in the expansion. This essentially sets the dimension
to ∗ and thus generalizes to a higher-level cuboid.
A k-dimensional cell has k direct parents in the cuboid lattice, where each parent is
(k − 1)-dimensional. There are many more ancestor cells in the data cube (e.g., if mul-
tiple dimensions are rolled up simultaneously). However, we choose only one parent
here to make the search space tractable and to limit the change in the query’s semantics.
As with intracuboid query expansion, correlated dimensions are not allowed in inter-
cuboid expansions. Within the uncorrelated dimensions, the two-sample t-test can be
performed to conﬁrm that the parent and the query cell share the same sample mean. If
multiple parent cells pass the test, the test’s conﬁdence level can be adjusted progressively
higher until only one passes. Alternatively, multiple parent cells can be used to boost the
conﬁdence simultaneously. The choice is application dependent.
Example 5.15 Intercuboid expansion to answer a query on sample data. Given the input relation in
Table 5.10, let the query on income be “occupation = teacher ∧ gender = male.” There is
only one sample in Table 5.10 that matches the query, and it has an income of $80,000.
Suppose the corresponding conﬁdence interval is larger than a preset threshold. We use
intercuboid expansion to ﬁnd a more reliable answer. There are two parent cells in the
data cube: “gender = male” and “occupation = teacher.” By moving up to “gender =
male” (and thus setting occupation to ∗), the mean income is $101,000. A two sample
t-test reveals that this parent’s sample mean differs signiﬁcantly from that of the original
query cell, so it is ignored. Next, “occupation = teacher” is considered. It has a mean
income of $85,000 and passes the two-sample t-test. As a result, the query is expanded
to “occupation = teacher” and an income value of $85,000 is returned with acceptable
reliability.

“How can we determine which method to choose—intracuboid expansion or intercuboid
expansion?” This is difﬁcult to answer without knowing the data and the application. A
strategy for choosing between the two is to consider what the tolerance is for change
in the query’s semantics. This depends on the speciﬁc dimensions chosen in the query.
For instance, the user might tolerate a bigger change in semantics for the age dimension
than education. The difference in tolerance could be so large that the user is willing to set
age to ∗ (i.e., intercuboid expansion) rather than letting education change at all. Domain
knowledge is helpful here.
So far, our discussion has only focused on full materialization of the sampling cube.
In many real-world problems, this is often impossible, especially for high-dimensional
cases. Real-world survey data, for example, can easily contain over 50 variables (i.e.,
dimensions). The sampling cube size would grow exponentially with the number of
dimensions. To handle high-dimensional data, a sampling cube method called Sampling
Cube Shell was developed. It integrates the Frag-Shell method of Section 5.2.4 with the
query expansion approach. The shell computes only a subset of the full sampling cube.

256 257 258 259 260 261 262 263 264 265 266