Page 261 -
P. 261

12-ch05-187-242-9780123814791
                         HAN

          224   Chapter 5 Data Cube Technology              2011/6/1  3:19 Page 224  #38



                         age-occupation can use its parent in either of the 1-D cuboids, age or occupation. Think
                         of intercuboid expansion as just an extreme case of intracuboid expansion, where all the
                         cells within a dimension are used in the expansion. This essentially sets the dimension
                         to ∗ and thus generalizes to a higher-level cuboid.
                           A k-dimensional cell has k direct parents in the cuboid lattice, where each parent is
                         (k − 1)-dimensional. There are many more ancestor cells in the data cube (e.g., if mul-
                         tiple dimensions are rolled up simultaneously). However, we choose only one parent
                         here to make the search space tractable and to limit the change in the query’s semantics.
                         As with intracuboid query expansion, correlated dimensions are not allowed in inter-
                         cuboid expansions. Within the uncorrelated dimensions, the two-sample t-test can be
                         performed to confirm that the parent and the query cell share the same sample mean. If
                         multiple parent cells pass the test, the test’s confidence level can be adjusted progressively
                         higher until only one passes. Alternatively, multiple parent cells can be used to boost the
                         confidence simultaneously. The choice is application dependent.
           Example 5.15 Intercuboid expansion to answer a query on sample data. Given the input relation in
                         Table 5.10, let the query on income be “occupation = teacher ∧ gender = male.” There is
                         only one sample in Table 5.10 that matches the query, and it has an income of $80,000.
                         Suppose the corresponding confidence interval is larger than a preset threshold. We use
                         intercuboid expansion to find a more reliable answer. There are two parent cells in the
                         data cube: “gender = male” and “occupation = teacher.” By moving up to “gender =
                         male” (and thus setting occupation to ∗), the mean income is $101,000. A two sample
                         t-test reveals that this parent’s sample mean differs significantly from that of the original
                         query cell, so it is ignored. Next, “occupation = teacher” is considered. It has a mean
                         income of $85,000 and passes the two-sample t-test. As a result, the query is expanded
                         to “occupation = teacher” and an income value of $85,000 is returned with acceptable
                         reliability.

                           “How can we determine which method to choose—intracuboid expansion or intercuboid
                         expansion?” This is difficult to answer without knowing the data and the application. A
                         strategy for choosing between the two is to consider what the tolerance is for change
                         in the query’s semantics. This depends on the specific dimensions chosen in the query.
                         For instance, the user might tolerate a bigger change in semantics for the age dimension
                         than education. The difference in tolerance could be so large that the user is willing to set
                         age to ∗ (i.e., intercuboid expansion) rather than letting education change at all. Domain
                         knowledge is helpful here.
                           So far, our discussion has only focused on full materialization of the sampling cube.
                         In many real-world problems, this is often impossible, especially for high-dimensional
                         cases. Real-world survey data, for example, can easily contain over 50 variables (i.e.,
                         dimensions). The sampling cube size would grow exponentially with the number of
                         dimensions. To handle high-dimensional data, a sampling cube method called Sampling
                         Cube Shell was developed. It integrates the Frag-Shell method of Section 5.2.4 with the
                         query expansion approach. The shell computes only a subset of the full sampling cube.
   256   257   258   259   260   261   262   263   264   265   266