Page 259 -
P. 259

12-ch05-187-242-9780123814791
                         HAN

          222   Chapter 5 Data Cube Technology              2011/6/1  3:19 Page 222  #36



                         value (i.e., the value to be predicted). Expanding within these dimensions will likely
                         increase the sample size and not shift the query’s answer. Consider an example of a 2-D
                         query specifying education = “college” and birth month = “July.” Let the cube measure
                         be average income. Intuitively, education has a high correlation to income while birth
                         month does not. It would be harmful to expand the education dimension to include val-
                         ues such as “graduate” or “high school.” They are likely to alter the final result. However,
                         expansion in the birth month dimension to include other month values could be helpful,
                         because it is unlikely to change the result but will increase sampling size.
                           To mathematically measure the correlation of a dimension to the cube value, the
                         correlation between the dimension’s values and their aggregated cube measures is com-
                                                                              2
                         puted. Pearson’s correlation coefficient for numeric data and the χ correlation test for
                         nominal data are popularly used correlation measures, although many other measures,
                         such as covariance, can be used. (These measures were presented in Section 3.3.2.) A
                         dimension that is strongly correlated with the value to be predicted should not be a
                         candidate for expansion. Notice that since the correlation of a dimension with the cube
                         measure is independent of a particular query, it should be precomputed and stored with
                         the cube measure to facilitate efficient online analysis.
                           After selecting dimensions for expansion, the next question is “Which values within
                         these dimensions should the expansion use?” This relies on the semantic knowledge of
                         the dimensions in question. The goal should be to select semantically similar values to
                         minimize the risk of altering the final result. Consider the age dimension—similarity
                         of values in this dimension is clear. There is a definite (numeric) order to the val-
                         ues. Dimensions with numeric or ordinal (ranked) data (like education) have a definite
                         ordering among data values. Therefore, we can select values that are close to the instan-
                         tiated query value. For nominal data of a dimension that is organized in a multilevel
                         hierarchy in a data cube (e.g., location), we should select those values located in the
                         same branch of the tree (e.g., the same district or city).
                           By considering additional data during query expansion, we are aiming for a more
                         accurate and reliable answer. As mentioned before, strongly correlated dimensions are
                         precluded from expansion for this purpose. An additional strategy is to ensure that
                         new samples share the “same” cube measure value (e.g., mean income) as the exist-
                         ing samples in the query cell. The two-sample t-test is a relatively simple statistical
                         method that can be used to determine whether two samples have the same mean (or
                         any other point estimate), where “same” means that they do not differ significantly. (It
                         is described in greater detail in Section 8.5.5 on model selection using statistical tests of
                         significance.)
                           The test determines whether two samples have the same mean (the null hypothesis)
                         with the only assumption being that they are both normally distributed. The test fails
                         if there is evidence that the two samples do not share the same mean. Furthermore, the
                         test can be performed with a confidence level as an input. This allows the user to control
                         how strict or loose the query expansion will be.
                           Example 5.14 shows how the intracuboid expansion strategies just described can be
                         used to answer a query on sample data.
   254   255   256   257   258   259   260   261   262   263   264