Page 258 -
P. 258

3:19 Page 221
                                                            2011/6/1
                               12-ch05-187-242-9780123814791
                                                                                    #35
                         HAN
                                   5.3 Processing Advanced Kinds of Queries by Exploring Cube Technology  221


                               cell is poor for prediction. A better solution is probably to drill down on the query cell
                               to a more specific one (i.e., asking more specific queries). Second, a small sample size
                               can cause a large confidence interval. When there are very few samples, the correspond-
                               ing t c is large because of the small degree of freedom. This in turn could cause a large
                               confidence interval. Intuitively, this makes sense. Suppose one is trying to figure out the
                               average income of people in the United States. Just asking two or three people does not
                               give much confidence to the returned response.
                                 The best way to solve this small sample size problem is to get more data. Fortunately,
                               there is usually an abundance of additional data available in the cube. The data do not
                               match the query cell exactly; however, we can consider data from cells that are “close
                               by.” There are two ways to incorporate such data to enhance the reliability of the query
                               answer: (1) intracuboid query expansion, where we consider nearby cells within the same
                               cuboid, and (2) intercuboid query expansion, where we consider more general versions
                               (from parent cuboids) of the query cell. Let’s see how this works, starting with intra-
                               cuboid query expansion.

                               Method 1. Intracuboid query expansion. Here, we expand the sample size by including
                               nearby cells in the same cuboid as the queried cell, as shown in Figure 5.15(a). We just
                               have to be careful that the new samples serve to increase the confidence in the answer
                               without changing the query’s semantics.
                                 So, the first question is “Which dimensions should be expanded?” The best candidates
                               should be the dimensions that are uncorrelated or weakly correlated with the measure



                                    age-occupation cuboid



                                           (a)
                                age cuboid   occupation cuboid








                                    age-occupation cuboid
                                          (b)

                    Figure 5.15 Query expansion within sampling cube: Given small data samples, both methods use strate-
                               gies to boost the reliability of query answers by considering additional data cell values.
                               (a) Intracuboid expansion considers nearby cells in the same cuboid as the queried cell.
                               (b) Intercuboid expansion considers more general cells from parent cuboids.
   253   254   255   256   257   258   259   260   261   262   263