Page 260 -
P. 260

#37
                                                            2011/6/1
                               12-ch05-187-242-9780123814791
                                                                     3:19 Page 223
                         HAN
                                   5.3 Processing Advanced Kinds of Queries by Exploring Cube Technology  223


                    Table 5.10 Sample Customer Survey Data
                                gender  age  education  occupation  income
                                female  23   college    teacher     $85,000
                                female  40   college    programmer  $50,000
                                female  31   college    programmer  $52,000
                                female  50   graduate   teacher     $90,000
                                female  62   graduate   CEO         $500,000
                                male    25   high school  programmer  $50,000
                                male    28   high school  CEO       $250,000
                                male    40   college    teacher     $80,000
                                male    50   college    programmer  $45,000
                                male    57   graduate   programmer  $80,000


                 Example 5.14 Intracuboid query expansion to answer a query on sample data. Consider a book
                               retailer trying to learn more about its customers’ annual income levels. In Table 5.10,
                                                                    6
                               a sample of the survey data collected is shown. In the survey, customers are segmented
                               by four attributes, namely gender, age, education, and occupation.
                                 Let a query on customer income be “age = 25,” where the user specifies a 95%
                               confidence level. Suppose this returns an income value of $50,000 with a rather large
                                              7
                               confidence interval. Suppose also, that this confidence interval is larger than a preset
                               threshold and that the age dimension was found to have little correlation with income
                               in this data set. Therefore, intracuboid expansion starts within the age dimension. The
                               nearest cell is “age = 23,” which returns an income of $85,000. The two-sample t-test at
                               the 95% confidence level passes so the query expands; it is now “age = {23,25}” with a
                               smaller confidence interval than initially. However, it is still larger than the threshold,
                               so expansion continues to the next nearest cell: “age = 28,” which returns an income of
                               $250,000. The two sample t-test between this cell and the original query cell fails; as a
                               result, it is ignored. Next, “age = 31” is checked and it passes the test.
                                 The confidence interval of the three cells combined is now below the threshold and
                               the expansion finishes at “age = {23,25,31}.” The mean of the income values at these
                               three cells is  85,000+50,000+52,000  = $62,333, which is returned as the query answer. It has
                                                3
                               a smaller confidence interval, and thus is more reliable than the response of $50,000,
                               which would have been returned if intracuboid expansion had not been considered.

                               Method 2. Intercuboid query expansion. In this case, the expansion occurs by looking
                               to a more general cell, as shown in Figure 5.15(b). For example, the cell in the 2-D cuboid


                               6 For the sake of illustration, ignore the fact that the sample size is too small to be statistically significant.
                               7 For the sake of the example, suppose this is true even though there is only one sample. In practice,
                               more points are needed to calculate a legitimate value.
   255   256   257   258   259   260   261   262   263   264   265