Page 260 -
P. 260
#37
2011/6/1
12-ch05-187-242-9780123814791
3:19 Page 223
HAN
5.3 Processing Advanced Kinds of Queries by Exploring Cube Technology 223
Table 5.10 Sample Customer Survey Data
gender age education occupation income
female 23 college teacher $85,000
female 40 college programmer $50,000
female 31 college programmer $52,000
female 50 graduate teacher $90,000
female 62 graduate CEO $500,000
male 25 high school programmer $50,000
male 28 high school CEO $250,000
male 40 college teacher $80,000
male 50 college programmer $45,000
male 57 graduate programmer $80,000
Example 5.14 Intracuboid query expansion to answer a query on sample data. Consider a book
retailer trying to learn more about its customers’ annual income levels. In Table 5.10,
6
a sample of the survey data collected is shown. In the survey, customers are segmented
by four attributes, namely gender, age, education, and occupation.
Let a query on customer income be “age = 25,” where the user specifies a 95%
confidence level. Suppose this returns an income value of $50,000 with a rather large
7
confidence interval. Suppose also, that this confidence interval is larger than a preset
threshold and that the age dimension was found to have little correlation with income
in this data set. Therefore, intracuboid expansion starts within the age dimension. The
nearest cell is “age = 23,” which returns an income of $85,000. The two-sample t-test at
the 95% confidence level passes so the query expands; it is now “age = {23,25}” with a
smaller confidence interval than initially. However, it is still larger than the threshold,
so expansion continues to the next nearest cell: “age = 28,” which returns an income of
$250,000. The two sample t-test between this cell and the original query cell fails; as a
result, it is ignored. Next, “age = 31” is checked and it passes the test.
The confidence interval of the three cells combined is now below the threshold and
the expansion finishes at “age = {23,25,31}.” The mean of the income values at these
three cells is 85,000+50,000+52,000 = $62,333, which is returned as the query answer. It has
3
a smaller confidence interval, and thus is more reliable than the response of $50,000,
which would have been returned if intracuboid expansion had not been considered.
Method 2. Intercuboid query expansion. In this case, the expansion occurs by looking
to a more general cell, as shown in Figure 5.15(b). For example, the cell in the 2-D cuboid
6 For the sake of illustration, ignore the fact that the sample size is too small to be statistically significant.
7 For the sake of the example, suppose this is true even though there is only one sample. In practice,
more points are needed to calculate a legitimate value.