Page 258 -

P. 258

3:19 Page 221
2011/6/1
12-ch05-187-242-9780123814791
#35
HAN
5.3 Processing Advanced Kinds of Queries by Exploring Cube Technology 221

cell is poor for prediction. A better solution is probably to drill down on the query cell
to a more speciﬁc one (i.e., asking more speciﬁc queries). Second, a small sample size
can cause a large conﬁdence interval. When there are very few samples, the correspond-
ing t c is large because of the small degree of freedom. This in turn could cause a large
conﬁdence interval. Intuitively, this makes sense. Suppose one is trying to ﬁgure out the
average income of people in the United States. Just asking two or three people does not
give much conﬁdence to the returned response.
The best way to solve this small sample size problem is to get more data. Fortunately,
there is usually an abundance of additional data available in the cube. The data do not
match the query cell exactly; however, we can consider data from cells that are “close
by.” There are two ways to incorporate such data to enhance the reliability of the query
answer: (1) intracuboid query expansion, where we consider nearby cells within the same
cuboid, and (2) intercuboid query expansion, where we consider more general versions
(from parent cuboids) of the query cell. Let’s see how this works, starting with intra-
cuboid query expansion.

Method 1. Intracuboid query expansion. Here, we expand the sample size by including
nearby cells in the same cuboid as the queried cell, as shown in Figure 5.15(a). We just
have to be careful that the new samples serve to increase the conﬁdence in the answer
without changing the query’s semantics.
So, the ﬁrst question is “Which dimensions should be expanded?” The best candidates
should be the dimensions that are uncorrelated or weakly correlated with the measure

age-occupation cuboid

(a)
age cuboid occupation cuboid

age-occupation cuboid
(b)

Figure 5.15 Query expansion within sampling cube: Given small data samples, both methods use strate-
gies to boost the reliability of query answers by considering additional data cell values.
(a) Intracuboid expansion considers nearby cells in the same cuboid as the queried cell.
(b) Intercuboid expansion considers more general cells from parent cuboids.

253 254 255 256 257 258 259 260 261 262 263