Page 263 -
P. 263
12-ch05-187-242-9780123814791
HAN
226 Chapter 5 Data Cube Technology 2011/6/1 3:19 Page 226 #40
district, the star rating, and whether the hotel offers complimentary treats or Internet
access. The ranking functions may be linear, quadratic, or any other form.
As shown in the preceding examples, individual users may not only propose ad hoc
ranking functions, but also have different data subsets of interest. Users often want to
thoroughly study the data via multidimensional analysis of the top-k query results. For
example, if unsatisfied by the top-5 results returned by Q 1 , the user may roll up on
the producer dimension to check the top-5 results on all sedans. The dynamic nature
of the problem imposes a great challenge to researchers. OLAP requires offline pre-
computation so that multidimensional analysis can be performed on-the-fly, yet the ad
hoc ranking functions prohibit full materialization. A natural compromise is to adopt a
semi-offline materialization and semi-online computation model.
Suppose a relation R has selection dimensions (A 1 ,A 2 , ...,A S ) and ranking dimen-
sions (N 1 ,N 2 ,...,N R ). Values in each ranking dimension can be partitioned into multi-
ple intervals according to the data and expected query distributions. Regarding the price
of used cars, for example, we may have, say, these four partitions (or value ranges): ≤ 5K,
[5 − 10K), [10 − 15K), and ≥ 15K. A ranking cube can be constructed by performing
multidimensional aggregations on selection dimensions. We can store the count for each
partition of each ranking dimension, thereby making the cube “rank-aware.” The top-k
queries can be answered by first accessing the cells in the more preferred value ranges
before consulting the cells in the less preferred value ranges.
Example 5.17 Using a ranking cube to answer a top-k query. Suppose Table 5.11 shows C MT , a mate-
rialized (i.e., precomputed) cuboid of a ranking cube for used-car sales. The cuboid,
C MT , is for the selection dimensions producer and type. It shows the count and corre-
sponding tuple IDs (TIDs) for various partitions of the ranking dimensions, price and
mileage.
Query Q 1 can be answered by using a selection condition to select the appropriate
selection dimension values (i.e., producer = “Ford” and type = “sedan”) in cuboid C MT .
2
2
In addition, the ranking function “(price − 10K) + (mileage − 30K) ” is used to find
the tuples that most closely match the user’s criteria. If there are not enough matching
tuples found in the closest matching cells, the next closest matching cells will need to be
accessed. We may even drill down to the corresponding lower-level cells to see the count
distributions of cells that match the ranking function and additional criteria regarding,
say, model, maintenance situation, or other loaded features. Only users who really want
to see more detailed information, such as interior photos, will need to access the physical
records stored in the database.
Table 5.11 Cuboid of a Ranking Cube for Used-Car Sales
producer type price mileage count TIDs
Ford sedan <5K 30–40K 7 t 6 , ..., t 68
Ford sedan 5–10K 30–40K 50 t 15 , ..., t 152
Honda sedan 10–15K 30–40K 20 t 8 , ..., t 32
... ... ... ... ... ...