Page 230 -
P. 230
Page 193
#7
2011/6/1
3:19
HAN 12-ch05-187-242-9780123814791
5.1 Data Cube Computation: Preliminary Concepts 193
data representations. The following are general optimization techniques for efficient
computation of data cubes.
Optimization Technique 1: Sorting, hashing, and grouping. Sorting, hashing, and
grouping operations should be applied to the dimension attributes to reorder and
cluster related tuples.
In cube computation, aggregation is performed on the tuples (or cells) that share
the same set of dimension values. Thus, it is important to explore sorting, hashing,
and grouping operations to access and group such data together to facilitate compu-
tation of such aggregates.
To compute total sales by branch, day, and item, for example, it can be more
efficient to sort tuples or cells by branch, and then by day, and then group them
according to the item name. Efficient implementations of such operations in large
data sets have been extensively studied in the database research community. Such
implementations can be extended to data cube computation.
This technique can also be further extended to perform shared-sorts (i.e., sharing
sorting costs across multiple cuboids when sort-based methods are used), or to per-
form shared-partitions (i.e., sharing the partitioning cost across multiple cuboids
when hash-based algorithms are used).
Optimization Technique 2: Simultaneous aggregation and caching of intermediate
results. In cube computation, it is efficient to compute higher-level aggregates from
previously computed lower-level aggregates, rather than from the base fact table.
Moreover, simultaneous aggregation from cached intermediate computation results
may lead to the reduction of expensive disk input/output (I/O) operations.
To compute sales by branch, for example, we can use the intermediate results
derived from the computation of a lower-level cuboid such as sales by branch and day.
This technique can be further extended to perform amortized scans (i.e., computing
as many cuboids as possible at the same time to amortize disk reads).
Optimization Technique 3: Aggregation from the smallest child when there exist mul-
tiple child cuboids. When there exist multiple child cuboids, it is usually more
efficient to compute the desired parent (i.e., more generalized) cuboid from the
smallest, previously computed child cuboid.
To compute a sales cuboid, C branch , when there exist two previously computed
cuboids, C {branch,year} and C {branch,item} , for example, it is obviously more efficient to
compute C branch from the former than from the latter if there are many more distinct
items than distinct years.
Many other optimization techniques may further improve computational efficiency. For
example, string dimension attributes can be mapped to integers with values ranging
from zero to the cardinality of the attribute.
In iceberg cube computation the following optimization technique plays a particu-
larly important role.