Page 227 -
P. 227
HAN 12-ch05-187-242-9780123814791
190 Chapter 5 Data Cube Technology 2011/6/1 3:19 Page 190 #4
1
more cuboids if we consider concept hierarchies for each dimension. In addition, the
size of each cuboid depends on the cardinality of its dimensions. Thus, precomputation
of the full cube can require huge and often excessive amounts of memory.
Nonetheless, full cube computation algorithms are important. Individual cuboids
may be stored on secondary storage and accessed when necessary. Alternatively, we can
use such algorithms to compute smaller cubes, consisting of a subset of the given set
of dimensions, or a smaller range of possible values for some of the dimensions. In
these cases, the smaller cube is a full cube for the given subset of dimensions and/or
dimension values. A thorough understanding of full cube computation methods will
help us develop efficient methods for computing partial cubes. Hence, it is important to
explore scalable methods for computing all the cuboids making up a data cube, that is,
for full materialization. These methods must take into consideration the limited amount
of main memory available for cuboid computation, the total size of the computed data
cube, as well as the time required for such computation.
Partial materialization of data cubes offers an interesting trade-off between storage
space and response time for OLAP. Instead of computing the full cube, we can compute
only a subset of the data cube’s cuboids, or subcubes consisting of subsets of cells from
the various cuboids.
Many cells in a cuboid may actually be of little or no interest to the data analyst. Recall
that each cell in a full cube records an aggregate value such as count or sum. For many
cells in a cuboid, the measure value will be zero. When the product of the cardinalities
for the dimensions in a cuboid is large relative to the number of nonzero-valued tuples
that are stored in the cuboid, then we say that the cuboid is sparse. If a cube contains
many sparse cuboids, we say that the cube is sparse.
In many cases, a substantial amount of the cube’s space could be taken up by a large
number of cells with very low measure values. This is because the cube cells are often
quite sparsely distributed within a multidimensional space. For example, a customer
may only buy a few items in a store at a time. Such an event will generate only a few
nonempty cells, leaving most other cube cells empty. In such situations, it is useful to
materialize only those cells in a cuboid (group-by) with a measure value above some
minimum threshold. In a data cube for sales, say, we may wish to materialize only
those cells for which count ≥ 10 (i.e., where at least 10 tuples exist for the cell’s given
combination of dimensions), or only those cells representing sales ≥ $100. This not
only saves processing time and disk space, but also leads to a more focused analysis.
The cells that cannot pass the threshold are likely to be too trivial to warrant further
analysis.
Such partially materialized cubes are known as iceberg cubes. The minimum thresh-
old is called the minimum support threshold, or minimum support (min sup), for short.
By materializing only a fraction of the cells in a data cube, the result is seen as the “tip of
the iceberg,” where the “iceberg” is the potential full cube including all cells. An iceberg
cube can be specified with an SQL query, as shown in Example 5.3.
1 Eq. (4.1) of Section 4.4.1 gives the total number of cuboids in a data cube where each dimension has
an associated concept hierarchy.