Page 227 -
P. 227

HAN 12-ch05-187-242-9780123814791


          190   Chapter 5 Data Cube Technology               2011/6/1  3:19  Page 190  #4



                                                                                1
                         more cuboids if we consider concept hierarchies for each dimension. In addition, the
                         size of each cuboid depends on the cardinality of its dimensions. Thus, precomputation
                         of the full cube can require huge and often excessive amounts of memory.
                           Nonetheless, full cube computation algorithms are important. Individual cuboids
                         may be stored on secondary storage and accessed when necessary. Alternatively, we can
                         use such algorithms to compute smaller cubes, consisting of a subset of the given set
                         of dimensions, or a smaller range of possible values for some of the dimensions. In
                         these cases, the smaller cube is a full cube for the given subset of dimensions and/or
                         dimension values. A thorough understanding of full cube computation methods will
                         help us develop efficient methods for computing partial cubes. Hence, it is important to
                         explore scalable methods for computing all the cuboids making up a data cube, that is,
                         for full materialization. These methods must take into consideration the limited amount
                         of main memory available for cuboid computation, the total size of the computed data
                         cube, as well as the time required for such computation.
                           Partial materialization of data cubes offers an interesting trade-off between storage
                         space and response time for OLAP. Instead of computing the full cube, we can compute
                         only a subset of the data cube’s cuboids, or subcubes consisting of subsets of cells from
                         the various cuboids.
                           Many cells in a cuboid may actually be of little or no interest to the data analyst. Recall
                         that each cell in a full cube records an aggregate value such as count or sum. For many
                         cells in a cuboid, the measure value will be zero. When the product of the cardinalities
                         for the dimensions in a cuboid is large relative to the number of nonzero-valued tuples
                         that are stored in the cuboid, then we say that the cuboid is sparse. If a cube contains
                         many sparse cuboids, we say that the cube is sparse.
                           In many cases, a substantial amount of the cube’s space could be taken up by a large
                         number of cells with very low measure values. This is because the cube cells are often
                         quite sparsely distributed within a multidimensional space. For example, a customer
                         may only buy a few items in a store at a time. Such an event will generate only a few
                         nonempty cells, leaving most other cube cells empty. In such situations, it is useful to
                         materialize only those cells in a cuboid (group-by) with a measure value above some
                         minimum threshold. In a data cube for sales, say, we may wish to materialize only
                         those cells for which count ≥ 10 (i.e., where at least 10 tuples exist for the cell’s given
                         combination of dimensions), or only those cells representing sales ≥ $100. This not
                         only saves processing time and disk space, but also leads to a more focused analysis.
                         The cells that cannot pass the threshold are likely to be too trivial to warrant further
                         analysis.
                           Such partially materialized cubes are known as iceberg cubes. The minimum thresh-
                         old is called the minimum support threshold, or minimum support (min sup), for short.
                         By materializing only a fraction of the cells in a data cube, the result is seen as the “tip of
                         the iceberg,” where the “iceberg” is the potential full cube including all cells. An iceberg
                         cube can be specified with an SQL query, as shown in Example 5.3.


                         1 Eq. (4.1) of Section 4.4.1 gives the total number of cuboids in a data cube where each dimension has
                         an associated concept hierarchy.
   222   223   224   225   226   227   228   229   230   231   232