Page 228 -
P. 228

2011/6/1
                                                                                    #5
                                                                           Page 191
                          HAN 12-ch05-187-242-9780123814791
                                                                     3:19
                                                    5.1 Data Cube Computation: Preliminary Concepts  191


                  Example 5.3 Iceberg cube.

                                        compute cube sales iceberg as
                                        select month, city, customer group, count(*)
                                        from salesInfo
                                        cube by month, city, customer group
                                        having count(*) >= min sup

                               The compute cube statement specifies the precomputation of the iceberg cube,
                               sales iceberg, with the dimensions month, city, and customer group, and the aggregate
                               measure count(). The input tuples are in the salesInfo relation. The cube by clause
                               specifies that aggregates (group-by’s) are to be formed for each of the possible subsets of
                               the given dimensions. If we were computing the full cube, each group-by would corre-
                               spond to a cuboid in the data cube lattice. The constraint specified in the having clause
                               is known as the iceberg condition. Here, the iceberg measure is count(). Note that the
                               iceberg cube computed here could be used to answer group-by queries on any combina-
                               tion of the specified dimensions of the form having count(*) >= v, where v ≥ min sup.
                               Instead of count(), the iceberg condition could specify more complex measures such as
                               average().
                                 If we were to omit the having clause, we would end up with the full cube. Let’s call this
                               cube sales cube. The iceberg cube, sales iceberg, excludes all the cells of sales cube with a
                               count that is less than min sup. Obviously, if we were to set the minimum support to 1
                               in sales iceberg, the resulting cube would be the full cube, sales cube.


                                 A naïve approach to computing an iceberg cube would be to first compute the full
                               cube and then prune the cells that do not satisfy the iceberg condition. However, this is
                               still prohibitively expensive. An efficient approach is to compute only the iceberg cube
                               directly without computing the full cube. Sections 5.2.2 and 5.2.3 discuss methods for
                               efficient iceberg cube computation.
                                 Introducing iceberg cubes will lessen the burden of computing trivial aggregate cells
                               in a data cube. However, we could still end up with a large number of uninteresting cells
                               to compute. For example, suppose that there are 2 base cells for a database of 100 dimen-
                               sions, denoted as {(a 1 , a 2 , a 3 ,..., a 100 ) : 10, (a 1 , a 2 , b 3 ,..., b 100 ) : 10}, where each has a
                               cell count of 10. If the minimum support is set to 10, there will still be an impermis-
                               sible number of cells to compute and store, although most of them are not interesting.
                                                                          2
                               For example, there are 2 101  − 6 distinct aggregate cells, like {(a 1 , a 2 , a 3 , a 4 ,..., a 99 , ∗) :
                               10,..., (a 1 , a 2 , ∗ , a 4 ,..., a 99 , a 100 ) : 10,..., (a 1 , a 2 , a 3 , ∗ ,..., ∗ , ∗) : 10}, but most of
                               them do not contain much new information. If we ignore all the aggregate cells that can
                               be obtained by replacing some constants by ∗’s while keeping the same measure value,
                               there are only three distinct cells left: {(a 1 , a 2 , a 3 ,..., a 100 ) : 10, (a 1 , a 2 , b 3 ,..., b 100 ) :
                               10, (a 1 , a 2 , ∗ ,..., ∗) : 20}. That is, out of 2 101  − 4 distinct base and aggregate cells, only
                               three really offer valuable information.


                               2 The proof is left as an exercise for the reader.
   223   224   225   226   227   228   229   230   231   232   233