Page 228 -
P. 228
2011/6/1
#5
Page 191
HAN 12-ch05-187-242-9780123814791
3:19
5.1 Data Cube Computation: Preliminary Concepts 191
Example 5.3 Iceberg cube.
compute cube sales iceberg as
select month, city, customer group, count(*)
from salesInfo
cube by month, city, customer group
having count(*) >= min sup
The compute cube statement specifies the precomputation of the iceberg cube,
sales iceberg, with the dimensions month, city, and customer group, and the aggregate
measure count(). The input tuples are in the salesInfo relation. The cube by clause
specifies that aggregates (group-by’s) are to be formed for each of the possible subsets of
the given dimensions. If we were computing the full cube, each group-by would corre-
spond to a cuboid in the data cube lattice. The constraint specified in the having clause
is known as the iceberg condition. Here, the iceberg measure is count(). Note that the
iceberg cube computed here could be used to answer group-by queries on any combina-
tion of the specified dimensions of the form having count(*) >= v, where v ≥ min sup.
Instead of count(), the iceberg condition could specify more complex measures such as
average().
If we were to omit the having clause, we would end up with the full cube. Let’s call this
cube sales cube. The iceberg cube, sales iceberg, excludes all the cells of sales cube with a
count that is less than min sup. Obviously, if we were to set the minimum support to 1
in sales iceberg, the resulting cube would be the full cube, sales cube.
A naïve approach to computing an iceberg cube would be to first compute the full
cube and then prune the cells that do not satisfy the iceberg condition. However, this is
still prohibitively expensive. An efficient approach is to compute only the iceberg cube
directly without computing the full cube. Sections 5.2.2 and 5.2.3 discuss methods for
efficient iceberg cube computation.
Introducing iceberg cubes will lessen the burden of computing trivial aggregate cells
in a data cube. However, we could still end up with a large number of uninteresting cells
to compute. For example, suppose that there are 2 base cells for a database of 100 dimen-
sions, denoted as {(a 1 , a 2 , a 3 ,..., a 100 ) : 10, (a 1 , a 2 , b 3 ,..., b 100 ) : 10}, where each has a
cell count of 10. If the minimum support is set to 10, there will still be an impermis-
sible number of cells to compute and store, although most of them are not interesting.
2
For example, there are 2 101 − 6 distinct aggregate cells, like {(a 1 , a 2 , a 3 , a 4 ,..., a 99 , ∗) :
10,..., (a 1 , a 2 , ∗ , a 4 ,..., a 99 , a 100 ) : 10,..., (a 1 , a 2 , a 3 , ∗ ,..., ∗ , ∗) : 10}, but most of
them do not contain much new information. If we ignore all the aggregate cells that can
be obtained by replacing some constants by ∗’s while keeping the same measure value,
there are only three distinct cells left: {(a 1 , a 2 , a 3 ,..., a 100 ) : 10, (a 1 , a 2 , b 3 ,..., b 100 ) :
10, (a 1 , a 2 , ∗ ,..., ∗) : 20}. That is, out of 2 101 − 4 distinct base and aggregate cells, only
three really offer valuable information.
2 The proof is left as an exercise for the reader.