Page 234 -
P. 234
2011/6/1
12-ch05-187-242-9780123814791
#11
3:19 Page 197
HAN
5.2 Data Cube Computation Methods 197
The base cuboid, denoted by ABC (from which all the other cuboids are directly or
indirectly computed). This cube is already computed and corresponds to the given
3-D array.
The 2-D cuboids, AB, AC, and BC, which respectively correspond to the group-by’s
AB, AC, and BC. These cuboids must be computed.
The 1-D cuboids, A, B, and C, which respectively correspond to the group-by’s A, B,
and C. These cuboids must be computed.
The 0-D (apex) cuboid, denoted by all, which corresponds to the group-by (); that
is, there is no group-by here. This cuboid must be computed. It consists of only one
value. If, say, the data cube measure is count, then the value to be computed is simply
the total count of all the tuples in ABC.
Let’s look at how the multiway array aggregation technique is used in this computa-
tion. There are many possible orderings with which chunks can be read into memory
for use in cube computation. Consider the ordering labeled from 1 to 64, shown in
Figure 5.3. Suppose we want to compute the b 0 c 0 chunk of the BC cuboid. We allocate
space for this chunk in chunk memory. By scanning ABC chunks 1 through 4, the b 0 c 0
chunk is computed. That is, the cells for b 0 c 0 are aggregated over a 0 to a 3 . The chunk
memory can then be assigned to the next chunk, b 1 c 0 , which completes its aggregation
after the scanning of the next four ABC chunks: 5 through 8. Continuing in this way,
the entire BC cuboid can be computed. Therefore, only one BC chunk needs to be in
memory at a time, for the computation of all the BC chunks.
In computing the BC cuboid, we will have scanned each of the 64 chunks. “Is there a
way to avoid having to rescan all of these chunks for the computation of other cuboids such
as AC and AB?” The answer is, most definitely, yes. This is where the “multiway com-
putation” or “simultaneous aggregation” idea comes in. For example, when chunk 1
(i.e., a 0 b 0 c 0 ) is being scanned (say, for the computation of the 2-D chunk b 0 c 0 of BC, as
described previously), all of the other 2-D chunks relating to a 0 b 0 c 0 can be simultane-
ously computed. That is, when a 0 b 0 c 0 is being scanned, each of the three chunks (b 0 c 0 ,
a 0 c 0 , and a 0 b 0 ) on the three 2-D aggregation planes (BC, AC, and AB) should be com-
puted then as well. In other words, multiway computation simultaneously aggregates to
each of the 2-D planes while a 3-D chunk is in memory.
Now let’s look at how different orderings of chunk scanning and of cuboid compu-
tation can affect the overall data cube computation efficiency. Recall that the size of the
dimensions A, B, and C is 40, 400, and 4000, respectively. Therefore, the largest 2-D
plane is BC (of size 400 × 4000 = 1,600,000). The second largest 2-D plane is AC (of
size 40 × 4000 = 160,000). AB is the smallest 2-D plane (of size 40 × 400 = 16,000).
Suppose that the chunks are scanned in the order shown, from chunks 1 to 64. As
previously mentioned, b 0 c 0 is fully aggregated after scanning the row containing chunks
1 through 4; b 1 c 0 is fully aggregated after scanning chunks 5 through 8, and so on. Thus,
we need to scan four chunks of the 3-D array to fully compute one chunk of the BC
cuboid (where BC is the largest of the 2-D planes). In other words, by scanning in this