Page 197 -
P. 197
11-ch04-125-186-9780123814791
HAN
2011/6/1
160 Chapter 4 Data Warehousing and Online Analytical Processing 3:17 Page 160 #36
The partial materialization of cuboids or subcubes should consider three factors: (1)
identify the subset of cuboids or subcubes to materialize; (2) exploit the materialized
cuboids or subcubes during query processing; and (3) efficiently update the materialized
cuboids or subcubes during load and refresh.
The selection of the subset of cuboids or subcubes to materialize should take into
account the queries in the workload, their frequencies, and their accessing costs. In addi-
tion, it should consider workload characteristics, the cost for incremental updates, and
the total storage requirements. The selection must also consider the broad context of
physical database design such as the generation and selection of indices. Several OLAP
products have adopted heuristic approaches for cuboid and subcube selection. A pop-
ular approach is to materialize the cuboids set on which other frequently referenced
cuboids are based. Alternatively, we can compute an iceberg cube, which is a data cube
that stores only those cube cells with an aggregate value (e.g., count) that is above some
minimum support threshold.
Another common strategy is to materialize a shell cube. This involves precomput-
ing the cuboids for only a small number of dimensions (e.g., three to five) of a data
cube. Queries on additional combinations of the dimensions can be computed on-the-
fly. Because our aim in this chapter is to provide a solid introduction and overview of
data warehousing for data mining, we defer our detailed discussion of cuboid selection
and computation to Chapter 5, which studies various data cube computation methods
in greater depth.
Once the selected cuboids have been materialized, it is important to take advantage of
them during query processing. This involves several issues, such as how to determine the
relevant cuboid(s) from among the candidate materialized cuboids, how to use available
index structures on the materialized cuboids, and how to transform the OLAP opera-
tions onto the selected cuboid(s). These issues are discussed in Section 4.4.3 as well as in
Chapter 5.
Finally, during load and refresh, the materialized cuboids should be updated effi-
ciently. Parallelism and incremental update techniques for this operation should be
explored.
4.4.2 Indexing OLAP Data: Bitmap Index and Join Index
To facilitate efficient data accessing, most data warehouse systems support index struc-
tures and materialized views (using cuboids). General methods to select cuboids for
materialization were discussed in Section 4.4.1. In this subsection, we examine how to
index OLAP data by bitmap indexing and join indexing.
The bitmap indexing method is popular in OLAP products because it allows quick
searching in data cubes. The bitmap index is an alternative representation of the
record ID (RID) list. In the bitmap index for a given attribute, there is a distinct bit
vector, Bv, for each value v in the attribute’s domain. If a given attribute’s domain con-
sists of n values, then n bits are needed for each entry in the bitmap index (i.e., there are
n bit vectors). If the attribute has the value v for a given row in the data table, then the
bit representing that value is set to 1 in the corresponding row of the bitmap index. All
other bits for that row are set to 0.