Page 271 -
P. 271
12-ch05-187-242-9780123814791
HAN
234 Chapter 5 Data Cube Technology 2011/6/1 3:19 Page 234 #48
respectively. These detailed exceptions were far from obvious when we were viewing the
data as an item-time group-by, aggregated over region in Figure 5.17. Thus, the InExp
value is useful for searching for exceptions at lower-level cells of the cube.
“How are the exception values computed?” The SelfExp, InExp, and PathExp measures
are based on a statistical method for table analysis. They take into account all of the
group-by’s (aggregations) in which a given cell value participates. A cell value is con-
sidered an exception based on how much it differs from its expected value, where its
expected value is determined with a statistical model. The difference between a given
cell value and its expected value is called a residual. Intuitively, the larger the residual,
the more the given cell value is an exception. The comparison of residual values requires
us to scale the values based on the expected standard deviation associated with the resid-
uals. A cell value is therefore considered an exception if its scaled residual value exceeds
a prespecified threshold. The SelfExp, InExp, and PathExp measures are based on this
scaled residual.
The expected value of a given cell is a function of the higher-level group-by’s of the
given cell. For example, given a cube with the three dimensions A, B, and C, the expected
value for a cell at the ith position in A, the jth position in B, and the kth position in C is a
A
C
B
function of γ , γ , γ , γ , γ AB , γ AC , and γ BC , which are coefficients of the statistical
i j k ij ik jk
model used. The coefficients reflect how different the values at more detailed levels are,
based on generalized impressions formed by looking at higher-level aggregations. In this
way, the exception quality of a cell value is based on the exceptions of the values below it.
Thus, when seeing an exception, it is natural for the user to further explore the exception
by drilling down.
“How can the data cube be efficiently constructed for discovery-driven exploration?”
This computation consists of three phases. The first step involves the computation of the
aggregate values defining the cube, such as sum or count, over which exceptions will be
found. The second phase consists of model fitting, in which the coefficients mentioned
before are determined and used to compute the standardized residuals. This phase can
be overlapped with the first phase because the computations involved are similar. The
third phase computes the SelfExp, InExp, and PathExp values, based on the standardized
residuals. This phase is computationally similar to phase 1. Therefore, the computation
of data cubes for discovery-driven exploration can be done efficiently.
5.5 Summary
Data cube computation and exploration play an essential role in data warehousing
and are important for flexible data mining in multidimensional space.
A data cube consists of a lattice of cuboids. Each cuboid corresponds to a different
degree of summarization of the given multidimensional data. Full materialization
refers to the computation of all the cuboids in a data cube lattice. Partial materi-
alization refers to the selective computation of a subset of the cuboid cells in the