Page 182 -
P. 182
3:17 Page 145
#21
11-ch04-125-186-9780123814791
2011/6/1
HAN
4.2 Data Warehouse Modeling: Data Cube and OLAP 145
by a set of dimension–value pairs; for example, htime = “Q1”, location = “Vancouver”,
item = “computer”i. A data cube measure is a numeric function that can be evaluated
at each point in the data cube space. A measure value is computed for a given point by
aggregating the data corresponding to the respective dimension–value pairs defining the
given point. We will look at concrete examples of this shortly.
Measures can be organized into three categories—distributive, algebraic, and holi-
stic—based on the kind of aggregate functions used.
Distributive: An aggregate function is distributive if it can be computed in a distributed
manner as follows. Suppose the data are partitioned into n sets. We apply the func-
tion to each partition, resulting in n aggregate values. If the result derived by applying
the function to the n aggregate values is the same as that derived by applying the func-
tion to the entire data set (without partitioning), the function can be computed in a
distributed manner. For example, sum() can be computed for a data cube by first par-
titioning the cube into a set of subcubes, computing sum() for each subcube, and then
summing up the counts obtained for each subcube. Hence, sum() is a distributive
aggregate function.
For the same reason, count(), min(), and max() are distributive aggregate functions.
By treating the count value of each nonempty base cell as 1 by default, count() of any
cell in a cube can be viewed as the sum of the count values of all of its corresponding
child cells in its subcube. Thus, count() is distributive. A measure is distributive if it is
obtained by applying a distributive aggregate function. Distributive measures can be
computed efficiently because of the way the computation can be partitioned.
Algebraic: An aggregate function is algebraic if it can be computed by an algebraic func-
tion with M arguments (where M is a bounded positive integer), each of which
is obtained by applying a distributive aggregate function. For example, avg() (aver-
age) can be computed by sum()/count(), where both sum() and count() are distributive
aggregate functions. Similarly, it can be shown that min N() and max N() (which
find the N minimum and N maximum values, respectively, in a given set) and
standard deviation() are algebraic aggregate functions. A measure is algebraic if it is
obtained by applying an algebraic aggregate function.
Holistic: An aggregate function is holistic if there is no constant bound on the stor-
age size needed to describe a subaggregate. That is, there does not exist an algebraic
function with M arguments (where M is a constant) that characterizes the compu-
tation. Common examples of holistic functions include median(), mode(), and rank().
A measure is holistic if it is obtained by applying a holistic aggregate function.
Most large data cube applications require efficient computation of distributive and
algebraic measures. Many efficient techniques for this exist. In contrast, it is difficult to
compute holistic measures efficiently. Efficient techniques to approximate the computa-
tion of some holistic measures, however, do exist. For example, rather than computing
the exact median(), Equation (2.3) of Chapter 2 can be used to estimate the approxi-
mate median value for a large data set. In many cases, such techniques are sufficient to
overcome the difficulties of efficient computation of holistic measures.