Page 271 -
P. 271

12-ch05-187-242-9780123814791
                         HAN

          234   Chapter 5 Data Cube Technology              2011/6/1  3:19 Page 234  #48



                         respectively. These detailed exceptions were far from obvious when we were viewing the
                         data as an item-time group-by, aggregated over region in Figure 5.17. Thus, the InExp
                         value is useful for searching for exceptions at lower-level cells of the cube.

                           “How are the exception values computed?” The SelfExp, InExp, and PathExp measures
                         are based on a statistical method for table analysis. They take into account all of the
                         group-by’s (aggregations) in which a given cell value participates. A cell value is con-
                         sidered an exception based on how much it differs from its expected value, where its
                         expected value is determined with a statistical model. The difference between a given
                         cell value and its expected value is called a residual. Intuitively, the larger the residual,
                         the more the given cell value is an exception. The comparison of residual values requires
                         us to scale the values based on the expected standard deviation associated with the resid-
                         uals. A cell value is therefore considered an exception if its scaled residual value exceeds
                         a prespecified threshold. The SelfExp, InExp, and PathExp measures are based on this
                         scaled residual.
                           The expected value of a given cell is a function of the higher-level group-by’s of the
                         given cell. For example, given a cube with the three dimensions A, B, and C, the expected
                         value for a cell at the ith position in A, the jth position in B, and the kth position in C is a
                                      A
                                             C
                                         B
                         function of γ , γ , γ , γ , γ  AB , γ  AC , and γ  BC , which are coefficients of the statistical
                                     i   j  k   ij   ik      jk
                         model used. The coefficients reflect how different the values at more detailed levels are,
                         based on generalized impressions formed by looking at higher-level aggregations. In this
                         way, the exception quality of a cell value is based on the exceptions of the values below it.
                         Thus, when seeing an exception, it is natural for the user to further explore the exception
                         by drilling down.
                           “How can the data cube be efficiently constructed for discovery-driven exploration?”
                         This computation consists of three phases. The first step involves the computation of the
                         aggregate values defining the cube, such as sum or count, over which exceptions will be
                         found. The second phase consists of model fitting, in which the coefficients mentioned
                         before are determined and used to compute the standardized residuals. This phase can
                         be overlapped with the first phase because the computations involved are similar. The
                         third phase computes the SelfExp, InExp, and PathExp values, based on the standardized
                         residuals. This phase is computationally similar to phase 1. Therefore, the computation
                         of data cubes for discovery-driven exploration can be done efficiently.


                 5.5     Summary


                           Data cube computation and exploration play an essential role in data warehousing
                           and are important for flexible data mining in multidimensional space.
                           A data cube consists of a lattice of cuboids. Each cuboid corresponds to a different
                           degree of summarization of the given multidimensional data. Full materialization
                           refers to the computation of all the cuboids in a data cube lattice. Partial materi-
                           alization refers to the selective computation of a subset of the cuboid cells in the
   266   267   268   269   270   271   272   273   274   275   276