Page 275 -
P. 275

HAN
                               12-ch05-187-242-9780123814791

          238   Chapter 5 Data Cube Technology              2011/6/1  3:19 Page 238  #52



                        (b) Suppose we want to compute a data cube where the condition is that the minimum
                            number of records is 10 and the average fare is over $500. Outline an efficient cube
                            computation method (based on common sense about flight data distribution).
                    5.11 (Implementation project) There are four typical data cube computation methods: Mul-
                         tiWay [ZDN97], BUC [BR99], H-Cubing [HPDW01], and Star-Cubing [XHLW03].
                         (a) Implement any one of these cube computation algorithms and describe your
                            implementation, experimentation, and performance. Find another student who has
                            implemented a different algorithm on the same platform (e.g., C++ on Linux) and
                            compare your algorithm performance with his or hers.
                            Input:
                            i. An n-dimensional base cuboid table (for n < 20), which is essentially a relational
                               table with n attributes.
                            ii. An iceberg condition: count (C) ≥ k, where k is a positive integer as a parameter.
                            Output:
                            i. The set of computed cuboids that satisfy the iceberg condition, in the order of
                               your output generation.
                            ii. Summary of the set of cuboids in the form of “cuboid ID: the number of
                               nonempty cells,” sorted in alphabetical order of cuboids (e.g., A: 155, AB: 120,
                               ABC: 22, ABCD: 4, ABCE: 6, ABD: 36), where the number after : represents the
                               number of nonempty cells. (This is used to quickly check the correctness of your
                               results.)
                        (b) Based on your implementation, discuss the following:
                             i. What challenging computation problems are encountered as the number of
                               dimensions grows large?
                            ii. How can iceberg cubing solve the problems of part (a) for some data sets (and
                               characterize such data sets)?
                            iii. Give one simple example to show that sometimes iceberg cubes cannot provide
                               a good solution.
                         (c) Instead of computing a high-dimensionality data cube, we may choose to materi-
                            alize the cuboids that have only a small number of dimension combinations. For
                            example, for a 30-D data cube, we may only compute the 5-D cuboids for every
                            possible 5-D combination. The resulting cuboids form a shell cube. Discuss how
                            easy or hard it is to modify your cube computation algorithm to facilitate such
                            computation.

                    5.12 The sampling cube was proposed for multidimensional analysis of sampling data (e.g.,
                         survey data). In many real applications, sampling data can be of high dimensionality
                         (e.g., it is not unusual to have more than 50 dimensions in a survey data set).

                         (a) How can we construct an efficient and scalable high-dimensional sampling cube in
                            large sampling data sets?
                        (b) Design an efficient incremental update algorithm for such a high-dimensional
                            sampling cube.
   270   271   272   273   274   275   276   277   278   279   280