Page 136 -
P. 136

HAN 10-ch03-083-124-9780123814791
                                                                     3:16
                                                             2011/6/1
                                                                                   #17
                                                                           Page 99
                                                                              3.4 Data Reduction  99


                               the purchaser’s name and address instead of a key to this information in a purchaser
                               database, discrepancies can occur, such as the same purchaser’s name appearing with
                               different addresses within the purchase order database.


                         3.3.4 Data Value Conflict Detection and Resolution

                               Data integration also involves the detection and resolution of data value conflicts. For
                               example, for the same real-world entity, attribute values from different sources may dif-
                               fer. This may be due to differences in representation, scaling, or encoding. For instance,
                               a weight attribute may be stored in metric units in one system and British imperial
                               units in another. For a hotel chain, the price of rooms in different cities may involve
                               not only different currencies but also different services (e.g., free breakfast) and taxes.
                               When exchanging information between schools, for example, each school may have its
                               own curriculum and grading scheme. One university may adopt a quarter system, offer
                               three courses on database systems, and assign grades from A+ to F, whereas another
                               may adopt a semester system, offer two courses on databases, and assign grades from 1
                               to 10. It is difficult to work out precise course-to-grade transformation rules between
                               the two universities, making information exchange difficult.
                                 Attributes may also differ on the abstraction level, where an attribute in one sys-
                               tem is recorded at, say, a lower abstraction level than the “same” attribute in another.
                               For example, the total sales in one database may refer to one branch of All Electronics,
                               while an attribute of the same name in another database may refer to the total sales
                               for All Electronics stores in a given region. The topic of discrepancy detection is further
                               described in Section 3.2.3 on data cleaning as a process.

                       3.4     Data Reduction



                               Imagine that you have selected data from the AllElectronics data warehouse for analysis.
                               The data set will likely be huge! Complex data analysis and mining on huge amounts of
                               data can take a long time, making such analysis impractical or infeasible.
                                 Data reduction techniques can be applied to obtain a reduced representation of the
                               data set that is much smaller in volume, yet closely maintains the integrity of the original
                               data. That is, mining on the reduced data set should be more efficient yet produce the
                               same (or almost the same) analytical results. In this section, we first present an overview
                               of data reduction strategies, followed by a closer look at individual techniques.


                         3.4.1 Overview of Data Reduction Strategies
                               Data reduction strategies include dimensionality reduction, numerosity reduction, and
                               data compression.
                                 Dimensionality reduction is the process of reducing the number of random variables
                               or attributes under consideration. Dimensionality reduction methods include wavelet
   131   132   133   134   135   136   137   138   139   140   141