Page 137 -
P. 137

10-ch03-083-124-9780123814791
                         HAN

          100   Chapter 3 Data Preprocessing                2011/6/1  3:16 Page 100  #18



                         transforms (Section 3.4.2) and principal components analysis (Section 3.4.3), which
                         transform or project the original data onto a smaller space. Attribute subset selection is a
                         method of dimensionality reduction in which irrelevant, weakly relevant, or redundant
                         attributes or dimensions are detected and removed (Section 3.4.4).
                           Numerosity reduction techniques replace the original data volume by alternative,
                         smaller forms of data representation. These techniques may be parametric or non-
                         parametric. For parametric methods, a model is used to estimate the data, so that
                         typically only the data parameters need to be stored, instead of the actual data. (Out-
                         liers may also be stored.) Regression and log-linear models (Section 3.4.5) are examples.
                         Nonparametric methods for storing reduced representations of the data include his-
                         tograms (Section 3.4.6), clustering (Section 3.4.7), sampling (Section 3.4.8), and data
                         cube aggregation (Section 3.4.9).
                           In data compression, transformations are applied so as to obtain a reduced or “com-
                         pressed” representation of the original data. If the original data can be reconstructed
                         from the compressed data without any information loss, the data reduction is called
                         lossless. If, instead, we can reconstruct only an approximation of the original data, then
                         the data reduction is called lossy. There are several lossless algorithms for string com-
                         pression; however, they typically allow only limited data manipulation. Dimensionality
                         reduction and numerosity reduction techniques can also be considered forms of data
                         compression.
                           There are many other ways of organizing methods of data reduction. The computa-
                         tional time spent on data reduction should not outweigh or “erase” the time saved by
                         mining on a reduced data set size.


                   3.4.2 Wavelet Transforms

                         The discrete wavelet transform (DWT) is a linear signal processing technique that,
                                                                                           0
                         when applied to a data vector X, transforms it to a numerically different vector, X , of
                         wavelet coefficients. The two vectors are of the same length. When applying this tech-
                         nique to data reduction, we consider each tuple as an n-dimensional data vector, that
                         is, X = (x 1 ,x 2 ,...,x n ), depicting n measurements made on the tuple from n database
                         attributes. 3
                           “How can this technique be useful for data reduction if the wavelet transformed data are
                         of the same length as the original data?” The usefulness lies in the fact that the wavelet
                         transformed data can be truncated. A compressed approximation of the data can be
                         retained by storing only a small fraction of the strongest of the wavelet coefficients.
                         For example, all wavelet coefficients larger than some user-specified threshold can be
                         retained. All other coefficients are set to 0. The resulting data representation is therefore
                         very sparse, so that operations that can take advantage of data sparsity are computa-
                         tionally very fast if performed in wavelet space. The technique also works to remove
                         noise without smoothing out the main features of the data, making it effective for data

                         3 In our notation, any variable representing a vector is shown in bold italic font; measurements depicting
                         the vector are shown in italic font.
   132   133   134   135   136   137   138   139   140   141   142