Page 143 -
P. 143

HAN
                               10-ch03-083-124-9780123814791

          106   Chapter 3 Data Preprocessing                2011/6/1  3:16 Page 106  #24



                         specify the slope of the line and the y-intercept, respectively. These coefficients can
                         be solved for by the method of least squares, which minimizes the error between the
                         actual line separating the data and the estimate of the line. Multiple linear regression
                         is an extension of (simple) linear regression, which allows a response variable, y, to be
                         modeled as a linear function of two or more predictor variables.
                           Log-linear models approximate discrete multidimensional probability distributions.
                         Given a set of tuples in n dimensions (e.g., described by n attributes), we can con-
                         sider each tuple as a point in an n-dimensional space. Log-linear models can be used
                         to estimate the probability of each point in a multidimensional space for a set of dis-
                         cretized attributes, based on a smaller subset of dimensional combinations. This allows
                         a higher-dimensional data space to be constructed from lower-dimensional spaces.
                         Log-linear models are therefore also useful for dimensionality reduction (since the
                         lower-dimensional points together typically occupy less space than the original data
                         points) and data smoothing (since aggregate estimates in the lower-dimensional space
                         are less subject to sampling variations than the estimates in the higher-dimensional
                         space).
                           Regression and log-linear models can both be used on sparse data, although their
                         application may be limited. While both methods can handle skewed data, regression
                         does exceptionally well. Regression can be computationally intensive when applied to
                         high-dimensional data, whereas log-linear models show good scalability for up to 10 or
                         so dimensions.
                           Several software packages exist to solve regression problems. Examples include SAS
                         (www.sas.com), SPSS (www.spss.com), and S-Plus (www.insightful.com). Another useful
                         resource is the book Numerical Recipes in C, by Press, Teukolsky, Vetterling, and Flannery
                         [PTVF07], and its associated source code.


                   3.4.6 Histograms
                         Histograms use binning to approximate data distributions and are a popular form
                         of data reduction. Histograms were introduced in Section 2.2.3. A histogram for an
                         attribute, A, partitions the data distribution of A into disjoint subsets, referred to as
                         buckets or bins. If each bucket represents only a single attribute–value/frequency pair, the
                         buckets are called singleton buckets. Often, buckets instead represent continuous ranges
                         for the given attribute.

            Example 3.3 Histograms. The following data are a list of AllElectronics prices for commonly sold
                         items (rounded to the nearest dollar). The numbers have been sorted: 1, 1, 5, 5, 5,
                         5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18,
                         18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30,
                         30, 30.
                           Figure 3.7 shows a histogram for the data using singleton buckets. To further reduce
                         the data, it is common to have each bucket denote a continuous value range for
                         the given attribute. In Figure 3.8, each bucket represents a different $10 range for
                         price.
   138   139   140   141   142   143   144   145   146   147   148