Page 154 - Statistics and Data Analysis in Geology
P. 154

Analysis of Multivariate  Data

             definitional equation for the sums of products (Eq. 2.23;  p. 40) rather than with the
             computational form for correlation given in Equation (2.28). This is because Equa-
             tion (2.28) involves squaring the quantities C x; and C x:. If these sums are large,
             the squares may be inaccurate because of  truncation.  This problem is avoided if
             the means are subtracted from each observation prior to calculation of  the sums
             of squares.  The sums of  squares are then found by Equations (2.19)  and (2.23).
             This process requires that the data be handled twice-first  to calculate the means,
             and then to subtract out this quantity during calculations. Although this involves a
             significant increase in labor if computations are performed by hand, the additional
             effort is trivial on a digital computer. Also, the resulting coefficients must be “un-
             standardized” if they are to be used in a predictive equation with raw data. However,
             these disadvantages are more than offset by the increased stability and accuracy of
             the matrix solution, and the standardized coefficients provide a way of  assessing
             the importance of  individual variables in the regression.  Partial regression coeffi-
             cients can be derived from the standardized partial regression coefficients by the
             transformation
                                                     SY
                                              bk  = Bk-                            (6.10)
                                                      sk
             The constant term, bo, can be found by
                                                                                   (6.11)

                 Although the various sums of squares change if  the data are standardized (i.e.,
             the correlation form of  the matrix  equation is used), the ratios of  the sums of
             squares remain the same.  Therefore, tests of  significance based on standardized
             regression are identical to those based on an unstandardized regression. Quantities
             such as the coefficient of multiple correlation (R) and percentage of goodness of fit
             (100% R2) also remain unchanged.
                 We  can compare the partial regression coefficients between basin magnitude
             and the other six basin properties in both raw and standardized form:

                     b‘ = [ -2.244  0.005  0.226  -0.233   0.063  -0.002   -0.1171
                     B’ = [  0.000  0.049  0.284  -0.458  0.975  -0.120   -0.1631
                 Although  the  standardized  partial  regression  coefficients  suggest  that  the
             basin properties having the most pronounced relationship with basin magnitude
             are x2  (relief), x3  (area), and x4  (stream length), these values do not take into ac-
             count the uncertainty associated with each estimated parameter. The easiest way to
             consider this aspect is by expanding the analysis of variance to test the significance
             of  each independent variable.
                 The sum of  squares attributable to a single variable, Xj, can be determined
             by calculating SSR(,)  for the regression with all m variables, calculating SSR(,-~),
             which is the sum of squares for regression using all variables except the jth variable,
             then finding the difference.  This process can be repeated for each independent
             variable in turn, in order to assess the contribution that each makes to the total
             regression.  Fortunately, there is an easier way to calculate the individual regres-
             sion sums of  squares, which simply requires dividing the square of  each partial
             regression coefficient by the diagonal elements of S&  that correspond to each of
             the variables. If we designate CXX = S&,  then

                                                                                   (6.12)

                                                                                      467
   149   150   151   152   153   154   155   156   157   158   159