Page 154 - Statistics and Data Analysis in Geology
P. 154
Analysis of Multivariate Data
definitional equation for the sums of products (Eq. 2.23; p. 40) rather than with the
computational form for correlation given in Equation (2.28). This is because Equa-
tion (2.28) involves squaring the quantities C x; and C x:. If these sums are large,
the squares may be inaccurate because of truncation. This problem is avoided if
the means are subtracted from each observation prior to calculation of the sums
of squares. The sums of squares are then found by Equations (2.19) and (2.23).
This process requires that the data be handled twice-first to calculate the means,
and then to subtract out this quantity during calculations. Although this involves a
significant increase in labor if computations are performed by hand, the additional
effort is trivial on a digital computer. Also, the resulting coefficients must be “un-
standardized” if they are to be used in a predictive equation with raw data. However,
these disadvantages are more than offset by the increased stability and accuracy of
the matrix solution, and the standardized coefficients provide a way of assessing
the importance of individual variables in the regression. Partial regression coeffi-
cients can be derived from the standardized partial regression coefficients by the
transformation
SY
bk = Bk- (6.10)
sk
The constant term, bo, can be found by
(6.11)
Although the various sums of squares change if the data are standardized (i.e.,
the correlation form of the matrix equation is used), the ratios of the sums of
squares remain the same. Therefore, tests of significance based on standardized
regression are identical to those based on an unstandardized regression. Quantities
such as the coefficient of multiple correlation (R) and percentage of goodness of fit
(100% R2) also remain unchanged.
We can compare the partial regression coefficients between basin magnitude
and the other six basin properties in both raw and standardized form:
b‘ = [ -2.244 0.005 0.226 -0.233 0.063 -0.002 -0.1171
B’ = [ 0.000 0.049 0.284 -0.458 0.975 -0.120 -0.1631
Although the standardized partial regression coefficients suggest that the
basin properties having the most pronounced relationship with basin magnitude
are x2 (relief), x3 (area), and x4 (stream length), these values do not take into ac-
count the uncertainty associated with each estimated parameter. The easiest way to
consider this aspect is by expanding the analysis of variance to test the significance
of each independent variable.
The sum of squares attributable to a single variable, Xj, can be determined
by calculating SSR(,) for the regression with all m variables, calculating SSR(,-~),
which is the sum of squares for regression using all variables except the jth variable,
then finding the difference. This process can be repeated for each independent
variable in turn, in order to assess the contribution that each makes to the total
regression. Fortunately, there is an easier way to calculate the individual regres-
sion sums of squares, which simply requires dividing the square of each partial
regression coefficient by the diagonal elements of S& that correspond to each of
the variables. If we designate CXX = S&, then
(6.12)
467