Page 336 - Statistics for Environmental Engineers
P. 336
L1592_frame_C39 Page 345 Tuesday, December 18, 2001 3:22 PM
39
2
The Coefficient of Determination, R
KEY WORDS coefficient of determination, coefficient of multiple correlation, confidence interval, F
ratio, hapenstance data, lack of fit, linear regression, nested model, null model, prediction interval, pure
2
error, R , repeats, replication, regression, regression sum of squares, residual sum of squares, spurious
correlation.
Regression analysis is so easy to do that one of the best-known statistics is the coefficient of determi-
2
nation, R . Anderson-Sprecher (1994) calls it “…a measure many statistician’s love to hate.”
2 2
Every scientist knows that R is the coefficient of determination and R is that proportion of the total
variability in the dependent variable that is explained by the regression equation. This is so seductively
2 2
simple that we often assume that a high R signifies a useful regression equation and that a low R
2
signifies the opposite. We may even assume further that high R indicates that the observed relation
between independent and dependent variables is true and can be used to predict new conditions.
2
Life is not this simple. Some examples will help us understand what R really reveals about how well
the model fits the data and what important information can be overlooked if too much reliance is placed
2
on the interpretation of R .
What Does “Explained” Mean?
2
Caution is recommended in interpreting the phrase “R explains the variation in the dependent variable.”
2
R is the proportion of variation in a variable Y that can be accounted for by fitting Y to a particular
2
model instead of viewing the variable in isolation. R does not explain anything in the sense that “Aha!
Now we know why the response indicated by y behaves the way we have observed in this set of data.”
If the data are from a well-designed controlled experiment, with proper replication and randomization,
it is reasonable to infer that an significant association of the variation in y with variation in the level of
x is a causal effect of x. If the data had been observational, what Box (1966) calls happenstance data,
there is a high risk of a causal interpretation being wrong. With observational data there can be many
reasons for associations among variables, only one of which is causality.
2
A value of R is not just a rescaled measure of variation. It is a comparison between two models. One
of the models is usually referred to as the model. The other model — the null model — is usually never
mentioned. The null model (y = β 0 ) provides the reference for comparison. This model describes a
horizontal line at the level of the mean of the y values, which is the simplest possible model that could
be fitted to any set of data.
2
• The model (y = β 0 + β 1 x + β 2 x + … + e i ) has residual sum of squares ∑ (y i – y ˆ) = RSS model .
2
• The null model (y = β 0 + e i ) has residual sum of squares ∑ (y i – y) = RSS null model .
The comparison of the residual sums of squares (RSS) defines:
R = 1 – --------------------------
RSS model
2
RSS null model
© 2002 By CRC Press LLC

