Page 336 - Statistics for Environmental Engineers
P. 336

L1592_frame_C39  Page 345  Tuesday, December 18, 2001  3:22 PM




                       39



                                                                          2
                       The Coefficient of Determination, R






                       KEY WORDS coefficient of determination, coefficient of multiple correlation, confidence interval, F
                       ratio, hapenstance data, lack of fit, linear regression, nested model, null model, prediction interval, pure
                             2
                       error, R , repeats, replication, regression, regression sum of squares, residual sum of squares, spurious
                       correlation.
                       Regression analysis is so easy to do that one of the best-known statistics is the coefficient of determi-
                              2
                       nation, R . Anderson-Sprecher (1994) calls it “…a measure many statistician’s love to hate.”
                                              2                               2
                        Every scientist knows that R  is the coefficient of determination and R  is that proportion of the total
                       variability in the dependent variable that is explained by the regression equation. This is so seductively
                                                         2                                           2
                       simple that we often assume that a high R  signifies a useful regression equation and that a low R
                                                                         2
                       signifies the opposite. We may even assume further that high R  indicates that the observed relation
                       between independent and dependent variables is true and can be used to predict new conditions.
                                                                              2
                        Life is not this simple. Some examples will help us understand what R  really reveals about how well
                       the model fits the data and what important information can be overlooked if too much reliance is placed
                                          2
                       on the interpretation of R .

                       What Does “Explained” Mean?
                                                                2
                       Caution is recommended in interpreting the phrase “R  explains the variation in the dependent variable.”
                        2
                       R  is the proportion of variation in a variable Y that can be accounted for by fitting Y to a particular
                                                               2
                       model instead of viewing the variable in isolation. R  does not explain anything in the sense that “Aha!
                       Now we know why the response indicated by y behaves the way we have observed in this set of data.”
                       If the data are from a well-designed controlled experiment, with proper replication and randomization,
                       it is reasonable to infer that an significant association of the variation in y with variation in the level of
                       x is a causal effect of x. If the data had been observational, what Box (1966) calls happenstance data,
                       there is a high risk of a causal interpretation being wrong. With observational data there can be many
                       reasons for associations among variables, only one of which is causality.
                                  2
                        A value of R  is not just a rescaled measure of variation. It is a comparison between two models. One
                       of the models is usually referred to as the model. The other model — the null model — is usually never
                       mentioned. The null model (y  =  β 0 ) provides the reference for comparison. This model describes a
                       horizontal line at the level of the mean of the y values, which is the simplest possible model that could
                       be fitted to any set of data.

                                                                                         2
                           • The model (y = β 0  + β 1 x + β 2 x +  …  + e i ) has residual sum of squares ∑ (y i –  y ˆ) =  RSS model .
                                                                                 2
                           • The null model (y = β 0  + e i ) has residual sum of squares  ∑ (y i –  y) =  RSS null model .
                       The comparison of the residual sums of squares (RSS) defines:

                                                      R =  1 –  --------------------------
                                                               RSS model
                                                       2
                                                              RSS null model

                       © 2002 By CRC Press LLC
   331   332   333   334   335   336   337   338   339   340   341