Page 337 - Statistics for Environmental Engineers
P. 337

L1592_frame_C39  Page 346  Tuesday, December 18, 2001  3:22 PM









                                    2                             2
                       This shows that R  is a model comparison and that large R  measures only how much the model improves
                       the null model. It does not indicate how good the model is in any absolute sense. Consequently, the
                                             2
                       common belief that a large R  demonstrates model adequacy is sometimes wrong.
                                       2
                        The definition of R  also shows that comparisons are made only between nested models. The concept
                       of proportionate reduction in variation is untrustworthy unless one model is a special case of the other.
                                     2
                       This means that  R  cannot be used to compare models with an intercept with models that have no
                       intercept: y = β 0  is not a reduction of the model y = β 1 x. It is a reduction of y = β 0  + β 1 x and y = β 0  +
                              2
                       β 1 x + β 2 x .

                                2
                       A High R  Does Not Assure a Valid Relation
                                                    2
                       Figure 39.1 shows a regression with R  = 0.746, which is statistically significant at almost the 1% level
                       of confidence (a 1% chance of concluding significance when there is no true relation). This might be
                       impressive until one knows the source of the data. X is the first six digits of pi, and Y is the first six
                       Fibonocci numbers. There is no true relation between x and y. The linear regression equation has no
                       predictive value (the seventh digit of pi does not predict the seventh Fibonocci number).
                                                                                2
                        Anscombe (1973) published a famous and fascinating example of how R  and other statistics that are
                       routinely computed in regression analysis can fail to reveal the important features of the data. Table 39.1


                                                                 10
                                                                      Y  = 0.31 + 0.79X
                                                                        2
                                                                  8   R  = 0.746
                                                                  6
                                                                Y
                                                                  4
                       FIGURE 39.1 An example of nonsense in regression. X is  2
                       the  first six digits of pi and Y is the  first six Fibonocci  0
                              2
                       numbers. R  is high although there is no actual relation  0  2  4  6  8  10
                       between x and y.                                      X


                                       TABLE 39.1
                                       Anscombe’s Four Data Sets
                                           A           B            C           D
                                        x      y    x      y     x     y     x      y
                                       10.0   8.04  10.0  9.14  10.0  7.46   8.0   6.58
                                        8.0   6.95  8.0   8.14  8.0   6.77   8.0   5.76
                                       13.0   7.58  13.0  8.74  13.0  12.74  8.0   7.71
                                        9.0   8.81  9.0   8.77  9.0   7.11   8.0   8.84
                                       11.0   8.33  11.0  9.26  11.0  7.81   8.0   8.47
                                       14.0   9.96  14.0  8.10  14.0  8.84   8.0   7.04
                                        6.0   7.24  6.0   6.13  6.0   6.08   8.0   5.25
                                        4.0   4.26  4.0   3.10  4.0   5.39  19.0  12.50
                                       12.0  10.84  12.0  9.13  12.0  8.15   8.0   5.56
                                        7.0   4.82  7.0   7.26  7.0   6.42   8.0   7.91
                                        5.0   5.68  5.0   4.74  5.0   5.73   8.0   6.89
                                       Note: Each data set has n = 11, mean of x =  9.0,  mean of y =  7.5,  equation
                                           of the regression line y = 3.0 + 0.5x, standard error of estimate of
                                           the slope  = 0.118 (t statistic  = 4.24, regression sum of squares
                                           (corrected for mean)  = 110.0, residual sum of squares  = 13.75,
                                                                   2
                                           correlation coefficient r = 0.82 and R  = 0.67).
                                       Source: Anscombe, F. J. (1973). Am. Stat., 27, 17–21.
                       © 2002 By CRC Press LLC
   332   333   334   335   336   337   338   339   340   341   342