Page 342 - Statistics for Environmental Engineers
P. 342

L1592_frame_C39  Page 351  Tuesday, December 18, 2001  3:22 PM











                       A Note on Description vs. Prediction
                                                               2
                       Is the regression useful? We have seen that a high R  does not guarantee that a regression has meaning.
                                     2
                       Likewise, a low R  may indicate a statistically significant relationship between two variables although
                       the regression is not explaining much of the variation. Even less does statistically significant mean that
                       the regression will predict future observations with much accuracy. “In order for the fitted equation to
                       be regarded as a satisfactory predictor, the observed  F ratio (regression mean square/residual mean
                       square) should exceed not merely the selected percentage point of the F distribution, but several times
                       the selected percentage point. How many times depends essentially on how great a ratio (prediction
                       range/error of prediction) is specified” (Box and Wetz, 1973). Draper and Smith (1998) offer this rule-
                       of-thumb: unless the observed F for overall regression exceeds the chosen test percentage point by at
                       least a factor of four, and preferably more, the regression is unlikely to be of practical value for prediction
                       purposes. The regression in Figure 39.4 has an F ratio of 581.12/8.952 = 64.91 and would have some
                       practical predictive value.



                       Other Ways to Examine a Model
                          2
                       If R  does not tell all that is needed about how well a model fits the data and how good the model may
                       be for prediction, what else could be examined?
                        Graphics reveal information in data (Tufte 1983): always examine the data and the proposed model
                                                                                             2
                       graphically. How sad if this advice was forgotten in a rush to compute some statistic like R .
                        A more useful single measure of the prediction capability of a model (including a k-variate regression
                       model) is the standard error of the estimate. The standard error of the estimate is computed from the
                       variance of the predicted value  (y ˆ)   and it indicates the precision with which the model estimates the
                       value of the dependent variable.  This statistic is used to compute intervals that have the following
                       meanings (Hahn, 1973).
                           • The confidence interval for the dependent variable is an interval that one expects, with a
                             specified level of confidence, to contain the average value of the dependent variable at a set
                             of specified values for the independent variables.
                           • A prediction interval for the dependent variable is an interval that one expects, with a specified
                             probability, to contain a single future value of the dependent variable from the sampled
                             population at a set of specified values of the independent variables.
                           • A confidence interval around a parameter in a model (i.e., a regression coefficient) is an
                             interval that one expects, with a specified degree of confidence, to contain the true regression
                             coefficient.
                        Confidence intervals for parameter estimates and prediction intervals for the dependent variable are
                       discussed in Chapters 34 and 35. The exact method of obtaining these intervals is explained in Draper
                       and Smith (1998). They are computed by most statistics software packages.



                       Comments
                       Widely used methods have the potential to be frequently misused. Linear regression, the most widely
                                                                                      2
                       used statistical method, can be misused or misinterpreted if one relies too much on R  as a characterization
                       of how well a model fits.
                          2
                        R  is a measure of the proportion of variation in y that is accounted for by fitting y to a particular linear
                                                                                               2
                       model instead of describing the data by calculating the mean (a horizontal straight line). High R does not
                                                         2
                       prove that a model is correct or useful. A low R  may indicate a statistically significant relation between two
                       variables although the regression has no practical predictive value. Replication dramatically improves the
                                                                                           2
                       predictive error of a model, and it makes possible a formal lack-of-fit test, but it reduces the R  of the model.
                       © 2002 By CRC Press LLC
   337   338   339   340   341   342   343   344   345   346   347