Page 342 - Statistics for Environmental Engineers

P. 342

L1592_frame_C39 Page 351 Tuesday, December 18, 2001 3:22 PM

A Note on Description vs. Prediction
2
Is the regression useful? We have seen that a high R does not guarantee that a regression has meaning.
2
Likewise, a low R may indicate a statistically signiﬁcant relationship between two variables although
the regression is not explaining much of the variation. Even less does statistically signiﬁcant mean that
the regression will predict future observations with much accuracy. “In order for the ﬁtted equation to
be regarded as a satisfactory predictor, the observed F ratio (regression mean square/residual mean
square) should exceed not merely the selected percentage point of the F distribution, but several times
the selected percentage point. How many times depends essentially on how great a ratio (prediction
range/error of prediction) is speciﬁed” (Box and Wetz, 1973). Draper and Smith (1998) offer this rule-
of-thumb: unless the observed F for overall regression exceeds the chosen test percentage point by at
least a factor of four, and preferably more, the regression is unlikely to be of practical value for prediction
purposes. The regression in Figure 39.4 has an F ratio of 581.12/8.952 = 64.91 and would have some
practical predictive value.

Other Ways to Examine a Model
2
If R does not tell all that is needed about how well a model ﬁts the data and how good the model may
be for prediction, what else could be examined?
Graphics reveal information in data (Tufte 1983): always examine the data and the proposed model
2
graphically. How sad if this advice was forgotten in a rush to compute some statistic like R .
A more useful single measure of the prediction capability of a model (including a k-variate regression
model) is the standard error of the estimate. The standard error of the estimate is computed from the
variance of the predicted value (y ˆ) and it indicates the precision with which the model estimates the
value of the dependent variable. This statistic is used to compute intervals that have the following
meanings (Hahn, 1973).
• The conﬁdence interval for the dependent variable is an interval that one expects, with a
speciﬁed level of conﬁdence, to contain the average value of the dependent variable at a set
of speciﬁed values for the independent variables.
• A prediction interval for the dependent variable is an interval that one expects, with a speciﬁed
probability, to contain a single future value of the dependent variable from the sampled
population at a set of speciﬁed values of the independent variables.
• A conﬁdence interval around a parameter in a model (i.e., a regression coefﬁcient) is an
interval that one expects, with a speciﬁed degree of conﬁdence, to contain the true regression
coefﬁcient.
Conﬁdence intervals for parameter estimates and prediction intervals for the dependent variable are
discussed in Chapters 34 and 35. The exact method of obtaining these intervals is explained in Draper
and Smith (1998). They are computed by most statistics software packages.

Comments
Widely used methods have the potential to be frequently misused. Linear regression, the most widely
2
used statistical method, can be misused or misinterpreted if one relies too much on R as a characterization
of how well a model ﬁts.
2
R is a measure of the proportion of variation in y that is accounted for by ﬁtting y to a particular linear
2
model instead of describing the data by calculating the mean (a horizontal straight line). High R does not
2
prove that a model is correct or useful. A low R may indicate a statistically signiﬁcant relation between two
variables although the regression has no practical predictive value. Replication dramatically improves the
2
predictive error of a model, and it makes possible a formal lack-of-ﬁt test, but it reduces the R of the model.
© 2002 By CRC Press LLC

337 338 339 340 341 342 343 344 345 346 347