Page 333 - Statistics for Environmental Engineers
P. 333

L1592_frame_C38  Page 342  Tuesday, December 18, 2001  3:21 PM









                        After modifying a model by adding, or in this case dropping, a term, an additional test should be
                       made to compare the regression sum of squares of the two models. Details of this test are given in
                       texts on regression analysis (Draper and Smith, 1998) and in Chapter 40. Here, the test is illustrated
                       by example.
                                                                                               2
                        The regression sum of squares for the complete model (Model A) is 20,256. Dropping the z  term to
                       get Model B reduced the regression sum of squares by only 54. We need to consider that a reduction
                       of 54 in the regression sum of squares may not be a statistically significant difference.
                                                                            2
                        The reduction in the regression sum of squares due to dropping z  can be thought of as a variance
                                       2
                       associated with the z  term. If this variance is small compared to the variance of the pure experimental
                                        2
                       error, then the term z  contributes no real information and it should be dropped from the model. In
                                                         2
                       contrast, if the variance associated with the z  term is large relative to the pure error variance, the term
                       should remain in the model.
                        There were no repeated measurements in this experiment, so an independent estimate of the variance
                       due to pure error variance cannot be computed. The best that can be done under the circumstances is to
                       use the residual mean square of the complete model as an estimate of the pure error variance. The residual
                       mean square for the complete model (Model A) is 51.5. This is compared with the difference in regression
                       sum of squares of the two models; the difference in regression sum of squares between Models A and B
                                                     2
                       is 54. The ratio of the variance due to z  and the pure error variance is F = 54/51.5 = 1.05. This value is
                       compared against the upper 5% point of the F distribution (1, 6 degrees of freedom). The degrees of
                       freedom are 1 for the numerator (1 degree of freedom for the one parameter that was dropped from the
                       model) and 6 for the denominator (the mean residual sum of squares). From Table C in the appendix,
                                                                         2
                       F 1,6  = 5.99. Because 1.05 < 5.99, we conclude that removing the z  term does not result in a significant
                                                                    2
                       reduction in the regression sum of squares. Therefore, the z  term is not needed in the model.
                        The test used above is valid to compare any two of the models that have one less parameter than
                                                                      2
                       Model A. To compare Models A and E, notice that omitting t  decreases the regression sum of squares
                       by 20256 − 17705 = 2551. The F statistic is 2551/51.5 = 49.5. Because 49.5 >> 5.99 (the upper 95%
                                                                                            2
                       point of the F distribution with 1 and 6 degrees of freedom), this change is significant and t  needs to be
                       included in the model.
                        The test is modified slightly to compare Models A and D because Model D has two less terms than
                                                                                               2
                       Model A. The decrease of 343 in the regression sum of squares results from dropping to terms (z  and zt).
                       The  F statistic is now computed using 343/2 in the numerator and 51.5 in the denominator:  F  =
                       (343/2)/51.5 = 3.33. The upper 95% point of the appropriate reference distribution is F = 5.14, which
                       has 2 degrees of freedom for the numerator and 6 degrees of freedom for the denominator. Because F
                                                                              2
                       for the model is less than the reference F (F = 3.33 < 5.14), the terms z  and zt are not needed.
                        Model D is as good as Model A. Model D is the simplest adequate model:

                                             Model D  y ˆ =  186 +  7.12t – 3.06z +  0.143t 2


                       This is the same model that was obtained by starting with the simplest possible model and adding terms
                       to make up for inadequacies.




                       Comments

                       The model building process uses regression to estimate the parameters, followed by diagnosis to decide
                                                                                                     2
                       whether the model should be modified by adding or dropping terms. The goal is not to maximize R ,
                       because this puts unneeded high-order terms into the polynomial model. The best model should have
                       the fewest possible parameters because this will minimize the prediction error of the model.
                        One approach to finding the simplest adequate model is to start with a simple tentative model and use
                       diagnostic checks, such as residuals plots, for guidance. The alternate approach is to start by overfitting
                       the data with a highly parameterized model and to then find appropriate simplifications. Each time a

                       © 2002 By CRC Press LLC
   328   329   330   331   332   333   334   335   336   337   338