Page 245 - Computational Statistics Handbook with MATLAB
P. 245

Chapter 7: Data Partitioning                                    233


                             with data it has already seen. Therefore, that procedure will yield an overly
                             optimistic (i.e., low) prediction error (see Equation 7.5). Cross-validation is a
                             technique that can be used to address this problem by iteratively partitioning
                             the sample into two sets of data. One is used for building the model, and the
                             other is used to test it.
                              We introduce cross-validation in a linear regression application, where we
                             are interested in estimating the expected prediction error. We use linear
                             regression to illustrate the cross-validation concept, because it is a topic that
                             most engineers and data analysts should be familiar with. However, before
                             we describe the details of cross-validation, we briefly review the concepts in
                             linear regression. We will return to this topic in Chapter 10, where we discuss
                             methods of nonlinear regression.
                                                      (
                                                         ,
                              Say we have a set of data,  X i Y i )  , where  X i   denotes a predictor variable
                                   represents  the corresponding  response variable. We are interested in
                             and Y i
                             modeling the dependency of Y on X. The easiest example of linear regression
                             is in situations where we can fit a straight line between X and Y. In Figure 7.1,
                                                             (
                             we show a scatterplot of 25 observed  X i Y i )   pairs [Draper and Smith, 1981].
                                                                ,
                             The X variable represents the average atmospheric temperature measured in
                             degrees Fahrenheit, and the Y variable corresponds to the pounds of steam
                             used per month. The scatterplot indicates that a straight line is a reasonable
                             model for the relationship between these variables. We will use these data to
                             illustrate linear regression.
                              The linear, first-order model is given by

                                                      Y =  β 0 +  β 1 X +  , ε              (7.1)

                                             are parameters that must be estimated from the data, and
                             where  β 0   and  β 1
                             ε   represents the error in the measurements. It should be noted that the word
                                                                       . The order (or degree) of the
                             linear refers to the linearity of the parameters β i
                             model refers to the highest power of the predictor variable X. We know from
                                                                       is the y-intercept. As another
                             elementary algebra that β 1   is the slope and  β 0
                             example, we represent the linear, second-order model by

                                                                     2
                                                   Y =  β 0 +  β 1 X +  β 2 X + ε  .        (7.2)

                                                                                       . Thus, the
                              To get the model, we need to estimate the parameters β 0   and β 1
                             estimate of our model given by Equation 7.1 is
                                                        ˆ    ˆ   ˆ
                                                        Y =  β 0 +  β 1X  ,                 (7.3)
                                   ˆ                                                      ˆ
                             where  Y   denotes the predicted value of Y for some value of X, and  β 0   and
                             ˆ
                             β 1   are the estimated parameters. We do not go into the derivation of the esti-
                             mators, since it can be found in most introductory statistics textbooks.




                            © 2002 by Chapman & Hall/CRC
   240   241   242   243   244   245   246   247   248   249   250