Page 243 - Computational Statistics Handbook with MATLAB
P. 243

Chapter 7




                             Data Partitioning










                             7.1 Introduction
                             In this book, data partitioning refers to procedures where some observations
                             from the sample are removed as part of the analysis. These techniques are
                             used for the following purposes:

                                •  To evaluate the accuracy of the model or classification scheme;
                                •  To decide what is a reasonable model for the data;
                                •  To find a smoothing parameter in density estimation;
                                •  To estimate the bias and error in parameter estimation;
                                •  And many others.

                              We start off with an example to motivate the reader. We have a sample
                             where we measured the average atmospheric temperature and the corre-
                             sponding amount of steam used per month [Draper and Smith, 1981]. Our
                             goal in the analysis is to model the relationship between these variables. Once
                             we have a model, we can use it to predict how much steam is needed for a
                             given average monthly temperature. The model can also be used to gain
                             understanding about the structure of the relationship between the two vari-
                             ables.
                              The problem then is deciding what model to use. To start off, one should
                             always look at a scatterplot (or scatterplot matrix) of the data as discussed in
                             Chapter 5. The scatterplot for these data is shown in Figure 7.1 and is exam-
                             ined in Example 7.3. We see from the plot that as the temperature increases,
                             the amount of steam used per month decreases. It appears that using a line
                             (i.e., a first degree polynomial) to model the relationship between the vari-
                             ables is not unreasonable. However, other models might provide a better fit.
                             For example, a cubic or some higher degree polynomial might be a better
                             model for the relationship between average temperature and steam usage.
                              So, how can we decide which model is better? To make that decision, we
                             need to assess the accuracy of the various models. We could then choose the





                            © 2002 by Chapman & Hall/CRC
   238   239   240   241   242   243   244   245   246   247   248