Page 243 - Computational Statistics Handbook with MATLAB

P. 243

Chapter 7

Data Partitioning

7.1 Introduction
In this book, data partitioning refers to procedures where some observations
from the sample are removed as part of the analysis. These techniques are
used for the following purposes:

• To evaluate the accuracy of the model or classification scheme;
• To decide what is a reasonable model for the data;
• To find a smoothing parameter in density estimation;
• To estimate the bias and error in parameter estimation;
• And many others.

We start off with an example to motivate the reader. We have a sample
where we measured the average atmospheric temperature and the corre-
sponding amount of steam used per month [Draper and Smith, 1981]. Our
goal in the analysis is to model the relationship between these variables. Once
we have a model, we can use it to predict how much steam is needed for a
given average monthly temperature. The model can also be used to gain
understanding about the structure of the relationship between the two vari-
ables.
The problem then is deciding what model to use. To start off, one should
always look at a scatterplot (or scatterplot matrix) of the data as discussed in
Chapter 5. The scatterplot for these data is shown in Figure 7.1 and is exam-
ined in Example 7.3. We see from the plot that as the temperature increases,
the amount of steam used per month decreases. It appears that using a line
(i.e., a first degree polynomial) to model the relationship between the vari-
ables is not unreasonable. However, other models might provide a better fit.
For example, a cubic or some higher degree polynomial might be a better
model for the relationship between average temperature and steam usage.
So, how can we decide which model is better? To make that decision, we
need to assess the accuracy of the various models. We could then choose the

238 239 240 241 242 243 244 245 246 247 248