Page 143 - Intermediate Statistics for Dummies
P. 143
11_045206 ch06.qxd 2/1/07 9:52 AM Page 122
122
Part II: Making Predictions by Using Regression
Removing one variable: The Step 2 column
Notice in the Step 2 column of Figure 6-4 that the left leg strength variable
no longer appears as a result (and it stays that way), because it has the high-
est p-value at Step 1 and that p-value is larger than the entry level of 0.10.
This is the work of the backward selection procedure. It operates in the only-
the-strong-survive mode when it comes to variable elimination.
In looking at the p-values for this new model in the Step 2 column, you see
the variable with the highest p-value is hang time (0.874). This result doesn’t
make sense at first because in Table 6-2 you saw hang time had the strongest
relationship with punt distance.
However, remember what the p-value represents here — the significance of
the variable in its contribution to y, given all the other variables already in
the model. Because so many of the other variables in the model were shown
to be correlated with hang time (see Figure 6-1), it makes sense that hang
time could possibly be eliminated somewhere near the beginning of this
procedure.
Working down to the final model: The Step 3 column and beyond
The Step 3 column of Figure 6-4 shows the model without left leg strength or
hang time. The next variable to be removed is left leg flexibility, which has a
p-value = 0.574. Looking at the Step 4 column of Figure 6-4, the next variable
to be removed is right leg flexibility, which has a p-value of 0.346.
After right leg flexibility is removed from the model, you can see the result in
Step 5 of Figure 6-4. All the remaining variables in the model have p-values
smaller than the level for removal, which is 0.10. This means you stop the
backward selection procedure and keep the model you’ve got. The final
model for the punt distance data using the backward selection procedure
with removal level 0.10 is y = 12.77 + 0.56 x 1 + 0.27x 2 , where x 1 = right leg
2
strength and x 2 = overall leg strength. The final value of R adjusted is 74.14
2
percent, which isn’t all that bad. (I’ve seen higher values of R , but I’ve also
seen a lot worse.) Mallow cheers this model on with a C-p value of 0, which
has been rounded off a bit.
2
2
Always remember to use the R adjusted rather than R to assess the fit of
your model at each step of any selection procedure, and here’s why: In the
2
2
punt distance example, the values of R and R adjusted appear on the second
and third lines from the bottom of the Minitab output in Figure 6-4. You can
2
see that with each step, the values of R decrease because fewer variables are
in the model to contribute something to predicting y. However, the values of
2
R adjusted increase because the adjustment needed for the number of vari-
ables in the model goes down. Each variable left in the model is providing
more bang for the buck in terms of helping predict y.