Page 127 - Statistics II for Dummies

P. 127

Chapter 6: How Can I Miss You If You Won’t Leave? Regression Model Selection
added value of the additional variable is outweighed by the number of 111
variables in the model. This gives you an idea of how much or how little
added value you get from a bigger model (bigger isn’t always better).
✓ Mallow’s C-p: Mallow’s C-p takes the amount of error left unexplained
by a model of p with the x variables, divides that number by the average
amount of error left over from the full model (with all the x variables),
and adjusts that result for the number of observations (n) and the
number of x variables used (p). In general, the smaller Mallow’s C-p is,
the better, because when it comes to the amount of error in your model,
less is more. A C-p value close to p (the number of x variables in the
model) reflects a model that fits well.

Model selection procedures

The process of finding the “best” model is not cut and dry. (Heck, even the
definition of “best” here isn’t cut and dry.) Many different procedures exist
for going through different models in a systematic way, evaluating each one,
and stopping at the right model. Three of the more common model selection
procedures are forward selection, backward selection, and the best subsets
model. In this section you get a very brief overview of the forward and back-
ward selection procedures, and then you get into the details of the best sub-
sets model, which is the one statisticians use most.

Going with the forward selection procedure
The forward selection procedure starts with a model with no variables in it
and adds variables one at a time according to the amount of contribution
they can make to the model.

Start with an entry level value of α. Then run hypothesis tests (see Chapter
3 for instructions) for each x variable to see how it’s related to y. The x vari-
able with the smallest p-value wins and is added to the model, as long as its
p-value is smaller than the entry level. You keep doing this with the remain-
ing variables until the one with the smallest p-value doesn’t make the entry
level. Then you stop.

The drawback of the forward selection procedure is that it starts with nothing
and adds variables one at a time as you go along; after a variable is added, it’s
never removed. The best model might not even get tested.

Opting for the backward selection procedure
The backward selection procedure does the opposite of the forward selection
method. It starts with a model with all the x variables in it and removes vari-
ables one at a time. Those that make the least amount of contribution to the
model are removed first. You choose a removal level to begin; then you test

7/23/09 9:27:04 PM
11_466469-ch06.indd 111 7/23/09 9:27:04 PM
11_466469-ch06.indd 111

122 123 124 125 126 127 128 129 130 131 132