Page 35 - Intermediate Statistics for Dummies
P. 35
05_045206 ch01.qxd 2/1/07 9:41 AM Page 14
14
Part I: Data Analysis and Model-Building Basics
that he needs more information, so he tries to uncover what other factors
help determine exam score on a statistics test besides study time. Bill mea-
sures everything from soup to nuts. His set of possible variables includes
study time, GPA, previous experience in statistics, math grades in high
school, attitudes toward statistics, whether you listen to classical music
while studying, shoe size, whether you chew gum during the exam, and even
what your favorite color is (after all, you never know, he figures). For good
measure, he includes 11 other variables, for a total of 20 possible factors that
he thinks may relate to exam score.
Bill starts out by looking for relationships between each of these variables
and exam score, so he does 20 correlations. (Correlation is a measure of the
linear relationship between two variables; see the section on correlation later
in this chapter). He finds out that four variables have a statistically signifi-
cant relationship with exam score (that means the results are supposed to be
correct with a 95 percent chance — but only if he collected the data properly
and did the analysis correctly).
The variables that Bill found to be related to exam score were study time,
math grades in high school, GPA, and whether the person chews gum during
the exam. It turns out that his new model fits pretty well (by criteria I discuss
in Chapter 5 on multiple linear regression models). Bill now thinks he’s
scored a home run and has answered that all-elusive question: How can I do
better on my statistics test?
But as they said in Apollo 13, “Houston, we have a problem.” By looking at all
possible correlations between his 20 variables and exam score, Bill is actually
doing 20 separate statistical analyses. Under typical conditions (I describe
these conditions in Chapter 3), each statistical analysis has a 5 percent
chance of being wrong just by chance (this value of 5 percent is called the sig-
nificance level of the test).
Because 5 percent of 20 analyses is equal to one, you can expect that when
you do 20 statistical analyses, one of them will give the wrong result, just by
chance, over the long term. I bet you can guess which one of Bill’s correla-
tions likely came out wrong in this case. Of course, study time has nothing to
do with exam score, and gum-chewing is the answer to all of our problems,
right? (If that were the case, all statisticians would be out of business and
working for chewing-gum companies instead.)
What Bill is doing is called data snooping in the data-analysis business. Bill
looks around until he finds something, and then he believes the result. This
strategy is dangerous, but one that’s done all too often in the real world. One
of the reasons data snooping is running rampant today is because everyone
and his brother is out there collecting data and analyzing it — and everyone
wants to find something. They’re using statistical software that allows them