Page 35 - Intermediate Statistics for Dummies
P. 35

05_045206 ch01.qxd  2/1/07  9:41 AM  Page 14
                                14
                                         Part I: Data Analysis and Model-Building Basics
                                                    that he needs more information, so he tries to uncover what other factors
                                                    help determine exam score on a statistics test besides study time. Bill mea-
                                                    sures everything from soup to nuts. His set of possible variables includes
                                                    study time, GPA, previous experience in statistics, math grades in high
                                                    school, attitudes toward statistics, whether you listen to classical music
                                                    while studying, shoe size, whether you chew gum during the exam, and even
                                                    what your favorite color is (after all, you never know, he figures). For good
                                                    measure, he includes 11 other variables, for a total of 20 possible factors that
                                                    he thinks may relate to exam score.
                                                    Bill starts out by looking for relationships between each of these variables
                                                    and exam score, so he does 20 correlations. (Correlation is a measure of the
                                                    linear relationship between two variables; see the section on correlation later
                                                    in this chapter). He finds out that four variables have a statistically signifi-
                                                    cant relationship with exam score (that means the results are supposed to be
                                                    correct with a 95 percent chance — but only if he collected the data properly
                                                    and did the analysis correctly).
                                                    The variables that Bill found to be related to exam score were study time,
                                                    math grades in high school, GPA, and whether the person chews gum during
                                                    the exam. It turns out that his new model fits pretty well (by criteria I discuss
                                                    in Chapter 5 on multiple linear regression models). Bill now thinks he’s
                                                    scored a home run and has answered that all-elusive question: How can I do
                                                    better on my statistics test?
                                                    But as they said in Apollo 13, “Houston, we have a problem.” By looking at all
                                                    possible correlations between his 20 variables and exam score, Bill is actually
                                                    doing 20 separate statistical analyses. Under typical conditions (I describe
                                                    these conditions in Chapter 3), each statistical analysis has a 5 percent
                                                    chance of being wrong just by chance (this value of 5 percent is called the sig-
                                                    nificance level of the test).
                                                    Because 5 percent of 20 analyses is equal to one, you can expect that when
                                                    you do 20 statistical analyses, one of them will give the wrong result, just by
                                                    chance, over the long term. I bet you can guess which one of Bill’s correla-
                                                    tions likely came out wrong in this case. Of course, study time has nothing to
                                                    do with exam score, and gum-chewing is the answer to all of our problems,
                                                    right? (If that were the case, all statisticians would be out of business and
                                                    working for chewing-gum companies instead.)
                                                    What Bill is doing is called data snooping in the data-analysis business. Bill
                                                    looks around until he finds something, and then he believes the result. This
                                                    strategy is dangerous, but one that’s done all too often in the real world. One
                                                    of the reasons data snooping is running rampant today is because everyone
                                                    and his brother is out there collecting data and analyzing it — and everyone
                                                    wants to find something. They’re using statistical software that allows them
   30   31   32   33   34   35   36   37   38   39   40