Page 93 - Statistics II for Dummies
P. 93

Chapter 4: Getting in Line with Simple Linear Regression  77


                                    well enough on its own. In this case, statisticians would try to add one
                                    or more variables to the model to help explain y more fully as a group
                                    (read more about this in Chapter 5).

                                For the textbook-weight example, the value of r (the correlation coefficient)
                                                                2
                                is 0.93. Squaring this result, you get r  = 0.8649. That number means approxi-
                                mately 86 percent of the variability you find in average textbook weights for
                                all students (y-values) is explained by the average student weight (x-values).
                                This percentage tells you that the model of using year in school to estimate
                                backpack weight is a good bet.
                                In the case of simple linear regression, you have only one x variable, but in
                                Chapter 5, you can see models that contain more than one x variable. In that
                                                 2
                                situation, you use r  to help sort out the contribution that those x variables
                                as a group bring to the model.

                                Scoping for outliers


                                Sometimes life isn’t perfect (oh really?), and you may find a residual in your
                                otherwise tidy data set that totally sticks out. It’s called an outlier, and it has
                                a standardized value at or beyond +3 or –3. It threatens to blow the conditions
                                of your regression model and send you crying to your professor.

                                Before you panic, the best thing to do is to examine that outlier more closely.
                                First, can you find an error in that data value? Did someone report her age as
                                642, for instance? (After all, mistakes do happen.) If you do find a certifiable
                                error in your data set, you remove that data point (or fix it if possible) and
                                analyze the data without it. However, if you can’t explain away the problem
                                by finding a mistake, you must think of another approach.

                                If you can’t find a mistake that caused the outlier, you don’t necessarily have
                                to trash your model; after all, it’s only one data point. Analyze the data with
                                that data point, and analyze the data again without it. Then report and
                                compare both analyses. This comparison gives you a sense of how influential
                                that one data point is, and it may lead other researchers to conduct more
                                research to zoom in on the issue you brought to the surface.
                                In Figure 4-1, you can see the scatterplot of the full data set for the textbook-
                                weight example. Figure 4-7 shows the scatterplot for the data set minus the
                                outlier. The scatterplot fits the data better without the outlier. The correlation
                                                                2
                                increases to 0.993, and the value of r  increases to 0.986. The equation for the
                                regression line for this data set is y = 1.78 + 0.139x.

                                The slope of the regression line doesn’t change much by removing the outlier
                                (compare it to Figure 4-2, where the slope is 0.113). However, the y-intercept
                                changes: It’s now 1.78 without the outlier compared to 3.69 with the outlier.
                                The slopes of the lines are about the same, but the lines cross the y-axis in







          09_466469-ch04.indd   77                                                                   7/24/09   10:20:40 AM
   88   89   90   91   92   93   94   95   96   97   98