Page 93 - Statistics II for Dummies
P. 93
Chapter 4: Getting in Line with Simple Linear Regression 77
well enough on its own. In this case, statisticians would try to add one
or more variables to the model to help explain y more fully as a group
(read more about this in Chapter 5).
For the textbook-weight example, the value of r (the correlation coefficient)
2
is 0.93. Squaring this result, you get r = 0.8649. That number means approxi-
mately 86 percent of the variability you find in average textbook weights for
all students (y-values) is explained by the average student weight (x-values).
This percentage tells you that the model of using year in school to estimate
backpack weight is a good bet.
In the case of simple linear regression, you have only one x variable, but in
Chapter 5, you can see models that contain more than one x variable. In that
2
situation, you use r to help sort out the contribution that those x variables
as a group bring to the model.
Scoping for outliers
Sometimes life isn’t perfect (oh really?), and you may find a residual in your
otherwise tidy data set that totally sticks out. It’s called an outlier, and it has
a standardized value at or beyond +3 or –3. It threatens to blow the conditions
of your regression model and send you crying to your professor.
Before you panic, the best thing to do is to examine that outlier more closely.
First, can you find an error in that data value? Did someone report her age as
642, for instance? (After all, mistakes do happen.) If you do find a certifiable
error in your data set, you remove that data point (or fix it if possible) and
analyze the data without it. However, if you can’t explain away the problem
by finding a mistake, you must think of another approach.
If you can’t find a mistake that caused the outlier, you don’t necessarily have
to trash your model; after all, it’s only one data point. Analyze the data with
that data point, and analyze the data again without it. Then report and
compare both analyses. This comparison gives you a sense of how influential
that one data point is, and it may lead other researchers to conduct more
research to zoom in on the issue you brought to the surface.
In Figure 4-1, you can see the scatterplot of the full data set for the textbook-
weight example. Figure 4-7 shows the scatterplot for the data set minus the
outlier. The scatterplot fits the data better without the outlier. The correlation
2
increases to 0.993, and the value of r increases to 0.986. The equation for the
regression line for this data set is y = 1.78 + 0.139x.
The slope of the regression line doesn’t change much by removing the outlier
(compare it to Figure 4-2, where the slope is 0.113). However, the y-intercept
changes: It’s now 1.78 without the outlier compared to 3.69 with the outlier.
The slopes of the lines are about the same, but the lines cross the y-axis in
09_466469-ch04.indd 77 7/24/09 10:20:40 AM