Page 98 - Intermediate Statistics for Dummies
P. 98

09_045206 ch04.qxd  2/1/07  9:49 AM  Page 77
                                                                Chapter 4: Getting in Line with Simple Linear Regression
                                                    and that the model fits well in more specific ways than the scatterplot and
                                                    correlation measure. This section presents methods for defining and assess-
                                                    ing the fit of a simple linear regression model.
                                                    Defining the conditions
                                                    Two major conditions must be met before you apply a simple linear regres-
                                                    sion model to a data set:
                                                       The y’s have to have a normal distribution for each value of x.
                                                       The y’s have to have a constant amount of spread (standard deviation)
                                                        for each value of x.
                                                    In the following sections, you look at these important conditions in depth.
                                                    Normal y’s for every x                                                 77
                                                    For any value of x, the population of possible y-values must have a normal
                                                    distribution. The mean of this distribution is the value for y that is on the
                                                    best-fitting line for that x-value. That is, some of your data falls above the
                                                    best-fitting line, some data falls below the best fitting line, and a few may
                                                    actually land right on the line.
                                                    If the regression model is fitting well, the data values should be scattered
                                                    around the best-fitting line in such a way that about 68 percent of the values
                                                    lie within one standard deviation of the line, about 95 percent of the values
                                                    should lie within two standard deviations of the line, and about 99.7 percent
                                                    of the values should lie within three standard deviations of the line. This
                                                    specification, as you may recall from your intro stats course, is called the
                                                    68-95-99.7 rule, and it applies to all bell-shaped data (for which the normal
                                                    distribution applies).
                                                    You can see in Figure 4-3 how for each x-value, the y-values you may observe
                                                    tend to be located near the best-fitting line in greater numbers, and as you
                                                    move away from the line, you see fewer and fewer y-values, both above and
                                                    below the line. More than that, they’re scattered around the line in a way that
                                                    reflects a bell-shaped curve, the normal distribution.
                                                    Why does this condition makes sense? The data you collect on y for any partic-
                                                    ular x-value varies from individual to individual (for example, not all students’
                                                    textbooks weigh the same, even for students who weigh the exact same
                                                    amount). But those values aren’t allowed to vary any way they want to. To fit
                                                    the conditions of a linear regression model, for each given value of x, the data
                                                    should be scattered around the line according to a normal distribution. Most of
                                                    the points should be close to the line, and as you get farther and farther from
                                                    the line, you can expect fewer and fewer data points to occur. So condition
                                                    number one is that the data have a normal distribution for each value of x.
   93   94   95   96   97   98   99   100   101   102   103