Page 89 - Statistics II for Dummies
P. 89

Chapter 4: Getting in Line with Simple Linear Regression  73


                                Same spread for every x
                                In order to use the simple linear regression model, as you move from left to
                                right on the x-axis, the spread in the y-values around the line should be the
                                same, no matter which value of x you’re looking at. This requirement is called
                                the homoscedasticity condition. (How they came up with that mouthful of a word
                                just for describing the fact that the standard deviations stay the same across
                                the x-values, I’ll never know.) This condition ensures that the best-fitting line
                                works well for all relevant values of x, not just in certain areas.

                                You can see in Figure 4-5 that no matter what the value of x is, the spread in
                                the y-values stays the same throughout. If the spread got bigger and bigger as
                                x got larger and larger, for example, the line would lose its ability to fit well
                                for those large values of x.


                                Finding and exploring the residuals


                                To check to see whether the y-values come from a normal distribution, you
                                need to measure how far off your predictions were from the actual data that
                                came in. These differences are called errors, or residuals. To evaluate whether
                                a model fits well, you need to check those errors and see how they stack up.

                                In a model-fitting context, the word error doesn’t mean “mistake.” It just means
                                a difference between the data and the prediction based on the model. The word I
                                like best to describe this difference is residual, however. It sounds more upbeat.

                                The following sections focus on finding a way to measure these residuals that
                                the model makes. You also explore the residuals to identify particular prob-
                                lems that occurred in the process of trying to fit a straight line to the data. In
                                other words, you can discover that looking at residuals helps you assess the
                                fit of the model and diagnose problems that caused a bad fit, if that was the case.

                                Finding the residuals
                                                                                y
                                A residual is the difference between the observed value ˆ of y (from the best-
                                fitting line) and the predicted value of y, also known as y (from the data set).
                                                y
                                Its notation is (y – ˆ). Specifically, for any data point, you take its observed
                                y-value (from the data) and subtract its expected y-value (from the line). If
                                the residual is large, the line doesn’t fit well in that spot. If the residual is
                                small, the line fits well in that spot.

                                For example, suppose you have a point in your data set (2, 4) and the equa-
                                tion of the best-fitting line is y = 2x + 1. The expected value of y in this case
                                is (2 * 2) + 1 = 5. The observed value of y from the data set is 4. Taking the
                                observed value minus the estimated value, you get 4 – 5 = –1. The residual for
                                that particular data point (2, 4) is –1. If you observe a y-value of 6 and use the
                                same straight line to estimate y, then the residual is 6 – 5 = +1.








          09_466469-ch04.indd   73                                                                   7/24/09   10:20:39 AM
   84   85   86   87   88   89   90   91   92   93   94