Page 304 - Statistics for Dummies
P. 304

288
                                         Part V: Statistical Studies and the Hunt for a Meaningful Relationship
                                                    But suppose the correlation is high; do you still need to look at the scatter-
                                                    plot? Yes. In some situations the data have a somewhat curved shape, yet the
                                                    correlation is still strong; in these cases making predictions using a straight
                                                    line is still invalid. Predictions need to be made based on a curve. (This topic
                                                    is outside the scope of this book; if you are interested, see Statistics II For
                                                    Dummies, where I tackle nonlinear relationships.)
                                                    Calculating the regression line
                                                    For the crickets and temperature data, you can see that the scatterplot in
                                                    Figure 18-1 shows a linear pattern. The correlation between cricket chirps
                                                    and temperature was found earlier in this chapter to be very strong (r = 0.98).
                                                    You now can find one line that best fits the data (in terms of having the small-
                                                    est overall distance to the points). Statisticians call this technique for finding
                                                    the best-fitting line a simple linear regression analysis using the least squares
                                                    method.
                                                   The formula for the best-fitting line (or regression line) is y = mx + b, where
                                                    m is the slope of the line and b is the y-intercept. This equation itself is the
                                                    same one used to find a line in algebra; but remember, in statistics the points
                                                    don’t lie perfectly on a line — the line is a model around which the data lie if a
                                                    strong linear pattern exists.
                                                     ✓ The slope of a line is the change in Y over the change in X. For example,
                                                        a slope of  ⁄3 means as the x-value increases (moves right) by 3 units, the
                                                                 10
                                                        y-value moves up by 10 units on average.
                                                     ✓ The y-intercept is that place on the y-axis where the value of x is zero. For
                                                        example, in the equation 2x – 6, the line crosses the y-axis at the point
                                                        –6. The coordinates of this point are (0, –6); when a line crosses the
                                                        y-axis, the x-value is always 0.
                                                    To come up with the best-fitting line, you need to find values for m and b that
                                                    fit the pattern of data the best, for your given criteria. Different criteria exist
                                                    and can lead to other lines, but the criteria I use in this book (and in all intro-
                                                    ductory level statistics courses in general) is to find the line that minimizes
                                                    what statisticians call the sum of squares for error (SSE). The SSE is the sum of
                                                    all the squared differences from the points on the proposed line to the actual
                                                    points in the data set. The line with the lowest possible SSE wins and its equa-
                                                    tion is used as the best-fitting line. This process is where the name the least-
                                                    squares method comes from.
                                                    You may be thinking that you have to try lots and lots of different lines to
                                                    see which one fits best. Fortunately, you have a more straightforward option
                                                    (although eyeballing a line on the scatterplot does help you think about what
                                                    you’d expect the answer to be). The best-fitting line has a distinct slope and







                                                                                                                           3/25/11   8:13 PM
                             26_9780470911082-ch18.indd   288
                             26_9780470911082-ch18.indd   288                                                              3/25/11   8:13 PM
   299   300   301   302   303   304   305   306   307   308   309