Page 304 - Statistics for Dummies
P. 304
288
Part V: Statistical Studies and the Hunt for a Meaningful Relationship
But suppose the correlation is high; do you still need to look at the scatter-
plot? Yes. In some situations the data have a somewhat curved shape, yet the
correlation is still strong; in these cases making predictions using a straight
line is still invalid. Predictions need to be made based on a curve. (This topic
is outside the scope of this book; if you are interested, see Statistics II For
Dummies, where I tackle nonlinear relationships.)
Calculating the regression line
For the crickets and temperature data, you can see that the scatterplot in
Figure 18-1 shows a linear pattern. The correlation between cricket chirps
and temperature was found earlier in this chapter to be very strong (r = 0.98).
You now can find one line that best fits the data (in terms of having the small-
est overall distance to the points). Statisticians call this technique for finding
the best-fitting line a simple linear regression analysis using the least squares
method.
The formula for the best-fitting line (or regression line) is y = mx + b, where
m is the slope of the line and b is the y-intercept. This equation itself is the
same one used to find a line in algebra; but remember, in statistics the points
don’t lie perfectly on a line — the line is a model around which the data lie if a
strong linear pattern exists.
✓ The slope of a line is the change in Y over the change in X. For example,
a slope of ⁄3 means as the x-value increases (moves right) by 3 units, the
10
y-value moves up by 10 units on average.
✓ The y-intercept is that place on the y-axis where the value of x is zero. For
example, in the equation 2x – 6, the line crosses the y-axis at the point
–6. The coordinates of this point are (0, –6); when a line crosses the
y-axis, the x-value is always 0.
To come up with the best-fitting line, you need to find values for m and b that
fit the pattern of data the best, for your given criteria. Different criteria exist
and can lead to other lines, but the criteria I use in this book (and in all intro-
ductory level statistics courses in general) is to find the line that minimizes
what statisticians call the sum of squares for error (SSE). The SSE is the sum of
all the squared differences from the points on the proposed line to the actual
points in the data set. The line with the lowest possible SSE wins and its equa-
tion is used as the best-fitting line. This process is where the name the least-
squares method comes from.
You may be thinking that you have to try lots and lots of different lines to
see which one fits best. Fortunately, you have a more straightforward option
(although eyeballing a line on the scatterplot does help you think about what
you’d expect the answer to be). The best-fitting line has a distinct slope and
3/25/11 8:13 PM
26_9780470911082-ch18.indd 288
26_9780470911082-ch18.indd 288 3/25/11 8:13 PM