Page 171 - Intermediate Statistics for Dummies
P. 171
13_045206 ch08.qxd 2/1/07 9:57 AM Page 150
150
Part II: Making Predictions by Using Regression
To estimate p, the chance of an event occurring, you need data that comes in
the form of yes or no, indicating whether or not the event occurred for each
individual in the data set. Now because yes or no data don’t have a normal
distribution, a condition needed for other types of regression, you need a
new type of regression model to do this job — logistic regression. Keep read-
ing this section to find out more about this model.
Defining a logistic regression model
A logistic regression model ultimately gives you an estimate for p, the
probability that a particular outcome will occur in a yes or no situation (for
example, the chance that it will rain versus not). The estimate is based on
information from one or more explanatory variables; you can call them x 1 , x 2 ,
x 3 , . . . x k . (For example, x 1 = humidity, x 2 = barometric pressure, x 3 = cloud
cover, . . . and x k = wind speed.) Note: In this chapter, I present only the case
where you use one explanatory variable. You can extend the ideas in exactly
the same way as you can extend the simple linear regression model (Chap-
ter 4) to a multiple regression model (Chapter 5).
Using an S-curve to estimate probabilities
In a simple linear regression model, the general form of a straight line is
y = β 0 + β 1 x. In the case of estimating p, the linear regression model is the
straight line p = β 0 + β 1 x. However, it doesn’t make sense to use a straight line
to estimate the probability of an event occurring based on another variable,
due to the following reasons:
The estimated values of p can never be outside of [0, 1], which goes
against the idea of a straight line (a straight line continues on in both
directions).
It doesn’t make sense to force the values of p to increase in a linear
way based on x. For example, an event may occur very frequently with a
range of large values of x and very frequently with a range of small
values of x, with very little chance of the event happening in an area
in between. This type of model would have a U-shape, rather than a
straight-line shape.
To come up with a more appropriate model for p, statisticians created a new
function of p whose graph is called an S-curve. The S-curve is a function that
involves p, but it also involves e (the natural logarithm) as well as a ratio of
two functions. The values of the S-curve always fit between 0 and 1 and allows
the probability, p, to change from low to high or high to low, according to
a curve that is shaped like an S. The general form of the logistic regres-
e β 0 + β 1 x
sion model based on an S-curve is p = β 0 + x .
1 + e β 1