Page 171 - Intermediate Statistics for Dummies
P. 171

13_045206 ch08.qxd  2/1/07  9:57 AM  Page 150
                               150
                                         Part II: Making Predictions by Using Regression
                                                    To estimate p, the chance of an event occurring, you need data that comes in
                                                    the form of yes or no, indicating whether or not the event occurred for each
                                                    individual in the data set. Now because yes or no data don’t have a normal
                                                    distribution, a condition needed for other types of regression, you need a
                                                    new type of regression model to do this job — logistic regression. Keep read-
                                                    ing this section to find out more about this model.
                                                    Defining a logistic regression model
                                                    A logistic regression model ultimately gives you an estimate for p, the
                                                    probability that a particular outcome will occur in a yes or no situation (for
                                                    example, the chance that it will rain versus not). The estimate is based on
                                                    information from one or more explanatory variables; you can call them x 1 , x 2 ,
                                                    x 3 , . . . x k . (For example, x 1  = humidity, x 2 = barometric pressure, x 3 = cloud
                                                    cover, . . . and x k = wind speed.) Note: In this chapter, I present only the case
                                                    where you use one explanatory variable. You can extend the ideas in exactly
                                                    the same way as you can extend the simple linear regression model (Chap-
                                                    ter 4) to a multiple regression model (Chapter 5).
                                                    Using an S-curve to estimate probabilities
                                                    In a simple linear regression model, the general form of a straight line is
                                                    y = β 0 + β 1 x. In the case of estimating p, the linear regression model is the
                                                    straight line p = β 0 + β 1 x. However, it doesn’t make sense to use a straight line
                                                    to estimate the probability of an event occurring based on another variable,
                                                    due to the following reasons:
                                                       The estimated values of p can never be outside of [0, 1], which goes
                                                        against the idea of a straight line (a straight line continues on in both
                                                        directions).
                                                       It doesn’t make sense to force the values of p to increase in a linear
                                                        way based on x. For example, an event may occur very frequently with a
                                                        range of large values of x and very frequently with a range of small
                                                        values of x, with very little chance of the event happening in an area
                                                        in between. This type of model would have a U-shape, rather than a
                                                        straight-line shape.
                                                    To come up with a more appropriate model for p, statisticians created a new
                                                    function of p whose graph is called an S-curve. The S-curve is a function that
                                                    involves p, but it also involves e (the natural logarithm) as well as a ratio of
                                                    two functions. The values of the S-curve always fit between 0 and 1 and allows
                                                    the probability, p, to change from low to high or high to low, according to
                                                    a curve that is shaped like an S. The general form of the logistic regres-
                                                                                      e β 0 +  β 1  x
                                                    sion model based on an S-curve is p =  β 0 +  x .
                                                                                     1 +  e  β 1
   166   167   168   169   170   171   172   173   174   175   176