Page 224 - Statistics II for Dummies
P. 224
208 Part III: Analyzing Variance with ANOVA
Seeing Regression through
the Eyes of Variation
Every basic statistical model tries to explain why the different outcomes (y)
are what they are. It tries to figure out what factors or explanatory variables
(x) can help explain that variability in those y’s. In this section, you start with
the y-values by themselves and see how their variability plays a central role
in the regression model. This is the first step toward applying ANOVA to the
regression model.
No matter what y variable you’re interested in predicting, you’ll always have
variability in those y-values. If you want to predict the length of a fish, for
example, you know that fish have many different lengths (indicating a great
deal of variability). Even if you put all the fish of the same age and species
together, you still have some variability in their lengths (it’s less than before
but still there nonetheless). The first step in understanding the basic ideas of
regression and ANOVA is to understand that variability in the y’s is to be
expected, and your job is to try to figure out what can explain most of it.
Spotting variability and finding
an “x-planation”
Both regression and ANOVA work to get a handle on explaining the variabil-
ity in the y variable using an x variable. After you collect your data, you can
find the standard deviation in the y variable to get a sense of how much the
data varies within the sample. From there, you collect data on an x variable
and see how much it contributes to explaining that variability.
Suppose you notice that people spend different amounts of time on the
Internet, and you want to explore why that may be. You start by taking a
small sample of 20 people and record how many hours per month they spend
on the Internet. The results (in hours) are 20, 20, 22, 39, 40, 19, 20, 32, 33, 29,
24, 26, 30, 46, 37, 26, 45, 15, 24, and 31. The first thing you notice about this
data is the large amount of variability in it. The standard deviation (average
distance from the data values to their mean) of this data set is 8.93 hours,
which is quite large given the size of the numbers in the data set.
So you figured out that the y-values — the amount of time someone uses the
Internet — have a great deal of variability in them. What can help explain
this? Part of the variability is due to chance. But you suspect some variable
is out there (call it x) that has some connection to the y variable, and that x
variable can help you make more sense out of this seemingly wide range of
y-values.
18_466469-ch12.indd 208 7/24/09 9:45:28 AM

