Page 345 - Statistics for Environmental Engineers

P. 345

L1592_frame_C40 Page 355 Tuesday, December 18, 2001 3:24 PM
40

Regression Analysis with Categorical Variables

KEY WORDS acid rain, pH, categorical variable, F test, indicator variable, east squares, linear model,
regression, dummy variable, qualitative variables, regression sum of squares, t-ratio, weak acidity.

Qualitative variables can be used as explanatory variables in regression models. A typical case would be
when several sets of data are similar except that each set was measured by a different chemist (or different
instrument or laboratory), or each set comes from a different location, or each set was measured on a
different day. The qualitative variables — chemist, location, or day — typically take on discrete values
(i.e., chemist Smith or chemist Jones). For convenience, they are usually represented numerically by a
combination of zeros and ones to signify an observation’s membership in a category; hence the name
categorical variables.
One task in the analysis of such data is to determine whether the same model structure and parameter
values hold for each data set. One way to do this would be to ﬁt the proposed model to each individual
data set and then try to assess the similarities and differences in the goodness of ﬁt. Another way would
be to ﬁt the proposed model to all the data as though they were one data set instead of several, assuming
that each data set has the same pattern, and then to look for inadequacies in the ﬁtted model.
Neither of these approaches is as attractive as using categorical variables to create a collective data
set that can be ﬁtted to a single model while retaining the distinction between the individual data sets.
This technique allows the model structure and the model parameters to be evaluated using statistical
methods like those discussed in the previous chapter.

Case Study: Acidiﬁcation of a Stream During Storms

Cosby Creek, in the southern Appalachian Mountains, was monitored during three storms to study how
pH and other measures of acidiﬁcation were affected by the rainfall in that region. Samples were taken
every 30 min and 19 characteristics of the stream water chemistry were measured (Meinert et al., 1982).
Weak acidity (WA) and pH will be examined in this case study.
Figure 40.1 shows 17 observations for storm 1, 14 for storm 2, and 13 for storm 3, giving a total of
44 observations. If the data are analyzed without distinguishing between storms one might consider
2
models of the form pH = β 0 + β 1 WA + β 2 WA or pH = θ 3 + (θ 1 − θ 3 )exp(−θ 2 WA). Each storm might be
described by pH = β 0 + β 1 WA, but storm 3 does not have the same slope and intercept as storms 1 and
2, and storms 1 and 2 might be different as well. This can be checked by using categorical variables to
estimate a different slope and intercept for each storm.

Method: Regression with Categorical Variables

Suppose that a model needs to include an effect due to the category (storm event, farm plot, treatment,
truckload, operator, laboratory, etc.) from which the data came. This effect is included in the model in
the form of categorical variables (also called dummy or indicator variables). In general m − 1 categorical
variables are needed to specify m categories.

340 341 342 343 344 345 346 347 348 349 350