Page 170 - Statistics for Environmental Engineers

P. 170

L1592_Frame_C20 Page 169 Tuesday, December 18, 2001 1:53 PM

Multiple Paired Comparisons of k Averages

KEY WORDS data snooping, data dredging, Dunnett’s procedure, multiple comparisons, sliding refer-
ence distribution, studentized range, t-tests, Tukey’s procedure.

The problem of comparing several averages arises in many contexts: compare ﬁve bioassay treatments
against a control, compare four new polymers for sludge conditioning, or compare eight new combina-
tions of media for treating odorous ventilation air. One multiple paired comparison problem is to compare
all possible pairs of k treatments. Another is to compare k – 1 treatments with a control.
Knowing how to do a t-test may tempt us to compare several combinations of treatments using a
series of paired t-tests. If there are k treatments, the number of pair-wise comparisons that could be
made is k(k – 1)/2. For k = 4, there are 6 possible combinations, for k = 5 there are 10, for k = 10 there
are 45, and for k = 15 there are 105. Checking 5, 10, 45, or even 105 combinations is manageable but
not recommended. Statisticians call this data snooping (Sokal and Rohlf, 1969) or data dredging (Tukey,
1991). We need to understand why data snooping is dangerous.
Suppose, to take a not too extreme example, that we have 15 different treatments. The number of
possible pair-wise comparisons that could be made is 15(15 – 1)/2 = 105. If, before the results are
known, we make one selected comparison using a t-test with a 100α% = 5% error rate, there is a 5%
chance of reaching the wrong decision each time we repeat the data collection experiment for those two
treatments. If, however, several pairs of treatments are tested for possible differences using this procedure,
the error rate will be larger than the expected 5% rate. Imagine that a two-sample t-test is used to compare
the largest of the 15 average values against the smallest. The null hypothesis that this difference, the
largest of all the 105 possible pair-wise differences, is likely to be rejected almost every time the
experiment is repeated, instead of just at the 5% rate that would apply to making one pair-wise comparison
selected at random from among the 105 possible comparisons.
The number of comparisons does not have to be large for problems to arise. If there are just three
treatment methods and of the three averages, A is larger than B and C is slightly larger than A
( y C > y A > y B ), it is possible for the three possible t-tests to indicate that A gives higher results than B
(η A > η B ), A is not different from C (η A = η C ), and B is not different from C (η B = η C ). This apparent
contradiction can happen because different variances are used to make the different comparisons. Analysis
of variance (Chapter 21) eliminates this problem by using a common variance to make a single test of
signiﬁcance (using the F statistic).
The multiple comparison test is similar to a t-test but an allowance is made in the error rate to keep
the collective error rate at the stated level. This collective rate can be deﬁned in two ways. Returning
to the example of 15 treatments and 105 possible pair-wise comparisons, the probability of getting the
wrong conclusion for a single randomly selected comparison is the individual error rate. The family
error rate (also called the Bonferroni error rate) is the chance of getting one or more of the 105
comparisons wrong in each repetition of data collection for all 15 treatments. The family error rate
counts an error for each wrong comparison in each repetition of data collection for all 15 treatments.
Thus, to make valid statistical comparisons, the individual per comparison error rate must be shrunk to
keep the simultaneous family error rate at the desired level.

165 166 167 168 169 170 171 172 173 174 175