Page 237 -
P. 237
11 How Many Times Should One Run a Computational Simulation? 235
contains the most extreme values of T, i.e. the tails of its distribution. Most users
of statistics stop here, and perform a test verifying whether t belongs to R or to A
and, as a consequence, respectively reject H 0 or fail to reject it, as part of a ritual
(Gigerenzer 2004). In this situation, an alternative way of reaching the same result
is to compare the p-value, whenever defined, to a fixed threshold ˛:ifthe p-value is
smaller than ˛, we reject H 0 , otherwise we fail to reject it.
It is interesting to review the relations among the quantities seen until now. We
saw before that the effect size d has an impact on ˇ. Since d is a measure of how easy
it is to discriminate between H 0 and H 1 , it is generally the case that power, 1 ˇ,
6
increases with d when ˛ is fixed. Another factor affecting ˛ and ˇ is the sample
size N. In this case too, 1 ˇ generally increases with N, when ˛ is fixed. At last, the
fT 2 Ag D ˇ show that there is a trade-off
formulas P H 0 fT 2 Ag D 1 ˛ and P H 1
between ˛ and ˇ. Indeed, when A gets larger, ˛ decreases while ˇ increases, and
vice versa. This explains why, when N and d are fixed, it is not possible to reduce ˛
without consequences on the Type-II error rate ˇ. 7
This is the reason why one cannot make ˛ as small as possible, that is because
this inflates ˇ. This fact suggests that good results could be achieved by balancing
8
the two error rates. This was indeed proposed by Neyman and Pearson in 1933, and
has been revived several times since then. A more recent attempt in this direction
is the compromise power analysis of Erdfelder (1984). However, the most common
approach is to consider the two sources of error differently.
A first approach completely disregards ˇ:avaluefor ˛ is rigorously fixed (often
as ˛ D 0:05), and the test checks whether t belongs to A or not using a sample
whose size N has been selected without reference to ˇ. This approach is the one that
most closely resembles the original Fisher paradigm, as the alternative hypothesis
has practically no role in it. It is based on the fact that, as N increases, ˇ goes to
0, so that a large sample size guarantees that ˇ will be small enough. A second
approach supplements this part of the analysis with the computation of power using
a value of d estimated on the basis of the data, a procedure called post hoc power
analysis. Because of the large variability of the estimated effect size, this approach
is generally regarded with suspicion by statisticians (Korn 1990; Hoenig and Heisey
2001). In the third approach, the researcher fixes ˛ and ˇ, hypothesizes a value of d,
fT 2 Ag D ˇ .d/
and chooses A and N so that both P H 0 fT 2 Ag D 1 ˛ and P H 1
hold true. This procedure, called a priori power analysis, guarantees that, if d is
correctly guessed, the desired values of ˛ and ˇ will be achieved.
6 This also explains why in some cases it is possible to increase the power of a test by designing
an experiment in which it is expected that the effect size d, if not null, is large. As an example,
in ABM this could be done by setting some of the quantities entering the model to their extreme
values.
7
See also van der Vaart (2000, p. 213) or Choirat and Seri (2012, Proposition 7, p. 285).
8
The authors say: “The use of these statistical tools in any given case, in determining just how the
balance should be struck, must be left to the investigator” (Neyman and Pearson 1933, p. 296).