Page 39 - Applied Statistics Using SPSS, STATISTICA, MATLAB and R
P. 39
18 1 Introduction
population of other years? Common sense (and other senses as well) rejects such a
claim. If a statistically significant difference was detected one should look
carefully to the conditions presiding the data collection: can the samples be
considered as being random?; maybe the 1996 sample was collected in at-risk
foetuses with lower baseline measurements; and so on. As a matter of fact, when
dealing with large samples even a small compositional difference may sometimes
produce statistically significant results. For instance, for the sample sizes of the
CTG dataset even a difference as small as 1 bpm produces a result usually
considered as statistically significant (p = 0.02). However, obstetricians only attach
practical meaning to rhythm differences above 5 bpm; i.e., the statistically
significant difference of 1 bpm has no practical significance.
Inferring causality from data is even a riskier endeavour than simple
comparisons. An often encountered example is the inference of causality from a
statistically significant but spurious correlation. We give more details on this issue
in section 4.4.1.
One must also be very careful when performing goodness of fit tests. A
common example of this is the normality assessment of a data distribution. A vast
quantity of papers can be found where the authors conclude the normality of data
distributions based on very small samples. (We have found a paper presented in a
congress where the authors claimed the normality of a data distribution based on a
sample of four cases!) As explained in detail in section 5.1.6, even with 25-sized
samples one would often be wrong when admitting that a data distribution is
’
normal because a statistical test didn t reject that possibility at a 95% confidence
level. More: one would often be accepting the normality of data generated with
asymmetrical and even bimodal distributions! Data distribution modelling is a
difficult problem that usually requires large samples and even so one must bear in
mind that most of the times and beyond a reasonable doubt one only has evidence
of a model; the true distribution remains unknown.
Another misuse of inferential statistics arrives in the assessment of classification
or regression models. Many people when designing a classification or regression
model that performs very well in a training set (the set used in the design) suffer
from a kind of love-at-first-sight syndrome that leads to neglecting or relaxing the
evaluation of their models in test sets (independent of the training sets). Research
literature is full with examples of improperly validated models that are later on
dropped out when more data becomes available and the initial optimism plunges
down. The love-at-first-sight is even stronger when using computer software that
automatically searches for the best set of variables describing the model. The book
of Chamont Wang (Wang C, 1993), where many illustrations and words of caution
on the topic of inferential statistics can be found, mentions an experiment where 51
data samples were generated with 100 random numbers each and a regression
model was searched for “explaining” one of the data samples (playing the role of
dependent variable) as a function of the other ones (playing the role of independent
variables). The search finished by finding a regression model with a significant
R-square and six significant coefficients at 95% confidence level. In other words, a
functional model was found explaining a relationship between noise and noise!
Such a model would collapse had proper validation been applied. In the present