Page 35 - Applied Statistics Using SPSS, STATISTICA, MATLAB and R
P. 35

14       1 Introduction


           example) and on appropriate  models and/or  conditions that the datasets  must
           satisfy.
              Let us now look in more detail what a confidence level really means. Imagine
           that in Example 1.2 we  were dealing  with a random sample extracted from a
           population of a very large number of students, attending the course and subject to
           an examination under the same conditions. Thus, only one random variable plays a
           role  here: the student  variability in the apprehension  of  knowledge.  Consider,
           further, that  we  wanted to statistically assess the  statement “the student
           performance is 3 or above”. Denoting by p the probability of the event “the student
           performance is 3 or above” we derive from the dataset an estimate of p, known as
           point estimate and denoted p ˆ , as follows:

                 12  +15  +10
               ˆ p =       = 0.74.
                     50

              The  question  is how reliable this estimate is. Since the  random variable
           representing such an estimate (with random samples of 50 students) takes value in
           a continuum of values, we know that the probability that the true mean is exactly
           that particular value (0.74) is zero. We then loose a bit of our innate and candid
           faith in exact numbers, relax our exigency, and move forward to thinking in terms
           of intervals around  p ˆ  (interval estimate). We now ask with which degree  of
           certainty (confidence level) we can say that the true proportion p of students with
           “performance 3 or above” is, for instance, between 0.72 and 0.76, i.e., with a
           deviation – or tolerance – of ε = ±0.02 from that estimated proportion?
              In  order to answer this  question one needs to know the so-called  sampling
           distribution of the following random variable:

              P =  ( ∑ n i 1=  X i  n / )  ,
               n

           where the X i are n independent random variables whose values are 1 in case of
           “success” (student performance ≥ 3 in this example) and 0 in case of “failure”.
              When the np and n(1–p) quantities are “reasonably large” P n has a distribution
           well approximated by the normal distribution with mean equal to p and standard
                              (
           deviation equal to   p 1−  p)  n /  . This topic is discussed in detail in Appendices A
           and B,  where what is meant by “reasonably large” is also presented. For the
           moment, it will suffice to  say that using  the normal distribution approximation
           (model),  one  is able to compute confidence levels  for several  values of the
           tolerance, ε, and sample size, n, as shown in Table 1.6 and displayed in Figure 1.6.
              Two important aspects are illustrated in Table 1.6 and  Figure  1.6:  first, the
           confidence level always converges to  1 (absolute certainty) with increasing n;
           second, when we want to be more precise in our interval estimates by decreasing
           the tolerance, then,  for  fixed  n,  we  have to lower the confidence levels, i.e.,
           simultaneous  and arbitrarily good  precision and certainty are impossible (some
           trade-off is always necessary). In the “jury verdict” analogy it is the same as if one
           said the degree of certainty increases with the number of evidential facts (tending
   30   31   32   33   34   35   36   37   38   39   40