Page 70 - Applied Statistics Using SPSS, STATISTICA, MATLAB and R
P. 70

2.2 Presenting the Data   49


              Let X denote the random variable associated to PRT. Then, the histogram of the
           frequency values represents an estimate, f ˆ X  (x ) , of the  unknown  probability
           density function f  X  (x ) .
              The number of bins to use in a histogram (or in a frequency table) depends on
           its goodness of fit to the true density function f  X  (x ) , in terms of bias and variance.
           In order to clarify this issue, let us consider the histograms of PRT using r = 3 and
           r = 50 bins as shown in Figure 2.18. Consider in both cases the  f ˆ X  (x ) estimate
           represented  by a polygonal line passing  through the  mid-point values of the
           histogram bars. Notice that in the  first case (r  = 3)  the  f ˆ X  (x ) estimate is quite
           smooth and lacks detail, corresponding to a large  bias of the expected value
           of f ˆ X  (x ) – f  X  (x ) ; i.e., in average terms (for an ensemble of similar histograms
           associated to X) the histogram will give a point estimate of the density that can be
           quite far from the true density. In the second case (r = 50) the  f ˆ X  (x ) estimate is
           too rough; our polygonal line may pass quite near the true density values, but the
            f ˆ X  (x ) values vary widely (large variance) around the  f  X  (x ) curve (corresponding
           to an average of a large number of such histograms).

                      50
                      45

                      40
                      35
                      30
                    No of obs  25

                      20
                      15

                      10
                      5
                                                                    PRT
                      0
                        104.000000   606.666667    1109.333333  1612.000000
                               355.333333   858.000000   1360.666667
           Figure 2.17.  Histogram of variable PRT (cork stopper  dataset) obtained with
           STATISTICA using r = 6 bins.

              Some formulas for selecting a “reasonable” number of bins, r, achieving a trade-
           off between large  bias and large  variance, have  been  divulged in the literature,
           namely:
              r = 1 + 3.3 log(n)   (Sturges, 1926);                         2.1

              r = 1 + 2.2 log(n)   (Larson, 1975).                          2.2
   65   66   67   68   69   70   71   72   73   74   75