Page 275 - Computational Statistics Handbook with MATLAB
P. 275

264                        Computational Statistics Handbook with MATLAB



                                             hf ξ () =  ∫  ft() t;  for  some  ξ  in B  k   .   (8.8)
                                                         d
                                                                         k
                                                k
                                                     B
                                                      k
                             This is based on the assumption that the probability density function  f x()  is
                                                                    . A function is Lipschitz contin-
                             Lipschitz continuous over the bin interval B k
                                                              such that
                             uous if there is a positive constant γ k
                                                                         ,
                                            fx() –  fy() <  γ k x –  y ;  for all xy  in B k  .  (8.9)
                             The first term in Equation 8.7 is an upper bound for the variance of the den-
                             sity estimate, and the second term is an upper bound for the squared bias of
                             the density estimate. This upper bound shows what happens to the density
                             estimate when the bin width h is varied.
                               We can try to minimize the MSE by varying the bin width h. We could set
                             h very small to reduce the bias, but this also increases the variance. The
                             increased variance in our density estimate is evident in Figure 8.1, where we
                             see more spikes as the bin width gets smaller. Equation 8.7 shows a common
                             problem in some density estimation methods: the trade-off between variance
                             and bias as h is changed. Most of the optimal bin widths presented here are
                             obtained by trying to minimize the squared error.
                              A rule for bin width selection that is often presented in introductory statis-
                             tics texts is called Sturges’ Rule. In reality, it is a rule that provides the number
                             of bins in the histogram, and is given by the following formula.

                             STURGES’ RULE (HISTOGRAM)

                                                        k =  1 +  log 2 n   .


                             Here k is the number of bins. The bin width h is obtained by taking the range
                             of the sample data and dividing it into the requisite number of bins, k.
                              Some improved values for the bin width h can be obtained by assuming the
                             existence of two derivatives of the probability density function  f x()  . We
                             include the following results (without proof), because they are the basis for
                             many of the univariate bin width rules presented in this chapter. The inter-
                             ested reader is referred to Scott [1992] for more details. Most of what we
                             present here follows his treatment of the subject.
                              Equation 8.7 provides a measure of the squared error at a point x. If we
                             want to measure the error in our estimate for the entire function, then we can
                             integrate over all values of x. Let’s assume f x()   has an absolutely continuous
                             and a square-integrable first derivative. If we let n get very large  n →(  ∞)  ,
                             then the asymptotic MISE is






                            © 2002 by Chapman & Hall/CRC
   270   271   272   273   274   275   276   277   278   279   280