Page 57 - Applied Probability
P. 57

3. Newton’s Method and Scoring
                              40
                                There are two potential problems with Newton’s method. First, it can be
                              expensive computationally to evaluate the observed information. Second,
                                      ˆ
                              far from θ, Newton’s method is equally happy to head uphill or downhill.
                              In other words, Newton’s method is not an ascent algorithm in the sense
                              that L(θ n+1 ) >L(θ n ). To generate an ascent algorithm, we can replace the
                                                    2
                              observed information −d L(θ n ) by a positive definite approximating matrix
                                                                                            t
                              A n . With this substitution, the proposed increment ∆θ n = A −1 dL(θ n ) ,if
                                                                                    n
                              sufficiently contracted, forces an increase in L(θ). For a nonstationary point,
                              this assertion follows from the first-order Taylor’s expansion
                                      L(θ n + α∆θ n ) − L(θ n )  = dL(θ n )α∆θ n + o(α)
                                                                                t
                                                            = αdL(θ n )A −1 dL(θ n ) + o(α),
                                                                        n
                              where the error ratio  o(α)  → 0 as the positive contraction constant α → 0.
                                                  α
                              Thus, a positive definite modification of the observed information combined
                              with some form of backtracking leads to an ascent algorithm. The simplest
                              form of backtracking is step-halving. If the initial increment ∆θ n does not
                                                                                          1
                                                                1
                                                                        1
                              produce an increase in L(θ), then try ∆θ n .If ∆θ n fails, then try ∆θ n ,
                                                                2       2                 4
                              and so forth.
                              3.3 Scoring
                              A variety of ways of approximating the observed information exists. The
                              method of steepest ascent replaces the observed information by the iden-
                              tity matrix I. The usually more efficient scoring algorithm replaces the ob-
                                                                                     2
                              served information by the expected information J(θ)= E[−d L(θ)]. The
                              alternative representation J(θ) = Var[dL(θ)] of J(θ) as a covariance matrix
                              shows that it is nonnegative definite [8, 18]. An extra dividend of scoring is
                                                      ˆ −1
                              that the inverse matrix J(θ)  immediately supplies the asymptotic vari-
                                                                                 ˆ
                              ances and covariances of the maximum likelihood estimate θ [8, 18]. Scoring
                              and Newton’s method share this advantage since the observed information
                              is asymptotically equivalent to the expected information under reasonably
                              natural assumptions. The available evidence indicates that the observed
                              information matrix is slightly superior to the expected information matrix
                              in estimating parameter asymptotic standard errors [7].
                                It is possible to compute J(θ) explicitly for exponential families of
                              densities [10]. Such densities take the form

                                                                         t
                                                 f(x | θ)  = g(x)e β(θ)+h(x) γ(θ)          (3.2)
                              relative to some measure ν, which in practice is usually either Lebesgue
                              measure or counting measure. Most of the distributional families commonly
                              encountered in statistics are exponential families. The score and expected
                              information can be expressed in terms of the mean vector µ(θ)= E[h(X)]
   52   53   54   55   56   57   58   59   60   61   62