Page 57 - Applied Probability
P. 57
3. Newton’s Method and Scoring
40
There are two potential problems with Newton’s method. First, it can be
expensive computationally to evaluate the observed information. Second,
ˆ
far from θ, Newton’s method is equally happy to head uphill or downhill.
In other words, Newton’s method is not an ascent algorithm in the sense
that L(θ n+1 ) >L(θ n ). To generate an ascent algorithm, we can replace the
2
observed information −d L(θ n ) by a positive definite approximating matrix
t
A n . With this substitution, the proposed increment ∆θ n = A −1 dL(θ n ) ,if
n
sufficiently contracted, forces an increase in L(θ). For a nonstationary point,
this assertion follows from the first-order Taylor’s expansion
L(θ n + α∆θ n ) − L(θ n ) = dL(θ n )α∆θ n + o(α)
t
= αdL(θ n )A −1 dL(θ n ) + o(α),
n
where the error ratio o(α) → 0 as the positive contraction constant α → 0.
α
Thus, a positive definite modification of the observed information combined
with some form of backtracking leads to an ascent algorithm. The simplest
form of backtracking is step-halving. If the initial increment ∆θ n does not
1
1
1
produce an increase in L(θ), then try ∆θ n .If ∆θ n fails, then try ∆θ n ,
2 2 4
and so forth.
3.3 Scoring
A variety of ways of approximating the observed information exists. The
method of steepest ascent replaces the observed information by the iden-
tity matrix I. The usually more efficient scoring algorithm replaces the ob-
2
served information by the expected information J(θ)= E[−d L(θ)]. The
alternative representation J(θ) = Var[dL(θ)] of J(θ) as a covariance matrix
shows that it is nonnegative definite [8, 18]. An extra dividend of scoring is
ˆ −1
that the inverse matrix J(θ) immediately supplies the asymptotic vari-
ˆ
ances and covariances of the maximum likelihood estimate θ [8, 18]. Scoring
and Newton’s method share this advantage since the observed information
is asymptotically equivalent to the expected information under reasonably
natural assumptions. The available evidence indicates that the observed
information matrix is slightly superior to the expected information matrix
in estimating parameter asymptotic standard errors [7].
It is possible to compute J(θ) explicitly for exponential families of
densities [10]. Such densities take the form
t
f(x | θ) = g(x)e β(θ)+h(x) γ(θ) (3.2)
relative to some measure ν, which in practice is usually either Lebesgue
measure or counting measure. Most of the distributional families commonly
encountered in statistics are exponential families. The score and expected
information can be expressed in terms of the mean vector µ(θ)= E[h(X)]