Page 397 - Numerical Methods for Chemical Engineering
P. 397

386     8 Bayesian statistics and parameter estimation



                   of p(y|θ,σ) and the prior p(θ,σ):

                                           p(θ,σ|y) ∝ p(y|θ,σ)p(θ,σ)                  (8.63)
                   Note that p(y|θ,σ) is the probability of observing a particular response y, given specified
                   values of θ and σ. But, at the time of analysis, it is y that is known and θ and σ that are
                   unknown. It is common practice to switch the order of the arguments and so define the
                   likelihood function for θ and σ, given specified y,
                                                   1      −N        1
                                                        N
                           l(θ,σ|y) ≡ p(y|θ,σ) =  √      σ   exp −   2  S(θ)          (8.64)
                                                   2π              2σ
                   While we switch the order of the arguments, the meaning of this quantity remains
                   unchanged – it is the probability that the measured response is y, given specified values of
                   θ and σ.
                     The posterior density is then written as
                                           p(θ,σ|y) ∝ l(θ,σ|y)p(θ,σ)                  (8.65)

                   In the Bayesian approach, we take as estimates θ M ,σ M the values that maximize the pos-
                   terior density p(θ,σ|y). The frequentist rule is to take the value θ MLE that maximizes the
                   likelihood function l(θ,σ|y). If the prior p(θ,σ) is not uniform in θ, the Bayesian most
                   probable estimate θ M and the maximum likelihood estimate θ MLE disagree.


                   Some general considerations about the selection of a prior
                   Clearly, the choice of the prior is crucial, as it influences our analysis. It is this subjective
                   nature of the Bayesian approach that is the cause of controversy, since in statistics we would
                   like to think that different people who look at the data will come to the same conclusions.
                   We hope that the data will be sufficiently informative that the likelihood function is sharply
                   peaked around specific values of θ and σ; i.e., that the inference problem is data-dominated.
                   In this case, the estimates are rather insensitive to the choice of prior, as long as it is non-
                   zero near the peak of the likelihood function. When this is not the case, the prior influences
                   the results of the analysis.
                     In some problems, the explicit dependence upon a prior is quite useful. If we know a
                   priori that certain regions of the parameter space are inadmissible (e.g. certain parameters
                   must be nonnegative), then the prior can be set to zero in those regions to exclude them
                   from consideration.
                     Using an explicit prior also allows us to blend learning from different sets of data in
                   a seamless manner. Let us say that we measure a response vector y   1   in some set of
                   experiments, and compute the posterior density


                                           p θ|y  1   ∝ l   1   θ|y  1   p(θ)         (8.66)
                   We later perform a new set of experiments on the same system and measure a response
                   vector y . When we perform the analysis on these new data, an apparent choice of prior
                          2
                   for the second set of experiments is the posterior density from the first. The posterior density
   392   393   394   395   396   397   398   399   400   401   402