Page 198 - Artificial Intelligence for Computational Modeling of the Heart
P. 198

170  Chapter 5 Machine learning methods for robust parameter estimation




                                         the available actions, and a single hyper-parameter related to the
                                         state space configuration. Then everything is learned automati-
                                         cally. The framework does not depend on the biophysical model
                                         to be personalized. We evaluated it on two different models: the
                                         inverse problem of cardiac electrophysiology and the personaliza-
                                         tion of a lumped parameter model of whole-body circulation.

                                         5.3.1 Parameter estimation as a Markov decision
                                                process
                                            To apply RL to a problem, we first need to map it into a Markov
                                         Decision Process (MDP) [266]. In brief, an MDP is defined as a
                                         tuple M = (S,A,T ,R,γ ),where S is the finite set of states de-
                                         scribing the environment, A is the finite set of actions for interact-
                                         ing with the environment, T is the stochastic transition function,
                                         where T (s t ,a t ,s t+1 ) describes the probability of arriving in state
                                         s t+1 after the agent performed action a t in state s t , R is the reward
                                         function, where r t+1 = R(s t ,a t ,s t+1 ) is the immediate reward the
                                         agent receives after performing action a t in state s t resulting in
                                         state s t+1 ,and γ ∈[0;1] is the discount factor. The goal of RL is to
                                         find the optimal policy π : S → A, i.e., the mapping from states to
                                                               ∗
                                         actions that maximizes the expected value of the cumulative dis-
                                         counted reward. The optimal policy for a fully defined MDP can be
                                         found by applying the value iteration method [266], among other
                                         techniques.
                                            However, since not all MDP components are known precisely
                                         (T is only an approximation from training data as we will see
                                         later), value iteration does not guarantee optimality. To mitigate
                                         potential issues due to this, we use a stochastic policy ˜π [384], in-
                                                                                           ∗
                                         stead of the standard deterministic policy. For a given state, while
                                         a deterministic policy always returns the action with the high-
                                         est state-action value (function computed by value iteration), the
                                         stochastic policy keeps multiple candidate actions with similar
                                         high state-action value (threshold defined by user), and returns
                                         one of them through a random process each time it is queried.

                                         5.3.1.1 Reformulation of model personalization into an MDP
                                            The model personalization problem is mapped to an MDP as
                                         follows:
                                         • States encode the misfit between computed model output and
                                            patient’s measurements. While misfit is generally continuous,
                                            the number of MDP states has to be finite, therefore the space
                                                                n c
                                            of objective vectors, R , is reduced to a finite set of represen-
                                            tative MDP states S,each s ∈ S covering a small region of that
                                            space. ˆ ∈ S denotes the success state, which covers the region
                                                  s
   193   194   195   196   197   198   199   200   201   202   203