Page 198 - Artificial Intelligence for Computational Modeling of the Heart
P. 198
170 Chapter 5 Machine learning methods for robust parameter estimation
the available actions, and a single hyper-parameter related to the
state space configuration. Then everything is learned automati-
cally. The framework does not depend on the biophysical model
to be personalized. We evaluated it on two different models: the
inverse problem of cardiac electrophysiology and the personaliza-
tion of a lumped parameter model of whole-body circulation.
5.3.1 Parameter estimation as a Markov decision
process
To apply RL to a problem, we first need to map it into a Markov
Decision Process (MDP) [266]. In brief, an MDP is defined as a
tuple M = (S,A,T ,R,γ ),where S is the finite set of states de-
scribing the environment, A is the finite set of actions for interact-
ing with the environment, T is the stochastic transition function,
where T (s t ,a t ,s t+1 ) describes the probability of arriving in state
s t+1 after the agent performed action a t in state s t , R is the reward
function, where r t+1 = R(s t ,a t ,s t+1 ) is the immediate reward the
agent receives after performing action a t in state s t resulting in
state s t+1 ,and γ ∈[0;1] is the discount factor. The goal of RL is to
find the optimal policy π : S → A, i.e., the mapping from states to
∗
actions that maximizes the expected value of the cumulative dis-
counted reward. The optimal policy for a fully defined MDP can be
found by applying the value iteration method [266], among other
techniques.
However, since not all MDP components are known precisely
(T is only an approximation from training data as we will see
later), value iteration does not guarantee optimality. To mitigate
potential issues due to this, we use a stochastic policy ˜π [384], in-
∗
stead of the standard deterministic policy. For a given state, while
a deterministic policy always returns the action with the high-
est state-action value (function computed by value iteration), the
stochastic policy keeps multiple candidate actions with similar
high state-action value (threshold defined by user), and returns
one of them through a random process each time it is queried.
5.3.1.1 Reformulation of model personalization into an MDP
The model personalization problem is mapped to an MDP as
follows:
• States encode the misfit between computed model output and
patient’s measurements. While misfit is generally continuous,
the number of MDP states has to be finite, therefore the space
n c
of objective vectors, R , is reduced to a finite set of represen-
tative MDP states S,each s ∈ S covering a small region of that
space. ˆ ∈ S denotes the success state, which covers the region
s