Page 198 - Artificial Intelligence for Computational Modeling of the Heart

P. 198

170 Chapter 5 Machine learning methods for robust parameter estimation

the available actions, and a single hyper-parameter related to the
state space conﬁguration. Then everything is learned automati-
cally. The framework does not depend on the biophysical model
to be personalized. We evaluated it on two different models: the
inverse problem of cardiac electrophysiology and the personaliza-
tion of a lumped parameter model of whole-body circulation.

5.3.1 Parameter estimation as a Markov decision
process
To apply RL to a problem, we ﬁrst need to map it into a Markov
Decision Process (MDP) [266]. In brief, an MDP is deﬁned as a
tuple M = (S,A,T ,R,γ ),where S is the ﬁnite set of states de-
scribing the environment, A is the ﬁnite set of actions for interact-
ing with the environment, T is the stochastic transition function,
where T (s t ,a t ,s t+1 ) describes the probability of arriving in state
s t+1 after the agent performed action a t in state s t , R is the reward
function, where r t+1 = R(s t ,a t ,s t+1 ) is the immediate reward the
agent receives after performing action a t in state s t resulting in
state s t+1 ,and γ ∈[0;1] is the discount factor. The goal of RL is to
ﬁnd the optimal policy π : S → A, i.e., the mapping from states to
∗
actions that maximizes the expected value of the cumulative dis-
counted reward. The optimal policy for a fully deﬁned MDP can be
found by applying the value iteration method [266], among other
techniques.
However, since not all MDP components are known precisely
(T is only an approximation from training data as we will see
later), value iteration does not guarantee optimality. To mitigate
potential issues due to this, we use a stochastic policy ˜π [384], in-
∗
stead of the standard deterministic policy. For a given state, while
a deterministic policy always returns the action with the high-
est state-action value (function computed by value iteration), the
stochastic policy keeps multiple candidate actions with similar
high state-action value (threshold deﬁned by user), and returns
one of them through a random process each time it is queried.

5.3.1.1 Reformulation of model personalization into an MDP
The model personalization problem is mapped to an MDP as
follows:
• States encode the misﬁt between computed model output and
patient’s measurements. While misﬁt is generally continuous,
the number of MDP states has to be ﬁnite, therefore the space
n c
of objective vectors, R , is reduced to a ﬁnite set of represen-
tative MDP states S,each s ∈ S covering a small region of that
space. ˆ ∈ S denotes the success state, which covers the region
s

193 194 195 196 197 198 199 200 201 202 203