Page 200 - Artificial Intelligence for Computational Modeling of the Heart
P. 200
172 Chapter 5 Machine learning methods for robust parameter estimation
5.3.1.3 From computed objectives to representative MDP state
The continuous space of objective vectors now needs to be
quantized into a finite set of representative MDP states S using a
data-driven approach. To this end, all objective vectors c that were
observed during exploration (as part of episodes in E)are grouped
into n S − 1 clusters based on their distance to each other. The dis-
tance metric is defined relative to the inverse of the thresholds in
the convergence criteria to ensure similar influence of all objec-
tives (e.g., to cancel out different units, etc.):
c 1 − c 2 ψ = (c 1 − c 2 ) diag(ψ) −1 (c 1 − c 2 ) . (5.3)
The centroid of each cluster becomes the centroid of a represen-
tative state, and the special “success state” mentioned earlier, de-
s
noted ˆ, is artificially created to cover the region in state space
where all objectives are met: ∀i :|c i | <ψ i . This results in a total
of n S states: n S − 1 are data-driven, and one is the success state.
To determine the MDP state of a given objective vector c,we
introduce a mapping φ.Let ξ denote the centroid corresponding
s
to state s, then the mapping is defined as:
φ(c) = argmin c − ξ ψ . (5.4)
s
s∈S
5.3.1.4 Transition function as probabilistic model representation
In this work, the stochastic MDP transition function T is gen-
erated such that the transition probabilities encode the learnt
knowledge about the behavior of computational model f .To
this end, we rely on model exploration and resulting training
episodes E. First, the individual samples (x t ,y ,c t ,a t ,x t+1 ,y ,
t t+1
ˆ
c t+1 ) are converted to state-action-state transition tuples E =
{(s,a,s )},where s = φ(c t ), a = a t and s = φ(c t+1 ). The transition
function is then approximated at each point based on statistical
analysis of the observed transition samples:
ˆ
|{(s,a,s ) ∈ E}|
. (5.5)
T (s,a,s ) =
ˆ
|{(s,a,s ) ∈ E}|
s ∈S
Some state-action combinations may not be observed, especially
if n S and n A are large. In such cases, uniform probability is as-
signed.
Now that the MDP model M is fully defined, we apply value
∗
iteration (section 5.3.1) and compute the stochastic policy ˜π ,
which completes the off-line phase.