Page 203 - Mechatronics for Safety, Security and Dependability in a New Era
P. 203
Ch39-I044963.fm Page 187 Tuesday, August 1, 2006 3:15 PM
Page 187
1, 2006
3:15 PM
Ch39-I044963.fm
Tuesday, August
187
187
Model of the Environment
Under the reinforcement learning framework (Figure la), an Agent performs an action a based on the
current sensory input and the policy formed so far. The Environment (or the Environment Model)
responds with a new sensory input s and an external reward r. The Agent adjusts its policy based on
the reward and completes the cycle by performing a new action.
The two parts of the environment model are implemented as follows. The model of the next input is
implemented as an additional output layer, trained to predict the next input based on information from
the current network state. The model of the external reward at this stage is implemented outside of
the network as a simple loolaip table keeping the last reward received for each input-output pair.
Environment
s
a
Agent
r
Environment
a) Model
Rew=+1
i i i o 1 i i
0 1 2 3 4
o
0
R
Goal
b) b) Start Goal c)
Figure 1. a) Reinforcement learning with additional model-generated experience, b) Random
walk task settings, c) Neural network structure.
SIMULATION
A five-state random walk task was used to test the approach. In this task, there are five squares in a
row, and an agent that moves one square left or right. The start position is the middle square and a
move outside from the leftmost and rightmost squares sends the agent back to the start position. Two
goals were used: moving right from the rightmost square and moving left from the leftmost square.
Figure lb shows the settings and a finite state automaton describing the states and the transitions
(inputs i and outputs o in the network). The reward value corresponds to goal set to the right side.
For this simulation, we used the PDP++ neural network simulator (PDP++ software package, ver. 3.2a,
http://www.cnbc.cmu.edu/PDP++/PDP++.html). The network input (see Figure lc) is the current
position of the agent. The network outputs are the current action in the Output layer and the
prediction of the next input in the NextJnput layer. The Hidden layer has one neuron for each state-
action combination. The top row encodes move-right and the bottom row encodes move-left. A
restriction is imposed through the k-Winners-Take-All function to allow only one active neuron. The
weights between the Input, Hidden, Output, and Nextlnput layers are hand-coded (in a separate
experiment we have confirmed that these weights can be learned too) so that from each state the two
possible actions are equally probable. The PFC has 8 stripes, each one with the same size as the
Hidden layer. The Hidden layer has one-to-one connections with each stripe in the PFC layer.
The training process, inspired by the Dyna algorithm (Sutton & Barto, 1998), is an interleaving
execution of two loops. One for the real experience, receiving the next input and the external reward
from the environment and the other, for the model-generated experience, obtaining the input from the
Nextlnput layer and the external reward from the lookup table.