Page 203 - Mechatronics for Safety, Security and Dependability in a New Era
P. 203

Ch39-I044963.fm  Page 187  Tuesday, August 1, 2006  3:15 PM
                      Page 187
                                      1, 2006
                                           3:15 PM
            Ch39-I044963.fm
                            Tuesday, August
                                                                                          187
                                                                                          187
                  Model of the Environment
                  Under the reinforcement  learning  framework  (Figure  la),  an Agent  performs  an action a based  on the
                  current  sensory  input  and  the  policy  formed  so  far.  The  Environment  (or  the  Environment  Model)
                  responds with a new  sensory  input s and  an external reward r.  The Agent  adjusts  its policy based  on
                  the reward  and completes the cycle by performing  a new action.
                  The two parts of the environment  model  are implemented  as follows.  The model  of the next  input  is
                  implemented  as an additional  output  layer, trained  to predict the next  input  based  on information  from
                  the  current  network  state.  The model  of the  external reward  at this  stage  is implemented  outside  of
                  the network as a simple loolaip table keeping the last reward received for  each input-output pair.

                                  Environment
                                s
                                        a
                                   Agent
                                r
                                   Environment
                             a)    Model
                                     Rew=+1
                               i  i  i o 1  i  i
                                0  1  2  3  4
                                    o
                                    0
                                   R
                                        Goal
                             b) b)  Start  Goal  c)
                    Figure  1. a) Reinforcement  learning with additional model-generated  experience, b) Random
                                     walk task settings, c) Neural  network  structure.
                  SIMULATION

                  A  five-state  random  walk task was used to test the approach.  In this task, there  are  five  squares  in a
                  row,  and  an  agent  that  moves  one  square  left  or right.  The  start position  is the middle  square  and  a
                  move outside  from the leftmost  and rightmost  squares sends the agent back to the start position.  Two
                  goals  were  used:  moving  right  from  the  rightmost  square  and  moving  left  from  the  leftmost  square.
                  Figure  lb  shows  the  settings  and  a  finite  state  automaton  describing  the  states  and  the  transitions
                  (inputs i and outputs o in the network).  The reward value corresponds to goal set to the right side.
                  For this simulation, we used the PDP++ neural network simulator (PDP++  software package, ver.  3.2a,
                  http://www.cnbc.cmu.edu/PDP++/PDP++.html).  The  network  input  (see  Figure  lc)  is  the  current
                  position  of  the  agent.  The  network  outputs  are  the  current  action  in  the  Output  layer  and  the
                  prediction  of the  next  input  in the  NextJnput layer.  The Hidden  layer  has  one neuron  for  each  state-
                  action  combination.  The  top  row  encodes  move-right  and  the  bottom  row  encodes  move-left.  A
                  restriction  is imposed through the k-Winners-Take-All  function  to allow  only one active neuron.  The
                  weights  between  the  Input,  Hidden,  Output,  and  Nextlnput  layers  are  hand-coded  (in  a  separate
                  experiment  we have  confirmed  that these  weights  can  be learned too)  so that  from  each  state the two
                  possible  actions  are  equally  probable.  The  PFC  has  8  stripes,  each  one  with  the  same  size  as  the
                  Hidden layer.  The Hidden layer has one-to-one connections with each stripe in the PFC  layer.

                  The  training  process,  inspired  by  the  Dyna  algorithm  (Sutton  &  Barto,  1998),  is  an  interleaving
                  execution  of two  loops.  One  for the real experience, receiving the next  input and the external  reward
                  from  the environment  and the  other,  for  the model-generated  experience,  obtaining the  input  from  the
                  Nextlnput  layer and the external reward  from  the lookup table.
   198   199   200   201   202   203   204   205   206   207   208