Page 402 - Mechatronics for Safety, Security and Dependability in a New Era
P. 402
Ch78-I044963.fm Page 386 Monday, August 7, 2006 11:30 AM
Ch78-I044963.fm
386
386 Page 386 Monday, August 7,2006 11:30 AM
applications (Morimoto and Doya.). This paper used fuzzy ART for sate-space segmentation. It is an
incremental category-space construction method (Carpenter et al.). Whenever the fuzzy ART
encounters a new situation, it generates a new category node in the category space. Furthermore, we
proposed generating methods of a new node that inherit the state-value and the policy from a similar
node to increase the learning speed more.
REINFORCEMENT LEARNING WITH FUZZY-ART STATE-SPACE CONSTRUCTION
Figure 1 shows the architecture of proposed system. This system based on actor-critic methods. Using
the sensed state from environment, the actor selects an action. The critic evaluates the new state to
determine whether the state has gone better or worse than expected. The TD-error represents this
evaluation. The TD-error is used for the improvement of the policy and the update of the state-value. If
TD-error is positive, it suggests the tendency to select the action should be strengthened for the future.
If TD-error is negative, it suggests the tendency should be weakened. We applied the fuzzy ART to
generalization of sensed states. In many tasks, most states will never have been experienced exactly
before. Fuzzy ART is a kind of self-organized clustering method and it classifies unknown state into
the group of approximate states. One state code used for each group.
FUZZY-ART
The fuzzy ART consists of two fully connected layers. The category layer contains category nodes to
categorize a given vector. A weight vector Wj shows the representative pattern of category node . The
j
sensed data are normalized with complement cording. The sensed data vector and its complement
vector are input to the input layer. The choice function Ti is defined as equation (1). Where c is a
choice parameter, the operator is defined by (p A q) =min(p, q) and the norm | x | is the sum of its
components. The category node i that has the maximal Ti is called a winner node. The winner node is
judged to cause resonance or mismatch reset according to the equation (2). If the match function Mi is
bigger than the vigilance parameter p, resonance is carried out and the weight vector Wi is updated
according to the equation (3). Where (3 is a learning rate parameter. When mismatch reset procedure
should be done, other category node that has the next maximal T is chosen again. When any category
node is not selected, a new node is generated. In normally, default values are given to the policy and
the state-value. In proposed methods, inherited value from a similar node was used. The efficiency was
estimated in the simulations of a two-link manipulator and a multi-link mobile robot.
Fuzzy ART
i l l A'1'2 ATi A'
Category Layer^ ^ ^ ^
"Critic H CTD-error j
Value function V | ' I
X(t) SA\ Aclion @ )
Input Lai
jsensor output (jW) Reward 1
Environment
Figure 1: Architecture of actor-critic learning method with fuzzy ART
„, \x A W,\
c+ | w, (1)
(2)
(3)