Page 122 - Socially Intelligent Agents Creating Relationships with Computers and Robots
P. 122
Electric Elves 105
The delay MDP reasoning is based on a world state representation, the most
salient features of which are the user’s location and the time. Figure 12.2 shows
a portion of the state space, showing only the location and time features, as well
as some of the state transitions (a transition labeled “delay ” corresponds to
the action “delay by minutes”). Each state also has a feature representing the
number of previous times the meeting has been delayed and a feature capturing
what the agent has told the other Fridays about the user’s attendance. There are
a total of 768 possible states for each individual meeting.
The delay MDP’s reward function has a maximum in the state where the user
is at the meeting location when the meeting starts, giving the agent incentive to
delay meetings when its user’s late arrival is possible. However, the agent could
choose arbitrarily large delays, virtually ensuring the user is at the meeting when
it starts, but forcing other attendees to rearrange their schedules. This team cost
is considered by incorporating a negative reward, with magnitude proportional
to the number of delays so far and the number of attendees, into the delay reward
function. However, explicitly delaying a meeting may benefit the team, since
without a delay, the other attendees may waste time waiting for the agent’s user
to arrive. Therefore, the delay MDP’s reward function includes a component
that is negative in states after the start of the meeting if the user is absent, but
positive otherwise. The reward function includes other components as well and
is described in more detail elsewhere [10].
The delay MDP’s state transitions are associated with the probability that
a given user movement (e.g., from office to meeting location) will occur in a
giventimeinterval. Figure12.2showsmultipletransitionsduetoa’wait’action,
with the relative thickness of the arrows reflecting their relative probability. The
“ask” action, through which the agent gives up autonomy and queries the user,
has two possible outcomes. First, the user may not respond at all, in which
case, the agent is performing the equivalent of a “wait” action. Second, the user
may respond, with one of the 10 responses from Figure 12.1. A communication
model [11] provides the probability of receiving a user’s response in a given
time step. The cost of the “ask” action is derived from the cost of interrupting
the user (e.g., a dialog box on the user’s workstation is cheaper than sending
a page to the user’s cellular phone). We compute the expected value of user
input by summing over the value of each possible response, weighted by its
likelihood.
Given the states, actions, probabilities, and rewards of the MDP, Friday uses
the standard value iteration algorithm to compute an optimal policy, specify-
ing, for each and every state, the action that maximizes the agent’s expected
utility [8]. One possible policy, generated for a subclass of possible meetings,
specifies “ask” and then “wait” in state S1 of Figure 12.2, i.e., the agent gives up
some autonomy. If the world reaches state S3, the policy again specifies “wait”,
so the agent continues acting without autonomy. However, if the agent then