Page 122 - Socially Intelligent Agents Creating Relationships with Computers and Robots
P. 122

Electric Elves                                                   105

                                The delay MDP reasoning is based on a world state representation, the most
                              salient features of which are the user’s location and the time. Figure 12.2 shows
                              a portion of the state space, showing only the location and time features, as well
                              as some of the state transitions (a transition labeled “delay  ” corresponds to
                              the action “delay by   minutes”). Each state also has a feature representing the
                              number of previous times the meeting has been delayed and a feature capturing
                              what the agent has told the other Fridays about the user’s attendance. There are
                              a total of 768 possible states for each individual meeting.
                                The delay MDP’s reward function has a maximum in the state where the user
                              is at the meeting location when the meeting starts, giving the agent incentive to
                              delay meetings when its user’s late arrival is possible. However, the agent could
                              choose arbitrarily large delays, virtually ensuring the user is at the meeting when
                              it starts, but forcing other attendees to rearrange their schedules. This team cost
                              is considered by incorporating a negative reward, with magnitude proportional
                              to the number of delays so far and the number of attendees, into the delay reward
                              function. However, explicitly delaying a meeting may benefit the team, since
                              without a delay, the other attendees may waste time waiting for the agent’s user
                              to arrive. Therefore, the delay MDP’s reward function includes a component
                              that is negative in states after the start of the meeting if the user is absent, but
                              positive otherwise. The reward function includes other components as well and
                              is described in more detail elsewhere [10].
                                The delay MDP’s state transitions are associated with the probability that
                              a given user movement (e.g., from office to meeting location) will occur in a
                              giventimeinterval. Figure12.2showsmultipletransitionsduetoa’wait’action,
                              with the relative thickness of the arrows reflecting their relative probability. The
                              “ask” action, through which the agent gives up autonomy and queries the user,
                              has two possible outcomes. First, the user may not respond at all, in which
                              case, the agent is performing the equivalent of a “wait” action. Second, the user
                              may respond, with one of the 10 responses from Figure 12.1. A communication
                              model [11] provides the probability of receiving a user’s response in a given
                              time step. The cost of the “ask” action is derived from the cost of interrupting
                              the user (e.g., a dialog box on the user’s workstation is cheaper than sending
                              a page to the user’s cellular phone). We compute the expected value of user
                              input by summing over the value of each possible response, weighted by its
                              likelihood.
                                Given the states, actions, probabilities, and rewards of the MDP, Friday uses
                              the standard value iteration algorithm to compute an optimal policy, specify-
                              ing, for each and every state, the action that maximizes the agent’s expected
                              utility [8]. One possible policy, generated for a subclass of possible meetings,
                              specifies “ask” and then “wait” in state S1 of Figure 12.2, i.e., the agent gives up
                              some autonomy. If the world reaches state S3, the policy again specifies “wait”,
                              so the agent continues acting without autonomy. However, if the agent then
   117   118   119   120   121   122   123   124   125   126   127