Human-Machine Dialogue as a Stochastic Game

In this paper, an original framework to model human-machine spoken dialogues is proposed to deal with co-adaptation be-tween users and Spoken Dialogue Systems in non-cooperative tasks. The conversation is modeled as a Stochastic Game: both the user and the system have their own preferences but have to come up with an agreement to solve a non-cooperative task. They are jointly trained so the Dialogue Manager learns the optimal strategy against the best possible user. Results obtained by simulation show that non-trivial strategies are learned and that this framework is suitable for dialogue modeling.


Introduction
In a Spoken Dialogue System (SDS), the Dialogue Manager (DM) is designed in order to implement a decision-making process (called strategy or policy) aiming at choosing the system interaction moves. The decision is taken according to the current interaction context which can rely on bad transcriptions and misunderstandings due to Automatic Speech Recognition (ASR) and Spoken Language Understanding (SLU) errors. Machine learning methods, such as Reinforcement Learning (RL) (Sutton and Barto, 1998), are now very popular to learn optimal dialogue policies under noisy conditions and inter-user variability (Levin and Pieraccini, 1997;Lemon and Pietquin, 2007;Laroche et al., 2010;Young et al., 2013). In this framework, the dialogue task is modeled as a (Partially Observable) Markov Decision Process ((PO)MDP), and the DM is an RL-agent learning an optimal policy. Yet, despite some rare exam-ples, RL-based DMs only consider task-oriented dialogues and stationary (non-adapting) users.
Unfortunately, (PO)MDP are restricted to model game-against-nature problems (Milnor, 1951). These are problems in which the learning agent evolves in an environment that doesn't change with time and acts in a totally disinterested manner. (PO)MDP-based dialogue modeling thus applies only if 1) the user doesn't modify his/her behavior along time (the strategy is learned for a stationary environment) and 2) the dialogue is task-oriented and requires the user and the machine to positively collaborate to achieve the user's goal.
The first assumption doesn't hold if the user adapts his/her behavior to the continuously improving performance of a learning DM. Some recent studies have tried to model this co-adaptation effect between a learning machine and a human (Chandramohan et al., 2012b) but this approach still considers the user and the machine as independent learning agents. Although there has already been some few attempts to model the "coevolution" of human machine interfaces (Bourguin et al., 2001), this work doesn't extend to RL-based interfaces (automatically learning) and is not related to SDS.
More challenging situations do also arise when the common-goal assumption doesn't hold either, which is the case in many interesting applications such as negotiation (El Asri et al., 2014), serious games, e-learning, robotic co-workers etc. Especially, adapting the MDP paradigm to the case of negotiation dialogues has been the topic of recent works. In (Georgila et al., 2014), the authors model the problem of negotiation as a Multi-Agent Reinforcement Learning (MARL) problem. Yet, this approach relies on algorithms that are treat-ing the multi-player issue as a non-stationnarity problem (e.g. WoLF-PHC (Bowling and Veloso, 2002)). Each agent is assumed to keep a stable interaction policy for a time sufficiently long so that the other agent can learn it's current policy. Otherwise, there is no convergence guarantees. Another major issue with these works is that noise in the ASR or NLU results is not taken into account although this is a major reason for using stochastic dialogue models. In (Efstathiou and Lemon, 2014), the authors follow the same direction by considering both agents as acting in a stationary MDP.
In this paper, we propose a paradigm shift from the now state-of-the-art (PO)MDP model to Stochastic Games (Patek and Bertsekas, 1999) to model dialogue. This model extends the MDP paradigm to multi-player interactions and allows learning jointly the strategies of both agents (the user and the DM), which leads to the best system strategy in the face of the optimal user/adversary (in terms of his/her goal). This paradigm models both co-adaptation and possible non-cooperativness. Unlike models based on standard game theory (Caelen and Xuereb, 2011), Stochastic Games allow to learn from data. Especially, departing from recent results (Perolat et al., 2015), we show that the optimal strategy can be learned from batch data as for MDPs . This means that optimal negotiation policies can be learnt from non-optimal logged interactions. This new paradigm is also very different from MARL methods proposed in previous work (Chandramohan et al., 2012b;Georgila et al., 2014;Efstathiou and Lemon, 2014) since optimization is jointly performed instead of alternatively optimizing each agent, considering the other can stay stationary for a while. Although experiments are only concerned with purely adversarial tasks (Zero-Sum games), we show that it could be naturally extended to collaborative tasks (general sum games) (Prasad et al., 2015). Experiments show that an efficient strategy can be learned even under noisy conditions which is suitable for modeling realistic human-machine spoken dialogues.

Markov Decision Processes and Reinforcement Learning
As said before, human-machine dialogue has been modeled as an (PO)MDP to make it suitable for automatic strategy learning (Levin and Pieraccini, 1997;Young et al., 2013). In this framework, the dialogue is seen as a turn-taking process in which two agents (a user and a DM) interact through a noisy channel (ASR, NLU) to exchange information. Each agent has to take a decision about what to say next according to the dialogue context (also called dialogue state). In this section, MDPs (Puterman, 1994) and RL (Sutton and Barto, 1998;Bertsekas and Tsitsiklis, 1996) are briefly reviewed and formally defined which will help switching the Stochastic Games in Section 3.

Markov Decision Processes
Definition 2.1. A Markov Decision Process (MDP) is a tuple S, A, T , R, γ where: S is the discrete set of environment states, A the discrete set of actions, T : S × A × S → [0, 1] the state transition probability function and R : S ×A → R the reward function. Finally, γ ∈ [0, 1) is a discount factor. At each time step, the RL-agent acts according to a policy π, which is either deterministic or stochastic. In the first case, π is a mapping from state space to action space : π : S → A, while in the latter, π is a probability distribution on the state-action space π : S × A → [0, 1]. Policies are generally designed to maximize the value of each state, i.e. the expected discounted cumulative reward: ∀s ∈ S, V π (s) = E[ ∞ t=0 γ t r(s t , π(s t ))|s 0 = s]. Let V be the space of all possible value functions. The optimal value function V * is the only value function such that: ∀V ∈ V, ∀s ∈ S, V * ≥ V . The following result, proved in (Puterman, 1994), is fundamental in the study of MDPs: Theorem 2.1. Let M be an MDP. Its optimal value function V * exists, is unique and verifies: Furthermore, one can always find a deterministic policy π * inducing V * .

Reinforcement Learning
In many cases, transition and reward functions are unknown. It is thus not possible to compute values nor Q-Functions, the RL-agent learns an approximation by sampling through actual interactions with the environment. The set of techniques solving this problem is called Reinforcement Learning.
For instance the Q-Learning algorithm (Watkins and Dayan, 1992) approximates, at each time step, the optimal Q-Function and uses the following update rule: It can been shown that, under the assumption that α = ∞ and α 2 < ∞ and that all states are visited infinitely often, Q-values converge towards the optimal ones. Thus, by taking at each state the action maximizing those values, one finds the optimal policy. There are batch algorithms solving the same problem among which Fitted-Q (Gordon, 1999;Ernst et al., 2005).

Definitions
Definition 3.1. A discounted Stochastic Game (SG) is a tuple D, S, A, T , R, γ where: D = {1, ..., n} represents the set of agents, S the discrete set of environment states, A = × i∈D A i the joint action set, where for all i = 1, ..., n, A i is the discrete set of actions available to the i th agent, T : S × A × S → [0, 1] the state transition probability function, R = × i∈D R i the joint reward function, where for all i = 1, ..., n, An agent i chooses its actions according to some strategy σ i , which is in the general case a probability distribution on i's state-action space. If the whole space of agents is considered, we speak about the joint strategy σ. The notation σ −i represents the joint strategy of all agents except i.
This definition is general, every 'MDP' in which multiple agents interact may be interpreted as a Stochastic Game. It is therefore useful to introduce a taxonomy. A game where there are only two players and where the rewards are opposite (i.e. R 1 = −R 2 ) is called Zero-Sum Game.
Conversely, a Purely Cooperative Game is a game where all the agents have the same reward (i.e. ∀i ∈ D, R i = R). A game which is neither Zero-Sum nor Purely Cooperative is said to be General-Sum.

Best Response
In all environments, agents learn by acting according to what has previously been learned. In other words, agents adapt to an environment. This is also valid in a multi-agent scenario, if agent i wants to learn about agent j, it will act according to what has previously been learned about j. But conversely, if j wants to learn about agent i, it will act according to what it knows about i. We say that agents co-adapt. Co-adapation is, due to this feedback loop, an intrinsically non-stationary process. An algorithm converges if it converges to stationary strategies.
Each agent acts in order to maximize its expected discounted cumulative reward, also called the discounted value of the joint strategy σ in state s to player i : The Q-function is then defined as (Filar and Vrieze, 1996): This value function depends on the opponents' strategies. It is therefore not possible to define in the general case a strategy optimal against every other strategy. A Best Response is an optimal strategy given the opponents ones.
Definition 3.2. Agent i plays a Best Response σ i against the other players' joint strategy σ −i if σ i is optimal given σ −i . We write σ i ∈ BR(σ −i ).
Best Response induces naturally the following definition: It is interesting to notice that in a single-player game, Nash Equilibrium strategies match the optimal policies defined in the previous section.
The existence of Nash Equilibria in all discounted Stochastic Games is assured by the following theorem (Filar and Vrieze, 1996): Theorem 3.1. In a discounted Stochastic Game G, there exists a Nash Equilibrium in stationary strategies.
Two remarks need to be introduced here. First, nothing was said about uniqueness since in the general case, there are many Nash Equilibria. Equilibrium selection and tracking may be a big deal while working with SGs. Second, contrarily to the MDP case, there may be no deterministic Nash Equilibrium strategies (but only stochastic).

The Zero-Sum Case
There are two ways to consider a Zero-Sum Stochastic Game: one can see two agents aiming at maximizing two opposite Q-functions or one can also see only one Q-function, with the first agent (called the maximizer) aiming at maximizing it and the second one (the minimizer) aiming at minimizing it. One can prove (Patek and Bertsekas, 1999), that if both players follow those maximizing and minimizing strategies, the game will converge towards a Nash Equilibrium, which is the only one of the game. In this case, thanks to the Minmax theorem (Osborne and Rubinstein, 1994), the value of the game is (with player 1 maximizing and player 2 minimizing): As we will see later, the existence of this unique value function for both player is helpful for finding efficient algorithms solving zero-sum SGs.

Algorithms
Even if the field of Reinforcement Learning in Stochastic Games is still young and guaranteed Nash Equilibrium convergence with tractable algorithms is, according to our knowledge, still an open problem, many algorithms have however already been proposed (Buşoniu et al., 2008), all with strengths and weaknesses.
Reinforcement Learning techniques to solve Stochastic Games were first introduced in (Littman, 1994). In his paper, Littman presents minimax-Q, a variant of the Q-Learning algorithm for the zero-sum setting, which is guaranteed to converge to the Nash Equilibrium in self-play. He then extended his work in (Littman, 2001) with Friend-or-Foe Q-Learning (FFQ), an algorithm assured to converge, and converging to Nash Equilibria in purely cooperative or purely competitive settings. The authors of (Hu and Wellman, 2003) were the first to propose an algorithm for general-sum Stochastic Games. Their algorithm, Nash-Q, is also a variant of Q-Learning able to allow the agents to reach a Nash Equilibrium under some restrictive conditions on the rewards' distribution. In the general case, they empirically proved that convergence was not guaranteed any more. (Zinkevich et al., 2006) proved by giving a counter-example that the Q-function does not contain enough information to converge towards a Nash Equilibrium in the general setting.
For any known Stochastic Game, the Stochastic Tracing Procedure algorithm (Herings and Peeters, 2000) finds a Nash Equilibrium of it. The algorithm proposed in (Akchurina, 2009) was the first learning algorithm converging to an approximate Nash Equilibrium in all settings (even with an unknown game). Equilibrium tracking is made here by solving at each iteration a system of ordinary differential equations. The algorithm has no guaranty to converge toward a Nash Equilibrium even however, it seems empirically to work. Finally, (Prasad et al., 2015) presented two algorithms converging towards a Nash Equilibrium in the General-Sum setting: one batch algorithm assuming the complete knowledge of the game and an on-line algorithm working with simulated transitions of the Stochastic Game.
In this paper we will use two algorithms which are reviewed hereafter: WoLF-PHC (Bowling and Veloso, 2002) and AGPI-Q (Perolat et al., 2015).

WoLF-PHC
WoLF-PHC is an extension of the Q-learning algorithm allowing probabilistic strategies. It considers independent agents evolving in an environment made non-stationary by the presence of the others. In such a setting, the aim of the agents is not to find a Nash Equilibrium (it is therefore not an SG algorithm) but to do as good as possible in this environment (and as a consequence, it may lead to a Nash Equilibrium). The algorithm is based on the following idea: convergence shall be facilitated if agents learn quickly to adapt when they are sub-optimal and learn slowly when they are near-optimal (in order to let the other agents adapt to this strategy).
Q-values are updated as in Q-learning and the probability of selecting the best action is incrementally increased according to some (variable) learning rate δ, which is decomposed into two learning rates δ L and δ W , with δ L > δ W . The policy update is made according to δ L while losing and to δ W while winning.
To determine if an agent is losing or winning, the expected value of its actual strategy π, is compared to the expected value of the average policy π. Formally, an agent is winning if a π(s, a)Q(s, a) > a π(s, a)Q(s, a) and losing otherwise.
In the general case, convergence is not proven and it is even shown on some toy-examples that sometimes, the algorithm does not converge (Bowling and Veloso, 2002).

AGPI-Q
Approximate Generalized Policy Iteration-Q, or AGPI-Q (Perolat et al., 2015), is an extension of the Fitted-Q (Gordon, 1999;Ernst et al., 2005) algorithm solving Zero-Sum Stochastic Games in a batch setting. At the initialization step, N samples (s, a 1 , a 2 , r, s ) and a Q-function (for instance, the null function) are given. The algorithm consists then in K iterations, each of them composed of two parts : a greedy part and an evaluation part. The algorithm provides then at each iteration a better approximation of the Q-function.
Let j = (s j , a j , b j , r j , s j ) be N collected samples. At time step k + 1, the greedy part consists of finding the maximizer's maxminimizing action a of the matrix game defined by Q j k (s j , a j , b j ). In our case, a turn-based setting, this involves finding a maximum. Then, during the evaluation part, since the second agent plays a minimizing strategy, the following value is computed: Q j = r + γ min b Q j k (s j , a j , b). At each iteration, the algorithm returns the Q-function Q k+1 fitting at best these values over some hypothesis space.

Dialogue as a Stochastic Game
Dialogue is a multi-agent interaction and therefore, it shall be considered as such during the optimization process. If each agent (i.e. the user and the DM) has its own goals and takes its decisions to achieve them, it sounds natural to model it as an MDP. In traditional dialogue system studies, this is only done for one conversant over two. Since (Levin and Pieraccini, 1997;Singh et al., 1999), only the DM is encoded as an RL agent, despite rare exceptions Chandramohan et al., 2012b;Chandramohan et al., 2012a)). The user is rather considered as a stationary agent modeled as a Bayesian net-work (Pietquin, 2006) or an agenda-based process (Schatzmann et al., 2007), leading to modeling errors (Schatztnann et al., 2005;Pietquin and Hastie, 2013).
At first sight, it seems reasonable to think that if two RL agents, previously trained to reach an optimal strategy, interact with each other, it would result in "optimal" dialogues. Yet, this assertion is wrong. Each agent would be optimal given the environment it's been trained on, but given another environment, nothing can be said about the learnt policy. Furthermore, if two DMs are trained together with traditional RL techniques, no convergence is guaranteed since, as seen above, nonstationarities emerge. Indeed, non-stationarity is not well managed by standard RL methods although some methods can deal with it (Geist et al., 2009;Daubigney et al., 2012) but adaptation might not be fast enough.
Jointly optimizing RL-agents in the framework of Stochastic Games finds a Nash Equilibrium. This guarantees both strategies to be optimal and this makes a fundamental difference with previous work (Chandramohan et al., 2012b;Georgila et al., 2014;Efstathiou and Lemon, 2014).
In the next section, we illustrate how dialogue may be modeled by a Stochastic Game, how transitions and reward functions depend on the policy of both agents. We propose now a Zero-Sum dialogue game where agents have to drive efficiently the dialogue to gather information quicker than their opponent. In this example, human user (Agent 1) and DM (Agent 2) are modeled with MDPs: each of them has a goal encoded into reward functions R 1 and R 2 (they may depend on the joint action).

A Zero-Sum Dialogue Game
The task involves two agents, each of them receives a random secret number and aims at guessing the other agent's number. They are adversaries: if one wins, the other one loses as much.
To find the secret number out, agents may perform one of the following actions: ask, answer, guess, ok, confirm and listen.
During a dialogue turn, the agent asking the question is called the guesser and the one answering is the opponent. To retrieve information about the opponent's hidden number, the guesser may ask if this number is smaller or greater than some other number. The opponent is forced to answer the truth. To show that it has understood the answer, the agent says ok and releases then the turn to its adversary, which endorses the guesser's role.
Agents are not perfect, they can misunderstand what has been said. This simulates ASR and NLU errors arising in real SDSs. They have an indicator giving a hint about the probability of having well understood (a confidence level). They are however never certain and they may answer a wrong question, e.g. in the following exchange : -Is your secret number greater than x ? -My number is greater than y.
When such an error arises, Agent 1 is allowed to ask another question instead of just saying ok. This punishment is harsh for the agent which misunderstood, it is almost as if it has to pass its turn. Another dialogue act is introduced to deal with such situations. If an agent is not sure, it may ask to confirm. In this case, Agent 1 may ask its question again. To avoid abuses, i.e. infinitely ask for a confirmation, this action induces a cost (and therefore a gain for the opponent).
If an agent thinks that it has found the number out, it can make a guess. If it was right, it wins (and therefore its opponent loses), otherwise, it loses (and its opponent wins).
Since we model dialogue as a turn-based interaction and we will need to consider joint actions, we introduce the action listen corresponding to the empty action.

Experimental Setting
Effects of the multi-agent setting are studied here through one special feature of the human-machine dialogue: the uncertainty management due to the dysfunctions of the ASR and the NLU. To promote simple algorithms, we ran our experiments on the zero-sum dialogue game presented above.
On this task, we compare three algorithms: Q-Learning, WoLF-PHC and AGPI-Q. Among those algorithms, only AGPI-Q is proved to converge towards a Nash Equilibrium in a Multi-Agent setting. Q-Learning and WoLF-PHC have however been used as Multi-Agent learning algorithm in a dialogue setting (English and Heeman, 2005;Georgila et al., 2014). Similarly to these papers, experiments will be done using simulation. We will show that, contrarily to AGPI-Q, they do not converge towards the Nash Equilibrium and therefore do not fit to the dialogue problem.

Modeling ASR and NLU Confidence Estimation
One difficulty while working with Spoken Dialogue Systems is how can a DM deal with uncertainty resulting from ASR and NLU errors and reflected by their Confidence Scores. Those scores are not always a probability. The only assumption made here is that with a score lower (resp. greater) than 0.5, the probability to misunderstand the last utterance is greater (resp. lower) than 0.5. Since dialogues are simulated, the ASR and NLU confidence levels will be modeled the following way. Each agent owns some fixed Sentence Error Rate (SER i ). With probability (1 − SER i ), agent i receives each utterance undisrupted, while with probability SER i , this utterance is misunderstood and replaced by another one.
Since Q-Learning and WoLF-PHC are used in their tabular form, it was necessary to discretize this score. To have states where the agent is almost sure of having understood (or sure of having misunderstood), we discretized by splitting the score around the cut points 0.1, 0.5 and 0.9. By equity concerns, the same discretization was applied for the AGPI-Q algorithm.

State Space
Consider two agents i and j. Their secret numbers are respectively m and n. To gather information about m, agent i asks if the secret number m is smaller or greater than some given number k. If agent j answers that m is greater (resp. smaller) than k, it will provides i a lower bound b i (resp. an upper bound b i ) on m. Agent i's knowledge on m may be represented by the interval The probability of wining by making a guess is then given by p = To take an action, an agent needs to remember who pronounced the last utterance, what was the last utterance it heard and to what extent it believes that what it heard was what had been said.
To summarize, agents taking actions make their decision according to the following features: the last utterance, its trust in this utterance, who uttered it, its progress in the game and its opponent's progress. they do not need to track the whole range of possible secret numbers but only the cardinal of these sets. Dialogue turn, last action, confidence score, cardinal of possible numbers for both agents are thus the five state features. The state space thus contains 2 * 5 * 4 * 5 * 5 = 1000 states.

Action Space
Agents are able to make one of the following actions: ask, answer, guess, confirm and listen. The actions ask, answer and guess need an argument: the number the agent wants to compare to. To learn quicker, we chose not to take a decision about this value. When an agent asks, it asks if the secret number is greater or smaller than the number in the middle of his range (this range is computed by the environment, it is not taken into account in the states). An agent answering says that her secret number is greater or smaller than the number it heard (which may be not the uttered number). An agent guessing proposes randomly a number in his range of possible values.

Reward function
To define the reward function, we consider the maximizing player. It is its turn to play. If it is guessing the right number, it earns +1. If it asks for a confirmation, it earns −0.2. Therefore, it is never in its interest to block the dialogue by always asking for a confirmation (in the worst case, ie if second agent immediately wins, it earns −1 while if it infinitely blocks the dialogue, it earns −0.2 ∞ k=0 (γ 2 ) k ≈ −1.05 for γ = 0.9).

Training of the algorithms
To train Q-Learning and WoLF-PHC, we followed the setup proposed in (Georgila et al., 2014). Both algorithms are trained in self-play by following an -greedy policy. Training is split into five epochs of 100000 dialogues. The exploration rate is set to 0.95 in the first epoch, 0.8 in the second, 0.5 in the third, 0.3 in the fourth and 0.1 in the fifth. The parameters δ L and δ W of WoLF-PHC are set to δ W = 0.05 and δ L = 0.2. The ratio δ L /δ W = 4 assures an aggressive learning when losing.
As a batch RL algorithm, AGPI-Q requires samples. To generate them, we followed the setup proposed in ). An optimal (or at least near) policy is first handcrafted. This policy is the following: an agent always asks for more information except when it or its opponent have enough information to make the right guess with probability 1. When the agent has to answer, it asks to confirm if its confidence score is below 0.5.
An -random policy is then designed. Agents make their decisions according the hand-crafted policy with probability and pick randomly actions with probability (1 − ).
Tuples (s, a 1 , a 2 , r, s ) are then gathered. We are then assured that the problem space is well-sampled and that there also exists samples giving the successful task completion reward. To ensure convergence, 75000 such dialogues are generated.
To keep the model as parameter-free as possible, CART trees are used as hypothesis space for the regression.

Results
The decision in the game is made on only two points: when is the best moment to end the dialogue with the guess action and what is the best way to deal with uncertainty by the use of the confirm action. Average duration of dialogues and average number of confirm actions are therefore chosen as the feature characterizing the Nash Equilibrium. Both are calculated over 5000 dialogues. Figures 1 and 2 illustrate those results.
Q-Learning dialogues' length decreases gradually with respect to an increasing SER ( Figure  1). Figure 2 brings an explanation: Q-Learning agents do not learn to use the CONFIRM action. More, dialogue length is even not regular, proving that the algorithm did not converge to a 'stable' policy. Q-Learning is a slow algorithm and therefore, agents do not have enough time to face the non-stationarities of the multi-agent environment. Convergence is thus not possible.
WoLF-PHC does not treat uncertainty too. Its number of confirm actions is by far the highest but stays constant. If the SDS asks for confirmation, even when there is no noise, it may be because being disadvantaged, it always loses, and while losing, its quick learning rate makes its strategy always changing. As previously said, convergence was not guaranteed.
AGPI-Q is then the only algorithm providing robustness against noise. The length of dialogues and the number of confirm actions increase both gradually with the SER of the SDS. We are also assured by the theory that in this setting, no improvement is possible.
It is also interesting to note the emergence of non-trivial strategies coming from the interaction between the AGPI-Q agents. For instance, when both agents are almost at the end of the dialogue (c i = 2 for each agent), agents make guess. Even if they have very low chances of wining, agents make also guess when it is sure that the adversary will win at the next turn. We provided a rigorous framework for co-learning in Dialogue Systems allowing optimization for both conversants. Its efficiency was shown on a purely adversarial setting under noisy conditions and an extension to situations more general than the purely adversarial setting is now proposed.

An appointment scheduling problem
The previous model considers only purely competitive scenarios. In this section, it is extended for the General-Sum case. We take as an example the task of scheduling the best appointment between two agents, where conversants have to interact to find an agreement. Each agent i has its own preferences about a slot in their agenda, they are encoded into some reward function R i . At each turn, an agent proposes some slot k. Next turn, its interlocutor may propose another slot or accept this one. If it accepts, agent i earns R i (k), it gets nothing otherwise. The conversation ends when an agent accepts an offered slot.
Agents, which are not always perfect, can misunderstand the last offer. An action confirm is therefore introduced. If an agent thinks that the last offer was on the slot k instead of the slot k, the outcome may be disastrous. An agent has thus always to find a trade-off between the uncertainty management on the last offer and its impatience, (due to the discount factor γ which penalizes long dialogues).
Here, cooperation is implicit. Conversants are self-centered, they care only on their own value functions, but, since it depends on both actions, or more explicitly the opponent may refuse an offer, they have to take into account the opponent's behavior.

Future work
In future, using General-Sum algorithms (Prasad et al., 2015), our framework will be applied on those much more complicated dialogue situations where cooperative and competitive phenomenon get mixed up in addition to the noisy conditions encountered in dialogue.
The long-term goal of this work is to use the model on a real data set in order to provide model of real interactions and designing adaptive SDS freeing ourselves from user modeling.