Deal or No Deal? End-to-End Learning for Negotiation Dialogues

Much of human dialogue occurs in semi-cooperative settings, where agents with different goals attempt to agree on common decisions. Negotiations require complex communication and reasoning skills, but success is easy to measure, making this an interesting task for AI. We gather a large dataset of human-human negotiations on a multi-issue bargaining task, where agents who cannot observe each other's reward functions must reach an agreement (or a deal) via natural language dialogue. For the first time, we show it is possible to train end-to-end models for negotiation, which must learn both linguistic and reasoning skills with no annotated dialogue states. We also introduce dialogue rollouts, in which the model plans ahead by simulating possible complete continuations of the conversation, and find that this technique dramatically improves performance. Our code and dataset are publicly available (https://github.com/facebookresearch/end-to-end-negotiator).


Introduction
Intelligent agents often need to cooperate with others who have different goals, and typically use natural language to agree on decisions. Negotiation is simultaneously a linguistic and a reasoning problem, in which an intent must be formulated and then verbally realised. Such dialogues contain both cooperative and adversarial elements, and require agents to understand, plan, and generate utterances to achieve their goals (Traum et al., 2008;Asher et al., 2012).
We collect the first large dataset of natural language negotiations between two people, and show 1 https://github.com/facebookresearch/end-to-end-negotiator that end-to-end neural models can be trained to negotiate by maximizing the likelihood of human actions. This approach is scalable and domainindependent, but does not model the strategic skills required for negotiating well. We further show that models can be improved by training and decoding to maximize reward instead of likelihood-by training with self-play reinforcement learning, and using rollouts to estimate the expected reward of utterances during decoding.
To study semi-cooperative dialogue, we gather a dataset of 5808 dialogues between humans on a negotiation task. Users were shown a set of items with a value for each, and asked to agree how to divide the items with another user who has a different, unseen, value function ( Figure 1).
We first train recurrent neural networks to imitate human actions. We find that models trained to maximise the likelihood of human utterances can generate fluent language, but make comparatively poor negotiators, which are overly willing to compromise. We therefore explore two methods for improving the model's strategic reasoning skillsboth of which attempt to optimise for the agent's goals, rather than simply imitating humans: Firstly, instead of training to optimise likelihood, we show that our agents can be considerably improved using self play, in which pre-trained models practice negotiating with each other in order to optimise performance. To avoid the models diverging from human language, we interleave reinforcement learning updates with supervised updates. For the first time, we show that end-toend dialogue agents trained using reinforcement learning outperform their supervised counterparts in negotiations with humans.
Secondly, we introduce a new form of planning for dialogue called dialogue rollouts, in which an agent simulates complete dialogues during decoding to estimate the reward of utterances. We show Figure 1: A dialogue in our Mechanical Turk interface, which we used to collect a negotiation dataset. that decoding to maximise the reward function (rather than likelihood) significantly improves performance against both humans and machines.
Analysing the performance of our agents, we find evidence of sophisticated negotiation strategies. For example, we find instances of the model feigning interest in a valueless issue, so that it can later 'compromise' by conceding it. Deceit is a complex skill that requires hypothesising the other agent's beliefs, and is learnt relatively late in child development (Talwar and Lee, 2002). Our agents have learnt to deceive without any explicit human design, simply by trying to achieve their goals.
The rest of the paper proceeds as follows: §2 describes the collection of a large dataset of humanhuman negotiation dialogues. §3 describes a baseline supervised model, which we then show can be improved by goal-based training ( §4) and decoding ( §5). §6 measures the performance of our models and humans on this task, and §7 gives a detailed analysis and suggests future directions.

Overview
To enable end-to-end training of negotiation agents, we first develop a novel negotiation task and curate a dataset of human-human dialogues for this task. This task and dataset follow our proposed general framework for studying semicooperative dialogue. Initially, each agent is shown an input specifying a space of possible actions and a reward function which will score the outcome of the negotiation. Agents then sequentially take turns of either sending natural language messages, or selecting that a final decision has been reached. When one agent selects that an agreement has been made, both agents independently output what they think the agreed decision was. If conflicting decisions are made, both agents are given zero reward.

Task
Our task is an instance of multi issue bargaining (Fershtman, 1990), and is based on DeVault et al. (2015). Two agents are both shown the same collection of items, and instructed to divide them so that each item assigned to one agent.
Each agent is given a different randomly generated value function, which gives a non-negative value for each item. The value functions are constrained so that: (1) the total value for a user of all items is 10; (2) each item has non-zero value to at least one user; and (3) some items have nonzero value to both users. These constraints enforce that it is not possible for both agents to receive a maximum score, and that no item is worthless to both agents, so the negotiation will be competitive. After 10 turns, we allow agents the option to complete the negotiation with no agreement, which is worth 0 points to both users. We use 3 item types (books, hats, balls), and between 5 and 7 total items in the pool. Figure 1 shows our interface.

Data Collection
We collected a set of human-human dialogues using Amazon Mechanical Turk. Workers were paid $0.15 per dialogue, with a $0.05 bonus for maximal scores. We only used workers based in the United States with a 95% approval rating and at least 5000 previous HITs. Our data collection interface was adapted from that of Das et al. (2016).
We collected a total of 5808 dialogues, based on 2236 unique scenarios (where a scenario is the  The perspectives differ on their input goals, output choice, and in special tokens marking whether a statement was read or written. We train conditional language models to predict the dialogue given the input, and additional models to predict the output given the dialogue. available items and values for the two users). We held out a test set of 252 scenarios (526 dialogues). Holding out test scenarios means that models must generalise to new situations.

Likelihood Model
We propose a simple but effective baseline model for the conversational agent, in which a sequenceto-sequence model is trained to produce the complete dialogue, conditioned on an agent's input.

Data Representation
Each dialogue is converted into two training examples, showing the complete conversation from the perspective of each agent. The examples differ on their input goals, output choice, and whether utterances were read or written.
Training examples contain an input goal g, specifying the available items and their values, a dialogue x, and an output decision o specifying which items each agent will receive. Specifically, we represent g as a list of six integers corresponding to the count and value of each of the three item types. Dialogue x is a list of tokens x 0..T containing the turns of each agent interleaved with symbols marking whether a turn was written by the agent or their partner, terminating in a special token indicating one agent has marked that an agree-ment has been made. Output o is six integers describing how many of each of the three item types are assigned to each agent. See Figure 2.

Supervised Learning
We train a sequence-to-sequence network to generate an agent's perspective of the dialogue conditioned on the agent's input goals ( Figure 3a).
The model uses 4 recurrent neural networks, implemented as GRUs The agent's input goals g are encoded using GRU g . We refer to the final hidden state as h g . The model then predicts each token x t from left to right, conditioned on the previous tokens and h g . At each time step t, GRU w takes as input the previous hidden state h t−1 , previous token x t−1 (embedded with a matrix E), and input encoding h g . Conditioning on the input at each time step helps the model learn dependencies between language and goals.
The token at each time step is predicted with a softmax, which uses weight tying with the embedding matrix E (Mao et al., 2015): Figure 3: Our model: tokens are predicted conditioned on previous words and the input, then the output is predicted using attention over the complete dialogue. In supervised training (3a), we train the model to predict the tokens of both agents. During decoding and reinforcement learning (3b) some tokens are sampled from the model, but some are generated by the other agent and are only encoded by the model. Note that the model predicts both agent's words, enabling its use as a forward model in Section 5. At the end of the dialogue, the agent outputs a set of tokens o representing the decision. We generate each output conditionally independently, using a separate classifier for each. The classifiers share bidirectional GRU o and attention mechanism  over the dialogue, and additionally conditions on the input goals.
The output tokens are predicted using softmax: The model is trained to minimize the negative log likelihood of the token sequence x 0..T conditioned on the input goals g, and of the outputs o conditioned on x and g. The two terms are weighted with a hyperparameter α.
Output choice prediction loss (10) Unlike the Neural Conversational Model (Vinyals and Le, 2015), our approach shares all parameters for reading and generating tokens.

Decoding
During decoding, the model must generate an output token x t conditioned on dialogue history x 0..t−1 and input goals g, by sampling from p θ : If the model generates a special end-of-turn token, it then encodes a series of tokens output by the other agent, until its next turn (Figure 3b).
The dialogue ends when either agent outputs a special end-of-dialogue token. The model then outputs a set of choices o. We choose each item independently, but enforce consistency by checking the solution is in a feasible set O: In our task, a solution is feasible if each item is assigned to exactly one agent. The space of solutions is small enough to be tractably enumerated.

Goal-based Training
Supervised learning aims to imitate the actions of human users, but does not explicitly attempt to maximise an agent's goals. Instead, we explore pre-training with supervised learning, and then fine-tuning against the evaluation metric using reinforcement learning. Similar two-stage learning strategies have been used previously (e.g. Li et al. (2016); Das et al. (2017)). During reinforcement learning, an agent A attempts to improve its parameters from conversations with another agent B. While the other agent B could be a human, in our experiments we used our fixed supervised model that was trained to imitate humans. The second model is fixed as we found that updating the parameters of both agents led to divergence from human language. In effect, agent A learns to improve by simulating conversations with the help of a surrogate forward model. Agent A reads its goals g and then generates tokens x 0..n by sampling from p θ . When x generates an end-of-turn marker, it then reads in tokens x n+1..m generated by agent B. These turns alternate until one agent emits a token ending the dialogue. Both agents then output a decision o and collect a reward from the environment (which will be 0 if they output different decisions). We denote the subset of tokens generated by A as X A (e.g. tokens with incoming arrows in Figure 3b).
After a complete dialogue has been generated, we update agent A's parameters based on the outcome of the negotiation. Let r A be the score agent A achieved in the completed dialogue, T be the length of the dialogue, γ be a discount factor that rewards actions at the end of the dialogue more strongly, and µ be a running average of completed dialogue rewards so far 2 . We define the future reward R for an action x t ∈ X A as follows: We then optimise the expected reward of each action x t ∈ X A : The gradient of L RL θ is calculated as in REIN-FORCE (Williams, 1992): 2 As all rewards are non-negative, we instead re-scale them by subtracting the mean reward found during self play. Shifting in this way can reduce the variance of our estimator. j ← j + 1 7: k ← k + 1 14: x k ∼ p θ (x k |x 0..k−1 , g) ⊲ Calculate rollout output and reward 15: if R(u) > R(u * ) then 18: return u * ⊲ Return best move

Goal-based Decoding
Likelihood-based decoding ( §3.3) may not be optimal. For instance, an agent may be choosing between accepting an offer, or making a counter offer. The former will often have a higher likelihood under our model, as there are fewer ways to agree than to make another offer, but the latter may lead to a better outcome. Goal-based decoding also allows more complex dialogue strategies. For example, a deceptive utterance is likely to have a low model score (as users were generally honest in the supervised data), but may achieve high reward. We instead explore decoding by maximising expected reward. We achieve this by using p θ as a forward model for the complete dialogue, and then deterministically computing the reward. Rewards for an utterance are averaged over samples to calculate expected future reward (Figure 4).
We use a two stage process: First, we generate c candidate utterances U = u 0..c , representing possible complete turns that the agent could make, which are generated by sampling from p θ until the end-of-turn token is reached. Let x 0..n−1 be current dialogue history. We then calculate the expected reward R(u) of candidate utterance u = x n,n+k by repeatedly sampling x n+k+1,T from p θ , then choosing the best output o using Equation 12, and finally deterministically computing the reward r(o). The reward is scaled by the probability of the output given the dialogue, because if the agents select different outputs then they both receive 0 reward.
We then return the utterance maximizing R.
We use 5 rollouts for each of 10 candidate turns.

Training Details
We implement our models using PyTorch. All hyper-parameters were chosen on a development dataset. The input tokens are embedded into a 64-dimensional space, while the dialogue tokens are embedded with 256-dimensional embeddings (with no pre-training). The input GRU g has a hidden layer of size 64 and the dialogue GRU w is of size 128. The output GRU− → o and GRU← − o both have a hidden state of size 256, the size of h s is 256 as well. During supervised training, we optimise using stochastic gradient descent with a minibatch size of 16, an initial learning rate of 1.0, Nesterov momentum with µ=0.1 (Nesterov, 1983), and clipping gradients whose L 2 norm exceeds 0.5. We train the model for 30 epochs and pick the snapshot of the model with the best validation perplexity. We then annealed the learning rate by a factor of 5 each epoch. We weight the terms in the loss function (Equation 10) using α=0.5. We do not train against output decisions where humans selected different agreements. Tokens occurring fewer than 20 times are replaced with an 'unknown' token.
During reinforcement learning, we use a learning rate of 0.1, clip gradients above 1.0, and use a discount factor of γ=0.95. After every 4 reinforcement learning updates, we make a supervised update with mini-batch size 16 and learning rate 0.5, and we clip gradients at 1.0. We used 4086 simulated conversations.
When sampling words from p θ , we reduce the variance by doubling the values of logits (i.e. using temperature of 0.5).

Comparison Systems
We compare the performance of the following: LIKELIHOOD uses supervised training and decoding ( §3), RL is fine-tuned with goal-based selfplay ( §4), ROLLOUTS uses supervised training combined with goal-based decoding using rollouts ( §5), and RL+ROLLOUTS uses rollouts with a base model trained with reinforcement learning.

Intrinsic Evaluation
For development, we use measured the perplexity of user generated utterances, conditioned on the input and previous dialogue.
Results are shown in Table 3, and show that the simple LIKELIHOOD model produces the most human-like responses, and the alternative training and decoding strategies cause a divergence from human language. Note however, that this divergence may not necessarily correspond to lower quality language-it may also indicate different strategic decisions about what to say. Results in §6.4 show all models could converse with humans.

End-to-End Evaluation
We measure end-to-end performance in dialogues both with the likelihood-based agent and with humans on Mechanical Turk, on held out scenarios. Humans were told that they were interacting with other humans, as they had been during the collection of our dataset (and few appeared to realize they were in conversation with machines).
We measure the following statistics: Score: The average score for each agent (which could be a human or model), out of 10. Agreement: The percentage of dialogues where both agents agreed on the same decision. Pareto Optimality: The percentage of Pareto optimal solutions for agreed deals (a solution is Pareto optimal if neither agent's score can be improved without lowering the other's score). Lower scores indicate inefficient negotiations.   Results are shown in Table 1. Firstly, we see that the RL and ROLLOUTS models achieve significantly better results when negotiating with the LIKELIHOOD model, particularly the RL+ROLLOUTS model. The percentage of Pareto optimal solutions also increases, showing a better exploration of the solution space. Compared to human-human negotiations (Table 2), the best models achieve a higher agreement rate, better scores, and similar Pareto efficiency. This result confirms that attempting to maximise reward can outperform simply imitating humans. Similar trends hold in dialogues with humans, with goal-based reasoning outperforming imitation learning. The ROLLOUTS model achieves comparable scores to its human partners, and the RL+ROLLOUTS model actually achieves higher scores. However, we also find significantly more cases of the goal-based models failing to agree a deal with humans-largely a consequence of their more aggressive negotiation tactics (see §7). Table 1 shows large gains from goal-based methods. In this section, we explore the strengths and weaknesses of our models.

Analysis
Goal-based models negotiate harder. The RL+ROLLOUTS model has much longer dialogues with humans than LIKELIHOOD (7.2 turns vs. 5.3 on average), indicating that the model is accepting deals less quickly, and negotiating harder.
A negative consequence of this more aggressive negotiation strategy is that humans were more likely to walk away with no deal, which is reflected in the lower agreement rates. Even though failing to agree was worth 0 points, people often preferred this course over capitulating to an uncompromising opponent-a factor not well captured by the simulated partner in reinforcement learning training or rollouts (as reflected by the larger gains from goal-based models in dialogues with the LIKELIHOOD model). In particular, the goal-based models are prone to simply rephrasing the same demand each turn, which is a more effective strategy against the LIKELIHOOD model than humans. Future work should address this issue. Figure 5 shows an example of our goal-based model stubbornly negotiating until it achieves a good outcome.
Models learn to be deceptive. Deception can be an effective negotiation tactic. We found numerous cases of our models initially feigning interest in a valueless item, only to later 'compromise' by conceding it. Figure 7 shows an example.

Models produce meaningful novel sentences.
One interesting question is whether our models are capable of generating novel sentences in the new circumstances they find themselves in, or if they simply repeat messages from the training data verbatim. We find that 76% of messages produced by the LIKELIHOOD model in self-play were found in the training data. We manually examined the novel  utterances produced by our model, and found that the overwhelming majority were fluent English sentences in isolation-showing that the model has learnt a good language model for the domain (in addition to results that show it uses language effectively to achieve its goals). These results suggest that although neural models are prone to the safer option of repeating sentences from training data, they are capable of generalising when necessary. Future work should choose domains that force a higher degree of diversity in utterances.
Maintaining multi-sentence coherence is challenging. One common linguistic error we see RL+ROLLOUTS make is to start a message by indicating agreement (e.g. I agree or Deal), but then going on to propose a counter offer-a behaviour that human partners found frustrating. One explanation is that the model has learnt that in the supervised data, messages beginning with I agree are often at the end of the dialogue, and partners rarely reply with further negotiation-so the models using rollouts and reinforcement learning believe this tactic will help their offer to be accepted.

Related Work
Most work on goal orientated dialogue systems has assumed that state representations are anno-  (2017) use task-specific rules to combine the task input and dialogue history into a more structured state representation than ours.
Reinforcement learning (RL) has been applied in many dialogue settings.
RL has been widely used to improve dialogue managers, which manage transitions between dialogue states (Singh et al., 2002;Pietquin et al., 2011;Rieser and Lemon, 2011;Gašic et al., 2013;Fatemi et al., 2016). In contrast, our end-toend approach has no explicit dialogue manager. Li et al. (2016) improve metrics such as diversity for non-goal-orientated dialogue using RL, which would make an interesting extension to our work. Das et al. (2017) use reinforcement learning to improve cooperative bot-bot dialogues. RL has also been used to allow agents to invent new languages (Das et al., 2017;Mordatch and Abbeel, 2017). To our knowledge, our model is the first to use RL to improve the performance of an end-toend goal orientated dialogue system in dialogues with humans.
Work on learning end-to-end dialogues has concentrated on 'chat' settings, without explicit goals (Ritter et al., 2011;Vinyals and Le, 2015;Li et al., 2015). These dialogues contain a much greater diversity of vocabulary than our domain, but do not have the challenging adversarial elements. Such models are notoriously hard to evaluate (Liu et al., 2016), because the huge diversity of reasonable responses, whereas our task has a clear objective. Our end-to-end approach would also be much more straightforward to integrate into a generalpurpose dialogue agent than one that relied on annotated dialogue states (Dodge et al., 2016).
There is a substantial literature on multi-agent bargaining in game-theory, e.g. Nash Jr (1950). There has also been computational work on modelling negotiations (Baarslag et al., 2013)-our work differs in that agents communicate in unrestricted natural language, rather than pre-specified symbolic actions, and our focus on improving performance relative to humans rather than other automated systems. Our task is based on that of DeVault et al. (2015), who study natural language negotiations for pedagogical purposes-their version includes speech rather than textual dialogue, and embodied agents, which would make interesting extensions to our work. The only automated natural language negotiations systems we are aware of have first mapped language to domainspecific logical forms, and then focused on choosing the next dialogue act (Rosenfeld et al., 2014;Cuayáhuitl et al., 2015;Keizer et al., 2017). Our end-to-end approach is the first to to learn comprehension, reasoning and generation skills in a domain-independent data driven way.
Our use of a combination of supervised and reinforcement learning for training, and stochastic rollouts for decoding, builds on strategies used in game playing agents such as AlphaGo (Silver et al., 2016). Our work is a step towards real-world applications for these techniques. Our use of rollouts could be extended by choosing the other agent's responses based on sampling, using Monte Carlo Tree Search (MCTS) (Kocsis and Szepesvári, 2006). However, our setting has a higher branching factor than in domains where MCTS has been successfully applied, such as Go (Silver et al., 2016)-future work should explore scaling tree search to dialogue modelling.

Conclusion
We have introduced end-to-end learning of natural language negotiations as a task for AI, arguing that it challenges both linguistic and reasoning skills while having robust evaluation metrics. We gathered a large dataset of human-human ne-gotiations, which contain a variety of interesting tactics. We have shown that it is possible to train dialogue agents end-to-end, but that their ability can be much improved by training and decoding to maximise their goals, rather than likelihood. There remains much potential for future work, particularly in exploring other reasoning strategies, and in improving the diversity of utterances without diverging from human language. We will also explore other negotiation tasks, to investigate whether models can learn to share negotiation strategies across domains.