Natural Language Does Not Emerge ‘Naturally’ in Multi-Agent Dialog

A number of recent works have proposed techniques for end-to-end learning of communication protocols among cooperative multi-agent populations, and have simultaneously found the emergence of grounded human-interpretable language in the protocols developed by the agents, learned without any human supervision! In this paper, using a Task & Talk reference game between two agents as a testbed, we present a sequence of ‘negative’ results culminating in a ‘positive’ one – showing that while most agent-invented languages are effective (i.e. achieve near-perfect task rewards), they are decidedly not interpretable or compositional. In essence, we find that natural language does not emerge ‘naturally’,despite the semblance of ease of natural-language-emergence that one may gather from recent literature. We discuss how it is possible to coax the invented languages to become more and more human-like and compositional by increasing restrictions on how two agents may communicate.


Introduction
One fundamental goal of artificial intelligence (AI) is the development of goal-driven dialog agents -specifically, agents that can perceive their environment (through vision, audition, or other sensors), and communicate with humans or other agents in natural language towards some goal.
While historically such agents have been based on slot filling (Lemon et al., 2006), the dominant paradigm today is neural dialog models (Bordes and Weston, 2016;Weston, 2016;Serban et al., 2016a,b) trained on large quantities of data.Perhaps somewhat counterintuitively, this current paradigm treats dialog as a static supervised learning problem, rather than as the interactive agent learning problem that it naturally is.Specifically, a typical pipeline is to collect a large dataset of human-human dialog (Lowe et al., 2015;Das et al., 2017a;de Vries et al., 2017;Mostafazadeh et al., 2017), inject a machine in the middle of a dialog from the dataset, and supervise it to mimic the human response.While this teaches the agent correlations between symbols, it does not convey the functional meaning of language, grounding (mapping words to physical concepts), compositionality (combining knowledge of simpler concepts to describe richer concepts), or aspects of planning (understanding the goal of the conversation).
An alternative paradigm that has a long history (Winograd, 1971;Kirby et al., 2014) and is witnessing a recent resurgence (Wang et al., 2016;Foerster et al., 2016;Sukhbaatar et al., 2016;Jorge et al., 2016;Lazaridou et al., 2017;Havrylov and Titov, 2017;Mordatch and Abbeel, 2017;Das et al., 2017b) -is situated language learning.A number of recent works have proposed reinforcement learning techniques to learn communication protocols of agents situated in virtual environments in a completely end-to-end manner -from perceptual input (e.g.pixels) to communication (discrete symbols without any prespecified meanings) to action (e.g.signaling in reference games or navigating in an environment) -and have simultaneously found the emergence of grounded human-interpretable (often compositional) language in the communication protocols developed by the agents, without any human supervision or pretraining, simply to succeed at the task.
In this paper, we study the following question -what are the conditions that lead to the emergence of human-interpretable or compositional grounded language?Our key finding is that natural language does not emerge 'naturally' in multi-agent dialog, despite the semblance of ease of natural-languageemergence in multi-agent games that one may gather from recent literature.Specifically, in a sequence of 'negative' results culminating in a 'positive' one, we find that while agents always successfully invent communication protocols and languages to achieve their goals with near-perfect accuracies, the invented languages are decidedly not compositional, interpretable, or 'natural'; and that it is possible to coax the invented languages to become more and more human-like and compositional by increasing restrictions on how two agents may communicate.Related work and novelty.The starting point for our investigation is the recent work of Das et al. (2017b), who proposed a cooperative reference game between two agents, where communication is necessary to accomplish the goal due to an information asymmetry.Our key contribution over Das et al. (2017b) is an exhaustive study of the conditions that must be present before compositional grounded language emerges, and subtle but important differences in execution -tabular Q-Learning (which does not scale) vs. REINFORCE (which does), and generalization to novel environments (not studied in prior work).We hope our findings shed more light into the interpretability of languages invented in cooperative multi-agent settings, place recent work in appropriate context, and inform fruitful directions for future work.

The Task & Talk Game
Our testbed is a cooperative reference game (Task & Talk) between two agents, Q-BOT and A-BOT.The game is grounded in a synthetic world of objects comprised of three attributes -color, style, and shape -each with four possible values for a total of 4 × 4 × 4 = 64 objects.Fig. 1a shows all the possible attribute values.
Task & Talk plays out over multiple rounds of dialog.At the start, A-BOT is given an object unseen by Q-BOT, e.g.(green, dotted, square).On the other side, Q-BOT is assigned a task G (unknown to A-BOT) consisting of two attributes, e.g.(color, style) and the goal is for Q-BOT to discover these two attributes of the hidden object, through dialog with A-BOT.Specifically, Q-BOT and A-BOT exchange utterances from finite vocabularies V Q and V A over two rounds, with Q-BOT speaking first.The game culminates in Q-BOT guessing a pair of attribute values, e.g.(green, dotted), and both agents are rewarded identically based on the accuracy of Q-BOT 's prediction.
Note that the Task & Talk game setting involves an informational asymmetry between the agents -A-BOT sees the object while Q-BOT does not; similarly Q-BOT knows the task while A-BOT does not.Thus, a two-way communication is necessary for success.Without this asymmetry, A-BOT could simply convey the target attributes from the task without Q-BOT having to speak.Such a setting has been widely studies in economics and game theory as the classic Lewis Signaling (LS) game (Lewis, 2008).By necessitating dialog between agents, we are able ground both V A and V Q in our final setting (Sec.4.3).

Modeling Q-BOT and A-BOT
We formalize Q-BOT and A-BOT as agents operating in a partially observable world and optimize their policies using deep reinforcement learning.States and Actions.Each agent observes its own input (task G for Q-BOT and object instance I for A-BOT) and the output of the other agent as a stochastic environment.At the beginning of round t, Q-BOT observes state s t Q =[G, q 1 , a 1 , . . ., q t−1 , a t−1 ] and acts by uttering some token q t from its vocabulary V Q .Similarly, A-BOT observes the history and this new ut-Figure 2: Policy networks for Q-BOT and A-BOT.At each round t of dialog, (1) Q-BOT generates a question qt from its speaker network conditioned on its state encoding S Q t−1 , (2) A-BOT encodes qt conditioned on instance y encoded via instance encoder, updates its state encoding S A t , and generates an answer at, (3) Q-BOT encodes (qt, at) pair, while A-BOT encodes the answer it sampled, (4) Q-BOT updates its state to S Q t , predicts an attribute pair via prediction LSTM at round T , and receives a reward.
terance as state s t A =[I, q 1 , a 1 , . . ., q t−1 , a t−1 , q t ] and emits a response a t from V A .At the last round, Q-BOT takes a final action by predicting a pair of attribute values ŵG = ( ŵG 1 , ŵG 2 ) to solve the task.Cooperative Reward.Both Q-BOT and A-BOT are rewarded identically based on the accuracy of Q-BOT's prediction ŵG , receiving a positive reward of R=1 if the prediction matches ground truth w G and a negative reward of R=−10 otherwise.We arrive at these values empirically based on the speed of convergence in our experiments.Policy Networks.
We model Q-BOT and A-BOT as operating under stochastic policies π Q (q t |s Q t ; θ Q ) and π A (a t |s A t ; θ A ) respectively, which we instantiate as LSTM-based models.We use lower case characters (e.g.s Q t ) to denote the strings (e.g.Q-BOT's token at round t), and upper case S Q t to denote the corresponding vector as encoded by the model.
As show in Fig. 2, Q-BOT is modeled with three modules -speaking, listening, and prediction.The task G is received as a 6-dimensional one-hot encoding over the space of possible tasks and embedded via the listener LSTM.At each round t, the speaker network models the probability of output utterances q t ∈ V Q based on the state S Q t−1 .This is modeled as a fully-connected layer followed by a softmax that transforms S Q t−1 to a distribution over V Q .After receiving the reply a t from A-BOT, the listener LSTM updates the state by processing both tokens of the dialog exchange.In the final round, the prediction LSTM is unrolled twice to produce Q-BOT's prediction based on the final state S Q T and the task G.As before, task G is fed in as a onehot encoding to the prediction LSTM for two time steps, resulting in a pair of outputs used as the pre-diction ŵG .
Analogously, A-BOT is modeled as a combination of a speaker network, a listener LSTM, and an instance encoder.Like in Q-BOT, the speaker network models the probability of utterances a t ∈ V A given the state S A t and the listener LSTM updates the state S A t based on dialog exchanges.The instance encoder embeds each one-hot attribute vector via a linear layer and concatenates all three encodings to obtain a unified instance representation.Learning Policies with REINFORCE.To train these agents, we update policy parameters θ Q and θ A using the popular REINFORCE (Williams, 1992) policy gradient algorithm.Note that while the game is fully-cooperative, we do not assume full observability of one agent by another, opting instead to treat each agent as part of the stochastic environment when updating the other.We will now derive the parameter gradients for our setup.
Recall that our agents take actions -utterances (q t and a t ) and attribute prediction ( ŵG ) -and our objective is to maximize the expected reward under the agents' policies: Though the agents receive the reward at the end of gameplay, all intermediate actions are assigned the same reward R. Following the REINFORCE algorithm, we write the gradient of this expectation as an expectation of policy gradients.For θ Q , we derive this explicitly at a time step t: Similarly, gradient w.r.t.θ A , i.e., ∇ θ A J will be: As is standard practice, we estimate these expectations with sample averages -sampling an environment (object instance and task), sampling a dialog between Q-BOT and A-BOT, culminating in a prediction from Q-BOT and the received reward.The REINFORCE update rule above has an intuitive interpretation -an informative dialog (q t , a t ) that leads to positive reward will be made more probable (positive gradient), while a poor exchange leading to negative reward will be pushed down (negative gradient).Implementation Details.All our models are implemented using the Pytorch 1 deep learning framework.To represent instances, we learn a 20 dimensional embedding for every possible attribute values and concatenate the three instance attributes to obtain a final instance representation of size 60.Tokens from V Q and V A are encoded as one-hot vectors and then embedded into 20 dimension vectors.Both A-BOT and Q-BOT learn their own token embeddings without sharing.The listener networks in both agents are implemented as LSTMs with a hidden layer size of 50 dimensions.All modules within an agent are initialized using the Xavier method (Glorot and Bengio, 2010).
We use 1000 episodes of two-round dialogs to compute policy gradients, and perform updates according to Adam optimizer (Kingma and Ba, 2015), with a learning rate of 0.01.Furthermore, gradients are clipped at [−5.0, 5.0].For faster convergence, 80% of train episodes for the next iteration are from instances misclassified by the current network, while randomly sampling the remaining from all instances.Our code is publicly available 2 .

The Road to Compositionality
This section details our key observation -that while the agents always successfully invent a lan-1 github.com/pytorch/pytorch 2 github.com/batra-mlp-lab/lang-emergeguage to solve the game with near-perfect accuracies, the invented languages are decidedly not compositional, interpretable, or 'natural' (e.g.A-BOT ignoring Q-BOT's utterances and simply encoding every object with a unique symbol if the vocabulary is sufficiently large).In our setting, the language being compositional simply amounts to the ability of the agents to communicate the compositional atoms of a task (e.g.shape or color) and an instance (e.g.square or blue) independently.
Through this section, we present a series of settings that get progressively more restrictive to coax the agents towards adopting a compositional language, providing analysis of the learned languages and 'cheating' strategies developed along the way.Tab. 2 summarizes results for all settings.In all experiments, optimal policies (achieving near-perfect training rewards) were found.For each setting, we provide qualitative analysis of the learned languages and report their ability to generalize to unseen instances.We use 80% of the object-instances for training and the remaining 20% for evaluation.

Overcomplete Vocabularies
We begin with the simplest setting where both A-BOT and Q-BOT are given arbitrarily large vocabularies.We find that when |V A | is greater than the number of instances (64), the learned policy mostly ignores what Q-BOT asks and instead has A-BOT convey the instance using pairs of symbols across rounds unique to an instance, e.g., both token pairs (a 1 , a 2 )=(14, 31), (40, 1) convey (red, triangle, filled), as shown in Fig. 3. Notice, this means no 'dialog' is necessary and amounts to each agent having a codebook that maps symbols to object instances.In essence, this setting has collapsed to an analog of Lewis Signaling (LS) game with A-BOT signaling its complete world state and Q-BOT simply reporting the target attributes.More examples to illustrate this behavior for this setting are shown in Fig. 3.
Perhaps as expected, the generalization of this language to unseen instances is quite poor (success rate: 25.6%).The adopted strategy of mapping instances to token pairs fails for test instances containing novel combinations of attributes for which the agents lack an agreed-upon code from training.
It seems clear that like in human communication (Nowak et al., 2000), a limited vocabulary that cannot possibly encode the richness of the world seems to be necessary for non-trivial dialog to emerge.We explore such a setting next.

Attribute-Value Vocabulary
Since our world has 3 attributes (shape/color/ style) and 4 + 4 + 4 = 12 possible settings of their states, one may believe that the intuitive choice of |V Q | = 3 and |V A | = 12 will be enough to circumvent the 'cheating' enumeration strategy from the previous experiment.Surprisingly, we find that the new language learned in this setting is not only decidedly non-compositional but also very difficult to interpret!Some observations follow from Fig. 4 that shows sample dialogs for this setting.
We observe that Q-BOT uses only the first round to convey the task to A-BOT by encoding tasks in an order-agnostic fashion, as: (style, shape),(shape, style) → X, (color, shape),(shape, color) → Y, and (color, style),(style, color) → Z.Thus, multiple rounds of utterance from Q-BOT are rendered unnecessary and we find the second round is inconsistent across instances even for the same task.For example, symmetric tasks (color, shape) and (shape, color) from first row of Fig. 4 induce q 1 =Y as the first token from Q-BOT.
Given the task from Q-BOT in the first round, A- 2).We show symmetric tasks for each instance on either side to illustrate the similarities in the language between the agents.As seen here, Q-BOT maps symmetric tasks in an order-agnostic fashion, and uses only the first token to convey task information to A-BOT.
BOT only needs to identify one of the 4×4=16 attribute pairs for a given task.Rather than ground its symbols into individual states, A-BOT follows a 'set partitioning' strategy, i.e.A-BOT identifies a pair of attributes with a unique combinations of round 1 and 2 utterances (i.e. the round 2 utterance has no meaning independent from round 1).Thus, symbols are reused across tasks to describe different attributes (i.e.symbols do not have individual consistent groundings).This 'set partitioning' strategy is consistent with known results from game theory on Nash equilibria in 'cheap talk' games (Crawford and Sobel, 1982).
This strategy has improved generalization to unseen instances because it is able to communicate the task; however, it fails on unseen attribute value combinations because it is not compositional.

Memoryless A-BOT, Minimal Vocabulary
The key problem with the previous setting is that A-BOT's utterances mean different things based on the round of dialog (a 1 = 1 is different from a 2 = 1).Essentially, the communication protocol is over-parameterized and we must limit it further.First, we limit A-BOT's vocabulary to |V A |=4 to reduce the number of 'synonyms' the agents learn.Second, we eliminate A-BOT's capability to identify different rounds of interaction by removing A-BOT's memory.In other words, we reset the state vector S A at each time step so that A-BOT can no longer distinguish rounds from one another.By doing so, we hypothesize that Q-BOT must now ground its own and A-BOT's tokens consistently across rounds to be able to communicate with a memoryless A-BOT.
These restrictions result in a learned language that grounds individual symbols into attributes and their states.For example, Q-BOT learns that Y → shape, X → color, and Z → style.Q-BOT does not however learn to always utter these symbols in the same order as the task, e.g.asking for shape first for both (color, shape) and (shape, color).
Notice that this is perfectly valid as Q-BOT can later re-arrange the attributes in the task desired order.Similarly, A-BOT learns mappings to attribute values for each attribute query that remain consistent regardless of round (i.e. when asked for color, 1 always means blue).
This is similar to learned languages reported in recent works and is most closely related to Das et al. (2017b), who solve this problem by taking away Q-BOT's state rather than A-BOT's memory.Their approach of taking away task G from Q-BOT's state can be interpreted as Q-BOT 'forgetting' the task after interacting with A-BOT.However, this behavior of Q-BOT to remember the task only during dialog but not while predicting is somewhat unnatural compared to our setting.
1 blue triangle dotted 2 purple square filled 3 green circle dashed 4 red start solid (a) A-BOT Task q 1 , q 2 (color, shape) Y, X (shape, color) (shape, style) Y, Z (style, shape) (color, style) Z, X (style, color) X, Z Table 1: Emergence of compositional grounding for language learnt by the agents.A-BOT (Tab.1a) learns consistent mapping across rounds, depending on the query attribute.Token grounding for Q-BOT (Tab.1b) depends on the task at hand.Though compositional, Q-BOT does not necessarily query attribute in the order of task, but instead re-arranges accordingly at prediction time as it contains memory.
Tab. 1 enumerates the learnt groundings for both the agents.Given this mapping, we can predict a plausible dialog between the agents for any unseen instance and task combination.Notice that this is possible only due to the compositionality in the emergent language between the two agents.For example, consider solving the task (shape, color) for an instance (red, square, filled) from Fig. Intuitively, this consistently grounded and compositional language has the greatest ability to generalize among the settings we have explored, correctly answering 74.4% of the held out instances.We note that errors in this setting seem to largely be due to A-BOT giving an incorrect answers despite Q-BOT asking the correct questions to accomplish the task.A plausible reason could be the model approximation error stemming from the instance encoder as test instances are unseen and have novel attribute combinations.

Evolution of Language
As demonstrated by the previous sections, even though compositional language is one of the optimal policies, the agents tend to learn other equally useful forms of communication.Thus, compositional language does not naturally emerge without an explicit need for it.Even in situations where compositionality does emerge, perhaps it is more interesting to analyze the process of emergence than the learnt language itself.Therefore, we present such a study that explicitly identifies when each symbol has been grounded by the agents  in the training timeline, along with implications thereof on the performance on Task & Talk game.

Dialog Trees
When two agents-Q-BOT and A-BOT-converse with each other, they can be seen as traversing through a dialog tree, a subtree of which is depicted in Fig. 6.Simply put, a dialog tree is an enumeration of all possible dialogs represented in the form of tree, with levels of the tree corresponding to the round of interaction.To elaborate, consider a partial dialog tree for (shape, color) task shown in Fig. 6 for the setting in Sec.4.3.For Q-BOT's first token q 1 = Y , A-BOT has |V A | = 4 plausible replies shown as a 4-way branch off.In general, the dialog tree for Task & Talk contains a total of |V Q | 2 |V A | 2 leaves and is 4 levels deep.We use the dialog between the agents to descend and land in one of these leaves.
Dialog trees offer an interesting alternate view of our learning problem.The goal of learning communication between the two agents can be equivalently seen as mapping (instance, task) pairs to one of the dialog tree leaves.Each leaf is labeled with an attribute pair used to accomplish the prediction task.For example, if solving (shape, color) for (blue, triangle, solid) results in the dialog Y →1→X→1, we descend the dialog tree along the corresponding path and assign the tuple (blue, triangle, solid, shape, color) to the resulting leaf.In case of a compositional, grounded dialog, all tuples of the form (blue, triangle, * , shape, color) would get mapped to the same leaf, which can then be labeled as (triangle, blue) to success- fully solve the task.Note the wildcard style attribute in the tuple above, as it is irrelevant for this particular task.
In the following section, we use dialog trees to explore the evolution of language as learnt by the two agents in the memoryless A-bot, minimal vocabulary setting in Sec.4.3.

Evolution Timeline
To gain further insight into the languages learned, we create a language evolution plot shown in Fig. 7. Specifically, at regular intervals during policy learning, we construct dialog trees.At some point in the learning, the nodes in the tree be- come and stay 'pure' (all (instance, task) at the node are identical), at which point we can say that the agents have learned this dialog subsequence.Fig. 7 depicts a timeline of concepts learned at various nodes of the trees during training.We next describe the procedure to identify when a particular 'concept' has been grounded by the agents in their language.
Construction.After constructing dialog trees at regular intervals, we identify 'concepts' at each node/leaf using the dialog tree of the completely trained model, which achieves a perfect accuracy on train set.A concept is simply the common trend among all the (instance, task) tuples either assigned to a leaf or contained within the subtree with a node as root.To illustrate, the concept of the top right leaf in Fig. 6 is (blue, triangle, * , shape, color), i.e., all instances assigned to that leaf for (shape, color) task are blue triangles.Next, given a resultant concept for each of the node/leaf, we backtrack in time and check for the first occurrence when only tuples which satisfy the corresponding concept are assigned to that particular node/leaf.In other words, we compute the earliest time when a node/leaf is 'pure' with respect to its final learned concept.Finally, we plot these leaves/nodes and the associated concept with their backtracked time to get Fig. 7.
Observations.We highlight the key observations from Fig. 7 below: (a) The agents ground most of the tasks initially at around epoch 20.Specifically, Q-BOT assigns Y to both (shape, style), (style, shape), (shape,color) and (color, shape), while (color, style) is mapped to Z. Hence, Q-BOT learns its first token very early into the training procedure at around 20 epochs.(b) The only other task (style, color) is grounded towards the end (around epoch 170) using X, leading to an immediate convergence.(c) We see a strong correlation between improvement in performance and when agents learn a language grounding.In particular, there is an improvement from 40% to 80% within a span of 25 epochs where most of the grounding is achieved, as seen from Fig. 7.

Conclusion
In conclusion, we presented a sequence of 'negative' results culminating in a 'positive' oneshowing that while most invented languages are effective (i.e.achieve near-perfect rewards), they are decidedly not interpretable or compositional.Our goal is simply to improve understanding and interpretability of invented languages in multiagent dialog, place recent work in context, and inform fruitful directions for future work.

Figure 1 :
Figure 1: (a) Task & Talk: The testbed for our study is cooperative 2-player game, Task & Talk, grounded in a synthetic world of objects with 4 shapes × 4 colors × 4 styles.(b) Q-BOT is assigned a task -to inquire about the state of an ordered pair of attributes.(c) An example gameplay between the two agents -Q-BOT asks questions depending on the task which are answered by A-BOT conditioned on the hidden instance visible to only itself.At the end, Q-BOT makes a prediction of pair of attributes (purple, square).

Figure 3 :
Figure 3: Overcomplete vocabularies setting (|VQ| = |VA| = 64, Sec.4.1).Owing to a large vocabulary, we denote the tokens using numbers, as opposed to English alphabet characters shown in other figures.A-BOT mostly ignores what Q-BOT asks and instead conveys the instance using pairs of symbols across rounds unique to an instance, leading to a highly non-human intuitive and non-compositional language.

Figure 4 :
Figure 4: Attribute and Value vocabulary setting (|VQ| = 3, |VA| = 12, Sec.4.2).We show symmetric tasks for each instance on either side to illustrate the similarities in the language between the agents.As seen here, Q-BOT maps symmetric tasks in an order-agnostic fashion, and uses only the first token to convey task information to A-BOT.

Figure 5 :
Figure 5: Example dialogs for memoryless A-BOT, minimal vocabulary setting (|VQ| = 3, |VA| = 4, Sec.4.3).Learnt language is consistent and grounded, denoted below each token.Incorrect predictions on unseen instances (right, bottom) are also shown.Notice that the incorrectly predicted attribute is still from the right category (a color attribute for color).
7(b).Q-BOT queries Y (shape) and X (color) across two rounds, and receives 2 (square) and 4 (red) as answers.More examples along with grounded meaning of each tokens are shown in Fig. 5.
BOT uses both rounds to convey task • Consistent A-BOT grounding across rounds • Good generalization to unseen instances

Figure 6 :
Figure 6: Dialog tree for memoryless A-BOT and minimal vocabulary setting (Sec.4.3), shown only for one task (shape, color).Every dialog between the agents results in a tree traversal beginning from the root, e.g., Y →1→X→1 lands us in the top-right leaf.See text for more details.

Figure 7 :
Figure 7: Evolution of Language: timeline shows groundings learned by the agents during training, overlaid on the accuracy.Note that Q-BOT learns encodings for all tasks early (around epoch 20) except (style, color).Improvement in accuracy is strongly correlated with groundings learnt.

Table 2 :
Overview of settings we explore to analyze the language learnt by two agents in a cooperative game, Task & Talk.Last two columns measure generalization in terms of prediction accuracy of both or at least one of the attribute pair, on a held-out test set containing unseen instances.