Multi-agent Communication meets Natural Language: Synergies between Functional and Structural Language Learning

We present a method for combining multi-agent communication and traditional data-driven approaches to natural language learning, with an end goal of teaching agents to communicate with humans in natural language. Our starting point is a language model that has been trained on generic, not task-specific language data. We then place this model in a multi-agent self-play environment that generates task-specific rewards used to adapt or modulate the model, turning it into a task-conditional language model. We introduce a new way for combining the two types of learning based on the idea of reranking language model samples, and show that this method outperforms others in communicating with humans in a visual referential communication task. Finally, we present a taxonomy of different types of language drift that can occur alongside a set of measures to detect them.


Introduction
In this work, we aim at making agents communicate with humans in natural language. Our starting point is a language model that has been trained on generic, not task-specific language data. We then place this model in a multi-agent communication environment that generates task-specific rewards, which are used to adapt or modulate the model, making it task-conditional. We thus propose to decompose the problem of learning language use into two components: learning "what" to say based on a given situation, and learning "how" to say it. The "what" is the essence of communication that underlies our intentions and is chosen by maximizing any given utility, making it a functional, utility-driven process. On the other hand, the "how" is a surface realization of our intentions, i.e., the words we use * All authors contributed equally. to communicate this "what" successfully. This factorization into content planning (here, "what") and surface realization (here, "how") moves us away from end-to-end neural generation systems and is in line with traditional methods of natural language generation (Reiter and Dale, 1997). More importantly, it enables us to bring together two different strands of research: traditional data-driven natural language learning and multi-agent communication.
Traditional approaches to natural language learning (Kneser and Ney, 1995;Mikolov et al., 2010;Sutskever et al., 2014;Vinyals and Le, 2015; are based on inferring structural properties of language from text corpora, often in a passive regime, dissociated from communication. While this type of learning is great for learning general statistical associations between symbols (e.g., adjectives come before nouns) and even inferring semantic relations, it ignores the functional aspects of communication, i.e., the fact that people use words to coordinate with others and make things happen in the world (Wittgenstein, 1953;Austin, 1975;. On the other hand, multi-agent communication research (Foerster et al., 2016;Lazaridou et al., 2017;Havrylov and Titov, 2017;Evtimova et al., 2017;Lee et al., 2019) puts communication at the heart of agents' (language) learning. Implemented within a multi-agent reinforcement learning setup, agents start tabula rasa and form communication protocols that maximize task rewards. While this purely utilitarian framework results in agents that successfully learn to solve the task by creating a communication protocol, these emergent communication protocols do not bear core properties of natural language. Chaabouni et al. (2019) show that protocols found through emergent communication, unlike natural language, do not conform to Zipf's Law of Abbreviation; Kottur et al. (2017) find that communication protocols do not follow composi-tionality patterns of natural language, and Lazaridou et al. (2018) find emerged protocols to be sensitive to different experimental conditions. This growing set of alarming results on emergent communication raises doubts about the use of this type of functional learning as a viable alternative to language learning.
Concluding that neither approach on its own is adequate for learning language use, we propose a method for combining the best of both worlds. Generic language data can be used effectively as a good prior model of language, encapsulating its intrinsic structural properties, i.e., are only used for the "how" in the form of generic language models. Conversely, multi-agent interactions, that provide rewards specific to the task of interest, now only need to be used for the functional learning of language use, i.e., learning the "what". 1 The contributions of this paper are as follows. First, we propose a general research program of language learning that combines two learning signals coming from multi-agent communication and traditional data-driven natural language learning techniques. We present a concrete study in the context of a referential communication game (see Section 2) between a speaker and a listener, where the traditional data-driven language learning takes the form of image captioning, and the functional learning takes the form of agent self-play (see Section 3). We then present a new approach for combining the two learning signals, i.e., reward-learned rerankers (see Section 4), and compare this to existing approaches using a human study (see Section 5). We discuss shortcomings of this program with respect to different types of language drift that can occur, and introduce a number of automatic measures to detect them (see Section 6). Finally, we show how such a program under oracle rewards can be a viable approach moving towards learning language use from human rewards (see Section 7). 1 About the terminology: by 'traditional data-driven natural language learning', we mean language modelling of the nextword-prediction variety. This type of learning does not involve any use of the language or other context, and as such only focuses on word statistics. Since the structure of the language is a large part of those statistics, and the role of the generic language models in our proposed combined systems is to provide structural knowledge of language, we also use the term 'structural learning'. We contrast this with the purely usagedriven, reward-based learning of the type seen in emergent communication research. Since the function, rather than the structure or statistics, is the only thing that matters for such a learner, we also use the term 'functional learning'.

Research framing
Our research can be framed in the following scenario. An agent needs to perform a functional communication task in a natural language (in this work, English). However, examples of linguistic communication about this functional task are not available -the only natural language data that can be used consist of examples of generic natural language, which are not grounded in the functional task. Recasting the task as a multi-agent language game provides a way to obtain a reward that judges whether an utterance elicited the correct behaviour by a listener.

Experimental setup
In this work, we instantiate the research in the following way: the functional task is a visual referential communication game for a target image in the context of a distractor, the reward is based on success in referential communication where a listener is tasked to pick the correct image within distractors guided by the speaker's description, and the generic natural language data are captioning data.
Visual referential communication game. There are two players, the speaker and the listener. The speaker sees a target object and needs to communicate an utterance about it in the context of distractors; both target and distractors are represented as images. The listener is presented with the same set of images, but without the knowledge of which is the target, and needs to identify the target image relying on the utterance being communicated by the speaker. The utterance takes the form of sequences of word-like units. If the listener's choice is correct they both receive a positive reward, else they receive the same negative reward. 2 Dataset and referential splits. For playing the visual referential communication game, we use a multi-modal dataset, the Abstract Scenes (Zitnick and Parikh, 2013) which contains 10k synthetic images accompanied with descriptive captions (on average 6 per image) (see Figure 1). 3 The cap-2 The task we consider is essentially discriminative image captioning (Vedantam et al., 2017;Dai and Lin, 2017;Andreas and Klein, 2016). Here we are using it as a placeholder of a communication task to illustrate our general framework. Thus, we are not incorporating any explicit bias in the model about this particular task. The only task-specific information we use is communicated via the reward.
3 Other multi-modal datasets like MSCOCO (Lin et al., 2014) or Flickr (Thomee et al., 2016), while providing com-Jenny is scared of the bear Mike is scared of the bear Jenny and Mike sit by a fire Jenny and Mike are sitting A bear is scaring mike and jenny  Table 1: Accuracy performance of a human listener with a human speaker producing either random or discriminative caption on the easy and difficult splits.
tions typically refer to diverse aspects of the scene (characters and actions), providing a rich and challenging environment for an agent to evolve the captioning skills for successful communication. In our experiments, we split the dataset into 80/10/10 for train/validation/test sets. We use the test images to create two referential splits, i.e., easy and difficult, as a function of the similarity between the target and distractor images. Each split contains 1000 pairs of a target and a distractor.
Human performance and setup validation. In order to assess the difficulty of the task in the presence of the particular data (images and captions) we perform a human study in the reference game with a human speaker and a human listener, where the human speaker can only communicate one of the existing captions of the target image. We perform the human study under two conditions. In the first condition, the human speaker has only access to the ground-truth captions and does not have access to the distractor image, thus has to pick a random caption. This corresponds to the perfect structural knowledge of English but no knowledge of the functional task and it is the human upper-bound of a captioning system performance on this task. In the second condition, the speaker has access to both the ground-truth captions and the distractor image, thus is able to pick a discriminative caption to communicate. For each condition, we collect 50 rounds plex naturalistic images, often have a repetitive set of captions, highlighting one particular aspect of the scene and suffer from a human reporting bias (Misra et al., 2016). By using Abstract Scenes, we have left certain visual challenges out of the scope of the work, obtained cleaner multi-modal associations between words and objects, and focused on the language use for referential communication.
of games and present results in Table 1. We see that the task-specific condition outperforms the first condition, indicating that in our current setup there is enough space to improve upon models based on structural-only learning (i.e., captioning models). Moreover, the good performance of discriminative caption speaker demonstrates that (in principle) the captioning data can be used in a successful communication with a human for this task.
3 Multi-agent communication setup

Speaker
The speaker is the primary learner in this research, aiming at creating a model that is able to use natural language in a communicative scenario, and consists of standard visual and language modules. To convert images to embeddings u, we use a pretrained ResNet (He et al., 2016) (parametrized by θ resnet ) and feed its last layer output into a onelayer MLP (parametrized by θ M LP S ). To generate a message m, we use a one-layer LSTM (Hochreiter and Schmidhuber, 1997) adding embeddings u at each time step as additional context. Section 4 presents different speaker models consisting of these modules.
We also design two oracle speakers (with no weights) that have direct access to ground-truth captions of images at test time. The random caption speaker outputs one of the ground-truth captions for the target image at random. Since this speaker is not aware of the functional goal, their performance will indicate whether having only good grounded language skills is enough for communication success in our setup. We also build an oracle speaker that is task-aware; discriminative caption speaker uses a simple word-overlap heuristic to pick the target's caption that has the least word overlap with any of the distractor's captions (the score is normalized by the captions' length excluding stop-words).

Listener
Throughout the experiments, we need a way to estimate performance on the functional communication task, either for evaluation or to provide rewards during training acting as a scaffolding to learn the speaker model. Ideally, this performance signal should be provided by a human who is interacting online with the speaker agent. While we do so for evaluation reasons, for training we approximate this quantity with a learned component, an agent listener.
To convert images to embeddings u, we use the same pre-trained ResNet as for the speaker and feed its last layer output into a one-layer MLP (parametrized by θ M LP L ). Following that, the listener uses an LSTM (parametrized by θ LST M L ) to embed the utterance m received by speaker, creating embedding v. Finally, the listener picks the image with the highest dot-product similarity between the embedded message v and the embeddings u t and u d for target and distractor. Since we know which image candidate is the intended referent, we cast this problem as supervised learning and update the listener's weights θ L = {θ M LP L , θ LST M L } optimizing cross-entropy. Finally, the listener assigns reward 1 to the speaker if they identified the correct image, else reward -1.
We consider two different setups: a joint listener, which is trained together with the speaker, as commonly done in the emergent communication literature, and a fixed listener that is pre-trained to perform best-response to the oracle discriminative caption speaker and stays fixed throughout the learning of the speakers with the sole use of providing them rewards. We expect the latter setup to be less prone to language drift issues due to the grounding of the discriminative caption speaker to language data. thus potentially resulting in better communication with human listeners. We also use the fixed listener for evaluation of all speakers.

Methods for learning language use
We describe ways to estimate the speaker's generative model p θ S (m|u, t) for message m, conditioned on target and distractor embeddings u = [u t , u d ] and target image index t ∈ {0, 1}.

Functional-only learning
This type of learning language use is identical to experiments commonly conducted in the literature of emergent communication (Lazaridou et al., 2017;Havrylov and Titov, 2017;Bouchacourt and Baroni, 2018;Evtimova et al., 2017;Graesser et al., 2019), i.e., the speaker learns to emit communication utterances m in order to maximize the communication task reward (see Section 3.2 for a discussion on how this reward is computed). Concretely, the weights θ S = {θ M LP S , θ LST M S } of the speaker policy π θ S (m|u, t) are updated via the REINFORCE update rule (Williams, 1992) using rewards r L provided by the listener, i.e., we optimize vocabulary size |V | = 100, and message length I = 10. 4 Note, that while this type of learning results in a language that is maximally functionally correct for the given task reward, this language is not natural language, i.e., the symbols are not grounded to natural language.

Structural-only learning
This type of learning ignores the functional aspect of communication and communicates utterances that reflect intrinsic structural properties of language, i.e., that are fluent, grammatical and related to the target. Here, we used paired data in the form u, c , where u is a visual embedding and c is the associated caption, and learn an image captioning model. The speaker's parameters θ S = {θ M LP S , θ LST M S } are optimized using cross-entropy, i.e., (the vocabulary size) and I = 25, i.e., the longest caption in the dataset. We approximate the speaker model p θ S (m|u, t) with the captioning one, which ignores distractor, thus the communication task. We construct two speakers with different decoding schemes: greedy uses greedy decoding, while sample picks the highest probability message among k = 20 stochastic samples (temperature τ = 2.0).

Structural and functional learning
We now describe several ways in which both types of learning are used to learn language use. In all cases, we equip the speaker with a base image captioning model similar to the one presented in Section 4.2, which is used to calculate p θ LST M S (c i |c <i , u t ). The functional part is learned via the REINFORCE update rule optimizing the task reward (i.e., listener's accuracy in the referential task). However, speakers differ in how they parametrize p θ S (m|u, t) and whether the task reward is used to update the weights {θ M LP S , θ LST M S } of the base captioning model.

Reward finetuning
The simplest approach is to first use existing pretrained components for which we have available corpora in order to learn the statistical properties of language, and then steer the language use to be functionally appropriate using reward finetuning for the given task. We use paired data in the form u, c to learn the weights θ S = {θ M LP S , θ LST M S } of a base image captioning model following Section 4.2, and then we perform functional learning by using the listener's reward to optimize the weights θ S as in Section 4.1. While this method is conceptually simple, it becomes challenging when the task requires extending the conditioning part of the base model. Here, we need to change the conditioning of the base captioning model from u = u t to u = [u t ; u d ], to allow conditioning on the distractor. Since this is not trivial (the base image captioning model has been learned by conditioning only on one image embedding), we keep the conditioning u = u t also during finetuning with REIN-FORCE. Thus, similar to the image captioning model, we approximate p θ S (m|u, t) with p θ LST M S (m|u t ). However, unlike image captioning, the information about distractors flows into the policy, since the weights θ S are optimized using the listener's reward which considers distractors.
Since the gradients from optimizing the functional task are sent all the way into the base captioning model, this causes catastrophic forgetting of the core knowledge of language, leading to language drift. Thus, we use a language regularizer term in the form of Kullback-Leibler divergence between pre-trained and fine-tuned language modeling distributions (Havrylov and Titov, 2017).

Multi-task learning
An alternative is to conduct both types of learning (i.e., image captioning and functional learning) at the same time (Lazaridou et al., 2017;Lee et al., 2019). This takes the form of multi-task learning optimizing λ f L f unctional + λ s L structural , where λ f = 1. Like in reward finetuning, the gradients of the reward learning flow into the weights of a base captioning model, leaving us with questions about a trade-off between task success and quality of language. Therefore, we introduce two variants of this model depending on the importance of the language component, i.e., one variant with λ s = 0.1 and a language-regularized one with λ s = 1.

Reward-learned rerankers
Finally, we introduce a new way of learning language use in the multi-agent communication setup. As before, we train the core language capabilities of a speaker using the image captioning task objec-tive, but after this pre-training phase, the weights of this model are frozen. The functional part is then viewed as learning to use this general knowledge of language grounded in images. This is operationalized as learning to rerank samples obtained from the captioning model optimizing the listener's reward. The action space of this speaker are sentences, as opposed to words used commonly in the literature of emergent communication. We emphasize that by leveraging the idea of reranking, we are able to take a task-unconditional model, i.e., a captioning model that only conditions on the target, and extend its conditioning turning it into a taskconditional model, i.e., a discriminative captioning model that conditions also on the distractor.
Below we consider two concrete reranker models. In both cases, the message generation happens in two steps. First, we sample |S| = 20 candidates from the pre-trained and fixed image captioning model p θ LST M S (m|u t ). Then, we pick the best sample s using a task-conditional reranking score p(s|u, t). The reranking score can be viewed as a new policy π θ S (s|u, t) that operates in the space of samples S drawn from the taskunconditional model. This policy introduces an additional set of trainable parameters θ rerank Product of experts reranker. In this model we parametrize the policy as a product of experts (PoE): π θ S (s|u, t) ∝ p(s|u, t) λ f p(s|u t ) λs , where u = [u t ; u d ] and λ f = 1. The second term is the image captioning message probability, renormalized over the samples space, thus bringing general language knowledge grounded in images. The first term adjusts for the task specifics. To model that, we re-embed the samples using transformed bag-of-words, thus the trainable parameters of the reranker θ rerank S are word embeddings and additional MLP weights. We combine target and distractor embeddings into a single vector and compute the dot-product similarity between this vector and each of the bag-of-words representations of samples. Finally, these scores are passed through a softmax layer to obtain p(s|u, t). We introduce two variants of the model, one with λ s = 0 and a language-regularized one with λ s = 1.
We omit the distractor vector u d in the conditioning of the prior, arriving to p(s|u t ) from the PoE reranker above. The crucial difference is that the first term now represents the speaker's approximation of the listener's behaviour. As before, we represent samples with the transformed bag-of-words, but then compute their dot-product similarities with each image separately and normalize with softmax across the images to obtain the probability of the target p(t|s, u). This reranker model is closely related to pragmatic speakers in Rational Speech Act (RSA) framework (Andreas and Klein, 2016;Monroe and Potts, 2015;Cohn-Gordon et al., 2018;Fried et al., 2018). However, while the RSA model assumes a given and fixed listener model, here we are learning the model of the listener that the speaker is using by optimizing end-to-end the listener's reward. Thus, when doing multi-agent communication using the noisy channel model, there exist two components that produce probability distributions of the same type p(t|s, u); one belongs to the listener, thus the speaker has no access to it (e.g., this listener in the future could be a human sending rewards), while the other belongs to the speaker corresponding to their model of the listener. Table 2 presents referential success when speakers are trained with rewards from a joint listener, i.e., a listener being learned jointly with the speaker.

Speakers trained jointly with listeners
We conduct three different evaluations: at test time we play against the fixed listener, human listeners and the joint listener the speaker was trained with. While fixed listener is the same for all speakers, the joint listener is speaker-specific. We report results on two splits: for the easy and difficult split we report referential success of the joint listener, and for the latter split, we also report results of the fixed and human listener.
To compute referential success using human listeners, we collect 400 annotations for each speaker model. To avoid annotators adapting to modelspecific strategies, we group predictions of similar models and collect annotations in three sessions (one for each group), during which we present annotators with predictions from a model sampled from that group. 5

Referential success of joint listeners
All models perform quite similarly in the easy split, whereas we observe larger gaps in the difficult split. In terms of joint accuracy results in the difficult split, reward finetuning has the lowest performance among models that are optimizing rewards, perhaps due to its large action space (i.e., the vocabulary size |V | = 2685), making it a hard RL exploration problem. multi-task, despite having the same action space performs better, probably due to the captioning objective being optimized concurrently facilitating the learning dynamics. Finally, the best results in both splits are obtained by the emergent communication model, that achieves near perfect performance. We believe this is the case since this speaker is the least constrained of all, since we can think of all other speakers (i.e., the ones that combine both types of learning) as being regularized towards producing natural language.

Referential success of human listeners
Somewhat alarmingly, we observe the joint performance is not predictive of the human's one across the board, hinting to issues regarding pragmatic drift (we will further discuss this in Section 6). In the most extreme case, while the emergent communication speaker achieved the highest results when playing against a listener jointly learned with the speaker, this comes in the expense of human performance: functional learning alone results in maximally uninterpretable protocols, and as such humans are at random when playing against such a model. Speakers that combine both types of learning achieve good human performance, with rewardlearned reranker models, i.e., noisy channel and PoE being the best. In their case, they outperform the image captioning baselines, even approaching the discriminative oracle speaker based on ground-truth captions. This indicates their effectiveness in extending the conditioning of the underlying image captioning to the distractor image with the reward coming from the listener, turning like this the base image-captioning model into a task-specific referential captioning model. Moreover, when giving the rerankers a perfect captioning model in the form of ground-truth captions of target images, performance of noisy channel and PoE surpass the oracles' (see last two columns of Table 2); as the community improves the base language models, we should expect this to also result in net improvement in the reranker models.
Finally, we also observe that the fixed grounded listener is significantly predictive of the human performance (p < 0.005, t-test). 6 This is encouraging, since as we will show in Section 7, we can use this listener as a fixed model that provides rewards to the speaker model.

Language drift and how to detect it
We show that the multi-agent communication framework is prone to language drift (Lee et al., 2019), i.e., when protocols diverge from human language. We present a taxonomy of different types that occur in this framework, alongside a set of automatic measures to detect it.

Structural drift: Definition and measures
The most basic type of drift that manifests in the emergent communication setup relates to the core 6 All t-tests are conducted between two distributions of scores dichotomized on human performance.  structural properties of the generated language, i.e., its fluency and grammaticality with respect to natural language (this is also referred to by Lee et al. (2019) as "syntactic"). Looking at Table 3, a clear example of this type of drift happens when models update the base captioning model. reward finetuning (no KL) does not produce at all grammatical sentences, while multi-task (λ s = 0.1) appears to suffer less, only occasionally producing slightly ungrammatical sentences by repeating consecutive words. We term this structural drift and we quantify it as the log probability of the generated message under a pre-trained unconditional language model (column log p(m) in Table 4).

Semantic drift: Definition and measures
The second type of drift is the semantic drift. This relates to whether the generated message is grounded with regards to the target object, i.e., its adequacy with respect to the literal semantics of the target (this is also referenced by Lee et al. (2019) as "semantic"). We have qualitatively observed instances of this type of drift in the PoE, which occasionally shifts the semantics of words, e.g., using the word tree to refer to ground as seen in Table 3. To measure it, we use a pre-trained imageconditional language model and compute the targetimage conditional log probability of the generated log p(m) log p(m|i) 1-gram 3-gram  message (column log p(m|i) in Table 4). These two log probability-based measures do not assume access to language data for the target objects, and as such can be computed from general unconditional and domain-specific conditional language models. In this particular case though, since we also have access to language data for the target images (i.e., captions in English), and assuming that these data describe everything that is true about the target, we can use simple n-gram statistics as proxies of semantic drift (i.e., in this case 1-gram word overlap ignoring stop word and 3-gram word overlap between the ground-truth captions and the speaker-generated message). Moreover, all these measures do not take into account the specific communication task the speaker has to perform, i.e., our measures do not consider any information about the distractor object, making them easily adaptable to other tasks.

Structural and semantic drift results
In Table 4 we report performance of different models under these automatic measures. The structural score log p(m) reflects the qualitative observations made from Table 3, i.e., multi-task and reward finetuning, have the highest structural drift, with the latter performing significantly worse than all the models. In contrast, the reranker models that do not update the base captioning model, i.e., PoE and noisy channel, perform the best on the semantic score by construction; both models directly incorporate in their models a component associated with the semantic score (i.e., the samples taken from the imageconditional model alongside the associated probabilities). Moreover, they also perform well on all other measures, indicating their robustness against language drift. Finally, all the model-specific language regularizers (KL for reward finetining, λ s = 1 for multi-task and λ s = 1 for PoE) we introduced were effective in limiting both types of language drift (as also seen in Table 3).

Pragmatic drift
Finally, we identify a novel type of drift, i.e., pragmatic drift, which relates to the divergence between a human's interpretation of the message from the interpretation a speaker will assume. Unfortunately, this type of drift is perhaps the most difficult to capture in an automatic way as it is task specific and requires access to the exact interpretation that the human would ascribe to the message. As a proxy of pragmatic drift, we use the difference between the agent-and human-listener referential success; if the joint referential success is higher than the human's one, then the speaker assumes an interpretation of the message that is different from the human's one, resulting in lower human performance. An extreme example of this drift manifests when the joint listener achieves almost perfect referential success whereas a human listener is at random, as in the case of emergent communication. However, in this case the messages are maximally uninterpretable with the lowest possible performance in both structural and semantic scores.
Hence, a natural question to ask is to what degree (if at all possible) pragmatic drift can manifest in the absence of the other two types of language drift. Or, put differently, does the emergent communication for learning language use hide any other pathological behaviour for models that do not suffer a lot from structural and semantic drift, as in the case of PoE and noisy channel? To study this, we create a setup where PoE is guaranteed to have a perfect knowledge of (grounded) language. Namely, it uses the reward to rerank ground-truth captions associated with the target image (note, our dataset provides up to five captions per image). Moreover, we perform several ablations where we allow the updating of different parameters in the speaker's and listener's model by unfreezing components. Table 5 presents the results of the joint and human referential success. The main finding is that  by increasing the number of components that get updated using the joint reward, the margin between the referential success of the two types of listeners increases. Despite the fact that the speaker is using human language that is perfectly fluent and accurate with respect to the target image (since the reranker operates on captions associated with the target image), while the joint listener is able to communicate with the agent speaker, the human listener achieves significantly lower performance.
In one test example, the speaker said Mike has a hat, which was equally true for both images making the human pick at random. So, how could the listener pick correctly? The speaker had reached a pact with the listener that the interpretation of this message will be something beyond what the phrase means (e.g., Mike has a yellow hat or the intensity of the pixels in the target image is lower). Since speaker and listener learn together, they co-adapt, forming conventions (or conceptual pacts (Brennan and Clark, 1996)) that differ from humans', even in the presence of fluent and grounded language.

Speakers trained using fixed listener
In the previous section we showed that learning a speaker using a learned reward module as a scaffolding (i.e., the joint listener) can lead to pragmatic drift. In this section, we use a grounded reward as scaffolding. In the absence of a human listener to provide rewards for learning, we use the oracle fixed listener, which was found in Section 5 to be predictive of human referential success. It is pre-trained, stays fixed and just provides rewards for training the speaker. As speakers, we use the models that scored the highest in Table 2 and retrain them against fixed. Table 6 presents the results of referential success against fixed and human listeners. Using a grounded reward results in better performance for the weaker models. The small gap between the rerankers in the two experimental setups points that using a learned reward module (joint) holds promise, despite the different types of language drift. Moreover, we show that our mod-

Discussion and Limitations
We presented a method for teaching agents to communicate with humans in natural language, by combining two learning signals coming from multi-agent communication and traditional data-driven natural language learning techniques, which adds on recent efforts of blending emergent communication with natural language (Lowe et al., 2020;Lu et al., 2020). Self-play between speakers and listeners can result in language drift, the most severe of which being pragmatic drift. Since speakers and listeners are learning concurrently, they can co-adapt to pair-specific policies that deviate from the policies that humans learn. This pathological behaviour of self-play is not specific to language and extends to other policies (Carroll et al., 2019).
Finally, we introduced the reward-learned reranker approach which alleviates language drift and achieves the highest human performance, by constraining the functional learning to happen on the level of utterances generated by a pre-trained language model. However, since the functional signal is not currently influencing the sampling from the language model, this will lead to poor performance when using more general language models with weaker conditioning (e.g. GPT-2 (Radford et al., 2019)) whose samples potentially do not fit the functional context. Moving towards integrating our findings into more realistic applications of self-play, e.g., user simulation in dialogue (Schatzmann et al., 2006;Shah et al., 2008), these shortcomings need to be addressed.

A Hyperparameters
The following tables represent our choice of hyperparameters in the speaker and listener agents. Hyperparameters in Table 7 where chosen in the image captioning task using the validation set. Hyperparameters in Table 8 where chosen in the referential task using the validation set.

B ResNet module
We use ResNet-50 (He et al., 2016) pre-trained on ImageNet. For image captioning and also for models that use the pre-trained captioning model (i.e. reward finetuning, PoE and noisy channel) we backpropagate gradients into the ResNet module. However, in all rerankers we freeze the ResNet during reward optimization. Moreover, we also keep the ResNet fixed in the jointly learned listener to prevent additional drift, however we back-propagate when we pre-train the fixed listener, grounded though the discriminative caption speaker.