Miss Tools and Mr Fruit: Emergent Communication in Agents Learning about Object Affordances

Recent research studies communication emergence in communities of deep network agents assigned a joint task, hoping to gain insights on human language evolution. We propose here a new task capturing crucial aspects of the human environment, such as natural object affordances, and of human conversation, such as full symmetry among the participants. By conducting a thorough pragmatic and semantic analysis of the emergent protocol, we show that the agents solve the shared task through genuine bilateral, referential communication. However, the agents develop multiple idiolects, which makes us conclude that full symmetry is not a sufficient condition for a common language to emerge.


Introduction
The advent of powerful deep learning architectures has revived research in simulations of language emergence among computational agents that must communicate to accomplish a task (e.g., Jorge et al., 2016;Havrylov and Titov, 2017;Kottur et al., 2017;Lazaridou et al., 2017;Lee et al., 2017;Choi et al., 2018;Evtimova et al., 2018;Lazaridou et al., 2018).The nature of the emergent communication code should provide insights on questions such as to what extent comparable functional pressures could have shaped human language, and whether deep learning models can develop human-like linguistic skills.For such inquiries to be meaningful, the designed setup should reflect as many aspects of human communication as possible.Moreover, appropriate tools should be applied to the analysis of emergent communication, since, as several recent studies have shown, agents might succeed at a task without truly relying on their communicative channel, or by means of ad-hoc communication techniques overfitting their environment (Kottur et al., 2017;Bouchacourt and Baroni, 2018;Lowe et al., 2019).
We contribute on both fronts.We introduce a game meeting many desiderata for a natural communication environment.We further propose a two-pronged analysis of emerging communication, at the pragmatic and semantic levels.At the pragmatic level, we study communicative acts from a functional perspective, measuring whether the messages produced by an agent have an impact on the subsequent behaviour of the other.At the semantic level, we decode which aspects of the extra-linguistic context the agents refer to, and how such reference acts differ between agents.Some of our conclusions are positive.Not only do the agents solve the shared task, but genuine bilateral communication helps them to reach higher reward.Moreover, their referential acts are meaningful given the task, carrying the semantics of their input.However, we also find that even perfectly symmetric agents converge to distinct idiolects instead of developing a single, shared code.

The fruit and tools game
Our game, inspired by Tomasello's (2014) conjecture that the unique cognitive abilities of humans arose from the requirements of cooperative interaction, is schematically illustrated in Fig. 1.In each episode, a randomly selected agent is presented with instances of two tools (knife, fork, axe. . .), the other with a fruit instance (apple, pear, plum. . .).Tools and fruits are represented by property vectors (e.g., has a blade, is small), with each instance characterized by values randomly varying around the category mean (e.g., an apple instance might be smaller than another).An agent is randomly selected to be the first to perform an action.The game then proceeds for an arbitrary number of turns.At each turn, one of the agents must decide whether to pick one of the two tools and stop, or to continue, in which case the message it utters is passed to the other agent, and the game proceeds.Currently, for ease of analysis, messages are single discrete symbols selected from a vocabulary of size 10, but extension to symbol sequences is trivial (although it would of course complicate the analysis).As soon as an agent picks a tool, the game ends.The agents receive a binary reward of 1 if they picked the better tool for the fruit at hand, 0 otherwise.The best choice is computed by a utility function that takes into account the interaction between tool and fruit instance properties (e.g., as in Fig. 1, a tool with a round edge might be particularly valuable if the fruit has a pit).Utility is relative: given a peach, the axe is worse than the spoon, but it would be the better tool when the alternative is a hammer.
Here are some desirable properties of our setup, as a simplified simulation of human interactions.The agents are fully symmetric and cannot specialize to a fixed role or turn-taking scheme.The number of turns is open and determined by the agents.In pure signaling/referential games (Lewis, 1969), the aim is successful communication itself.In our game, reward depends instead on tool and fruit affordances.Optimal performance can only be achieved by jointly reasoning about the properties of the tools and how they relate to the fruit.Humans are rewarded when they use language to solve problems of this sort, and not for successful acts of reference per se.Finally, as we use commonsense descriptions of everyday objects to build our dataset (see below), the distribution of their properties possesses the highly skewed characteristics encountered everywhere in the human environment (Li, 2002).For example, if the majority of fruits requires to be cut, a knife is intrinsically more useful than a spoon.Note that the agents do not have any a priori knowledge of the tools utility.Yet, baseline agents are able to dis-cover context-independent tool affordances and already reach high performance.We believe that this scenario, in which communication-transmitted information complements knowledge that can be directly inferred by observing the world, is more interesting than typical games in which language is the only information carrier.
Game ingredients and utility We picked 16 tool and 31 fruit categories from McRae et al. (2005) and Silberer et al. (2013), who provide subject-elicited property-based commonsense descriptions of objects, with some extensions.We used 11 fruit and 15 tool features from these databases to represent the categories.We rescaled the elicitation-frequency-based property values provided in the norms to lie in the [0, 1] range, and manually changed some counter-intuitive values.An object instance is a property vector sampled from the corresponding category as follows.
For binary properties such as has a pit, we use Bernoulli sampling with p equaling the category value.For continuous properties such as is small, we sample uniformly from [µ−0.1, µ+0.1],where µ is the category value.We then devised a function returning an utility score for any fruit-tool property vector pair.The function maps properties to a reduced space of abstract functional features (such as break for tools, and hard for fruits).Details are in Supplementary Section 7.For example, an apple with is crunchy=0.7 value gets a high hard functional feature score.A knife with has a blade=1 gets a high cut score, and therefore high utility for the hard apple.Some features, e.g., has a handle for tools, have no impact on utility.They only represent realistic aspects of objects and act as noise.Our dataset with full category property vectors will be publicly released along with code.
Datasets We separate the 31 fruit categories into three sets: in-domain (21 categories), validation and transfer (5 categories each).The indomain set is further split into train and test parti- tions.We train agents on the in-domain train partition and monitor convergence on the validation set.We report test performance on the in-domain test partition and on the transfer set.For example, the peach category is in-domain, meaning that distinct peach instances will be seen at training and in-domain testing time.The nectarine category is in the transfer set, so nectarine instances will only be seen at test time.This scheme tests the generalization abilities of the agents (that can generalize to new fruits since they are all are described in the same feature space).We generate 210, 000 indomain training samples and 25, 000 samples for the other sets, balanced across fruits and tools (that are common across the sets).
Game dynamics and agent architecture At the beginning of a game episode, two neural network agents A and B receive, randomly, either a pair of tools (tool 1 , tool 2 ) (always sampled from different categories) or a f ruit.The agent receiving the tools (respectively the fruit) will be Tool Player (respectively Fruit Player) for the episode. 1  The agents are also randomly assigned positions, and the one in position 1 starts the game.Figure 2 shows two turns of the game in which A (blue/left) is Tool Player and in position 1.The first turn is indexed t = 0, therefore A will act on even turns, B on odd turns.At game opening, each agent passes its input (tool pair or fruit) through a linear layer followed by a tanh nonlinearity, resulting in embedding i A (resp.i B ).Then, at each turn t, an agent, for example A, receives the message m B t−1 from agent B, and accesses its own previous internal state s A t−2 (we refer to "memory" the addition When an agent stops by choosing a tool, for example tool 1 , we compute the two utilities U (tool 1 , f ruit) and U (tool 2 , f ruit).If U (tool 1 , f ruit) ≥ U (tool 2 , f ruit), that is the best tool was chosen, shared reward is R = 1.If U (tool 1 , f ruit) < U (tool 2 , f ruit) or if the agents reach T max turns without choosing, R = 0. 2 During learning, the reward is back-propagated with Reinforce (Williams, 1992).When the game starts at t = 0, we feed the agent in position 1 a fixed dummy message m 0 , and the previous states of the agents s A t−2 and s B t−1 are initialized with fixed dummy s 0 .In the no-memory ablation, previous internal states are always replaced by s 0 .When we block communication, agent messages are replaced by m 0 .Supplementary Section 8 provides hyperparameter and training details.

Measuring communication impact
Message Effect is computed on single turns and uses causal theory (Pearl et al., 2016) ensures there are no confounders when we analyze the influence from m A t (Pearl et al., 2016).Supplementary Figure A1 shows the causal graph supporting our assumptions.We will not from here onwards write the conditioning on c A t , s B t−1 , i B explicitly.We define z B t+1 = (c B t+1 , m B t+1 ).We want to estimate how much the message from A, m A t , influences the next-turn behaviour (choice and message) of B, z B t+1 .We thus measure the discrepancy between the conditional distribution p(z B t+1 |m A t ) and the marginal distribution p(z B t+1 ) not taking m A t into account.However, we want to assess agent B's behaviour under other possible received messages m A t .To do so, when we compute the marginal of agent B's p(z B t+1 ), we intervene on m A t and draw the messages from the intervention distribution.We define p(z B t+1 ), the marginal computed with counterfactual messages m A t , as: where p(m A t ) is the intervention distribution, different from p(m A t |s A t ).If at turn t, A continues the game, we define the Message Effect (ME) from agent A's message m A t on agent B's choice and Given the message m A t from agent A. .
message pair, z B t+1 as: where KL is the Kullback-Leibler divergence, and p(z B t+1 ) is computed as in Eq. 1.This allows us to measure how much the conditional distribution differs from the marginal.Algorithm 1 shows how we estimate ME A→B t .In our experiments, we draw K = 10 samples z B t+1,k , and use a uniform intervention distribution p(m A t ) with J = 10.This kind of counterfactual reasoning is explored in depth by Bottou et al. (2013).Jaques et al. (2018) and Lowe et al. (2019) present related measures of causal impact based on the Mutual Information (MI) between influencing and influenced agents.We discuss in Supplementary Section 9 possible issues with the MI-based approach.
Bilateral communication Intuitively, there has been a proper dialogue if, in the course of a conversation, each agent has said at least one thing that influenced the other.We operationalize this through our bilateral communication measure.This is a binary, per-game score, that is positive only if in the game there has been at least one turn with significant message effect in both directions, i.e., ∃ t, t s.t.ME A→B t > θ and ME B→A t > θ.We set θ = 0.1. 3e first confirm that the agents succeed at the task, and communication improves their performance.Second, we study their pragmatics, looking at how ablating communication and memory affect their interaction.Finally, we try to interpret the semantics of the agents' messages.

Performance and pragmatics
We report mean and standard error of the mean (SEM) over successful training seeds. 4Each agent A or B can be either F (Fruit Player) or T (Tool Player) and in position 1 or 2, depending on the test game.We measure to what extent Tool Player influences Fruit Player (ME T →F ) and vice versa (ME F →T ).Similarly, we evaluate position impact by computing ME 1→2 and ME 2→1 .We average ME values over messages sent during each test game, and report averages over test games.Note that we also intervene on the dummy initialization message used at t = 0, which is received by the agent in position 1.This impacts the value of ME 2→1 .If the agent in position 1 has learned to rely on the initialization message to understand that the game is beginning, an intervention on this message will have an influence we want to take into account. 5Similarly, in the no-communication ablation, when computing ME values, we replace the dummy fixed message the agents receive with a counterfactual.Finally, we emphasize that the computation of ME values does not interfere with game dynamics and does not affect performance.
Both communication and memory help Table 1 shows that enabling the agents to communicate greatly increases performance compared to the no-communication ablation, both with and without memory, despite the high baseline set by agents that learn about tool usefulness without communicating (see discussion below).Agents equipped with memory perform better than their no-memory counterparts, but the gain in performance is smaller compared to the gain attained from communication.The overall best performance is achieved with communication and memory.We also see that the agents generalize well, with transfer-fruit performance almost matching that on in-domain fruits.Next, we analyze in detail the impact of each factor (communication, memory) on agent performance and strategies.
No-communication, no-memory We start by looking at how the game unfolds when communication and agent memory are ablated (top left quadrant of Table 1).Performance is largely above chance (≈ 50%), because, as discussed in Section 2, some tools are intrinsically better on average across fruits than others.Without communication, the agents exploit this bias and learn a strategy where (i) Fruit Player never picks the tool but always continues the game and (ii) Tool Player picks the tool according to average tool usefulness.Indeed, Tool Player makes the choice in more than 99% of the games.Conversation length is 0 if Tool Player starts and 1 if it is the second agent, requiring the starting Fruit Player to pass its turn.Reassuringly, ME values are low, confirming the reliability of this communication score, and indicating that communication-deprived agents did not learn to rely on the fixed dummy message (e.g., by using it as a constant bias).Still, we observe that, across the consistently low values, Fruit Player appears to affect Tool Player significantly more than the reverse (ME F →T > ME T →F ).This is generally observed in all configurations, and we believe it due to the fact that Tool Player takes charge of most of the reasoning in the game.We come back to this later in our analysis.We also observe that the second player impacts the first more than the reverse (ME 2→1 > ME 1→2 ).We found this to be an artifact of the strategy adopted by the agents.In the games in which Tool Player starts and immediately stops the game, we can only compute ME for the Tool/position-1 agent, by intervening on the initialization.The resulting value, while tiny, is unlikely to be exactly 0. In the games where Fruit Player starts and Tool Player stops at the second turn, we compute instead two tiny MEs, one per agent.Hence, the observed asymmetry.We verified this hypothesis by removing single-turn games: the influence of the second player on the first indeed disappears.
Impact of communication The top quadrants of Table 1 show that communication helps performance, despite the high baseline set by the "average tool usefulness" strategy.Importantly, when communication is added, we see a dramatic increase in the proportion of games with bilateral communication, confirming that improved performance is not due to an accidental effect of adding a new channel (Lowe et al., 2019).ME and average number of turns also increase.Fruit Player is the more influential agent.This effect is not due to the artifact we found in the no-communication ablation, because almost all conversations, including those started by Tool Player, are longer than one turn, so we can compute both ME F →T and ME T →F .We believe the asymmetry to be due to the fact that Tool Player is the agent that demands more information from the other, as it is the one that sees the tools, and that in the large majority of cases makes the final choice.Supplementary Table A4 shows that the gap between the influence of the Fruit Player on the Tool player and its reverse is greater when the Fruit Player is in position 2. This, then, explains ME 2→1 > ME 1→2 as an epiphenomenon of Fruit Player being more influential.
Is memory ablation necessary for communication to matter?An important observation from previous research is that depriving at least one agent of memory might be necessary to develop successful multi-turn communication (Kottur et al., 2017;Cao et al., 2018;Evtimova et al., 2018).This is undesirable, as obviously language should not emerge simply as a surrogate for mem-ories of amnesiac agents.The performance and communicative behaviours results in the bottom right quadrant of Table 1 show that, in our game, genuine linguistic interaction (as cued by ME and bilateral communication scores) is present even when both agents are equipped with memory.It is interesting however to study how adding memory affects the game dynamics independently of communication.In the bottom left quadrant, we see that memory leads to some task performance improvement for communication-less agents.Manual inspection of example games reveals that such agents are developing turn-based strategies.For example, Tool Player learns to continue the game at turn t if tool 1 has a round end.At t + 1, Fruit Player can use the fact that Tool Player continues at t as information about relative tool roundness, and either pick the appropriate one based on the fruit or continue to gather more information.In a sense, agents learn to use the possibility to stop or continue at each turn as a rudimentary communication channel.Indeed, exchanges are on average longer when memory is involved, and turn-based strategies appear even with communication.In the latter case, agents rely on communication but also on turn-based schemes, resulting in lower ME val- ues and bilateral communication compared to the no-memory ablation.Finally, the respective positions of the agents in the conversation no longer impact ME (ME 1→2 ≈ ME 2→1 ).This might be because, with memory, the starting agent can identify whether it is at turn t = 0, where it almost always chooses to continue the game to send and receive more information via communication.Intervening on the dummy initialization message has a lower influence, resulting in lower ME 2→1 .

Conversation semantics
Having ascertained that our agents are conducting bidirectional conversations, we next try to decode what are the contents of such conversations.To do this, we train separate classifiers to predict, from the message exchanges in successful indomain test game, what are Fruit, Tool 1, Tool 2 in the game. 6Consider for example a game in which fruit is apple and tools 1 and 2 knife and spoon, respectively.If the message-based classifiers are, say, able to successfully decode apple but not knife/spoon, this suggests that the messages are about the fruit but not the tools.For each prediction task, we train classifiers (i) on the whole conversation, i.e., both agents' utterances (Both), and (ii) on either Player's utterances: Fruit (F) or Tool only (T).For comparison, we also report accuracy of a baseline that makes guesses based on the train category distribution (Stats), which is stronger than chance.We report mean accuracy and SEM across successful training seeds.Supplementary Section 11 provides further details on classifier implementation and training.
The first row of Table 2 shows that the conversation as a whole carries information about any object.The second and third show that the agents are mostly conveying information about their respec-tive objects (which is very reasonable), but also, to a lesser extent, but still well above baseline-level, about the other agent's input.This latter observation is intriguing.Further work should ascertain if it is an artifact of fruit-tool correlations, or pointing in the direction of more interesting linguistic phenomena (e.g., asking "questions").The asymmetry between Tool 1 and 2 would also deserve further study, but importantly the agents are clearly referring to both tools, showing they are not adopting entirely degenerate strategies. 7e tentatively conclude that the agents did develop the expected semantics, both being able to refer to all objects in the games.Did they however developed shared conventions to refer to them, as in human language?This would not be an unreasonable expectation, since the agents are symmetric and learn to play both roles and in both positions.Following up on the idea of "self-play" of Graesser et al. (2019), after a pair of agents A and B are trained, we replace at test time agent B's embedders and modules with those in A, that is, we let one agent play with a copy of itself.If A and B are speaking the same language, this should not affect test performance.Instead, we find that with self-play average game performance drops down to 67% and 65% in in-domain and transfer test sets, respectively.This suggests that the agents developed their own idiolects.The fact that performance is still above chance could be due to the fact that the latter are at least partially exchangeable, or simply to the fact that agents can still do reasonably well by relying on knowledge of average tool usefulness (self-play performance is below that of the communication-less agents in Table 1).To decide between these interpretations, we trained the semantic classifier on conversations where A is the Fruit Player and B the Tool Player, testing on conversations about the same inputs, but where the roles are inverted.The performance drops down to the levels of the Stats baseline (Supplementary Table A5), supporting the conclusion that non-random performance is due to knowledge acquired by the agents independently of communication, and not partial similarity among their codes.
Games Among the long history of early works that model language evolution between agents (e.g.Steels, 2003;Brighton et al., 2003), Reitter and Lebiere (2011) simulate human language evolution with a Pictionary type task.Most recently, with the advent of neural network architectures, literature focuses on simple referential games with a sender sending a single message to a receiver, and reward depending directly on communication success (e.g., Lazaridou et al., 2017;Havrylov and Titov, 2017;Lazaridou et al., 2018).Evtimova et al. (2018) extend the referential game presenting the sender and receiver with referent views in different modalities, and allowing multiple message rounds.Still, reward is given directly for referential success, and the roles and turns of the agents are fixed.Das et al. (2017) generalize Lewis' signaling game (Lewis, 1969) and propose a cooperative image guessing game between two agents, a question bot and an answer bot.They find that grounded language emerges without supervision.Cao et al. (2018) (expanding on Lewis et al., 2017) propose a setup where two agents see the same set of items, and each is provided with arbitrary, episode-specific utility functions for the object.The agents must converge in multi-turn conversation to a decision about how to split the items.The fundamental novelty of our game with respect to theirs is that our rewards depend on consistent, realistic commonsense knowledge that is stable across episodes (hammers are good to break hard-shell fruits, etc.).Mordatch and Abbeel (2018) (see also Lowe et al., 2017) study emergent communication among multiple (> 2) agents pursuing their respective goals in a maze.In their setup, fully symmetric agents are encouraged to use flexible, multi-turn communication as a problem-solving tool.However, the independent complexities of navigation make the environment somewhat cumbersome if the aim is to study emergent communication.
Communication analysis Relatively few papers have focused specifically on the analysis of the emergent communication protocol.Among the ones more closely related to our line of inquiry, Kottur et al. (2017) analyze a multi-turn signaling game.One important result is that, in their game, the agents only develop a sensible code if the sender is deprived of memory across turns.Evtimova et al. (2018) study the dynamics of agent confidence and informativeness as a conversation progresses.Cao et al. (2018) train probe classifiers to predict, from the messages, each agent utility function and the decided split of items.Most directly related to our pragmatic analysis, Lowe et al. (2019), who focus on simple matrix communication games, introduce the notions of positive signaling (an agent sends messages that are related to its state) and positive listening (an agent's behaviour is influenced by the message it receives).They show that positive signaling does not entail positive listening, and commonly used metrics might not necessarily detect the presence of one or the other.We build on their work, by focusing on the importance of mutual positive listening in communication (our "bilateral communication" measure).We further refine the causal approach to measuring influence they introduce.Jaques et al. (2018) also use the notion of causal influence, both directly as a term in the agent cost function, and to analyze their behaviour.

Discussion
We introduced a more challenging and arguably natural game to study emergent communication in deep network agents.Our experiments show that these agents do develop genuine communication even when (i) successful communication per se is not directly rewarded; (ii) the observable environment already contains stable, reliable information helping to solve the task (object affordances); and (iii) the agents are not artificially forced to rely on communication by erasing their memory.The linguistic exchanges of the agents are not only leading to significantly better task performance, but can be properly pragmatically characterized as dialogues, in the sense that the behaviour of each agent is affected by what the other agent says.Moreover, they use language, at least in part, to denote the objects in their environment, showing primitive hallmarks of a referential semantics.
We also find, however, that agent pairs trained together in fully symmetrical conditions develop their own idiolects, such that an agent won't (fully) understand itself in self play.As convergence to a shared code is another basic property of human language, in future research we will explore ways to make it emerge.First, we note that Graesser et al. (2019), who study a simple signaling game, similarly conclude that training single pairs of agents does not lead to the emergence of a com-mon language, which requires diffusion in larger communities.We intend to verify if a similar trend emerges if we extend our game to larger agent groups.Conversely, equipping the agents with a feedback loop in which they also receive their own messages as input might encourage shared codes across speaker and listener roles.
In the current paper, we limited ourselves to one-symbol messages, facilitating analysis but greatly reducing the spectrum of potentially emergent linguistic phenomena to study.Another important direction for future work is thus to endow agents with the possibility of producing, at each turn, a sequence of symbols, and analyze how this affects conversation dynamics and the communication protocol.Finally, having shown that agents succeed in our setup, we intend to test them with larger, more challenging datasets, possibly involving more realistic perceptual input.
This section provides additional details on the dataset we use and the utility function we employ to compute the utilities between fruits and tools.Note that we refer to fruits for conciseness, but some vegetables, such as carrot and potato, are included.
There are 11 fruits features: is crunchy, has skin, has peel, is small, has rough skin, has a pit, has milk, has a shell, has hair, is prickly, has seeds and 15 tools features: has a handle, is sharp, has a blade, has a head, is small, has a sheath, has prongs, is loud, is serrated, has handles, has blades, has a round end, is adorned with feathers, is heavy, has jaws.Note that, when we sample instances of each category as explained in Section 2 of the main paper, features are sampled independently.We filter out, however, nonsensical combinations.For example, the features has prongs, has a blade and has blades are treated as pairwise mutually exclusive.
In order to compute the utility for a pair (tool, f ruit), we use three mapping matrices.The mapping matrix M T ∈ R 15×6 (Table A1) maps from the space of tool features to a space of more general functional features: (cut, spear, lift, break, peel, pit remover), and similarly M F ∈ R 11×6 (Table A2) maps from the space of fruits features to a space of functional features: (hard, pit, shell, pick, peel, empty inside).Finally, the matrix M ∈ R 6×6 (Table A3) maps the two abstract functional spaces of features together.For example, if an axe sample is described by the vector t a ∈ R 1×15 and a nectarine sample is the vector f n ∈ R 1×11 , the utility is computed as where denotes transpose.We always add a value of 0.01 to avoid zero utilities.Therefore we can compute the utility of any combination of (possibly new) fruits and tools, as long as it can be described in the corresponding functional representational space.Note that in our case we have the same number of abstract functional features for fruits and tools (6), but they need not be the same.In other words, M need not be a square matrix.
Given the values in the mapping matrices, 5 of the tools features have no impact on the utility computation since they do not affect the scores of the functional tool features (they have only zeros in the mapping matrix M T ).These are: has a handle, is sharp, has a sheath, is loud, has handles, is adorned with feathers.Such features only represent realistic aspects of objects and act as noise.
8 Implementation details

Training and architecture hyperparameters
We update the parameters with RMSProp (Tieleman and Hinton, 2012) with a learning rate of 0.001 and the rest of the parameters left to their Pytorch default value.We use a scalar reward baseline b to reduce variance, learned with Mean Square Error such that 1 + b matches the average reward.We clip all gradients at 0.1.For the Message encoder and decoder modules, we embed input and output symbols with dimensionality 50 and then use a RNN with 100 hidden dimensions.The Fruit embedder linear transformation is of output size 100, the Tool embedder is of size 50.The Body module is of size 100.We train the agents with batches of 128 games for a total of 1 million batches.We validate on 12 batches of 100 games, for a total of 1200 validation games, and similarly for testing.

Test procedure details
The computation of the ME values involves random sampling in steps 1 and 2 of Algorithm 1 so we test using 20 testing seeds.We balance the number of test games in each configuration: we use 3 batches of 100 test games in each configuration, resulting in 12 batches for a total of 1200 test games.The ME value in each configuration c is the average ME over the number of batches in this configuration (3 in our case).We then average the ME in each configuration over the four possible configurations to obtain M E 1→2 , M E F →T and their reverse.9 Message effect metric 9.1 Causal graph and assumptions Figure A1 shows the causal graph we consider when computing ME A→B t .We write all variables that should be considered at this turn t.Conditioning on c A t , s B t−1 , i B blocks any backdoor paths when we compute the causal influence of m A t on c B t+1 , m B t+1 (Pearl et al., 2016).Moreover, the path between m A t and c B t+1 , m B t+1 through i A is blocked at the collider node s A t+2 .Therefore, we ensure there is no confounder when we compute the influence of m  Fruits Feature Hard Pit Shell Pick Peel Empty inside is crunchy 1 0 0 0 0 0 has skin 0 0 0 0 1 0 has peel 0 0 0 0 1 0 is small 0 0 0 1 0 0 has rough skin 0 0 0.5 0 0 0 has a pit 0 1 0 0 0 0 has milk 0 0 0 0 0 1 has a shell 0 0 1 0 0 0 has hair 0 0 0 0 0.5 0 is prickly 0 0 0 0 0.5 0 has seeds 0 0 0 0 0 1 Hard Pit Shell Pick Peel Empty inside Cut 1 0 0.5 0 0.5 0 Spear 0 0 0 1 0 0 Lift 0 0 0 0.5 0 1 Break 0.5 0 1 0 0 0 Peel 0 0 0 0 1 0 Pit Remover 0 1 0 0 0 0 ogenous variables that may alter the causal relations in our model (Pearl et al., 2016).We denote agent B's choice and message pair at turn t + 1 as z B t+1 = (c B t+1 , m B t+1 ).We explain in Section 3 of the main paper that we compare (i) the conditional distribution p(z B t+1 |m A t ) and (ii) the marginal distribution p(z B t+1 ) which does not take m A t into account.We intervene on m A t , and draw counterfactual messages not from agent A but from another distribution over m A t , the intervention distribution.We define p(z B t+1 ), the marginal computed with counterfactual messages m A t , as: where p(m A t ) is the intervention distribution.In our experiments we take a uniform intervention distribution.Importantly, p(m A t ) is different from the observational distribution distribution p(m A t A t ) that agent A actually defines over the messages.Contrarily to Bottou et al. (2013) by feeding the counterfactuals messages to agent B, we access p(z B t+1 |m A t ) and need not estimate it from empirical data.2019) also define a causal influence metric based on MI.MI computation requires counterfactuals drawn from the influencing agent distribution, and not from an intervention one.In our setting, this means drawing counterfactuals from agent A's distribution p(m A t |s A t ), and not p(m A t ), in step 2 of Algorithm 1 (main paper).

Difference with Mutual Information
There is an issue with employing MI and drawing counterfactuals from the influencing agent's distribution, e.g., p(m A t |s A t ), that is particularly pressing if message distributions are very skewed (as it is the case with our agents below).Consider a simple setting where agents A and B utter a message that has two possible values u and v.A is the influencing agent and B is the influencee.Suppose that the dynamics are such that A almost always says u, and B always replies with the message it received.Most of the exchanges we would sample would be: "A says u, B replies u".The MI estimate would then be very low, and one might erroneously conclude that B is not influenced by A. Indeed when distributions are very peaky as in this example, it would require very large samples to witness rare events, such as "'A says v, B replies v".Lowe et al. (2019) ensure that all possible messages from the influencing agent are considered.This is computationally expensive when the message space is large.Moreover, the resulting MI can still be small as each message's contribution is weighted by its probability under the influencing agent.By using a uniform intervention distribution, we ensure that B in the current example would receive v and therefore reply v in half the exchanges, easily detecting the influence of A on B.
10 Additional results: performance and pragmatics task is to predict the fruit).We consider the setting with communication and with memory.From successful test in-domain conversations, we create train/validation/test partitions for the classifier.We ensure that each fruit is in the train set, and each tool in either of the two positions.We use 20 different seeds for initializing the classifier dataset partitioning into train/validation/test.For each successful training seed, we compute the average accuracy over these 20 test initialization seeds, and report the classifier accuracy mean and standard error of the mean (SEM) over the successful training seeds.
The agents were trained with symmetrical roles and random starting agent, but we generate conversations with fixed roles and positions, so that all conversations follow the same pattern (for example: agent A always starts and agent A is always the Fruit Player).

Inverted-roles experiment
Table A5 shows the results of the inverted-roles experiment: e.g., we train the classifier on conversations where A is Fruit Player and B is Tool Player, and test on conversations about the same inputs, but where the roles are inverted, that is, B is Fruit Player and A is Tool Player.The performance drops compared to testing on conversations where the roles are not inverted.For this experiment, we consider only the conversations that have at least one utterance from each agent (conversation length ≥ 2) in order to remove the potential confounding effect of conversation length.

Figure 1 :
Figure 1: Our game.One agent receives a fruit, another two tools.Each agent sends a message in turn, until an agent ends the episode by choosing a tool.The agents are rewarded if the tool choice is optimal given the fruit.

Figure A1 :
Figure A1: Causal graph considered when we compute ME A→B t.The orange node m A t is the variable we intervene on.Shaded nodes represent the variables we condition on.
A t on c B t+1 , m B t+1 .As inJaques et al. (2018), we have knowledge of the inputs to the model and the distributions of the variables we consider.Therefore we do not need to perform abduction to update probabilities of unobserved ex- Jaques et al. (2018) train agents to have impact on other agents by maximizing the causal influence of their actions.They show their definition of influence to relate to the Mutual Information (MI) between influencing and influenced agents.Lowe  et al. ( Figure 2: Two turns of dialogue.Dashed boxes are not used in this game episode due to the agent roles (the blue/left agent is Tool Player, the green/right one is Fruit player).The flow is explained in detail in the text.

Table 1 :
Test performance and pragmatic measures mean and SEM in different settings."Av.perf."(average performance) denotes % of samples where best tool was chosen, "Bi.comm."denotes % of games with bilateral communication taking place."Av.conv.length" is average conversation length in turns."T chooses" denotes % of games ended by Tool Player.Values of ME with an asterisk * are statistically significantly higher than their reverse (e.g.ME F →T > ME T →F ).Best "Av.perf." and "Bi.comm." in bold.

Table A1 :
M T .Rows are dataset tool features, columns are functional tool features.

Table A2 :
M F .Rows are dataset fruit features, columns are functional fruit features.

Table A3 :
M .Rows are functional tool features, columns are functional fruit features.

Table A4 :
Table A4 reports a more detailed view of the ME in Table1of the main paper.1T/2F denotes games where the Tool Player is in first position and the Fruit Player is in second position, and 1F/2T denotes games where the Fruit Player in first position and Tool Player in second.This table shows that in the no-memory, with-communication setting (top right quadrant), the difference between the influence of the Fruit Player on the Tool player and its reverse is greater when the Fruit Player is in position 2. On in-domain data, the Fruit Player has a stronger influence on the Tool Player only when in position 2.This explains the effect ME 2→1 > ME 1→2 we mention in Section 4.1 in the main paper.We also observe that in the no-memory, no-communication setting (top left quadrant), when the Tool Player is in position 1, M E 1→2 1T /2F ≈ 0. This relates to the artifact we describe in the main paper: in that case the Tool Player stops the game at t = 0, leaving no room for the Fruit Player to be influenced.Our classifier consists of an Embedding table of size 50 which maps the agents' discrete utterances to a continuous space, then uses a RNN with a hidden size of 100 to map the entire embedded conversation into a hidden state.The hidden state is then fed to a linear classifier that predicts a score for each class, and the number of classes depends on the prediction task (e.g. 31 classes when the ± 0.32 94.5 ± 0.37 M E F →T 0.133 * ± 0.01 0.14 * ± 0.01 5.0 * ± 0.39 5.0 * ± 0.36 M E T →F Detailed ME values (compare to Table 1 in main paper).1T /2F denotes games where the Tool Player is in first position and the Fruit Player is in second position, and 1F/2T denotes games where the Fruit Player in first position and Tool Player in second.Both, Train A is F / Test B is F 6.8 ± 0.61 11 ± 1.16 8.8 ± 0.68 Both, Train B is F / Test A is F 5.9 ± 0.53 10 ± 1.05 8.8 ± 0.62 Stats A is F 6.4 ± 0.27 8.9 ± 0.42 8.2 ± 0.38 Stats B is F 6.4 ± 0.15 9.1 ± 0.61 9.0 ± 0.75

Table A5 :
Semantic classifier % accuracy in inverted-roles setup