Emergent Linguistic Phenomena in Multi-Agent Communication Games

We describe a multi-agent communication framework for examining high-level linguistic phenomena at the community-level. We demonstrate that complex linguistic behavior observed in natural language can be reproduced in this simple setting: i) the outcome of contact between communities is a function of inter- and intra-group connectivity; ii) linguistic contact either converges to the majority protocol, or in balanced cases leads to novel creole languages of lower complexity; and iii) a linguistic continuum emerges where neighboring languages are more mutually intelligible than farther removed languages. We conclude that at least some of the intricate properties of language evolution need not depend on complex evolved linguistic capabilities, but can emerge from simple social exchanges between perceptually-enabled agents playing communication games.


Introduction
Contact linguistics (Myers-Scotton, 2002) studies what happens when two or more languages or language varieties interact. It poses several pertinent open questions that are difficult to answer: how does symmetric ("mutually intelligible") communication emerge; how do languages behave under contact; how does one language come to dominate another; how and why does extensive language contact tend to lead to simplification (e.g. in creoles); and how does a linguistic continuum come about, where neighboring languages are more intelligible than farther removed ones? In this work, we show that such linguistic phenomena emerge naturally given a few general assumptions about the organizational structure of networks of artificial agents equipped with a minimalistic form of learned communication.
We introduce a multi-agent framework for studying the emergence and evolution of language, where agents are neural networks endowed with the ability to exchange messages about their perceptual input. The advantage of this approach is that one can precisely control linguistic, environmental and algorithmic variables.
First, we investigate linguistic behavior at the agent-level, and examine when symmetric communication emerges within a linguistic community, as well as how the topological organization of communities-i.e., which other agents an agent comes into contact with and how frequently that happens-impacts convergence and learning. We then examine the behavior of communities of such agents when they come into contact, as well as how community-level topology impacts convergence, success rate and mutual intelligibility.
We demonstrate that the following linguistic behaviors emerge, which correspond to known linguistic phenomena in natural languages: 1) the outcome of contact is a function of inter-and intra-group connectivity, i.e. that languages become mutually intelligible through contact, even for agents that have not themselves been exposed to the other language, provided there is sufficient connectivity between communities; 2) linguistic contact over time either converges to the dominant majority protocol, leading to the extinction of the other language, or if the communities are balanced, gives rise to an original "creole" protocol that has lower complexity than the original languages; 3) a linguistic continuum emerges, where neighboring languages are more mutually intelligible than farther removed languages and the topology of the continuum governs its behavior.
To our knowledge, this work constitutes the first attempt at studying contact linguistic phenomena using communities of deep neural agents. Our findings indicate that intricate properties of language evolution need not depend on intrinsic properties of highly complex evolved linguistic capa-bilities, but instead can emerge purely from social exchanges between perceptually-enabled agents with simple communicative capabilities.

Related Work
Studying language change in vivo is challenging, since it requires simultaneous observation of speaker and community interactions (Brooks and Ragir, 2008;Trudgill, 1974;Joseph, 2017;Christiansen and Kirby, 2003), while carefully controlling for purposes and goals (Winograd, 1971;Flores and Winograd, 1987;Nowak and Krakauer, 1999). Studies of language emergence and evolution must furthermore be conducted over a long period of time, spanning decades or even centuries. Even the Nicaraguan sign language, which emerged remarkably rapidly, took several decades to develop fully (Senghas et al., 2005). Language itself also never ceases to evolve (Fishman, 1964).
Advances in computer science have provided us with opportunities for instead investigating the emergence and evolution of languages in vitro using computational and mathematical models (R. Hurford, 1989;Briscoe, 2002;Kirby, 2002;Christiansen and Kirby, 2003;Kirby et al., 2008;Lewis, 2008;Skyrms, 2010). In computational approaches, communities of agents, equipped with the ability to communicate, are deployed in a simulated environment. Their communication protocol is either evolved or learned, in order to maximize some reward provided by the environment. The agents' behavior and communication are observed and used to compare against linguistic phenomena or hypothesized linguistic theories.
Computational multi-agent models are characterized by the complexity of the agents, the choice of learning algorithm, and the design of the environment and reward structure. The complexity of an artificial agent ranges from a set of simple difference equations (Grouchy et al., 2016), to a CPU-like architecture with an instruction set and registers (Knoester et al., 2007), to a co-occurrence matrix between objects and symbols (Nowak and Krakauer, 1999), to a simple single-layer neural network (Trianni and Dorigo, 2006), to a deep neural network (Lazaridou et al., 2016;Foerster et al., 2016;Jorge et al., 2016). The learning algorithm is either a variant of evolutionary algorithms (Nowak and Krakauer, 1999;Kirby, 2002;Grouchy et al., 2016), often used in the framework of Artificial Life (Bedau, 2003), or a gradient-based optimization algorithm, as is often used for training deep neural networks with a supervised or reinforcement learning objective function. The former simulates generations developing complex behavior over time, while the latter enables more sophisticated agents thanks to the recent advances in deep learning (LeCun et al., 2015). Recent years have seen intriguing new results in emergent communication, starting with Lazaridou et al. (2016) and Foerster et al. (2016), using deep neural agents (Lewis et al., 2017;Havrylov and Titov, 2017;Jorge et al., 2016;Evtimova et al., 2018;Das et al., 2018;Cao et al., 2018). Often, these approaches could be framed as special or generalized cases of Lewis's signalling game (Lewis, 2008), in which agents exchange signals to achieve a common goal. In this work, deep neural agents play games within communities of similar agents, where the aim is for agents to communicate about their perceptual input.

Multi-agent communication
In order to study emergent linguistic phenomena in a simplified but realistic setting, the communication game needs to have several properties. First, it should be symmetric, in that all agents should be able to act as "speaker" and "listener". Second, the agents should communicate about something external to themselves, i.e., about the sensory experience of something in their environment. Third, the world should be partially observable, implying that communication is required for solving the game successfully. Fig. 1 shows example training data, the game setup and agent design. See the supplementary material for details.
Reference Game Let G = A, O, M, W, R be a multi-agent communication game, consisting of communities of agents A communicating across the bidirectional message channel M given environmental observations O, where the community membership of agents is defined as a graph with the agents A as vertices and weighted edges W that determine whether two agents are connected, i.e., W specifies the topology of the network.
Each agent is a deep neural network with perceptual inputs and given a language description, as illustrated in Fig.1. All agents have the same network architecture because of the requirement that the game be symmetric, but each agent has its own parameters which are learned during training. Pairs of connected agents learn to play a game Agent 1 Agent 2 There is a green square. There is a blue cross. There is a green cross. There is a triangle. There is a cyan shape … Before comm.
Message Channel Before comm.
After comm.
After comm.  Figure 1: Overview of the communication game and agent structure. Left: A graphical illustration of the proposed game. Each of two agents observes a partition of an input image and decides which of ten textual captions best describes the entire image before and after exchanging messages with the other agent. Middle: Example training data. Only a random part of each image (dark background) is presented to one agent, necessitating communication in order to solve the game. Right: The modular structure of an agent. with a reward structure R, designed specifically to require communication-based collaboration. The exchange of information through the communication channel M can take any form. By learning to play the game, agents develop a communication protocol about the observations. This framework allows us to control for proximity constraints, population size and degree of interaction, through the underlying graph structure. Throughout the game, each agent observes one part of an image that contains an object of a specific shape and color and is given a set of n cap synthetic compositional captions (e.g. "there is a green triangle") of which only one correctly describes the image 1 .
The goal is for the agents to identify the correct caption y * for the image. Since each agent only has partial information, the pair of agents must cooperate via communication to be effective at solving the problem together.
At the beginning of the game, each agent makes an initial guessŷ 0 of the correct answer, followed by k rounds of communication in which the agents take turns transmitting a binary message to the other agent. In the experiments discussed in this paper, agents communicate using 8-bit binary message vectors. Binary message vectors have been used before for studying the emergence and evolution of language (Kirby and Hurford, 2002). While we selected this type of communication for efficiency reasons, it is straightforward to replace it with sequences of discrete symbols (Jorge et al., 2016;Havrylov and Titov, 2017;Lee et al., 2017;Cao et al., 2018;, continuous vectors or larger binary vectors. After several rounds of communication, each agent makes their final guessŷ 1 . The game is considered successful if both of the agents correctly guessed the answer. During training, a pair of agents corresponding to adjacent vertices, a i ∈ A and a j ∈ A, is selected at random according to interaction weights w ij ∈ W . One of them is randomly picked as the starting agent. The agents then play one instance of the reference game and their parameters are updated accordingly. The community structure can change during training. For instance, we are able to merge separately trained linguistic communities into a single community and fine-tune the agents from both communities (i.e. bring the communities into contact) to investigate linguistic contact. Once training is done, we can test pairs of distant agents to understand what changed in the protocol.
Reward The reward structure for each agent is designed as follows. First, we reward the agent when it correctly guesses the answer after communication r self after = 1 y * =ŷ 1 , where 1 is an indicator function, in order to encourage it to incorporate information received from the other agent. Second, we reward cooperation by giving each agent a shared reward composed of both its own and the other agent's rewards, i.e., r after = r self after +r other after . As Fig. 2 shows, we empirically validate the importance of rewarding after-communication behavior and observe that the cooperation reward significantly boosts the success rate. Lastly, we explicitly encourage the agents to rely on communication by rewarding them for relative improvement from communication, rather than the success after communication: r = r self comm + r other comm , where r self comm = r self after − r self before and r other comm = r other after − r other before . This final reward, which encourages both cooperation and explicit reliance on communication, reaches the highest success rate.
Training Each agent is trained using a hybrid of supervised and reinforcement learning. Each agent computes two predictive distributions before and after message exchange, p before (y|h) and p after (y|h), where y refers to one of the n cap image captions and h is the hidden state of the agent network, which fuses perceptual input with the last received message. Since we know the correct caption y * during training, we use supervised learning to train these two predictive distribution, max log p before (y * |h) + log p after (y * |h), using backpropagation and stochastic gradient descent (Rumelhart et al., 1986). Since messages are discrete, the message generating process cannot be backpropagated through. Instead, we use REIN-FORCE (Williams, 1992) to maximize the reward r; max E m|h,p(y|h) [r]. We add the entropy of the message distribution as a regularization term, encouraging the exploration of various communication strategies in the early stage of learning.
Loss Functions There are four loss functions involved in each game. The first one is a prediction loss function. Given the index y * of a correct caption, the prediction loss function is This loss is used twice based on the predictions before and after the message exchange; denoted as L before pred and L after pred respectively. The second loss is a value loss function. After playing a game, the agent receives a reward r.
The agent's value sub-network (see Fig.1) learns to predict this reward: The third loss is a message loss function. During training, we sample one messagem from the message distribution generated by the agent p(m|h, p(y|h)). If this message led to a success, we increase the probability of the sampled message. Otherwise, we decrease it. The success is measured relative to the predicted value. The cost function is then: whereV (h) refers to using the predicted value but not updating the value sub-network according to this loss function. The gradient of this message loss function with respect to p(m =m|h, p(y|h)) corresponds to REINFORCE (Williams, 1992).
Lastly, we include an entropy penalty. Following (Evtimova et al., 2018), we encourage the entropy of the message distribution to be higher to facilitate exploration: The overall loss function is then the weighted sum of the four loss functions: where we set α pred = 1.0, α value = 1.0, α msg = 1.0 and α entropy = 0.01.

Experiments
When Are Protocols Symmetric? We first examine whether it is sufficient to have just two autonomous agents to develop a common communication protocol. That is, we ask whether a symmetric language emerges when there are only two agents in a linguistic community, or that they learn to speak distinct idiolects. We formally define "mutual intelligibility" in the communication protocol as the ability for each agent to play the game against itself. This is an elegant solution: if a shared communication protocol has emerged, the agent would not have any trouble playing a game with itself during test time (she "understands" what she "says" and "says" what she "understands"). We run five simulations with we show the success rates under self-play and cross-play. We observe that at least three agents are necessary for the emergent protocol to be symmetric-without any specialized mechanism that enforces the symmetry of emergent protocol. (b) The average number of plays per agent to the first observed success rate ≥ 70% between a pair of agents in a linguistic community approximately stays constant with respect to the community size. (c) The number of plays required for each agent in a linguistic community to reach the success rate of 75% on average across all agents pairs in the community approximately stays constant with respect to the community size. (d) Each agent learns at approximately the same rate regardless of community size, which suggests that the emergent protocol emerges incrementally in a distributed manner rather than in a centralized way. random initialization and examine the success rate between two agents averaged over all the test examples, where the agents succeed on each example if both of them correctly guess the answer after communication.
As shown in Fig. 3 (a), two agents can play the game with a high success rate when they play with each other (cross-play). However, the success rate drops to random chance (10%) when each agent plays against itself (self-play). This suggests that the emergent communication protocol is not symmetric: each agent has developed its own protocol, to which the other has adapted, leading to two incompatible idiolects, similar to previous obser- We then run additional experiments having more than two agents, where every pair of agents (v i , v j ) interacts with an equal interaction intensity w ij = c (the success rate is averaged over all possible pairs of agents, unless stated otherwise). We notice that the success rates between self-play and cross-play are indistinguishable, strongly implying that a common, shared language emerges as a social convention if and only if we have more than two language users. This finding demonstrates that it is not strictly necessary to specifically equip an agent with an innate mechanism that ties listening and speaking, such as the obverter technique (Oliphant and Batali, 1997;Choi et al., 2018), nor any explicit community-wide coordination. All that is needed in order for a common language to emerge, at least within this framework, is a minimum number of agents.
We observe no detrimental effect to increasing the number of agents per linguistic community, even though more agents have to come to agree on a single protocol. As shown in Fig. 3 (b-c), it takes approximately 60-65,000 plays per agent for us to observe the first instance of a pair of agents reaching the success rate of 70% regardless of the community size. Similarly, it takes approximately Two communities of population ten each came in contact with the ratio of learning frequencies (K w inter )/(L w intra ) = 1 and the inter-group connectivity p inter = 0.2, after being separately trained in isolation (averaged over five runs). The bridge agents, who interact with the agents from the other community, learn faster and better the new, shared emergent protocol. All the other agents however also rapidly learn to communicate with the agents from the other community, although they never interact directly with them. (c) Contour plot visualization of the success rate after 200,000 plays after the contact by two linguistic communities while varying the ratio of learning frequencies (Kw inter )/(Lw intra ) and the inter-group connectivity p inter (linearly interpolated from 15 experiments.) We observe that the success rate, which measures the level of convergence of two protocols, requires a certain level of the inter-group connectivity (p inter > 0.2). Even when the inter-group connectivity is high enough, we further see that the bridge agents must interact with the agents from the other community enough ((Kw inter )/(Lw intra ) ≥ 1) for the converged protocol to be well understood by the agents from both communities.
150-200,000 plays per agent for the success rate averaged over all pairs of agents in a community to reach 75%, again, regardless of the size of the community. Surprisingly, the speed at which each agent learns stays constant with respect to the community size, as in Fig. 3 (d). That is, we did not observe any correlation between community size and linguistic convergence of the entire community, which would probably only come into play with orders of magnitude more agents or for more complex reference games.

Understanding Convergence
We next examine what happens when we expose different linguistic communities to each other. Specifically, we consider two linguistic communities of population sizes N 1 and N 2 , which are trained independently from each other as fully-connected communities and have developed separate communication protocols. We bring these two communities into contact with each other by introducing a new set of inter-community edges with probability p inter to form a new linguistic community. We assign a weight w inter to all the inter-group edges and another weight w intra to all the intra-group edges. We then examine how "interaction intensity" relates to language shift. See Fig. 4 (a). We first investigate two communities of identical population sizes (N 1 = N 2 ) with the ratio of the learning frequencies of the intra-group pairs and inter-group pairs set to (Kw inter )/(Lw intra ) = 1, where K is the number of inter-community connections and L is the number of intra-community connections, and the inter-group connectivity chance set to p inter = 0.2. We notice in Fig. 4 (b) that the bridge agents learn to communicate better more rapidly (evident from the higher success rate among themselves), but the other agents quickly catch up (according to the success rate among themselves excluding the bridge agents), although these other agents never directly interact with agents from the other community. This finding demonstrates the rapid shift toward a common protocol in both groups where all agents learn to speak a shared language, regardless of whether they actually interact with agents from the other group.
Having established that linguistic contact leads to convergence of the communication protocol, we delve deeper into the impact of two major parameters, the ratio (Kw inter )/(Lw intra ) and the connectivity probability p inter . We vary the ratio (Kw inter )/(Lw intra ) of the learning frequencies of the intra-group pairs and inter-group pairs between 2/1, 1/1 and 1/2, while fixing the inter-group connectivity to p inter = 0.2. After 200,000 plays, the former ((Kw inter )/(Lw intra ) = 2/1) converges to a more tightly coupled linguistic community,  Figure 5: Relationship between the common emergent protocol and the original protocols after linguistic contact between two communities (averaged over five runs). (a) Divergence of the common emergent protocol from the original protocols. Agents converge either to the majority protocol or to one in-between the two originals. (b) By varying the population ratio, it becomes clear that a near-even balance between two communities is necessary for a novel, contact protocol to emerge rather than the domination by a majority protocol. achieving 65.6% success rate between agents that never interacted with each other. On the other hand, when the inter-group interaction occurred only half as frequently as the intra-group interaction, the agents from the two groups can play together with a much lower 52.4% success rate. We observed similar patterns over many different combinations of the ratio and inter-group connectivity. For example, we varied the inter-group connectivity p inter between 0.1, 0.15, 0.2, 0.5 and 0.75 while the interaction ratio was fixed to (Kw inter )/(Lw intra ) = 2/1. After 200,000 plays, we observed the success rates, averaged over all possible inter-group pairs, reach 42.1%, 51.1%, 65.55%, 66.65% and 66.3%, respectively. This implies that there is a critical level of inter-group connectivity (around 0.2 in this specific case) after which language propagation saturates.
In Fig. 4 (c), we plot the interplay between the ratio (Kw inter )/(Lw intra ) and the inter-group connectivity p inter after interpolating from the fifteen experiments varying these parameters. This demonstrates that both parameters are important in determining the level of linguistic convergence.
Birth of a New Language: Emergence of a Contact Language We investigate the effect of the population size ratio N 1 /N 2 between two linguistic communities when they come in contact. We study how population size is a factor in one language coming to "dominate" another language upon contact. We vary the ratio by fixing N 1 = 10 and varying N 2 ∈ {3, . . . , 10}. Each community is pretrained in isolation to develop its own protocol before being exposed to the other. We set the interaction ratio (Kw inter )/(Lw intra ) to 1 and the inter-group connectivity chance p inter to 0.2.
We refer to the original protocols of the communities right after pretraining by L 0 1 and L 0 2 . Each of these is then evolved further after these two communities come in contact, resulting in L ∞ 1 and L ∞ 2 . The previous experiment on linguistic contact suggests that L ∞ 1 ≈ L ∞ 2 based on the fact that the agents from both communities can successfully play the game after coming in contact, so we refer to the final protocol as L ∞ . We examine how similar L ∞ is to either of the original protocols, L 0 1 or L 0 2 . This similarity is measured by letting the agent using L 0 1 or L 0 2 play against the one using L ∞ , which is naturally facilitated by the proposed framework. This "historical selfplay" accuracy S(L 0 · L ∞ ) reflects the similarly of the original and final protocols.
When the population ratio deviates from N 1 /N 2 = 1, we observe that the final protocol rapidly converges to the majority protocol (L 0 1 ), evident from the near-perfect S(L 0 1 L ∞ ) and the near-chance S(L 0 2 L ∞ ) in Fig. 5. This results from the fact that members of both communities are rewarded for cooperating and playing the game well (via the bridge agents). In other words, the agents prefer to integrate or assimilate rather than segregate, similar to how it has been found that minority groups shift toward the use of dominant language (Fase et al., 1992). On the other hand, we observe S(L 0 1 L ∞ ) ≈ S(L 1 2 L ∞ ) and that both of these historical self-play accuracies are significantly above chance, when the population ratio is closer to or exactly 1. It is impossible to identify either L 0 1 or L 0 2 as an ancestor of L ∞ , but

E[H(m)]
10-10 10-9 10-4 10-3 H(p(m|h, p(y|h)))] Figure 6: Evolution of success rate and protocol complexity as two communities of varying populations come into contact (averaged over three runs). Agents were finetuned until the average success rate reached at least 70%. The success rate evolves similarly in terms of the number of plays per agent. (a) We observe significantly different levels of complexity dependent on the population ratio. (b) The complexity is generally lower when the sizes of two communities are more balanced (10-10 and 10-9), while the complexity does not decrease as much when there is a significant imbalance in sizes between two communities (10-4 and 10-3).
L ∞ is rather a combination of these two original protocols, which is a key characteristic of contact languages (Matras, 2009). Both of these observations suggest the potential of the proposed framework for simulating and understanding the birth and death of new languages via linguistic contact.
We further investigate the complexity of the contact language arising from two linguistic communities coming into contact. We define complexity as the uncertainty of an agent when generating a message, and measure the entropy of the message distribution H(p(m|h, p(y|h))). Higher entropy indicates that agents can express states in many different ways: in other words, the more complex a language, the higher the degree of freedom. For each linguistic community, we then compute the average of their message distributions in order to characterize the complexity of a learned communication protocol.
We observe in Fig. 6 (b) that the complexity decreases when two communities come into contact. This observation agrees with a similar phenomenon of structural simplification in creole languages which are understood to arise from the contact of two or more languages (Parkvall, 2008;Bakker et al., 2011). We also observe that the complexity plateaus earlier when there is a larger imbalance between two communities' population (10-3 and 10-4), while it drops further with more balanced communities (10-9 and 10-10). This implies that the new-born contact languages arising from the contact of two similarly-sized communities tend to be substantially simpler.
A Linguistic Continuum of Contact We generalize the previous setting to having M > 2 linguistic communities in various topologies. We start by pretraining M linguistic communities of populations N 1 , N 2 , . . . , N M respectively, evolving M distinct communication protocols. We then chain them such that each consecutive pair, C i and C i+1 , comes in contact with a pre-specified intergroup connectivity chance p inter and interaction ratio (Kw inter )/(Lw intra ), and begin training all of the communities jointly. We study the emergence of a linguistic continuum, similar to the dialect continuums that can be found in natural languages such as the Nordic Germanic dialects of Scandinavia (Chrystal, 1987). Often, speakers on the border are mutually intelligible, while those from communities geographically separated by many intermediate ones cannot communicate.
We start by considering a chain of five communities of equal population (M = 5). As plotted in Fig. 7 (a), we clearly observe the emergence of a linguistic continuum. The agents from a pair of adjacent communities can communicate with each other almost as well as those within a single community, while communicability rapidly degrades as the distance between a pair of communities grows (off-diagonal). The agents from C 1 and C 5 cannot understand each other at all, achieving the near-chance success rate. A similar continuum is observed when we increased the population of the center community two-fold (5→10). This continuum however exhibits properties different from the original chain of equal-sized communities: the center communities C 2 , C 3 and C 4 become more tightly coupled, as evident from the higher success rate among those in Fig. 7 (b-c). This however happens at the cost of communicability between the agents from furthest-removed communities.
To see if the emergence of such a continuum is due to topological properties of communities, we show similarities S(L ∞ i L ∞ j ) among the five communities when densely connected in Fig. 7 (de). Unlike for chaining, we ensure that every pair of communities comes in contact with each other in a densely connected topology. All communities are uniformly similar, confirming that the linguistic continuum arises from the topology.

Discussion and Conclusion
We have described a framework for the large-scale investigation of complex linguistic phenomena via multi-agent communication games. We started by observing that a symmetric communication protocol emerges without any innate, explicit mechanism built in an agent, when there were three or more of them in a community. We then demonstrated the emergence of several complex linguistic phenomena in this simple framework.
First, the result of linguistic contact between communities is determined by inter-and intragroup connectivity patterns. Given sufficient intergroup connectivity, languages become mutually intelligible through contact, even for agents that have not been exposed to the other language. Second, linguistic contact over time either converges to the majority protocol, leading to the extinction of the other language, or gives rise to an original "creole" protocol that has lower complexity than the original languages, if the communities are balanced. Third, a linguistic continuum emerges, where neighboring languages are more mutually intelligible than farther removed languages. The topology of the continuum governs its behavior, and a very dominant central language causes its neighbors to lose mutual intelligibility with communities that are not directly exposed to that central language.
We conclude that intricate properties of language evolution need not depend on complex evolved linguistic capabilities, but can emerge from simple social exchanges between perceptually-enabled agents playing communication games. Language evolution and its properties can be effectively and efficiently investigated indepth under the proposed framework.
In future work, it would be useful to investigate more complicated environments and more complex agent interactions. The setting in this paper included a very simple vision task, and we suspect emergent linguistic phenomena would be more pronounced and even more interesting to study in more sophisticated settings.