The Grammar of Emergent Languages

In this paper, we consider the syntactic properties of languages emerged in referential games, using unsupervised grammar induction (UGI) techniques originally designed to analyse natural language. We show that the considered UGI techniques are appropriate to analyse emergent languages and we then study if the languages that emerge in a typical referential game setup exhibit syntactic structure, and to what extent this depends on the maximum message length and number of symbols that the agents are allowed to use. Our experiments demonstrate that a certain message length and vocabulary size are required for structure to emerge, but they also illustrate that more sophisticated game scenarios are required to obtain syntactic properties more akin to those observed in human language. We argue that UGI techniques should be part of the standard toolkit for analysing emergent languages and release a comprehensive library to facilitate such analysis for future researchers.


Introduction
Artificial agents parameterised by deep neural networks can learn to communicate using discrete symbols to solve collaborative tasks (Foerster et al., 2016;Lazaridou et al., 2017;Havrylov and Titov, 2017). A prime reason to conduct such studies, which constitute a new generation of experiments with referential games, is that they may provide insight in the factors that shaped the evolution of human languages (Kirby, 2002).
However, the emergent languages developed by neural agents are not human-interpretable, and little is known about their semantic and syntactic * Shared senior authorship nature. More specifically, we do not know to what extent the structure of emergent languages resembles the structure of human languages, what the languages encode, and how these two things depend on choices that need to be made by the modeller.
A substantial obstacle to better understanding emergent languages is the lack of tools to analyse their properties. Previous work has concentrated primarily on understanding languages through their semantics, by studying the alignment of messages and symbolic representations of the meaning space (e.g. Lazaridou et al., 2018). A substantial downside of such approaches is that they are restricted to scenarios for which a symbolic representation of the meaning space is available. Furthermore, they ignore a second important aspect of language: syntax, which is relevant not just for syntacticallyoriented researchers, but also for those that are interested in semantics from a compositional perspective. In this work, we aim to address this gap in the literature by presenting an analysis of the syntax of emergent languages.
We take inspiration from unsupervised grammar induction (UGI) techniques originally proposed for natural language. In particular, we use them to investigate if the languages that emerge in the typical setup of referential games exhibit interesting syntactic structure, and to what extent this depends on the maximum message length and number of symbols that the agents are allowed to use.
We first establish that UGI techniques are suitable also for our artificial scenario, by testing them on several artificial structured languages that are distributionally similar to our emergent languages. We then use them to analyse a variety of languages emerging from a typical referential game, with various message lengths and vocabulary sizes. We show that short messages of up to length five do not give rise to any interesting structure, while longer messages are significantly more structured than random languages, but yet far away from the type of syntactic structure observed in even simple human language sentences.
As such, our results thus suggest that more interesting games scenarios may be required to trigger properties more similar to human syntax and -importantly -confirm that UGI techniques are a useful tool to analyse such more complex scenarios. Their results are informative not only for those interested in the evolution of structure of human languages, but can also fuel further semantic analysis of emergent languages.

Related work
Previous work that focused on the analysis of emergent languages has primarily concentrated on semantics-based analysis. In particular, they considered whether agents transmit information about categories or objects, or instead communicate using low-level feature information (Steels, 2010;Lazaridou et al., 2017;Bouchacourt and Baroni, 2018;Lazaridou et al., 2018;Mihai and Hare, 2019, i.a.).

Qualitative inspection
Many previous studies have relied on qualitative, manual inspection. For instance, Lazaridou et al. (2018) and Havrylov and Titov (2017) showed that emergent languages can encode category-specific information through prefixing as well as wordorder and hierarchical coding, respectively. Others instead have used qualitative inspection to support the claim that messages focus on pixel information instead of concepts (Bouchacourt and Baroni, 2018), that agents consistently use certain words for specific situations (Mul et al., 2019) or re-use the same words for different property values (Lu et al., 2020), or that languages represent distinct properties of the objects (e.g. colour and shape) under specific circumstances (Kottur et al., 2017;Choi et al., 2018;Słowik et al., 2020).

RSA
Another popular approach to analyse the semantics of emergent languages relies on representational similarity analysis (RSA, Kriegeskorte et al., 2008). RSA is used to analyse the similarity between the language space and the meaning space, in which case it is also called topographic simi-larity (Brighton et al., 2005;Brighton and Kirby, 2006;Lazaridou et al., 2018;Andreas, 2019;Li and Bowling, 2019;Keresztury and Bruni, 2020;Słowik et al., 2020;Ren et al., 2020), It has also been used to directly compare the continuous hidden representations of a neural agent with the input space (Bouchacourt and Baroni, 2018).

Diagnostic Classification
A last technique used to analyse emergent languages is diagnostic classification (Hupkes et al., 2018), which is used to examine which concepts are captured by the visual representations of the playing agents (Lazaridou et al., 2018), whether the agents communicate their hidden states (Cao et al., 2018), which input properties are best retained by the agent's messages (Luna et al., 2020) and whether the agents communicate about their own objects and possibly ask questions (Bouchacourt and Baroni, 2019).

Method
We analyse the syntactic structure of languages emerging in referential games with UGI techniques. In this section, we describe the game setup that we consider ( §3.1), the resulting languages that are the subject of our analysis ( §3.2) and the UGI techniques that we use ( §3.3). Lastly, we discuss our main methods of evaluating our UGI setups and the resulting grammars ( §3.4) as well as several baselines that we use for comparison ( §3.5).

Game
We consider a game setup similar to the one presented by Havrylov and Titov (2017), in which we vary the message length and vocabulary size. In this game, two agents develop a language in which they speak about 30 × 30 pixel images that represent objects of different shapes, colours and sizes (3 × 3 × 2), placed in different locations. In the first step of the game, the sender agent observes an image and produces a discrete message to describe it. The receiver agent then uses this message to select an image from a set containing the correct image and three distractor images. Following Luna et al. (2020), we generate the target and distractor images from a symbolic description with a degree of non-determinism, resulting in 75k, 8k, and 40k samples for the train, validation, and test set.
Both the sender and receiver agent are modelled by an LSTM and CNN as language and visual units, respectively. We pretrain the visual unit of the agents by playing the game once, after which it is kept fixed throughout all experiment. All trained agents thus have the same visual unit, during training only the LSTM's parameters are updated. We use Gumbel-Softmax with a temperature of 1.2 for optimising the agents' parameters, with batch size 128 and initial learning rate 0.0001 for the Adam optimiser (Kingma and Ba, 2015). In addition to that, we use early stopping with a patience of 30 to avoid overfitting. We refer to Appendix A for more details about the architectures and a mathematical definition of the game that we used.

Languages
From the described game, we obtain several different languages by varying the maximum message length L and vocabulary size V throughout experiments. For each combination of L ∈ {3, 5, 10} and V ∈ {6, 13, 27}, we train the agents three times. In all these runs, the agents develop successful communication protocols, as indicated by their high test accuracies (between 0.95 and 1.0). Furthermore, all agents can generalise to unseen scenarios.
For our analysis, we then extract the sender messages for all 40K images from the game's test set. From this set of messages, we construct a disjoint induction set (90%) and validation set (10%). Because the sender may use the same messages for several different input images, messages can occur multiple times. In our experiments, we consider only the set of unique messages, which us this smaller than the total number of images. Table 1 provides an overview of the number of messages in the induction and evaluation set for each language with maximum message length L and vocabulary size V .
In the rest of this paper we refer to the three sets by denoting the message length and vocabulary size of the game they come from. For instance, V 6L10 refers to the set of languages trained with a vocabulary size of 6 and a maximum message length of 10. Note that while the sender agent of the game may choose to use shorter messages and fewer symbols than these limits, they typically do not.

Grammar induction
For natural language, there are several approaches to unsupervised parsing and grammar induction. Some of these approaches induce the syntactic structure (in the form of a bracketing) and the con-seed 0 seed 1 seed 2 L V induct. eval. induct. eval. induct. eval .   3 6  162  19  141  16  147  17  13  440  49  390  44  358  40  27  596  67  554  62  512  57  5 6  913 102  795  89  781  87  13  1819 203  1337 149  1614 180  27  2062 230  1962 219  1429 159  10 6  4526 503  4785 532  4266 475  13  8248 917  9089 1010  7546 839  27  9538 1060  8308 924  9112 1013   Table 1: The number of messages per language for the induction and evaluation set, for all three seeds for playing the referential game. Figure 1: Our two-stage grammar induction setup. We try to reconstruct the grammar G that is hypothesised to have generated our set of messages M , using first CCL and DIORA to infer unlabeled constituency trees for all m ∈ M and then BMM to label these trees. stituent labels simultaneously, but most do only one of those. We follow this common practice and use a two-stage induction process (see Figure 1), in which we first infer unlabelled constituency structures and then label them. From these labelled structures, we then read out a probabilistic context free grammar (PCFG).

Constituency structure induction
To induce constituency structures, we compare two different techniques: the pre-neural statistical common cover link parser (CCL, Seginer, 2007)  CCL While proposed in 2007, CCL 2 is still considered a state-of-the-art unsupervised parser. Contrary to other popular parsers from the 2000s (e.g. Klein andManning, 2004, 2005;Ponvert et al., 2011;Reichart and Rappoport, 2010), it does not require POS-annotation of the words in the corpus, making it appropriate for our setup. CCL is an incremental and greedy parser, that aims to incrementally add cover links to all words in a sentence. From these sets of cover links, constituency trees can be constructed. To limit the search space, CCL incorporates a few assumptions based on knowledge about natural language, such as the fact that constituency trees are generally skewed and the word distribution zipfian. In our experiments, we use the default settings for CCL.
DIORA In addition to CCL, we also experiment with the more recent neural unsupervised parser DIORA 3 . As the name suggests, DIORA is built on the application of recursive auto-encoders.
In our experiments with DIORA, we use a tree-LSTM with a hidden dimension of 50, and train for a maximum of 5 epochs with a batch size of 128. We use the GloVe framework 4 (Pennington et al., 2014) to pretrain word-embeddings for our corpus; using an embedding size of 16.

Constituency labelling
To label the constituency structures returned by CCL and DIORA, we use Bayesian Model Merging (BMM, Stolcke and Omohundro, 1994). BMM was originally approached to induce grammars for natural language corpora, but proved to be infeasible for that purpose. However, BMM has been successfully used to infer labels for unlabelled constituency trees (Borensztajn and Zuidema, 2007). It can therefore complement techniques such as CCL and DIORA.
The BMM algorithm starts from a set of constituency trees in which each constituent is given its own unique label. It defines an iterative search procedure that merges labels to reduce the joint description length of the data (DDL) and the grammar that can be inferred from the labelling (GDL). To find the next best merge step, the algorithm computes the effect of merging two labels on the sum of the GDL and DDL after doing the merge, where the GDL is defined as the number of bits to encode the grammar that can be inferred from the current labelled treebank with relative frequency estimation, and the DDL as the negative log-likelihood of the corpus given this grammar. To facilitate the search and avoid local minima, several heuristics and a look-ahead procedure are used to improve the performance of the algorithm. We use the BMM implementation provided by Borensztajn and Zuidema (2007)  We refer to our complete setups with the names CCL-BMM and DIORA-BMM, respectively, depending on which constituency inducer was used in the first step.

Evaluation
As we do not know the true structure of the emergent languages, we have to resort to different measures than the traditional precision, recall and F1 scores that are typically used to evaluate parses and grammars. We consider three different aspects, which we explain below.

Grammar aptitude
To quantitatively measure how well the grammar describes the data, we compute its coverage on a disjoint evaluation set. Coverage is defined as the ratio of messages that the grammar can parse and thus indicates how well a grammar generalises to unseen messages of the same language. We also provide an estimate of how many messages outside of the language the grammar can parse -i.e. to what extent the grammar overgenerates -by computing its coverage on a subset of 500 randomly sampled messages.

Language compressibility
To evaluate the extent to which the grammar can compress a language, we consider the grammar and data description lengths (GDL and DDL), as defined by Borensztajn and Zuidema (2007). To allow comparison between languages that have a different number of messages, we consider the average message DDL.

Grammar nature
Lastly, to get a more qualitative perspective in the nature of the induced grammar, we consider a few statistics expressing the number of non-terminals and pre-terminals in the grammar, as well as the number of recursive production rules, defined as a production rule where the symbol from the lefthand side also appears on the right-hand side. Additionally, we consider the distribution of depths of the most probable parses of all messages in the evaluation sets.

Baselines
To ground our interpretation, we compare our induced grammars with three different language baselines that express different levels of structure. We provide a basic description here, more details can be found in Appendix D.1.

Random baseline
We compare all induced grammars with a grammar induced on a random language that has the same vocabulary and length distribution as the original language, but whose messages are sampled completely randomly from the vocabulary.

Shuffled baseline
We also compare the induced grammars with a grammar induced on languages that are constructed by shuffling the symbols of the emergent languages. The symbol distribution in these languages are thus identical to the symbol distribution in the languages they are created from, but the symbol order is entirely random.

Structured baseline
Aside from (semi)random baselines, we also consider a structured baseline, consisting of a grammar induced on languages that are similar in length and vocabulary size, but that are generated from a context-free grammar defining a basic hierarchy and terminal-class structure. 6 These structured baseline grammars indicate what we should expect if a relatively simple but yet hierarchical grammar would explain the emergent languages.

Suitability of induction techniques
As the grammar induction techniques we apply are defined for natural language, they are not trivially also suitable for emergent languages. In our first series of experiments, we therefore assess the suitability of the grammar induction techniques for our artificial scenario, evaluate to what extent the techniques are dependent on the exact sample taken from the training set, and we determine what is a suitable data set size for the induction techniques. The findings of these experiments inform and validate the setup for analysing the emergent languages in §5.

Grammars for structured baselines
We first qualitatively assess the extent to which CCL-BMM and DIORA-BMM are able to infer the correct grammars for the structured baseline languages described in the previous section. In particular, we consider if the induced grammars reflect the correct word classes defined by the preterminals, and if they capture the simple hierarchy defined on top of these word-classes.

Results
We conclude that CCL-BMM is able to correctly identify all the unique word classes for the examined languages, as well as the simple hierarchy (for some examples of induced grammars, we refer to Appendix B). DIORA-BMM performs well for the smallest languages, but for the most complex grammar (V = 27, L = 10) it is only able to find half of the word classes and some of the word class combinations. We also observe that DIORA-BMM appears to have a bias for binary trees, which results in larger and less interpretable grammars for the longer fully structured languages. Overall, we conclude that both CCL-BMM and DIORA-BMM should be able to infer interesting grammars for our artificial setup; CCL-BMM appears to be slightly more adequate.

Grammar consistency and data size
As a next step, we study the impact of the induction set sample on the resulting grammars. We do so by measuring the consistency of grammars induced on different sections of the training data as well as grammars induced on differently-sized sections of the training data. We consider incrementally larger message pools of size N = {500, 1000, 2000, 4000, 8000} by sampling from the V 27L10 language with replacement according to the original message frequencies. From each pool we take the unique messages to induce the grammar. More details on this procedure and the resulting data sets can be found in Appendix C.
We express the consistency between two grammars as the F1-score between their parses on the same test data. We furthermore consider the GDL of the induced grammars, which we compare with a baseline grammar that contains exactly one prediction rule for each message. If the GDL of the induced grammar is not smaller than the GDL of this baseline grammar, then the grammar was not more efficient than simply enumerating all messages.
The experiments described above provide information about the sensitivity of the grammar induction techniques on the exact section of the training data as well as the size of the training data that is required to obtain a consistent result. We use the results to find a suitable data set size for the rest of our experiments.
Results Overall, the experiments show that CCL-BMM has higher consistency scores than DIORA-BMM, but also more variation between different induction set sizes (see Figure 2). From the changing consistencies of CCL-BMM with increasing the number of messages, we conclude that differences in data-set size influence its grammar induction considerably. We believe that the low consistency scores of DIORA-BMM are due to the strongly stochastic nature of the neural parser.
For both CCL-BMM and DIORA-BMM, the evaluation set coverage increases with the induction set-size, although CCL-BMM reaches a near perfect coverage much faster than DIORA-BMM. Furthermore, the GDL implies a lower bound for the required induction set size, since the GDL is only smaller than its baseline for N > 2000 with CCL-BMM, while the crossover point is even larger for DIORA-BMM. More details on the progressions of the coverage and GDL can be found in the appendix in Figures C.1 and C.2 respectively.
To conclude, while a small induction set would suffice for CCL, we decide to use all messages of the induction set, because DIORA requires more data for good results, and we see no evidence that this impairs the performance of CCL-BMM.

Analysing emergent languages
Having verified the applicability of both CCL-BMM and DIORA-BMM, we use them to induce grammars for all languages described in §3.2. We analyse the induced grammars and parses, comparing with the structured, shuffled, and random baselines introduced in §3.5.

Grammar aptitude and compressibility
We first quantitatively evaluate the grammars, considering the description lengths and their evaluation and overgeneration coverage, as described in §3.4.
As a general observation, we note that the GDL increases with the vocabulary size. This is not surprising, as larger vocabularies require a larger number of lexical rules and allow for more combinations of symbols, but indicates that comparisons across different types of languages should be taken with care.

L3 and L5
As a first finding, we see that little to no structure appears to be present in the shorter languages with messages of length 3 and 5: there are no significant differences between the emergent languages and the random and shuffled baseline (full plots can be found in the appendix, Figures D.1 and D.2). Some of the grammars for the emergent L3 languages and random baselines, however, have a surprisingly low GDL. Visual inspection of the trees suggests that this is due to the fact that the grammars approach a trivial form, in which there is only one pre-terminal X that expands to every lexical item in the corpus, and one production rule S → XXX. 7 This result is further confirmed by the coverages presented in Table 2, which illustrates that the grammars for the L3 and L5 languages can parse not only all sentences in these languages, but also all other possible messages with the same length and vocabulary.
Interestingly, for DIORA-BMM, there are also no significant differences for the structured baselines. We hypothesise that this may stem from DIORA's inductive bias and conclude that for the analysis of shorter languages, CCL-BMM might be more suitable.

L10
In the L10 languages, we find more indication of structure. As can be seen in Figure 3, the emergent grammars differ all significantly from all baselines grammars (p < .05) and most strongly from the random baseline (p < .001). The GDL of the shuffled baseline grammar is in-between the language and random baseline grammar, suggesting that some regularity may be encoded simply in the frequency distribution of the symbols.
The average DDL of the L10 languages, however, also differs considerably from the baselines, but in the other direction: both the structured and the completely random baseline are much smaller than the emergent language DDL. An explanation for this discrepancy is suggested when looking at their coverages. A good grammar has a high coverage on an independent evaluation set with messages from the same language, but a low coverage on a random sample of messages outside of the language (which we measure with overgeneration coverage, see §3.4). A perfect example of such a grammar is the CCL-BMM grammar inferred for the structured baseline, which has a coverage of 100% for the evaluation set but approximately 0% outside of it (see Table 2). For the V 13L10 and V 27L10 languages, we observe a similar pattern.
Coming back to the random languages, we can see that their grammars do not generalise to any message outside of their induction set. This result suggests that for these languages, the induction method resulted in a large grammar that keeps the DDL low at the expense of a larger GDL, by simply overfitting to exactly the induction set.
Concerning the coverage, another interesting finding is that the shuffled baseline often has a higher coverage than the random baseline. Combined with the generally higher average DDL, this suggests that the induction methods are less inclined to overfit the shuffled baselines. This might be explained by the regularities present in the shuffled messages through the frequencies of the symbols, as well as their co-occurrences within messages.

Nature of syntactic structure
The description lengths and coverage give an indication of whether there is any structure present in the languages, we finish with an explorative analysis of the nature of this structure. We focus our analysis on the V 13L10 and V 27L10 languages, which we previously found most likely to contain interesting structure.

Word class structure
We first examine if there is any structure at the lexical level, in the form of word classes. We consider the number of terminals per pre-terminal and vice versa. We will discuss the most important results here, the complete results can be found in the appendix, in Figure D.3.
A first observation is that in all grammars each symbol is unambiguously associated with only one pre-terminal symbol, indicating that there is no ambiguity with respect to the word class it belongs to. The number of terminals per pre-terminal suggests that our grammar induction algorithms also do not find many word classes: with some notable exceptions, every pre-terminal symbols expand only to a single terminal symbol. Interestingly, some of these exceptions overlap between CCL-BMM and DIORA-BMM (see Table 3), suggesting that they in fact are indicative of some form of lexical structure.

Higher level structure
We next check if the trees contain structure one level above the pre-terminals, by computing if preterminals can be grouped based on the non-terminal that generates them (e.g. if there is a rule K → A B we say that K generates the group A B). Specifically, we count the unique number of pre-terminal groups, defined by each right-hand side consisting solely of pre-terminals and symbols. If there is an underlying linguistic structure that prescribes which pre-terminals belong together (and in which order), it is expected that fewer groups are required to explain the messages than if no such hierarchy were present. Indeed, the number of pre-terminal groups (see Table 4) shows this pattern, as we discover a significantly smaller number of groups than the random baseline. These results thus further confirm the presence of structure in the V 13L10 and V 27L10 languages. As a tentative explanation, we would like to suggest that perhaps the symbols in the emergent languages are more akin to characters than to words. In that case, the pre-terminal groups would represent the words, and the generating nonterminals the word-classes. For both CCL-BMM and DIORA-BMM, the average number of preterminal groups generated by these non-terminals is 2.4± < 0.01 for the emergent languages, while it is 1.0 for the shuffled and random baselines. This suggests that the pre-terminal groups share in syntactic function. Such observations could form a fruitful basis for further semantic analysis of the languages.

Recursion
Lastly, we would like to note the lack of recursive production rules in nearly all induced grammars. While this is not surprising given both the previous results as well as the simplicity of the meaning space, it does suggest that perhaps more interesting input scenarios are required for referential games.

CCL vs DIORA
We ran all our experiments with both CCL-BMM and DIORA-BMM. There were similarities, but also some notable differences. Based on the GDL, CCL-BMM seems more suitable to analyse shorter languages, but earlier tests with reconstructing the structured baseline grammars (see §4.1) suggest that DIORA-BMM also performs worse on languages with a larger message length and vocabulary size; leading us to believe that CCL-BMM is more appropriate for our setup.
Another difference concerns the distribution of the tree depths, which reflects mostly skewed and binary trees for CCL-BMM for L = 10, but more evenly distributed depths for DIORA-BMM (for a plot of the depth distributions, we refer to D.4). An example of this difference is shown in Figure 4. A possible explanation is that CCL-BMM is more biased towards fully right-branching syntax trees, since these are a good baseline for natural language. Alternatively, these trees might actually reflect the emergent languages best, perhaps because of the left-to-right nature of the agents' LSTMs. Additional work is required to establish which type of trees better reflect the true structure of the emergent languages.

Conclusion
While studying language and communication through referential games with artificial agents has recently regained popularity, there is still a very limited amount of tools available to facilitate the analysis of the resulting emergent languages. As a consequence, we still have very little understanding of what kind of information these languages encode. In this paper, for the first time, we focus on syntactic analysis of emergent languages.
We test two different unsupervised grammar induction (UGI) algorithms that have been successful for natural language: a pre-neural statistical one, CCL, and a neural one, DIORA. We use them to infer grammars for a variety of languages emerging from a simple referential game and then label those trees with BMM, considering in particular the effect of the message length and vocabulary size on the extent to which structure emerges.
We first confirm that the techniques are capable of inferring interesting grammars for our artificial setup and demonstrate that CCL appears to be a more suitable constituency parser than DIORA. We then find that the shorter languages, with messages up to 5 symbols, do not contain any interesting structure, while languages with longer messages appear to be substantially more structured than the two random baselines we compare them with. Interestingly, our analysis shows that even these languages do not appear to have a notion of word classes, suggesting that their symbols may in fact be more akin to letters than to words. In light of these results, it would be interesting to explore the use of unsupervised tokenisers that work well for languages without spaces (e.g. SentencePiece Kudo and Richardson, 2018) prior to our approach and to try other word embedding models for DIORA, such as the character-based ELMo embeddings 8 (Peters et al., 2018) or the more recent BERT (Devlin et al., 2019).
Our results also suggest that more sophisticated game scenarios may be required to obtain more interesting structure. UGI could provide an integral part in analysing the languages emerging in such games, especially since it -contrary to most techniques previously used for the analysis of emergent languages -does not require a description of the hypothesised semantic content of the messages. We argue that while the extent to which syntax develops in different types of referential games is an interesting question in its own right, a better understanding of the syntactic structure of emergent languages could also provide pivotal in better understanding their semantics, especially if this is considered from a compositional point of view. To facilitate such analysis, we bundled our tests in a comprehensive and easily usable evaluation framework. 9 We hope to have inspired other researchers to apply syntactic analysis techniques and encourage them to use our code to evaluate new emergent languages trained in other scenarios. A Definition of the referential game The languages emerge from two agents playing a referential game with a setup similar to Havrylov and Titov (2017). In each round of the game, the sender samples a message m describing the target image t to the receiver. m consists of up to L symbols sampled from a vocabulary with size V . 10 The receiver has to identify the described image from a set with t and three other distracting images in random order. The images are created by generating a shape with a certain colour and size, on a logical grid. In the game, two images are the same if they have the same colour, shape, and size, even when differently positioned.

B Fully structured languages
For all the configurations of L and V of our emergent languages (see §3.2), we create a simple grammar containing word classes, each with a disjoint set of symbols. Furthermore, two pre-terminals form a group that can be placed either at the beginning or the end of the message or both, while the other pre-terminals occupy the remaining spots in fixed order. The smaller grammars repeat word classes to ensure enough messages for the induction and evaluation. All the possible messages are randomly divided over a induction and evaluation set (80% and 20% respectively). Table B.1 provides more details on the data sets used for each language configuration. 10 Technically, the vocabulary also contains a stop character and the sender is allowed to generate messages shorter than L. However, typically the messages have a length of L. For the analyses in this paper we have removed all stop characters in a pre-processing step and we do not count it as part of L and V .   L  V  total induction evaluation   3  6  16  12  4  13  160  128  32  27  1458  1166  292  5  6  24  19  5  13  378  302  76  27 15480  2000  500  10 6  24  19  5  13  32  25  7  27 52488  2000  500   Table B.1: An overview of the total number of possible messages that can be generated for each L and V configuration, as well as the sizes of the induction and evaluation sets. The size of the induction set is capped at 2000 to keep the grammar induction computationally feasible. When evaluating the grammars a maximum number of 500 messages of either set is used.

B.1 Example grammars
In the following examples, TOP denotes the start symbol, NP the pre-terminal group, and the numbers the terminals that represent the symbols in the generated messages. The structured baseline grammar for V = 13 and L = 5 is represented as: -> 3 | 4 | 5 C -> 6 | 7 | 8 D -> 9 | 10 E -> 11 | 12 The resulting CCL-BMM induced grammar is: -> 6 | 7 | 8 F -> 9 | 10 H -> 11 | 12 and DIORA-BMM finds: C Consistency and suitable data set size The number of messages in the induction set might influence the properties of the grammars induced from it. To investigate these effects, we perform induction experiments on different sub-samples of the language V 27L10. We compare the induced grammars on their consistency and study the progression of the evaluation coverage and GDL.
The consistency of a setup is computed on different samples of a data set to study the effect of the data set size as well as to show how dependent the algorithm is on the exact selection of induction messages. We create incrementally larger pools by sampling a fixed number of randomly selected messages from the data-set, resulting in pool sizes N = {500, 1000, 2000, 4000, 8000}. The messages are sampled with replacement according to the frequency in the original language. From these pools we then only consider the unique messages. The procedure is repeated three times for each N to obtain an average consistency.
Subsequently, we study the average evaluation coverage and GDL for these grammars. The resulting progression of the evaluation coverage is shown in Figure C.1. The coverage is evaluated with respect to the disjoint set consisting of 10% of the language's messages. We study the GDL against the number of messages compared to the baseline grammar of one production rule for each message in the induction set in Figure 3  6  147  150  13  358  396  27  512  554  5  6  913  829  13  1819  1590  27  1962  1817  10 6  4266  4525  13  8248  8294  27 9112 8986

D Analysing emergent languages
Here we present a complete overview of the results from analysing the languages in §5. To aid in interpreting the different metrics, we compare these with several baselines. To test for significance, we report the p-values from a one-sample t-test, where the baseline value is assumed to be the population mean.

D.1 Baselines
The shuffled baselines are constructed by randomly shuffling the messages of the induction set for a randomly selected seed, such that they are unique in the shuffled set. We create the random baselines by randomly sampling the same number of unique messages as the induction set, also for one seed. See Table D.1 for the number of messages used for each baseline per language.

D.2 Description lengths
Tables D. 2, D.3, and D.4 give an overview of the description lengths for the induction sets, the evaluation sets, and their ratios, respectively. The description lengths are also visualised in Figures D.1 and D.2.

D.3 Coverage
We show the evaluation and overgeneration coverage in Table D.5.
D.4 Nature of syntactic structure Table D.6 gives an overview of the total number of unique pre-terminals and terminals in the induced grammars. We show the average number of preterminals per terminal in Table D.8 and Figure D.3. The average number of pre-terminals per terminal is one for every language and baseline, and is therefore omitted. The number of pre-terminal groups and the number of non-terminals generating these groups are presented in Table D.7.