Compositionality and Capacity in Emergent Languages

Recent works have discussed the extent to which emergent languages can exhibit properties of natural languages particularly learning compositionality. In this paper, we investigate the learning biases that affect the efficacy and compositionality in multi-agent communication in addition to the communicative bandwidth. Our foremost contribution is to explore how the capacity of a neural network impacts its ability to learn a compositional language. We additionally introduce a set of evaluation metrics with which we analyze the learned languages. Our hypothesis is that there should be a specific range of model capacity and channel bandwidth that induces compositional structure in the resulting language and consequently encourages systematic generalization. While we empirically see evidence for the bottom of this range, we curiously do not find evidence for the top part of the range and believe that this is an open question for the community.


Introduction
Compositional language learning in the context of multi agent emergent communication has been extensively studied (Foerster et al., 2016;Lazaridou et al., 2017;Baroni, 2020). These works have found that while most emergent languages do not tend to be compositional, they can be guided towards this attribute through artificial task-specific constraints (Harding Graesser et al., 2019;Lee et al., 2018;Gupta* et al., 2020).
In this paper, we focus on how a neural network, specifically a generative one, can learn a compositional language. Moreover, we ask how this can occur without task-specific constraints. To accomplish this, we first define what is a language and what we mean by compositionality. In tandem, we introduce precision and recall, two metrics that help us measure how well a generative model at * These two authors contributed equally. large has learned a grammar from a finite set of training instances. We then use a variational autoencoder with a discrete sequence bottleneck to investigate how well the model learns a compositional language, in addition to what affects that learning. This allows us to derive residual entropy, a third metric that reliably measures compositionality in our particular environment. We use this metric to cross-validate precision and recall.
Our paper is most similar to Kottur et al. (2017), which showed that compositional language arose only when certain constraints on the agents are satisfied. While the constraints they examined were either making their models memoryless or having a minimal vocabulary in the language, we hypothesized about the importance for agents to have small capacity relative to the number of concepts to which they are exposed. Each of Verhoef et al. (2016); Kirby et al. (2015); Zaslavsky et al. (2018) examine the trade-off between expression and compression in both emergent and natural languages, in addition to how that trade-off affects the learners. We differ in that we target a specific aspect of the agent (capacity) and ask how that aspect biases the learning.

Compositional Language and Learning
We consider the problem of learning an underlying language L from a finite set of training strings randomly drawn from it: D = {s|s ∼ G } where G is the minimal length generator associated with L . We assume |D| |L | and our goal is to use D to learn a language L that approximates L as well as possible. We know that there exists an equivalent generator G for L, and so our problem becomes estimating a generator from this finite set rather than reconstructing an entire set of strings belonging to the original language L * . We cast the problem of estimating a generator G as density modeling, in which case the goal is to estimate a distribution p(s). Sampling from p(s) is equivalent Figure 1: The grid above shows five shapes and five colors. Agents with a non-compositional language can use this shared map to communicate "Red Circle" with only log 2 5 2 = 5 bits. If they instead used a compositional language, it would require log 2 5 = 3 bits for each concept for a total of 6 bits to convey the string. On the other hand, the agent needs 25 memory slots to store the concepts in the former case but only 10 slots in the compositional case. This trade-off exemplifies the motivation for our investigation because it suggests that a key driver of compositionality in language is the capacity of an agent relative to the total number of objects in its environment.
to generating a string from the generator G.
Evaluation metrics When the language was learned perfectly, any string sampled from the learned distribution p(s) must belong to L . Also, any string in L must be assigned a non-zero probability under p(s). Otherwise, the set of strings generated from this generator, implicitly defined via p(s), is not identical to the original language L . This observation leads to two metrics for evaluating the quality of the estimated language with the distribution p(s), precision and recall: where I(x) is the indicator function. These metrics are designed to be fit for any compositional structure rather than one-off evaluation approaches.
Our setup We simplify and assume that each of the characters in the string s ∈ L correspond to underlying concepts. While the inputs are ordered according to the sequential concepts, our model encodes them using a bag of words (BoW) representation.
The speaker f θ is parameterized using a recurrent policy which receives the sequence of concatenated one-hot input tokens of s and converts each of them to an embedding. It then runs an LSTM nonautoregressively for l timesteps taking the flattened representation of the input embeddings as its input and linearly projecting each result to a probability distribution over {0, 1}. This results in a sequential Bernoulli distribution over l latent variables: f θ (z|s) = l t=1 p(z t |s; θ). From this distribution, we can sample a latent string z = (z 1 , . . . , z l ).
The listener g φ receives z and uses a BoW representation to encode them into its own embedding space. Taking the flattened representation of these embeddings as input, we run an LSTM for |N | time steps, each time outputting a probability distribution over the full alphabet Σ: To train the whole system end-to-end (Sukhbaatar et al., 2016;Mordatch and Abbeel, 2018) via backpropogation, we apply a continuous approximation to z t that depends on a learned temperature parameter τ . We use the 'straight-through' version of Gumbel-Softmax (Jang et al., 2017;Maddison et al., 2017) to convert the continuous distribution to a discrete distribution for each z t . The final sequence of one hot vectors encoding z is our message, which is passed to the listener g φ .
The prior p λ encodes the message z using a BoW representation. It gives the probability of z according to the prior (binary) distribution for each z t and is defined as: p λ (z) = l t=1 p(z t |λ). This can be used both to compute the prior probability of a latent string and also to efficiently sample from p λ using ancestral sampling. Penalizing the KL divergence between the speaker's distribution and the prior distribution encourages the emergent protocol to use latent strings that are as diverse as possible.
Hypotheses on compositionality Under this framework for language learning, we can make the following observations. If the length of the latent sequence l < log 2 |L |, it is impossible for the model to avoid the failure case because there will be |L | − 2 l strings in L that cannot be generated from the trained model. Consequently, recall cannot be maximized. However, this may be difficult to check using the sample-based estimate as the chance of sampling s ∈ L \ g φ (s|z)p λ (z)dz decreases proportionally to the size of L . This is especially true when the gap |L | − 2 l is narrow.
When l ≥ log 2 |L |, there are three cases. The first is when there are not enough parameters θ to learn the underlying compositional grammar, in which case L cannot be learned. The second case is when the number of parameters |θ| is greater than that required to store all the training strings, i.e., |θ| = O(l|D|). Here, it is highly likely for the model to overfit as it can map each training string with a unique latent string without having to learn any of L 's compositional structure. Lastly, when the number of parameters lies in between these two poles, we hypothesize that the model will capture the underlying compositional structure and exhibit systematic generalization (Bahdanau et al., 2019).

Experiments Models and Learning
The task is to communicate 6 concepts, each of which have 10 possible values with a total dataset size of 10 6 . We train the proposed VAE We gradually decrease the number of LSTM units from the base model by a factor α ∈ (0, 1]. This is how we control the number of parameters (|θ| and |φ|). We obtain seven models from each of these by varying the length of the latent sequence l from {19, 20, 21, 22, 23, 24, 25}. These were chosen because we both wanted to show a range of bits and because we need at least 20 bits to cover the 10 6 strings in L * ( log 2 10 6 = 20).
Evaluation: Residual Entropy Our setup allows us to design a metric by which we can check the compositionality of the learned language L by examining how the underlying concepts are described by a string. Let p be a sequence of partitions of {1, 2, . . . , l}. We define the degree of compositionality as the ratio between the variabil-ity of each concept C i and the variability explained by a latent subsequence z[p i ] indexed by an associated partition p i . More formally, the degree of compositionality given the partition sequence p is defined as a residual entropy where there are |N | concepts by the definition of our language. When each term inside the summation is close to zero, it implies that a subsequence z[p i ] explains most of the variability of the specific concept C i , and we consider this situation compositional. The residual entropy of a trained model is then the smallest re(p) over all possible sequences of partitions P and spans from 0 (compositional) to 1 (non-compositional) where re(L, L ) = min p∈P re(p, L, L ). Fig. 3 shows the main findings of our research. In plot (a), we see the parameter counts at the threshold. Below these values, the model cannot solve the task but above these, it can solve it. Further, observe the curve delineated by the lower left corner of the shift from unsuccessful to successful models. This inverse relationship between bits and parameters shows that the more parameters in the model, the fewer bits it needs to solve the task. Note however that it could only solve the task with fewer bits if it was forming a non-compositional code, suggesting that higher parameter models are able to do so while lower parameter ones cannot. Observe further that all of our models above the minimum threshold (72,400) have the capacity to learn a compositional code. This is shown by the perfect training accuracy achieved by all of those models in plot (a) for 24 bits and by the perfect compositionality (zero entropy) in plot (b) for 24 bits. Together with the above, this validates that learning compositional codes requires less capacity than learning non-compositional codes. Plot (c) confirms our hypothesis that large models can memorize the entire dataset. The 24 bit model with 971,400 parameters achieves a train accuracy of 1.0 and a validation accuracy of 0.0. Cross-validating this with plots (d) and (g), we find that a member of the same parameter class is non-compositional and that there is one that achieves unusually low recall. We verified that these are all the same seed, which shows that the agents in this model are memorizing the dataset.

Results
Plots (b) and (e) show that our compositionality metrics pass two sanity checks -high recall and perfect entropy can only be achieved with a channel that is sufficiently large (i.e. 24 bits) to allow for a compositional latent representation. Plot (f) shows that while the capacity does not affect the ability to learn a compositional language across the model range, it does change the learnability. Here we find that smaller models can fail to solve the task for any bandwidth, which coincides with literature suggesting a link between overparameterization and learnability (Li and Liang, 2018;Du et al., 2019). Finally, as expected, we find that no model learns to solve the task with < 20 bits, validating that the minimum required number of bits for learning a language of size |L| is log(|L|) . We also see that no model learns to solve it for 20 bits, which is likely due to optimization difficulties.
We first confirm the effectiveness of training by observing that almost all the models achieve perfect precision (Fig. 2 (a)), implying that L ⊆ L , where L is the language learned by the model. This occurs even with our learning which encouraging the model to capture all training strings rather than to focus on only a few training strings. A natural follow-up question is how large is L \L. We measure this with recall in Fig. 2 (b), which shows a clear phase transition according to the model capacity when l ≥ 22. This agrees with what we saw in Fig. 3 and is equivalent to saying |L \L| 0 at a value that is close to our predicted boundary of l = log 2 10 6 = 20. We attribute this gap to the difficulty in learning a perfectly-parameterized neural network.
These results clearly confirm the first part of our hypothesis -the latent sequence length must be at least as large as log |L |. They also confirm that there is a lowerbound on the number of parameters over which this model can successfully learn the underlying language. We have not been able to verify the upper bound in our experiments, which may require either a more (computationally) extensive set of experiments with even more parameters or a better theoretical understanding of the inherent biases behind learning with this architecture, such as from recent work on overparameterized models (Belkin et al., 2019;Nakkiran et al., 2020).

Conclusion
This paper opens the door for a vast amount of follow-up research. All our models were sufficiently large to represent the compositional structure of the language when given sufficient bandwidth. Furthermore, while large models did overfit, this was an exception rather than the rule. We hypothesize that this is due to the large number of examples in our language, which forces the model to generalize, but note that there are likely additional biases at play that warrant further investigation.