On the Spontaneous Emergence of Discrete and Compositional Signals

We propose a general framework to study language emergence through signaling games with neural agents. Using a continuous latent space, we are able to (i) train using backpropagation, (ii) show that discrete messages nonetheless naturally emerge. We explore whether categorical perception effects follow and show that the messages are not compositional.


Introduction
In a signaling game, artificial agents learn to communicate to achieve a common goal: a sender sees some piece of information and produces a message, which is then sent to a receiver that must take some action (Lewis, 1969;Skyrms, 2010).If the action is coherent with the sender's initial piece of information, the choice of the message and its interpretation is reinforced.For instance, in a referential game, sender and receiver see a set of objects, and the sender knows which of these the receiver must pick; the sender then sends a message to the receiver, who must interpret it to pick up the right object (Lazaridou et al., 2017(Lazaridou et al., , 2018;;Havrylov and Titov, 2017;Chaabouni et al., 2019).
This setting has been used to study the factors influencing the emergence of various fundamental properties of natural language, such as compositionality (Kirby et al., 2015;Franke, 2016;Steinert-Threlkeld, 2016;Mordatch and Abbeel, 2018;Lazaridou et al., 2018;Choi et al., 2018).In this paper, we add focus on two other so-called 'design features' of natural language (Hockett, 1960): discreteness (i.e.words form clusters in acoustic space), and displacement (i.e.efficient communication can occur about objects and facts beyond the immediate context of the conversation).
From an implementation point of view, we follow the recent literature which has shown that a sig-naling game is essentially an autoencoder setting, with the encoder playing the role of the sender, and the decoder the role of the receiver (see Fig. 1).In this literature, however, the discreteness of the communication protocol is assumed, since the networks then traditionally use a (normally sequential and) discrete latent space (Havrylov and Titov, 2017;Chaabouni et al., 2019;Kharitonov et al., 2019).
Our main contribution is a generalization of the current implementation of signaling games as autoencoders.Our implementation covers a broader variety of signaling games, and it crucially incorporates the possibility of displacement and makes no a priori assumption of discreteness.Our main result is that under appropriate conditions, discreteness emerges spontaneously: if the latent space is thought about as a continuous acoustic space, then trained messages form coherent clusters, just like regular words do.We also show that the messages are not compositional.
In addition to contributing to our understanding of the emergence of communication protocols with features like natural language, our results have technical significance: by using a continuous communication protocol, with discreteness spontaneously emerging, we can train end-to-end using standard backpropagation, instead of reinforcement learning algorithms like REINFORCE and its refinements (Williams, 1992;Schulman et al., 2015;Mnih et al., 2016), which are difficult to use in practice.

Related Work
A related line of work attempts to avoid the difficulties of reinforcement learning-used when there are stochastic nodes in a computation graphby reparameterization and/or non-stochastic estimators (Bengio et al., 2013;Schulman et al., 2015).In the emergent communication case, where the stochastic nodes are discrete (e.g.sampling a message from a sender distribution), the Gumbel-Softmax estimator has become increasingly popular (Jang et al., 2017;Maddison et al., 2017).
That work enables standard backpropagation to be used for training by optimizing approximations to the true reinforcement learning signal.By contrast, we do not approximate the discrete RL learning signal, but rather ask under what conditions discreteness will emerge.
Several earlier papers explore similar topics in the emergence of discrete symbols.Nowak et al. (1999) show that the division of the acoustic space is an emergent property of language use under noise.It assumes that speakers have a fixed language and asks which such ones are stable.In our setting, the language itself is changing as the result of reinforcement from communication and transmission itself is not noisy.
De Boer (2000) simulates the emergence of vowel systems in artificial agents modeled after phonetic production and perception in humans, resulting in a self-discretizing acoustic space and a vowel system that resembles human ones.This makes the agents much closer to what we know about humans, but also limits its scope.Results about emergent communication can tell us both about the emergence of human language, but also about communication protocols in general, that may be used by very different agents, e.g.autonomous ones, or animals (Steinert-Threlkeld et al., 2020).

Function Games
We here introduce a general communication game setting, which we call Function Games.Our games contain three basic components: (i) a set of contexts C, (ii) a set of actions A, (iii) a family of functions F , from contexts to actions.One play of a Function Game game runs as follows: Abstractly, the function f represents some piece of knowledge available primarily for Sender, and which determines what action is appropriate in any given context.Two concrete interpretations will help illustrate the variety of communication protocols and goals that this framework encompasses.

Generalized referential games.
A reference game is one in which Sender tries to get Receiver to pick the correct object out of a given set (Skyrms, 2010;Lazaridou et al., 2017Lazaridou et al., , 2018;;Havrylov and Titov, 2017;Chaabouni et al., 2019).Here, contexts are sets of objects (i.e. an m × n matrix, with m objects represented by n features).Normally (though we will drop this assumption later), c = shuffled(c): Sender and Receiver see the same objects, but in a different arrangement.Actions are the objects, and the functions f ∈ F are choice functions: f (c) ∈ c for every context c.Belief update games.We will mostly focus on the previous interpretation, but illustrate the generality of the setting with another interpretation here.Contexts can represent the (possibly different) belief states of the agents.'Actions' can represent updated belief states (A = C), the different functions in F then representing how to update an agent's beliefs in the light of learning a particular piece of information (passed directly to Sender, and only through the message to Receiver).

Experiment
Because we are interested in the simultaneous emergence both of discrete and of compositional signals, we use a Function Game called the Extremity Game designed to incentivize and test rich compositionality (Steinert-Threlkeld, 2018, 2020).In this game, one may think of the n dimensions of the objects as gradable properties, e.g.size and darkness, so that a 2D object is determined by a given size and shade of gray.For the functions, we set F = {arg min i , arg max i : 0 ≤ i < n}.An emerging language may contain compositional messages like 'MOST + BIG', 'LEAST + DARK'.

Model
Our model (Figure 1) resembles an encoderdecoder architecture, with Sender encoding the context/target pair into a message, and Receiver decoding the message (together with its context c ) into an action.Both the encoder and decoder are multi-layer perceptrons with two hidden layers of 64 ReLU units (Nair and Hinton, 2010;Glorot et al., 2011).A smaller, intermediate layer without an activation function bridges the encoder and decoder and represents the transformation of the input information to messages.

Game Parameters
We manipulate the following parameters: • Context identity.In the shared setting, Receiver sees a shuffled version of Sender's context (c = shuffled(c)).In the non-shared setting, Receiver's context c is entirely distinct from Sender's.This forces displacement and may incentivize compositional messages, since Sender cannot rely on the raw properties of the target object in communication.• Context strictness.In strict contexts, there is a one-to-one (and onto) correspondence between F and A (as in the original Extremity Game from Steinert-Threlkeld, 2018, 2020).
In non-strict contexts, an object may be the arg max or arg min of several dimensions, or of no dimension.In all experiments, the latent space (message) dimension is always 2, and objects have 5 dimensions.Strict contexts therefore contain 10 objects, while non-strict contexts contain 5, 10, or 15 objects.

Training Details
We use the Adam optimizer (Kingma and Ba, 2015) with learning rate 0.001, β 1 = 0.9, and β 2 = 0.999.The model is trained for 5,000 steps by feeding the network mini-batches of 64 contexts concatenated with one-hot function selectors.The network's loss is taken as the MSE between the target object f (c ) and the object generated by the Receiver.For each setting of the above parameters, we run 20 trials with different random seeds. 1 5 Results

Communicative success
We measure the communicative success of the network by calculating the accuracy of recovering the correct object from c .Receiver's prediction is considered correct if its output is closer to f (c ) than 1 The project's code for extension and reproduction is available at https://github.com/0xnurl/signaling-auto-encoder.

Shared
Non-shared Strict 10 objects 63.78% ± 1.63 60.22% ± 1.56 Non-strict 5 objects 49.37% ± 1.67 43.55% ± 1.69 10 objects 33.06% ± 1.47 31.89%± 1.63 15 objects 27.58% ± 1.30 27.95% ± 1.24   to all other objects in c .Accuracy of the different settings is reported in Table 1.While the network handles displacement well (non-shared contexts), the model struggles with non-strict contexts.Note that although accuracy is not 100%, it is still well above chance, since e.g. for a context of 10 objects random guessing yields an expected accuracy of 10% (which we observe in our model before training).

Discrete signals
Figure 2 depicts message vectors sampled from the latent space layer, before and after training.It is apparent that discrete messages emerge from the imposed learning regime.We measure cluster tendency more quantitatively through two measures, one considering Sender's production, and the other Receiver's perception.First, we sample 100 contexts, and collect the output of the trained encoder for each of these contexts combined with each possible function f .We apply an unsupervized clustering algorithm to this set of produced messages (DBSCAN, Ester et al., 1996, with = 0.5).A label is assigned to each cluster using the ground truth: the label of a cluster is the function f that was most often at the source of a point in this cluster.This allows us to compute F1-scores, which are reported in Table 2.The model reached near-optimal clusteriza-Shared Non-shared Strict 10 objects 1.00 ± 0.00 0.90 ± 0.09 Non-strict 5 objects 0.99 ± 0.02 0.54 ± 0.15 10 objects 1.00 ± 0.00 0.99 ± 0.01 15 objects 1.00 ± 0.00 1.00 ± 0.00  tion measures in 7 out of 8 parameter settings, with the Non-strict, Non-shared context with 5 objects being the exception.
The second approach is akin to studying perception.Given the clusterization of the message space, we sample new messages from each cluster, and test Receiver's perception of these 'artificial' messages, which have never been produced by Sender.To sample artificial messages, we take the average of 10 messages from a (now labelled) cluster.These artificial messages are fed to Receiver for 100 different contexts.The output object accuracy for these artificial messages is shown in Table 3.The model achieves recovery accuracy similar to when interpreting actual messages.
In sum, we can identify discrete, abstract regions of the latent space corresponding to different functions in the input, just like words form clusters in acoustic space.

Compositionality
Our agents are capable of communicating in abstract situations, namely some in which their contexts are different in the first place.This generalizability suggests that the messages may be 'compositional'.We here probe for a candidate compositional structure to the latent space, by asking how the messages relate to the structure of the family of functions F .
First, the pioneering Mikolov et al., 2013 looks for compositionality at the level of word embeddings (WE) through addition, most classically asking whether WE(queen)=WE(king)-WE(man)+WE(woman).
In the current Game, we can ask whether the messages are related as follows, for any dimensions i and j: M(c, arg max i )=M(c, arg max j )-M(c, arg min j )+M(c, arg min i ).
For each such pair of object dimensions we calculate the right-hand side of the equation above for 100 contexts, feed it to Receiver, compare Receiver's output to the output that would have been obtained if M(c, arg max i ) (the left-hand side) had been sent in the first place.This leads to important degradation of average communicative success: a drop of at least 24 percentage points across parameter combinations, to around chance level.Full results are in the left column of Table 4.
Second, we note as others that the compositionas-addition assumption is disputable, both in general and in the original application case (Linzen, 2016;Chen et al., 2017).To abstract away from this issue, we train a 'composition network' (an MLP with 2 hidden layers of 64 ReLU units) on the task of predicting M(c, arg max i ) from M(c, arg max j ), M(c, arg min j ) and M(c, arg min i ), therefore letting it discover any function for mixing values, and not involving addition a priori.We leave out one dimension i 0 from training, and feed Receiver with the message predicted by the 'composition network' from M(c, arg max j ), M(c, arg min j ) and M(c, arg min i 0 ).If the language was compositional, this predicted message should behave like M(c, arg max i 0 ), but we found that, as in the case of addition, the average communication accuracy for all taken-out parameters dropped dramatically (again, at least 24 percentage points drop).Full results are in the right column of Table 4.

Categorical perception
Above we essentially propose an analysis of discreteness both in production and perception.This can lead to more psycholinguistic-like queries about these emergent languages.For instance, one may ask whether classical 'Categorical Perception' (CP) effects obtain, whereby two messages at a short distance in the latent space may be discriminated easily if (and only if)   (see Liberman et al., 1957, andDamper andHarnad, 2000 for early discussions in the context of neural architectures).
As an initial foray, we can investigate the sharpness of the boundaries of our discrete messages (i.e.distribution in latent space).For representation purposes, we sample pairs of messages, call them M −1 and M +1 generated by Sender for two choice functions F −1 and F +1 .We explore a continuous spectrum of messages in the dimension connecting these two messages (M t = (1−t)M −1 +(1+t)M +1 2 , continuously shifting from M −1 to M +1 as the continuous variable t moves from −1 to +1).The messages M t are fed to Receiver together with contexts C , and for each function F −1 and F +1 in turn, we calculate object recovery accuracy.This is plotted in Figure 3 for an Extremity Game model trained in a strict, non-shared context setting with object size 5.The model shows that clusters have relatively sharp boundaries, especially in the direction of a message belonging to another cluster (the area where x is between −1 and +1 in Fig. 3).We can thus identify a boundary around a cluster, and its width, providing the necessary setup to investigate CP effects: whether pairs of messages crossing such a boundary behave differently (e.g., are easier to discriminate) than a pair of equally distant messages both on one side of this boundary.

Conclusion
We propose a general signaling game framework in which fewer a priori assumptions are imposed on the conversational situations.We use both production and perception analyses, and find that under appropriate conditions, which are met by most studies involving neural signaling games, messages become discrete without the analyst having to force this property into the language (and having to deal with non-differentiability issues).We find no evidence of compositional structure using vector analogies and a generalization thereof but do find sharp boundaries between the discrete message clusters.Future work will explore other measures and alternative game settings for the emergence of compositionality, as well as more subtle psychological effects (Categeorical Perception) of continuous biological systems exhibiting discrete structure, like the auditory system.

1.
Nature chooses f ∈ F and a context c ∈ C. 2. Sender sees the context c and f .3. Sender sends a message m to Receiver.4. Receiver sees a possibly different context c and the message m and chooses an action a .5. Both are 'rewarded' iff a = f (c ).

Figure 1 :
Figure 1: Our model architecture, mixing terminology from the autoencoder and signaling game traditions.

Figure 2 :
Figure 2: Sampled messages for contexts of 10 objects of size 5 for (a) an untrained and (b) a trained network.Colors represent the f i ∈ F input part of the Sender.

Figure 3 :
Figure 3: Categorical perception effect, demonstrated by accuracy of object recovery using messages shifted between two 'meanings'.

Table 1 :
Communicative success, as measured by object recovery accuracy.

Table 2 :
Discreteness in production, as measured by F1 scores for automatically clusterized messages.

Table 3 :
Discreteness in perception, as measured by object recovery accuracy from artificial messages.
they are on two sides of a categorical boundary for interpretation purposes

Table 4 :
Communicative success using messages 'inferred' by assuming a systemic relation within arg min i /arg max i message pairs.The 'compositionality by addition' method assumes that M(c, arg max i ) = M(c, arg max j ) -M(c, arg min j ) + M(c, arg min i ).The 'compositional network' is an MLP trained to predict M(c, arg max i ) from the other three messages.Table values are object recovery accuracies averaged for all i.