Continual adaptation for efficient machine communication

To communicate with new partners in new contexts, humans rapidly form new linguistic conventions. Recent language models trained with deep neural networks are able to comprehend and produce the existing conventions present in their training data, but are not able to flexibly and interactively adapt those conventions on the fly as humans do. We introduce a repeated reference task as a benchmark for models of adaptation in communication and propose a regularized continual learning framework that allows an artificial agent initialized with a generic language model to more accurately and efficiently communicate with a partner over time. We evaluate this framework through simulations on COCO and in real-time reference game experiments with human partners.


Introduction
Linguistic communication depends critically on shared expectations about the meanings of words (Lewis, 1969). However, the real-world demands of communication often require speakers and listeners to go beyond dictionary meanings to understand one another (Clark, 1996;Stolk et al., 2016). The social world continually presents new communicative challenges, and agents must continually coordinate on new meanings to meet them.
For example, consider a nurse visiting a bedridden patient in a cluttered home. The first time they ask the nurse to retrieve a particular medication, the patient must painstakingly refer to unfamiliar pills, e.g. "the Eprosartan mesylate for my blood pressure, the ones in a small bluish medicine bottle on the bookcase in my bathroom." But after a week of care, they may just ask for their "meds" and expect the nurse to know what they mean.
Such flexibility of meaning in language use poses a challenge for models of language in ma-  We introduce a repeated reference task, where a speaker agent must communicate the identity of a target object in context to a listener agent, and (B) a regularized continual learning approach allowing agents to rapidly adapt to their partner. chine learning. Approaches based on deep neural networks typically learn a monolithic meaning function during training, with weights fixed during use. For an in-home robot to communicate flexibly and efficiently with patients as a human nurse, however, it must be equipped with a mechanism for rapid adaptation to its partner. Such a continual learning mechanism would present two specific advantages for interaction and communication applications.
First, to the extent that pre-trained models perform poorly in a particular communication setting, an adaptive approach can quickly improve accuracy on the relevant subset of language. Second, an adaptive model enables speakers to communicate more efficiently as they build up common ground and ad hoc conventions, remaining understandable while expending significantly fewer words, as humans naturally do (Clark and Wilkes-Gibbs, 1986).
In this paper, we introduce a benchmark repeated reference task on natural images. We then present a regularized continual learning framework for transforming neural language models into adaptive models that can be deployed as both speakers and listeners in real-time interactions with human partners.
Our key insight is that the sparse observations accumulated in a shared history of interaction are sufficient to support rapid, partner-specific finetuning of meaning representations (Fig. 1).
We are motivated by a hierarchical Bayesian approach to task-specific adaptation and convention formation (Kleinschmidt and Jaeger, 2015;. In Sec. 2, we introduce the three core components of our algorithm: (i) a loss objective combining speaker and listener likelihoods, (ii) a regularization objective for finetuning model weights without overfitting or catastrophic forgetting, and (iii) a data augmentation step for compositionally assigning credit to subutterances.In Sec. 3, we present several experiments demonstrating that these components enable more effective communication with human partners over repeated interactions. Finally, in Sec. 4 we report ablation analyses showing that each component plays a necessary role.

Approach
We begin by recasting communication as a multitask problem for meta-learning. Each context and communicative partner can be regarded as a related but distinct task making its own demands on the agent's language model. To be effective across many such tasks, a communicative agent must both (1) have a prior representation they can use to understand novel partners and contexts, and (2) have a mechanism to rapidly update this representation from a small number of interactions. The present work assumes a conventionally pre-trained initialization and focuses on developing the latter mechanism.

Repeated reference game task
As a benchmark for studying this problem, we introduce the repeated reference game task ( Fig.  1), which has been widely used in cognitive science to study partner-specific adaptation in communication (Krauss and Weinheimer, 1964;Clark and Wilkes-Gibbs, 1986;Wilkes-Gibbs and Clark, 1992). In this task, a speaker agent and a listener agent are shown a context of images, C, and must collaborate on how to refer to them. On each trial, one of these images is privately designated as the target object, o * , for the speaker. The speaker agent thus takes the pair (o * , C) as input and returns an utterance u that will allow the listener to select the target. The listener agent takes (u, C) as input and returns a softmax probability for each image, which it uses to make a selection. Both agents then receive feedback about the listener's response and the identity of the target. Critically, the sequence of trials is constructed so that each image repeatedly appears as the target, allowing us to evaluate how communication about each image changes over time.

Continual adaptation with Hierarchical Bayes
Before formalizing our algorithm as a generic update rule for neural networks, we describe the theoretical Bayesian foundations of our approach. At the core of any communication model is a notion of the semantics of language. The semantics supplies the relationship between utterances and states of the world. Under a Bayesian approach, this representation is probabilistic: we represent some uncertainty over meanings. In a hierarchical Bayesian model, this uncertainty is structured, sharing statistics over different partners and contexts.
At the highest level of the hierarchy is a task-general variable Θ which parameterizes the agent's prior beliefs about underlying semantics θ i used in context i: P (θ i |Θ). Given observations D i from communicative interactions in that context, an agent can infer the task-specific semantics using Bayes rule: The Bayesian formulation thus decomposes the problem of task-specific adaptation into two terms, a prior term P (θ i |Θ) and a likelihood term P (D i |θ i ) 1 . The prior captures the idea that different language tasks share some task-general structure in common: in the absence of strong information about usage departing from this common structure, the agent ought to be regularized toward their task-general knowledge. The likelihood term accounts for needed deviations from general knowledge due to evidence from the current context.

Continual adaptation for neural language models
There is a deep theoretical connection between the hierarchical Bayesian framework presented in the previous section and recent deep learning approaches to multi-task learning (Nagabandi et al., 2018;Jerfel et al., 2018). Given a task-general initialization, regularized gradient descent on a particular task is equivalent to conditioning on new data under a Bayesian prior. We exploit this connection to derive an online continual learning scheme for a neural model that can adapt to a human partner in our challenging referential communication task. This corresponds to an implicit prior aiming to keep the weights of the adapted model close to those of the initial model. Because small differences in weights can lead to large differences in behavior for neural models, we also consider a regularization intended to keep the behavior of the adapted model close to that of the initial model. Concretely, we consider an image-captioning network (see Fig. 2A) that combines a convolutional visual encoder (ResNet-152) with an LSTM decoder (Vinyals et al., 2015). The LSTM takes a 300-dimensional embedding as input for each word in an utterance and its output is then linearly projected back to a softmax distribution over the vocabulary size. To pass the visual feature vector computed by the encoder into the decoder, we replaced the final layer of ResNet with a fully-connected adapter layer. This layer was jointly pre-trained with the decoder on the COCO training set and then frozen, leaving only the decoder weights (i.e. word embeddings, LSTM, and linear output layer) to be adapted in an online fashion. Upon observing each utterance-object data point in the current task, we take a small number of gradient steps fine-tuning these weights to better account for the usage observed so far (see Algorithm 1). Our objective is built from terms for the speaker likelihood, the listener likelihood, and a KL-based regularization. After describing these terms, we introduce a final core element of our approach: structured data augmentation. Speaker likelihood. The primary signal available for adaptation is the (log-) probability of the new data. The form of this likelihood depends on the task at hand and what kind of evidence is available. For our benchmark communication task, D = {(u, o)} 1:t contains paired observations of utterances u and their objects of reference o throughout the history of interaction up to the current time t. These data can be viewed from the point of view of a speaker (generating u given o) or a listener (choosing o from a context of options, given u) (Smith et al., 2013). A speaker model uses its expectations about the task-specific semantics θ t at current time t to sample utterances u given target o. The speaker likelihood can be computed directly from the neural captioning model, as shown in Fig. 2A, where the probability of each word in u = {w 0 , . . . , w } is given by the softmax decoder output conditioned on the sentence so far, P θt (w i |o, w −i ). Thus: Algorithm 1 Update step for adaptive language model Input: θ t : weights at time t Output: θ t+1 : updated weights Data: (u t , o t ): observed utterance and object at time t for step do sample augmented batch of sub-utterances u ∼ P(u t ) Listener likelihood. A listener can be modeled as inverting this speaker model to evaluate how well an utterance u describes each object o relative to the others in a context C of objects (see Fig. 2 Frank and Goodman, 2012;Vedantam et al., 2017;Monroe et al., 2017): While the speaker likelihood serves to make the observed utterance more likely for the target in isolation, the listener likelihood makes it more likely relative to other objects in context. Because these views of the data D i provide complementary statistical information about the task-specific semantics θ t , we combine them in our approach.
Regularization. Fine-tuning repeatedly on a small number of data points presents a clear risk of catastrophic forgetting (Robins, 1995), losing our ability to produce or understand utterances for other images. While limiting the number of gradient steps will keep the task-specific model somewhat close to the prior, we will show that this is not sufficient (see Sec. 4). We thus also consider a global KL regularization term that explicitly minimizes the divergence between the captioning model's output probabilities before and after fine-tuning (Li and Bilmes, 2007;Yu et al., 2013;Galashov et al., 2018), preventing catastrophic forgetting by tethering task-specific behavior to the task-general model (in the absence of strong task-specific evidence). . Since the support for our distribution of captions is infinite, we approximate the divergence incrementally by expanding from the maximum a posteriori (MAP) word denoted w * at each step according to the initial model P Θ (see Appendix A for a derivation of this objective). This loss is then averaged across random images sampled from the full domain O, not just those in context: where here is the length of the MAP caption. Data augmentation. Ideally, an adaptive agent should learn that words and sub-phrases contained in the observed utterance are compositionally responsible for its meaning. Such credit assignment is critical for a speaker model to converge on more efficient conventions. To encourage such learning, we derive a small training dataset D(u) using a data augmentation step on each utterance u. We use the set of sub-phrases derived from a syntactic dependency parse, which preserves grammatical acceptability. A second form of augmentation we consider is local rehearsal: at each interaction we include the augmented data from the history of previous observations in the same context, to prevent overfitting to the most recent observation 2 .

Evaluations
To evaluate our model, we implemented a repeated reference game using images from the validation set of COCO (Lin et al., 2014) as the targets of reference. We constructed two kinds of contexts to obtain varying degrees of communicative difficulty. To construct challenging contexts C, we used our pre-trained model's own visual encoder to find sets of highly similar images. We extracted feature vectors for each image, partitioned the images into 100 groups using a k-means algorithm, sampled one image from each cluster, and took its 3 nearest neighbors in feature space, yielding 100 unique contexts of 4 images each 3 . To construct simple contexts, we sampled images randomly from distinct COCO category labels. We consider our model's performance in two different tasks, which present distinct challenges for adaptation. First, in a listening task (Fig. 3A), our model is paired with a human speaker and 2 In practice we subsamples batches of history in a separate loss term with its own weighting coefficient, ensuring the new data point and a batch of its subphrase augmentations are used in every gradient step.
3 Using pre-trained VGG features gave similar contexts.  must learn to interpret their referring expressions in challenging contexts. Second, in a speaking task (Fig. 3B), our model is paired with a human listener and must learn to generate appropriate referring expressions in simple contexts. The pre-trained model is poorly calibrated for each of these tasks in different ways. In the listening task, we expect accuracy to be initially low because the images are nearly indistinguishable and the speaker may use idiosyncratic and out-of-sample language. In the speaking task, we expect efficiency to be initially low because the model will produce more complex referring expressions than required to distinguish the images (COCO captions are relatively exhaustive). In both cases, we find that adaptation is able to quickly resolve these issues.

Human baselines
We first investigated the baseline performance of human speakers and listeners in both kinds of contexts. We recruited 224 participants from Amazon Mechanical Turk and automatically paired them into an interactive environment with a chatbox. For each pair, we sampled a context and constructed a sequence of 24 trials structured into 6 repetition blocks, where each of the 4 im-ages appeared as the target once per block. We prevented the same target appearing twice in a row and scrambled the order of the images on each player's screen on each trial. After excluding games that terminated before completion, or where participants self-reported confusion or a native language other than English, we obtained 54 complete games using challenging contexts and 50 games using simple contexts.
Pairs of humans were remarkably accurate at the task, with performance near ceiling in both types of context ( Fig. 3A-B, black lines). At the same time, their utterances grew increasingly efficient: in challenging contexts, for example, the mean utterance length decreased from 7 words per image on the first repetition to only 3 words on the last. To statistically test this increase in efficiency, we conducted a mixed-effects regression predicting utterance length. We included fixed effects for repetition number and context type (e.g. 'simple' vs. 'challenging') and random intercepts accounting for variability in initial utterance length at the pair-and image-level. We found a significant overall decrease in utterance length across repetitions, t=21, with a significant positive quadratic component, t=14 indicating that reduction gradually asymptotes. Speakers also use significantly fewer words overall in simpler contexts, t=6, reflecting sensitivity to the necessary level of informativity to initially distinguish the images. However, this effect is clarified by a significant interaction in the extent of reduction across contexts, t=6: speakers initially use nearly twice as many words in challenging contexts then simple contexts, but converge to approximately the same utterance length by the end 4 .

Listening task with human speaker
Next, we evaluated how our adaptive model performed as a listener in real-time interaction with human speakers. We recruited 45 additional participants from Amazon Mechanical Turk who were told they would be paired with an artificial agent learning how they talk. This task was identical to the one performed by humans, except participants were only allowed to enter a single message through the chatbox on each trial. This message was sent to a server where the model weights from the previous trial were loaded to the GPU, used to generate a response, and updated in realtime for the next round. The approximate latency for the model to respond was 5-10s depending on how many games were running simultaneously.
For our objective, we used a linear combination of the speaker and listener likelihood losses and the KL-regularization. We also used rehearsal and sub-phrase data augmentation. We found that a listener based on a pre-trained neural captioning model-the initialization for our adapting model-performs much less accurately than humans due to the challenging nature of the reference task. Yet our model rapidly improves in accuracy as it coordinates on appropriate meanings with human speakers. (Fig. 3A.) In a mixedeffects logistic regression predicting trial-level accuracy, including pair-and image-level random effects, we found a significant increase in the probability of a correct response with successive repetitions, z=9.3, p<0.001, from 37% correct (slightly above chance levels of 25%) to 93% at the end. Similarly, while there was substantial variation in the degree that human speakers simplified their utterances, with some speakers increasing utterance length based on early feedback that the model was making errors, we found that speakers nonetheless became significantly more efficient over time on 4 All effects significant p < 0.001. average, t = −3.1, p = 0.003.

Speaking task with human listener
Finally, we evaluated our model in the speaker role, which requires the model to form more efficient conventions given feedback from human responses. 52 participants from Amazon Mechanical Turk were paired to play the listener role with our model. Utterances were selected from the LSTM decoder using beam search with a beam width of 50 and length normalization (e.g. Wu et al., 2016). After producing an utterance, the model receives feedback about the listener's selection. If they correctly select the intended target, it performs an adaptation step using the new observation; if they make an incorrect response, however, it refrains from updating. This strategy thus only strengthens utterance meanings (and sub-phrase meanings, though data augmentation) after positive evidence of understanding from a partner.
As predicted, we found that the model starts with much longer captions than human speakers use in simple contexts (Fig. 3B). It uses nearly as many words for simple contexts as humans use for challenging contexts. However, like humans, it gets dramatically more efficient over interaction while maintaining high accuracy. We found a significant decrease in utterance length over successive repetitions, t = −31, p < 0.001, using the same mixed-effects regression structure reported above.

Analysis
We proceed to a series of lesion analyses that analyze the role played by each component of our approach.

Preventing catastrophic forgetting
To test the effectiveness of our KL regularization term for preventing catastrophic forgetting, we examined the likelihood of different captions before and after adaptation to the human baseline utterances in a listening task. First, we sampled a random set of images from COCO that were not used in our experiment as control images, and used the initialized state of the LSTM to greedily generate a caption for each. We also generated initial captions for the target objects in context. We recorded the likelihood of all of these sampled captions under the model at the beginning and at each step of adaptation until the final round. Finally, we greedily generated an utterance for each target at the end and retrospectively evaluated its likelihood at earlier states. These likelihood curves are shown with and without speaker KL regularization in Fig. 4. The final caption becomes more likely in both cases (brown line); without the KL term, the initial captions for both targets and unrelated controls are (catastrophically) lost (orange and yellow lines).

Lesioning data augmentation steps
We next simulated what our model's performance in the listening task would have been without the ability to keep training on batches from the history of the interaction (Fig. 5A). As a metric, we use the raw probability assigned to the target after hearing each utterance. We found that rehearsal on previous rounds allowed for faster adaptation on early rounds. Compared to an entirely nonadapting baseline, however, the lesioned model still performed significantly better over time, successfully adapting to human language use.
Next, we investigated the role played by the sub-phrase data augmentation mechanism (Fig.  5B). The key intuition is that it may be helpful to initially produce longer utterances providing partially redundant information in the absence of task-specific evidence supporting particular meanings. But given evidence of understanding from a partner, the speaker builds confidence that individual pieces of information (i.e. components of the referring expression) will also carry the intended meaning. We instantiated this idea in our sub-phrase data augmentation step, exposing the model to the compositional structure of the utterance. To directly test the role played by this mechanism, we simulated performance in the speaking task with and without sub-phrase data augmenta-tion. We found, as expected, that the model fails to become more efficient: positive feedback from interaction only reinforces the entire utterance.

Related and future work
Adapting or personalizing language models is a classic problem of practical interest for NLP, where shifts in the data distribution are often found across test contexts (Ben-David et al., 2010). While our KL regularization approach is drawn from this domain adaptation literature (Yu et al., 2013;Liu et al., 2016), the interactive communicative setting we consider poses several distinct challenges from speech recognition tasks (Bellegarda, 2004;Miao and Metze, 2015) and parsing or text classification tasks (Blitzer et al., 2007;Glorot et al., 2011) for which adaptation is typically considered. In referential communication, partner-specific observations are extremely sparse and must be incorporated in an online manner, ideally accounting for the fact that these observations were produced by intentional agents, as our speaker and listener loss terms aim to do.
While our evaluations were limited to a canonical CNN-RNN image captioning architecture, a key open question for future work is how more complex, state-of-the-art architectures ought to be adapted. One possibility, following an alternative approach recently proposed by Jaech and Ostendorf, is to allow context (e.g. partner identity) to control a low-rank transformation of the weight matrix such that fine-tuning can be lim- ited to a more compact context embedding space (Jaech and Ostendorf, 2018). Furthermore, while we adapted the entire parameterized RNN, future work should investigate the effect of limiting adaption to subcomponents (e.g. word embeddings) or expanding adaptation to supplemental model components such as attention weights or high-level visual features. Another critical area for improvement is generalizing the forms of social feedback that can be used as evidence beyond the sparse choices made in a reference game. In particular, forms of repair through bi-directional dialogue may allow misunderstandings to be resolved more quickly (Drew, 1997;Dingemanse et al., 2015) The claim that language users rapidly adapt their linguistic expectations to new contexts and partners also has a long history in cognitive science. Indeed, a similar fine-tuning adaptation approach has also recently been shown to accurately predict human surprisal on psycholinguistic stimuli (Van Schijndel and Linzen, 2018). While these connections suggest that our model is capturing a key aspect of human language use, it also raises a concern about the extent to which improvement in our evaluations is driven by humans adapting to our model rather than the other way around.
We certainly expect that both parties are adapting, just as pairs of humans do, but found strong evidence that the model's adaptation was critical to success. For example, if improvements in the listening task were due to humans searching for utterances that a relatively fixed model could understand, we would expect our non-adapting baseline (Fig. 5) to improve over time instead of remaining flat. More broadly, we expect that implementing a meta-learning 'outer loop' around the adaptive 'inner loop' described for a single partner in this paper may lead to better initializations that implicitly account for the ways both the human and machine adapts over short interactions.

Conclusions
Human language use is flexible, continuously adapting to the needs of the current situation. In this paper, we introduced a challenging repeated reference game benchmark for artificial agents, which requires such adaptability to succeed. We proposed a continual learning approach that forms context-specific conventions by fine-tuning general-purpose representations. Even when general-purpose models initially perform inaccurately or inefficiently, our approach allows adapted variants of such models to quickly become more accurate and more efficient through interaction with a partner.

Appendix B: Experiment parameters
For both the speaker task and listener task, we used a learning rate of 0.0005, took 6 gradient steps after each trial, and used a batch size of 8 when sampling utterances from the augmented set of subphrases, At each gradient step, we sampled 50 objects from the full domain of COCO to compute the sum in our regularization term. We set the coefficients weighting each term in our loss function as follows.