Reasoning About Pragmatics with Neural Listeners and Speakers

We present a model for pragmatically describing scenes, in which contrastive behavior results from a combination of inference-driven pragmatics and learned semantics. Like previous learned approaches to language generation, our model uses a simple feature-driven architecture (here a pair of neural"listener"and"speaker"models) to ground language in the world. Like inference-driven approaches to pragmatics, our model actively reasons about listener behavior when selecting utterances. For training, our approach requires only ordinary captions, annotated _without_ demonstration of the pragmatic behavior the model ultimately exhibits. In human evaluations on a referring expression game, our approach succeeds 81% of the time, compared to a 64% success rate using existing techniques.


Introduction
We present a model for describing scenes and objects by reasoning about context and listener behavior. By incorporating standard neural modules for image retrieval and language modeling into a probabilistic framework for pragmatics, our model generates rich, contextually appropriate descriptions of structured world representations. Figure 1 shows a reference game RG played between a listener L and a speaker S: 1. Reference candidates r 1 and r 2 are revealed to both players.
3. S produces a description d = S(t, r 1 , r 2 ), which is shown to L.
5. Both players win if c = t. In order for the players to win, S's description d must be pragmatic: it must (implicitly or explicitly) encode an understanding of L's behavior. In Figure 1, for example, the owl is wearing a hat and the owl is sitting in the tree are both accurate descriptions of the target image, but only the second allows a human listener to succeed with high probability. RG can be used to elicit a broad class of such pragmatic phenomena, which ultimately concern behavior rather than truth conditions. Existing computational models of pragmatics can be divided into two essentially independent lines of work, which we term the direct and derived approaches.
Broadly, direct models (see Section 2 for examples) are based on a representation of S. They learn pragmatic behavior by example. Beginning with datasets annotated for the specific task they are trying to solve (e.g. examples of humans playing RG), direct models use feature-based architectures to predict appropriate behavior without a listener representation. While quite general in principle, such models require training data annotated specifically with pragmatics in mind; such data is scarce in practice.
Derived models, by contrast, are based on a representation of L. They first instantiate a base listener L0, intended to simulate the real listener L. They then form a reasoning speaker S1, which chooses a description that causes L0 to behave correctly. Existing derived models couple handwritten grammars and hand-engineered listener models with sophisticated inference procedures. They exhibit complex behavior, but are restricted to small domains where grammar engineering is practical.
The approach we present in this paper aims to capture the best aspects of both lines of work. Like direct approaches, we use machine learning to acquire a complete grounded generation model from data, without domain knowledge in the form of a hand-written grammar or hand-engineered listener model. But like derived approaches, we use this learning to construct a base model, and embed it within a higher-order model that reasons about listener responses. As will be seen, this reasoning step allows the model to make use of weaker supervision than previous data-driven approaches, while exhibiting robust behavior in a variety of contexts.
Our approach is, in a sense, a straightforward derived model, but with the underlying generation and interpretation behavior learned rather than engineered. Independent of the application to RG, our model also resembles the suite of neural image captioning models that have been a popular subject of recent study (Xu et al., 2015). Nevertheless, our approach appears to be: • the first such captioning model to reason explicitly about listeners • more generally, the first neural captioning model that can generate different captions for the same target in different contexts • the first learned approach to pragmatics to require no pragmatic training data Following previous work, we evaluate our model on RG, though the general architecture could be applied to other tasks where pragmatics plays a core role. Using a large database of abstract scenes like the one shown in Figure 1, we run a series of games with humans in the role of L and our system in the role of S. We find that the descriptions generated by our model result in correct interpretation 17% more often than a recent learned baseline system. We use these experiments to explore various other aspects of computational pragmatics, including tradeoffs between adequacy and fluency, and between computational efficiency and expressive power. 1 2 Related Work Direct pragmatics As an example of the direct approach mentioned in the introduction, FitzGerald et al. (2013) collect a set of human-generated referring expressions about abstract representations of sets of colored blocks. Given a set of blocks to describe, their model directly learns a maximum-entropy distribution over the set of logical expressions whose denotation is the target set. Other research, focused on direct referring expression generation from a computer vision perspective, includes that of Mao et al. (2015) and Kazemzadeh et al. (2014).
Derived pragmatics The derived approach is exemplified by the work of Smith et al. (2013). This work describes a series of nested Bayesian models, in which intelligent listeners reason about the behavior of reflexive speakers, and even higher-order speakers reason about these listeners. Experiments (Frank et al., 2009) show that this model explains human behavior well, but both computational and representational issues restrict its application to very simple reference games.
Other work in this family includes that of Vogel et al. (2013), Golland et al. (2010), and Monroe and Potts (2015). These approaches couple template-driven language generation with gametheoretic reasoning frameworks to produce contextually appropriate language. While somewhat more expressive than the models of Smith et al. (2013), they still require both domain-specific engineering, controlled world representations, and pragmatically annotated training data.
Representing language and the world In addition to the pragmatics literature, the approach proposed in this paper relies extensively on recently developed tools for multimodal processing of language and unstructured representations like images. These includes both image retrieval models, which select an image from a collection given a textual description (Socher et al., 2014), and neural conditional language models, which take a content representation and emit a sequence of tokens (Donahue et al., 2015).
Reasoning with neural networks A general framework for constructing task-specific neural networks and composing them to produce novel behavior was explored in the context of visual question answering by Andreas et al. (2015). The current work can be thought of as a generalization of that approach, in which the high-level model actively reasons about the output of low-level modules, rather than directly composing them. While not trained end-to-end, our approach can also be viewed as a cooperative analog of the "generative adversarial networks" used for image generation (Goodfellow et al., 2014a).

Approach
Our goal is to produce a model that can play the role of the speaker S in RG. Specifically, given a target referent (e.g. scene or object) r and a distractor r , the model must produce a description d that uniquely identifies r. For training, we have access to a set of non-contrastively captioned referents {(r i , d i )}: each training description d i is generated for its associated referent r i in isolation. There is no guarantee that d i would actually serve as a good referring expression for r i in any particular context. We must thus use the training data to ground language in referent representations, but rely on reasoning to produce pragmatics.
Our model architecture is compositional and hierarchical. We begin in Section 3.2 by describing a collection of "modules": basic computational primitives for mapping between referents, descriptions, and reference judgments, here implemented as linear operators or small neural networks. While these modules appear as substructures in neural architectures for a variety of tasks, we put them to novel use in constructing a reasoning pragmatic speaker model. Section 3.3 describes how to assemble two base models: a literal speaker, which maps from referents to strings, and a literal listener, which maps from strings to reference judgments. Section 3.4 describes how these base models are used to implement a top-level reasoning speaker: a learned, probabilistic, derived model of pragmatics. : Diagrams of modules used to construct speaker and listener models. "FC" is a fully-connected layer (a matrix multiply) and "ReLU" is a rectified linear unit. The encoder modules (a,b) map from feature representations (in gray) to embeddings (in black), while the ranker (c) and describer modules (d) respectively map from embeddings to categorical choices and strings.

Preliminaries
Formally, we'll take a description d to consist of a sequence of words d 1 , d 2 , . . . , d n , drawn from a vocabulary of known size. For encoding, we'll also assume access to a feature representation f (d) of the sentence (here a vector of indicator features on n-grams for the rest of this paper). These two views-as a sequence of words d i and a feature vector f (d)-form the basis of module interactions with language. Referent representations are similarly simple. Because the model never generates referentsonly conditions on them and scores them-a vector-valued feature representation of referents suffices. Our approach is completely indifferent to the nature of this representation. While the experiments in this paper use a vector of indicator features on objects and actions present in abstract scenes (Figure 1), it would be easy to instead use pre-trained convolutional representations for referring to natural images. As with descriptions, we denote this feature representation f (r) for referents.

Modules
All listener and speaker models are built from a kit of simple building blocks for working with multimodal representations of images and text: 1. a referent encoder E r 2. a description encoder E d 3. a choice ranker R 4. a referent describer D These are depicted in Figure 2, and specified more formally below. All modules are parameterized by weight matrices, written with capital letters W 1 , W 2 , . . .; we refer to the collection of weights for all modules together as W .
Encoders The referent and description encoders produce a linear embedding of referents and descriptions in a common vector space.
Referent encoder: Choice ranker The choice ranker takes a string encoding and a collection of referent encodings, assigns a score to each (string, referent) pair, and then transforms these scores into a distribution over referents. We write R(e i |e −i , e d ) for the probability of choosing i in contrast to the alternative; for example, R(e 2 |e 1 , e d ) is the probability of answering "2" when presented with encodings e 1 and e 2 .
(Here ρ is a rectified linear activation function.)

Referent describer
The referent describer takes an image encoding and outputs a description using a (feedforward) conditional neural language model. We express this model as a distribution p(d n+1 |d n , d <n , e r ), where d n is an indicator feature on the last description word generated, d <n is a vector of indicator features on all other words previously generated, and e r is a referent embedding. This is a "2-plus-skip-gram" model, with local positional history features, global positionindependent history features, and features on the referent being described. To implement this probability distribution, we first use a multilayer perceptron to compute a vector of scores s (one s i for each vocabulary item): and then normalize them to obtain probabilities: Reasoning speaker (S1) S0 L0 Sampler Figure 3: Schematic depictions of models. The literal listener L0 maps from descriptions and reference candidates to reference decisions. The literal speaker S0 maps directly from scenes to descriptions, ignoring context, while the reasoning speaker uses samples from S0 and scores from both L0 and S0 to produce contextually-appropriate captions.

Base models
From these building blocks, we construct a pair of base models. The first of these is a literal listener L0, which takes a description and a set of referents, and chooses the referent most likely to be described. This serves the same purpose as the base listener in the general derived approach described in the introduction. We additionally construct a literal speaker S0, which takes a referent in isolation and outputs a description. The literal speaker is used for efficient inference over the space of possible descriptions, as described in Section 3.4. L0 is, in essence, a retrieval model, and S0 is neural captioning model. Both of the base models are probabilistic: L0 produces a distribution over referent choices, and S0 produces a distribution over strings. They are depicted with gray backgrounds in Figure 3.
Literal listener Given a description d and a pair of candidate referents r 1 and r 2 , the literal listener embeds both referents and passes them to the ranking module, which produces a distribution over choices i.
That is, p L0 (1|d, r 1 , r 2 ) = R(e 1 |e 2 , e d ) and vice-versa. This model is trained contrastively, by solving the following optimization problem: Here r is a random distractor chosen uniformly from the training set. For each training example (r i , d i ), this objective attempts to maximize the probability that the model chooses r i as the referent of d i over a random distractor. This uniform random choice is important: it ensures that our approach is applicable even when there is not a naturally-occurring source of targetdistractor pairs, as previous work (Golland et al., 2010;Monroe and Potts, 2015) has required. Note that this objective can also be viewed as the contrastive loss described by Smith and Eisner (2005), where it serves as an approximation to the likelihood objective that encourages L0 to prefer r i to every other possible referent simultaneously.
Literal speaker As in the figure, the literal speaker is obtained by composing a referent encoder with a describer, as follows: As with the listener, the literal speaker should be understood as producing a distribution over strings. It is trained by maximizing the conditional likelihood of captions in the training data: These base models are intended to be the minimal learned equivalents of the hand-engineered speakers and hand-written grammars employed in previous derived approaches (Golland et al., 2010). The neural encoding/decoding framework implemented by the modules in the previous subsection provides a simple way to map from referents to descriptions and descriptions to judgments without worrying too much about the details of syntax or semantics. Past work amply demonstrates that neural conditional language models are powerful enough to generate fluent and accurate (though not necessarily pragmatic) descriptions of images or structured representations (Donahue et al., 2015).

Reasoning model
As described in the introduction, the general derived approach to pragmatics constructs a base listener and then selects a description that makes it behave correctly. Since the assumption that listeners will behave deterministically is often a poor one, it is common for such derived approaches to implement probabilistic base listeners, and maximize the probability of correct behavior.
The neural literal listener L0 described in the preceding section is such a probabilistic listener. Given a target i and a pair of candidate referents r 1 and r 2 , it is natural to specify the behavior of a reasoning speaker as simply: At a first glance, the only thing necessary to implement this model is the representation of the literal listener itself. When the set of possible utterances comes from a fixed vocabulary (Vogel et al., 2013) or a grammar small enough to exhaustively enumerate (Smith et al., 2013) the operation max d in Equation 8 is practical.
In this case, however, we would like our model to be capable of producing arbitrary strings. Because the score p L0 is produced by a discriminative listener model, and does not factor along the words of the description, there is no dynamic program that enables efficient inference over the space of all strings.
We instead use a sampling-based optimization procedure. The key ingredient here is a good proposal distribution from which to sample sentences likely to be assigned high weight by the model listener. For this we turn to the literal speaker S0 described in the previous section. Recall that this speaker produces a distribution over plausible descriptions of isolated images, while ignoring pragmatic context. We can use it as a source of candidate descriptions, to be reweighted according to the expected behavior of L0.
The full specification of a sampling neural reasoning speaker is as follows: While primarily to enable efficient inference, we can also use the literal speaker to serve a different purpose: "regularizing" model behavior towards choices that are adequate and fluent, rather than exploiting strange model behavior. Past work has restricted the set of utterances in a way that guarantees fluency. But with an imperfect learned listener model, and a procedure that optimizes this listener's judgments directly, the speaker model might accidentally discover the kinds of pathological optima that neural classification models are known to exhibit (Goodfellow et al., 2014b)-in this case, sentences that cause exactly the right response from L0, but no longer bear any resemblance to human language use. To correct this, we allow the model to consider two questions: as before, "how likely is it that a listener would interpret this sentence correctly?", but additionally "how likely is it that a speaker would produce it?" Formally, we introduce a parameter λ that trades off between L0 and S0, and take the reasoning model score in step 2 above to be: This can be viewed as a weighted joint probability that a sentence is both uttered by the literal speaker and correctly interpreted by the literal listener, or alternatively in terms of Grice's conversational maxims (Grice, 1970): L0 encodes the maxims of quality and relation, ensuring that the description contains enough information for L to make the right choice, while S0 encodes the maxims of quantity and manner, ensuring that the description conforms with human language users with respect to sentence length and word choice.

Evaluation
We evaluate our model on the reference game RG described in the introduction. In particular, we construct instances of RG using the Abstract Scenes Dataset introduced by Zitnick and Parikh (2013). Example abstract scenes are shown in Figure 1 and Figure 6. The dataset contains simple pictures constructed by humans and described in natural language. Scene representations are available both as rendered images and as feature representations containing the identity and location of each object; as noted in Section 3.1, we use this feature set to produce our referent representation f (r). This dataset was previously used for a variety of language and vision tasks (e.g. Ortiz et al. (2015), Zitnick et al. (2014)). It consists of 10,020 scenes, each annotated with up to 6 captions.
Abstract scenes are appealing for several reasons. Because high-quality features are readily available, we can avoid complications from visual model training in our evaluation. It is additionally straightforward to measure the similarity of pairs of scenes, facilitating model evaluation at varying levels of difficulty.
Some past work has avoided human evaluation for this task, instead reporting success from a synthetic listener model. In the case of derived methods, this amounts to reporting posterior likelihood. Here we instead rely on human evaluation (via Amazon Mechanical Turk). We begin by holding out a development set and a test set; each held-out set contains 1000 scenes and their accompanying descriptions. For each held-out set, we construct two sets of 200 paired (target, distractor) scenes: All, with up to four differences between paired scenes, and Hard, with exactly one difference between paired scenes. (We take the number of differences between scenes to be the number of objects that appear in one scene but not the other.) We report two evaluation metrics. Fluency is determined by showing human raters isolated sentences, and asking them to rate linguistic quality on a scale from 1-5. Accuracy is success rate at RG: as in Figure 1, humans are shown two images and a model-generated description, and asked to select the image matching the description.
In the remainder of this section, we measure the tradeoff between fluency and accuracy that results from different mixtures of the base models (Section 4.1), measure the number of samples needed to obtain good performance from the reasoning listener (Section 4.2), and attempt to approximate the reasoning listener with a monolithic "compiled" listener (Section 4.3). In Section 4.4 we report final accuracies for our approach and baselines.

How good are the base models?
To measure the performance of the base models, we draw 10 samples d jk for a subset of 100 pairs (r 1,j , r 2,j ) in the Dev-All set. We collect human fluency and accuracy judgments for each of the 1000 total samples. This allows us to conduct a post-hoc search over possible values of the mixing parameter λ: for a range of λ, we compute the average accuracy and fluency of the highest scoring sample. By varying λ, we can view the tradeoff between accuracy and fluency that results from interpolating between the listener and speaker model-setting λ = 0 gives samples from p L0 , and λ = 1 gives samples from p S0 . Figure 4 shows the resulting accuracy and fluency for various values of λ. It can be seen that relying entirely on the listener gives the highest accuracy but substantially degraded fluency.  However, by adding only a very small weight to the speaker model, it is possible to achieve nearperfect fluency without a substantial decrease in accuracy. Example sentences for an individual reference game are shown in Figure 5; increasing λ causes captions to become more generic. For the remaining experiments in this paper, we take λ = 0.02, finding that this gives excellent performance on both metrics.
On the dev set, λ = 0.02 results in an average fluency of 4.8 (compared to 4.8 for the literal speaker λ = 1). This high fluency can be confirmed by inspection of model samples ( Figure 6). We thus focus on accuracy or the remainder of the evaluation.

How many samples are needed?
Next we turn to the computational efficiency of the reasoning model. As in all sampling-based inference, the number of samples that must be drawn from the proposal is of critical interest-if too # samples 1 10 100 1000 Accuracy (%) 66 75 83 85 many samples are needed, the model will be too slow to use in practice. Having fixed λ = 0.02 in the preceding section, we measure accuracy for versions of the reasoning model that draw 1, 10, 100, and 1000 samples. Results are shown in Table 1. We find that substantial gains continue up to 100 samples, then level off.

Is reasoning necessary?
Because they do not require complicated inference procedures, direct approaches to pragmatics typically enjoy better computational efficiency than derived approaches. Having built an accurate derived speaker, can we use it to bootstrap a more efficient direct speaker?
To explore this, we constructed a "compiled" speaker model as follows: Given reference candidates r 1 and r 2 and target t, this model produces embeddings e 1 and e 2 , concatenates them together into a "contrast embedding" [e t , e −t ], and then feeds this whole embedding into a string decoder module. Like S0, this model generates captions without the need for discriminative rescoring; unlike S0, the contrast embedding means this model can in principle learn to produce pragmatic captions, if given access to pragmatic training data. Since no such training data exists, we train the compiled model on captions sampled from the reasoning speaker itself.
This model is evaluated in Table 2. While the distribution of scores is quite different from that of the base model (it improves noticeably over S0 on scenes with 2-3 differences), the overall gain is very small (the difference in mean scores is not significant). The compiled model significantly underperforms the reasoning model. While this merits further exploration, the results at least suggest either that the reasoning procedure is not easily approximated by a shallow neural network, or that example descriptions of randomly-sampled training pairs (which are usually easy to discriminate) do not provide a strong enough signal for a reflex learner to recover pragmatic behavior.
(a) the sun is in the sky (b) mike is wearing a chef's hat (c) the dog is standing beside jenny (d) the plane is flying in the sky Figure 6: Four randomly-chosen samples from our model. For each, the target image is shown on the left, the distractor image is shown on the right, and description generated by the model is shown below. All descriptions are fluent, and generally succeed in uniquely identifying the target scene, even when they do not perfectly describe it (e.g. (c)).   Table 3: Success rates at RG on abstract scenes. "Literal" is a captioning baseline corresponding to the base speaker S0. "Contrastive" is a reimplementation of the approach of Mao et al. (2015). "Reasoning" is the reasoning model S1 from this paper. All differences between the reasoning model and baselines are significant (p < 0.001, Binomial).

Final evaluation
Based on the following sections, we keep λ = 0.02 and use 100 samples to generate predictions. We evaluate on the test set, comparing this Reasoning model S1 to two baselines: Literal, an image captioning model trained normally on the abstract scene captions (corresponding to our L0), and Contrastive, a model trained with a soft contrastive objective, and previously used for visual referring expression generation (Mao et al., 2015). Results are shown in Table 3. Our reasoning model outperforms both the literal baseline and previous work by a substantial margin, achieving an improvement of 17% on all pairs set and 15% on hard pairs. Figures 6 and 7 show various representative descriptions from the model.

Conclusion
We have presented an approach for learning to generate pragmatic descriptions about general referents, even without training data collected in a pragmatic context. Our approach is built from a pair of simple neural base models, a listener and a speaker, and a high-level model that reasons about their outputs in order to produce pragmatic descriptions. In an evaluation on a standard referring expression game, our model's descriptions produced correct behavior in human listeners significantly more often than existing baselines.
It is generally true of existing derived approaches to pragmatics that much of the system's behavior required hand-engineering, and generally true of direct approaches (and neural networks (a) (b) (c) (b-a) mike is holding a baseball bat (b-c) the snake is slithering away from mike and jenny Figure 7: Descriptions of the same image in different contexts. When the target scene (b) is contrasted with the left (a), the system describes a bat; when the target scene is contrasted with the right (c), the system describes a snake.
in particular) that training is only possible when supervision was available for the precise target task. By synthesizing these two approaches, we address both problems, obtaining pragmatic behavior without domain knowledge and without targeted training data. We believe that this general strategy of using reasoning at inference time to obtain novel contextual behavior from neural decoding models might be much more broadly applied.