Pragmatically Informative Image Captioning with Character-Level Inference

We combine a neural image captioner with a Rational Speech Acts (RSA) model to make a system that is pragmatically informative: its objective is to produce captions that are not merely true but also distinguish their inputs from similar images. Previous attempts to combine RSA with neural image captioning require an inference which normalizes over the entire set of possible utterances. This poses a serious problem of efficiency, previously solved by sampling a small subset of possible utterances. We instead solve this problem by implementing a version of RSA which operates at the level of characters (“a”, “b”, “c”, ...) during the unrolling of the caption. We find that the utterance-level effect of referential captions can be obtained with only character-level decisions. Finally, we introduce an automatic method for testing the performance of pragmatic speaker models, and show that our model outperforms a non-pragmatic baseline as well as a word-level RSA captioner.


Introduction
The success of automatic image captioning (Farhadi et al., 2010;Mitchell et al., 2012;Karpathy and Fei-Fei, 2015;Vinyals et al., 2015) demonstrates compellingly that end-to-end statistical models can align visual information with language. However, high-quality captions are not merely true, but also pragmatically informative in the sense that they highlight salient properties and help distinguish their inputs from similar images. Captioning systems trained on single images struggle to be pragmatic in this sense, producing either very general or hyper-specific descriptions.
In this paper, we present a neural image captioning system 1 that is a pragmatic speaker as defined by the Rational Speech Acts (RSA) model (Frank and Goodman, 2012;Goodman and Stuhlmüller, 1 The code is available at https://github.com/ reubenharry/Recurrent-RSA 2013). Given a set of images, of which one is the target, its objective is to generate a natural language expression which identifies the target in this context. For instance, the literal caption in Figure 1 could describe both the target and the top two distractors, whereas the pragmatic caption mentions something that is most salient of the target. Intuitively, the RSA speaker achieves this by reasoning not only about what is true but also about what it's like to be a listener in this context trying to identify the target.
This core idea underlies much work in referring expression generation (Dale and Reiter, 1995;Monroe and Potts, 2015;Andreas and Klein, 2016;Monroe et al., 2017) and image captioning (Mao et al., 2016a;Vedantam et al., 2017), but these models do not fully confront the fact that the agents must reason about all possible utterances, which is intractable. We fully address this problem by implementing RSA at the level of characters rather than the level of utterances or words: the neural language model emits individual characters, choosing them to balance pragmatic informativeness with overall well-formedness. Thus, the agents reason not about full utterances, but rather only about all possible character choices, a very small space. The result is that the information encoded recurrently in the neural model allows us to obtain global pragmatic effects from local decisions. We show that such character-level RSA speakers are more effective than literal captioning systems at the task of helping a reader identify the target image among close competitors, and outperform word-level RSA captioners in both efficiency and accuracy.

Bayesian Pragmatics for Captioning
In applying RSA to image captioning, we think of captioning as a kind of reference game. The speaker and listener are in a shared context consisting of a set of images W , the speaker is privately assigned a target image w ⇤ 2 W , and the speaker's goal is to produce a caption that will enable the listener to identify w ⇤ . U is the set of possible utterances. In its simplest form, the literal speaker is a conditional distribution S 0 (u|w) assigning equal probability to all true utterances u 2 U and 0 to all others. The pragmatic listener L 0 is then defined in terms of this literal agent and a prior P (w) over possible images: The pragmatic speaker S 1 is then defined in terms of this pragmatic listener, with the addition of a rationality parameter ↵ > 0 governing how much it takes into account the L 0 distribution when choosing utterances. P (u) is here taken to be a uniform distribution over U : As a result of this back-and-forth, the S 1 speaker is reasoning not merely about what is true, but rather about a listener reasoning about a literal speaker who reasons about truth. To illustrate, consider the pair of images 2a and 2b in Figure 2. Suppose that U = {bus, red bus}. Then the literal speaker S 0 is equally likely to produce bus and red bus when the left image 2a is the target. However, L 0 breaks this symmetry; because red bus is false of the right bus, L 0 (2a|bus) = 1 3 and L 0 (2b|bus) = 2 3 . The S 1 speaker therefore ends up favoring red bus when trying to convey 2a, so that S 1 (red bus|2a) = 3 4 and S 1 (bus|2a) = 1 4 . Figure 2: Captions for the target image (in green).

Applying Bayesian Pragmatics to a Neural Semantics
To apply the RSA model to image captioning, we first train a neural model with a CNN-RNN architecture (Karpathy and Fei-Fei, 2015;Vinyals et al., 2015). The trained model can be considered an S 0 -style distribution P (caption|image) on top of which further listeners and speakers can be built.
(Unlike the idealized S 0 described above, a neural S 0 will assign some probability to untrue utterances.) The main challenge for this application is that the space of utterances (captions) U will be very large for any suitable captioning system, making the calculation of S 1 intractable due to its normalization over all utterances. The question, therefore, is how best to approximate this inference. The solution employed by Monroe et al. (2017) and Andreas and Klein (2016) is to sample a small subset of probable utterances from the S 0 , as an approximate prior upon which exact inference can be performed. While tractable, this approach has the shortcoming of only considering a small part of the true prior, which potentially decreases the extent to which pragmatic reasoning will be able to apply. In particular, if a useful caption never appears in the sampled prior, it cannot appear in the posterior.

Step-Wise Inference
Inspired by the success of the "emittorsuppressor" method of Vedantam et al. (2017), we propose an incremental version of RSA. Rather than performing a single inference over utterances, we perform an inference for each step of the unrolling of the utterance.
We use a character-level LSTM, which defines a distribution over characters P (u|pc, image), where pc ("partial caption") is a string of char-acters constituting the caption so far and u is the next character of the caption. This is now our S 0 : given a partially generated caption and an image, it returns a distribution over which character should next be added to the caption. The advantage of using a character-level LSTM over a word-level one is that U is much smaller for the former (⇡30 vs. ⇡20, 000), making the ensuing RSA model much more efficient.
We use this S 0 to define an L 0 which takes a partial caption and a new character, and returns a distribution over images. The S 1 , in turn, given a target image w ⇤ , performs an inference over the set of possible characters to determine which is best with respect to the listener choosing w ⇤ .
At timestep t of the unrolling, the listener L 0 takes as its prior over images the L 0 posterior from timestep (t 1). The idea is that as we proceed with the unrolling, the L 0 priors on which image is being referred to may change, which in turn should affect the speaker's actions. For instance, the speaker, having made the listener strongly in favor of the target image, is less compelled to continue being pragmatic.

Model Definition
In our incremental RSA, speaker models take both a target image and a partial caption pc. Thus, S 0 is a neurally trained conditional distribution S t 0 (u|w, pc t ), where t is the current timestep of the unrolling and u is a character.
We define the L t 0 in terms of the S t 0 as follows, where ip is a distribution over images representing the L 0 prior: Given an S t 0 and L t 0 , we define S t 1 and L t 1 as: Unrolling To perform greedy unrolling (though in practice we use a beam search) for either S 0 or S 1 , we initialize the state as a partial caption pc 0 consisting of only the start token and a uniform prior over the images ip 0 . Then, for t > 0, we use our incremental speaker model S 0 or S 1 to generate a distribution over the subsequent character S t (u|w, ip t , pc t ), and add the character u with highest probability density to pc t , giving us pc t+1 . We then run our listener model L 1 on u, to obtain a distribution ip t+1 = L t 1 (w|u, ip t , pc t ) over images that the L 0 can use at the next timestep.
This incremental approach keeps the inference itself very simple, while placing the complexity of the model in the recurrent nature of the unrolling. 2 While our S 0 is character-level, the same incremental RSA model works for a word-level S 0 , giving rise to a word-level S 1 . We compare character and word S 1 s in section 4.2.
As well as being incremental, these definitions of S t 1 and L t 1 differ from the typical RSA described in section 2 in that S t 1 and L t 1 draw their priors from S t 0 and L t 0 respectively. This generalizes the scheme put forward for S 1 by Andreas and Klein (2016). The motivation is to have Bayesian speakers who are somewhat constrained by the S 0 language model. Without this, other methods are needed to achieve English-like captions, as in Vedantam et al. (2017), where their equivalent of the S 1 is combined in a weighted sum with the S 0 .

Evaluation
Qualitatively, Figures 1 and 2 show how the S 1 captions are more informative than the S 0 , as a result of pragmatic considerations. To demonstrate the effectiveness of our method quantitatively, we implement an automatic evaluation.

Automatic Evaluation
To evaluate the success of S 1 as compared to S 0 , we define a listener L eval (image|caption) / P S 0 (caption|image), where P S 0 (caption|image) is the total probability of S 0 incrementally generating caption given image. In other words, L eval uses Bayes' rule to obtain from S 0 the posterior probability of each image w given a full caption u.
The neural S 0 used in the definition of L eval must be trained on separate data to the neural S 0 used for the S 1 model which produces captions, since otherwise this S 1 production model effectively has access to the system evaluating it. As Mao et al. (2016b) note, "a model might 'com-municate' better with itself using its own language than with others". In evaluation, we therefore split the training data in half, with one part for training the S 0 used in the caption generation model S 1 and one part for training the S 0 used in the caption evaluation model L eval .
We say that the caption succeeds as a referring expression if the target has more probability mass under the distribution L eval (image|caption) than any distractor.
Dataset We train our production and evaluation models on separate sets consisting of regions in the Visual Genome dataset (Krishna et al., 2017) and full images in MSCOCO (Chen et al., 2015). Both datasets consist of over 100,000 images of common objects and scenes. MSCOCO provides captions for whole images, while Visual Genome provides captions for regions within images.
Our test sets consist of clusters of 10 images. For a given cluster, we set each image in it as the target, in turn. We use two test sets. Test set 1 (TS1) consists of 100 clusters of images, 10 for each of the 10 most common objects in Visual Genome. 3 Test set 2 (TS2) consists of regions in Visual Genome images whose ground truth captions have high word overlap, an indicator that they are similar. We again select 100 clusters of 10. Both test sets have 1,000 items in total (10 potential target images for each of 100 clusters).
Captioning System Our neural image captioning system is a CNN-RNN architecture 4 adapted to use a character-based LSTM for the language model.

Hyperparameters
We use a beam search with width 10 to produce captions, and a rationality parameter of ↵ = 5.0 for the S 1 .

Results
As shown in Table 1, the character-level S 1 obtains higher accuracy (68% on TS1 and 65.9% on TS2) than the S 0 (48.9% on TS1 and 47.5% on TS2), demonstrating that S 1 is better than S 0 at referring.
Advantage of Incremental RSA We also observe that 66% percent of the times in which the S 1 caption is referentially successful and the S 0 3 Namely, man, person, woman, building, sign, caption is not, for a given image, the S 1 caption is not one of the top 50 S 0 captions, as generated by the beam search unrolling at S 0 . This means that in these cases the non-incremental RSA method of Andreas and Klein (2016) could not have generated the S 1 caption, if these top 50 S 0 captions were the support of the prior over utterances.
Comparison to Word-Level RSA We compare the performance of our character-level model to a word-level model. 5 This model is incremental in precisely the way defined in section 3.2, but uses a word-level LSTM so that u 2 U are words and U is a vocabulary of English. It is evaluated with an L eval model that also operates on the word level. Though the word S 0 performs better on both test sets than the character S 0 , the character S 1 outperforms the word S 1 , demonstrating the advantage of a character-level model for pragmatic behavior. We conjecture that the superiority of the characterlevel model is the result of the increased number of decisions where pragmatics can be taken into account, but leave further examination for future research.
Variants of the Model We further explore the effect of two design decisions in the characterlevel model. First, we consider a variant of S 1 which has a prior over utterances determined by an LSTM language model trained on the full set of captions. This achieves an accuracy of 67.2% on TS1. Second, we consider our standard S 1 but with unrolling such that the L 0 prior is drawn uniformly at each timestep rather than determined by the L 0 posterior at the previous step. This achieves an accuracy of 67.4% on TS1. This suggests that neither this change of S 1 nor L 0 priors has a large effect on the performance of the model.

Conclusion
We show that incremental RSA at the level of characters improves the ability of the neural image captioner to refer to a target image. The incremental approach is key to combining RSA with language models: as utterances become longer, it becomes exponentially slower, for a fixed n, to subsample n% of the utterance distribution and then perform inference (non-incremental approach). Furthermore, character-level RSA yields better results than word-level RSA and is far more efficient.