Know What You Don’t Know: Modeling a Pragmatic Speaker that Refers to Objects of Unknown Categories

Zero-shot learning in Language & Vision is the task of correctly labelling (or naming) objects of novel categories. Another strand of work in L&V aims at pragmatically informative rather than “correct” object descriptions, e.g. in reference games. We combine these lines of research and model zero-shot reference games, where a speaker needs to successfully refer to a novel object in an image. Inspired by models of “rational speech acts”, we extend a neural generator to become a pragmatic speaker reasoning about uncertain object categories. As a result of this reasoning, the generator produces fewer nouns and names of distractor categories as compared to a literal speaker. We show that this conversational strategy for dealing with novel objects often improves communicative success, in terms of resolution accuracy of an automatic listener.


Introduction
It is commonly agreed that even massive resources for language & vision (Deng et al., 2009;Chen et al., 2015;Krishna et al., 2017) will never fully cover the huge range of objects to be found "in the wild". This motivates research in zero-shot learning (Lampert et al., 2009;Socher et al., 2013;Hendricks et al., 2016), which aims at predicting correct labels or names for objects of novel categories, typically via external lexical knowledge such as, e.g., word embeddings.
More generally, however, uncertain knowledge of the world that surrounds us, including novel objects, is not only a machine learning challenge: it is simply a very common aspect of human communication, as speakers rarely have perfect representations of their environment. Precisely the richness of verbal interaction allows us to communicate these uncertainties and to collaborate towards communicative success (Clark and Wilkes-Gibbs, 1986).  (Yu et al., 2016), providing descriptions of visual objects from an interactive reference game.
Here, the use of the unspecific thingy and the omission of a noun in left blue can be seen as pragmatically plausible strategies that avoid confusing the listener with potentially inaccurate names for difficult-to-name objects. While there has been a lot of recent and traditional research on pragmatically informative object descriptions in reference games (Mao et al., 2016;Yu et al., 2017;Cohn-Gordon et al., 2018;Dale and Reiter, 1995;Frank and Goodman, 2012), conversational strategies for dealing with uncertainties like novel categories are largely understudied in computational pragmatics, though see, e.g., work by Fang et al. (2014).
In this paper, we frame zero-shot learning as a challenge for pragmatic modeling and explore zero-shot reference games, where a speaker needs to describe a novel-category object in an image to an addressee who may or may not know the category. In contrast to standard reference games, this game explicitly targets a situation where relatively common words like object names are likely to be more inaccurate than other words like e.g. attributes. We hypothesize that Bayesian reasoning in the style of Rational Speech Acts, RSA (Frank and Goodman, 2012), can extend a neural generation model trained to refer to objects of known categories, towards zero-shot learning. We im-plement a Bayesian decoder reasoning about categorical uncertainty and show that, solely as a result of pragmatic decoding, our model produces fewer misleading object names when being uncertain about the category (just as the speakers did in Figure 1). Furthermore, we show that this strategy often improves reference resolution accuracies of an automatic listener.

Background
We investigate referring expression generation (REG henceforth), where the goal is to compute an utterance u that identifies a target referent r among other referents R in a visual scene. Research on REG has a long tradition in natural language generation (Krahmer and Van Deemter, 2012), and has recently been re-discovered in the area of Language & Vision (Mao et al., 2016;Yu et al., 2016;Zarrieß and Schlangen, 2018). These latter models for REG essentially implement variants of a standard neural image captioning architecture (Vinyals et al., 2015), combining a CNN and an LSTM to generate an utterance directly from objects marked via bounding boxes in real-world images.
Our approach combines such a neural REG model with a reasoning component that is inspired by theory-driven Bayesian pragmatics and RSA (Frank and Goodman, 2012). We will briefly sketch this approach here. The starting point in RSA is a model of a "literal speaker", S 0 (u|r), which generates utterances u for the target r. The "pragmatic listener" L 0 then assigns probabilities to all referents R based on the model S 0 : In turn, the "pragmatic speaker" S 1 reasons about which utterance is more discriminative and will be resolved to the target by the pragmatic listener: (S 0 and L 0 are components of the recursive reasoning of S 1 and not in fact separate agents.) There has been some previous work on leveraging RSA-like reasoning for neural language generation. For instance, Cohn-Gordon et al. (2018) implement the literal speaker as a neural captioning model trained on non-discriminative image descriptions. On top of this neural semantics, they build a pragmatic speaker that produces more discriminative captions, applying equation 2 at each step of the inference process. They evaluate their model in a reference game where an automatic listener (trained on a different portion of the image data) is used to test whether the generated caption singles out the target image among a range of distractor images. A range of related articles have extended neural captioning models with decoding procedures geared towards vocabulary expansion (Anderson et al., 2017;Agrawal et al., 2018) or contextually discriminative scene descriptions (Andreas and Klein, 2016;Vedantam et al., 2017).
Previous work on REG commonly looks at visual scenes with multiple referents of identical or similar categories. Here, speakers typically produce expressions composed of a head noun, which names the category of the target, and a set of attributes, which distinguish the target from distractor referents of the same category (Krahmer and Van Deemter, 2012). Our work adds an additional dimension of uncertainty to this picture, namely a setting where the category of the target itself might not be known to the model and, hence, cannot be named with reasonable accuracy. In this setting, we expect that a literal speaker (e.g. a neural REG model trained for a restricted set object categories) generates misleading references, e.g. containing incorrect head nouns, as it has no means of "knowing" which words risk being inaccurate for referring to novel objects. The following Section 3 describes how we modify the RSA approach for reasoning in such a zero-shot reference game.

Model
Inspired by the approach in Section 2, we model our pragmatic zero-shot speaker as a neural generator (the literal speaker) that is decoded via a pragmatic listener. In contrast to the listener in Equation (1), however, our listener possesses an additional latent variable C, which reflects its beliefs about the target's category. This hidden belief distribution will, in turn, allow the pragmatic speaker to reason about how accurate the words produced by the literal speaker might be.
Our Bayesian listener will assign a probability P (r|u) to a referent r conditioned on the utterance u by the (literal) speaker. To do that, it needs to calculate P (u|r), as in Equation 1. While previous work on RSA typically equates P (u|r) with S 0 (u|r), we are going to modify the way this prob-ability is calculated. Thus, we assume that our listener has hidden beliefs about the category of the referent, that we can marginalize over as follows: As a simplification, we condition u only on c i , instead of P (u|c i , r). This will allow us to estimate P (u|c i ) directly via maximum likelihood on the training data, i.e. in terms of word probabilities conditioned on categories (observed in training) . The pragmatic listener is defined as follows: For instance, let's consider a game with 3 categories and two words, the less specific left with P (u|c i ) = 1 2 for all c i ∈ C and the more specific bus with P (u|c 1 ) = 9 10 , P (u|c 2 ) = 1 10 , P (u|c 3 ) = 1 10 . When the listener is uncertain and predicts P (c i |r) = 1 3 for all c i ∈ C, this yields L 0 (r|left) = 0.5 and L 0 (r|bus) = 0.36, meaning that the less specific left will be more likely resolved to the target r. Vice versa, when the listener is more certain, e.g. P (c 1 |r) = 9 10 , P (c 2 |r) = 1 10 , P (c 3 |r) = 1 10 , more specific words will be preferred: L 0 (r|bus) = 0.83 and L 0 (r|left) = 0.55.
The definition of the pragmatic speaker is straightforward: Intuitively, S 1 guides its potentially overoptimistic language model (S 0 ) to be more cautious in producing category-specific words, e.g. nouns. The idea is that the degree to which a word is category-specific and, hence, risky in a zero-shot reference game can be determined on descriptions for objects of known categories and is expressed in P (u|c) . For unknown categories, the pragmatic speaker can deliberately avoid these category-specific words and resort to describing other visual properties like colour or location. 1 Similar to Cohn-Gordon et al. (2018), we use incremental, word-level inference to decode the pragmatic speaker model in a greedy fashion: At each time step, we generate the most likely word determined via S 0 and L 0 . The parameters α and β will determine the balance between the literal speaker and the listener. While α is simply a constant (set to 2, in our case), β is zero as long as w does not occur in u t−1 and increases when it does occur in u t−1 (it is then set to 2). This ensures that there is a dynamic tradeoff between the speaker and the listener, i.e. for words that occur in previously generated utterance prefix, the language model probabilities (S 0 ) will have comparitively more weight than for new words.

Exp. 1: Referring without naming?
Section 3 has introduced a model for referring expression generation (REG) in a zero-shot reference game. This model, and its pragmatic decoding component in particular, is designed to avoid words that are specific to categories when there is uncertainty about the category of a target object, in favour of words that are not specific to categories like, e.g., colour or location attributes. In the following evaluation, we will test how this reasoning component actually affects the referring behavior of the pragmatic speaker as compared to the literal speaker, which we implement as neural supervised REG model along the lines of previous work (Mao et al., 2016;Yu et al., 2016). As object names typically express category-specific information in referring expressions, we focus the comparison on the nouns generated in the systems' output.

Training
Data We conduct experiments on RefCOCO (Yu et al., 2016) referring expressions to objects in MSCOCO (Lin et al., 2014) images. As is commonly done in zero-shot learning, we manually select a range of different categories as targets for our zero-shot game, cf. (Hendricks et al., 2016). Out of the 90 categories in MSCOCO, we select 6 medium-frequent categories (cat,horse,cup,bottle,bus,train), that are similar to those in (Hendricks et al., 2016). For each category, we divide the training set of Ref-COCO into a new train-test split such that all images with an instance of the target zero-shot category are moved to the test set.
Generation Model (S 0 ) We implement a standard CNN-LSTM model for REG, trained on pairs of image regions and referring expressions. The architecture follows the baseline version of (Yu et al., 2016). We crop images to the target region, and obtain the fc features from VGG (Simonyan and Zisserman, 2014). We set the word embedding layer size to 512, and the hidden state to 1024. We optimized with ADAM, set the batch size to 32 and the learning rate to 0.0004. The number of training epochs is 5 (verified on the RefCOCO validation set).
Uncertainty Estimation Similar to previous work in zero-shot learning, we factor out the problem of automatically determining the model's certainty with respect to an object's category, cf. (Lampert et al., 2009;Socher et al., 2013): for computing L 0 (r|u), we set P (c i |r) to be a uniform distribution over categories, meaning that the model is maximally uncertain about the referent's category. We leave exploration of a more realistic uncertainty or novelty prediction to future work.

Evaluation
Measures We test to what extent our models produces incorrect names for novel objects. First, for each zero-shot category, we define a set of distractor nouns (distr-noun), which correspond to the names of the remaining categories in MSCOCO. Any choice of noun from that set would be wrong, as the categories are pairwise disjunct; the exploration of other nouns (e.g. thingy, animal) is left for future work. In Table 1, "% distr-noun" refers to how many expressions generated for an instance of a zero-shot category contain such an incorrect distractor noun. Second, we count how many generated expressions do not contain any noun (no-noun) at all, according to the NLTK POS tagger.
Results Table 1 shows that the proportion of output expressions containing a distractor noun decreases markedly from S 0 to S 1 , whereas the proportion of expression without any name increases   Figure 2: Qualitative Example markedly from S 0 to S 1 . First of all, this suggests that our baseline model S 0 does, in many cases, not know what it does not know, i.e. it is not aware that it encounters a novel category and frequently generates names of known categories encountered during training. However, even in this simple model, we find a certain portion of output expressions that do not contain any name (e.g. 27% for bottle, but only 6% for bus). The results also confirm our hypothesis that the pragmatic speaker S 1 avoids to produce "risky" or specific words that are likely to be confused for uncertain or unknown categories. It is worth stressing here that this behaviour results entirely from the Bayesian reasoning that S 1 uses in decoding; the model does not have explicit knowledge of linguistic categories like nouns, names or other taxonomic knowledge.

Exp. 2: Communicative success
The Experiment in Section 4 found that the pragmatic speaker uses less category-specific vocabulary when referring to objects of novel categories as compared to a literal speaker. Now, we need to establish whether the resulting utterances still achieve communicative success in the zero-shot reference game, despite using less specific vocab-  ulary (as shown above). We test this automatically using a model of a "competent" listener, that knows the respective object categories. This is supposed to approximate a conversation between a system and a human that has more elaborate knowledge of the world than the system.
The evaluation listener One pitfall of using a trained listener model (instead of a human) for task-oriented evaluation is that this model might simply make the same mistakes as the speaker model as it is trained on similar data. To avoid this circularity, Cohn-Gordon et al. (2018) train their listener on a different subset of the image data. Rather than training on different data, we opt for training the listener on better data, as we want it to be as strict and human-like as possible. For instance, we do not want our listener model to resolve an expression like the brown cat to a dog. We train S eval as a neural speaker on the entire training set and give L eval access to ground-truth object categories. The ground-truth category c r of a referent r is used to calculate P (n u |c r ) where n u is the object name contained in the utterance u. P (n u |c r ) is estimated on the entire training set.
L eval (r|u, c r ) = S eval (u|r) * P (n u |c r ) P (n u |c r ) will be close to zero if the utterance contains a rare or wrong name for the category c r , and L eval will then assign a very low probability to this referent. We apply this listener to all referents in the scene and take the argmax.
Test set The set TS-image pairs each target with other (annotated!) objects in the same image, a typical set-up for reference resolution.As many images in RefCOCO only have distractors of the same category as the target (which is not ideal for our purposes), we randomly sample an additional test set called TS-distractors, pairing zero-  Table 3: Reference resolution accuracies obtained from listener L eval on expressions by S 0 , S 1 shot targets with 4 distractors of a similar category, which we defined manually, shown in Table  2. This is slightly artificial as objects are taken out of the coherent spatial context, but it helps us determining whether our model can successfully refer in a context with similar, but not identical, categories.
Results As shown in Table 3, the S 1 model improves the resolution accuracy for all categories on TS-distractors, except for bus. On TS-image, resolution accuracies are generally much higher and the comparison between S 0 and S 1 gives mixed results. We take this as positive evidence that S 1 improves communicative success in a relevant number of cases, but it also indicates that combining this model with the more standard RSA approach could be promising. Figure 2 shows a qualitative example for S 1 being more successful than S 0 .

Conclusion
We have presented a pragmatic approach to modeling zero-shot reference games, showing that Bayesian reasoning inspired by RSA can help decoding a neural generator that refers to novel objects. The decoder is based on a pragmatic listener that has hidden beliefs about a referent's category, which leads the pragmatic speaker to use fewer nouns when being uncertain about this category. While some aspects of the experimental setting are, admittedly, simplified (e.g. compilation of an artificial test set, uncertainty estimation), we believe that this is an encouraging result for scaling models in computational pragmatics to realworld conversation and its complexities.