Resolving References to Objects in Photographs using the Words-As-Classifiers Model

A common use of language is to refer to visually present objects. Modelling it in computers requires modelling the link between language and perception. The"words as classifiers"model of grounded semantics views words as classifiers of perceptual contexts, and composes the meaning of a phrase through composition of the denotations of its component words. It was recently shown to perform well in a game-playing scenario with a small number of object types. We apply it to two large sets of real-world photographs that contain a much larger variety of types and for which referring expressions are available. Using a pre-trained convolutional neural network to extract image features, and augmenting these with in-picture positional information, we show that the model achieves performance competitive with the state of the art in a reference resolution task (given expression, find bounding box of its referent), while, as we argue, being conceptually simpler and more flexible.


Introduction
A common use of language is to refer to objects in the shared environment of speaker and addressee. Being able to simulate this is of particular importance for verbal human/robot interfaces (HRI), and the task has consequently received some attention in this field (Matuszek et al., 2012;Tellex et al., 2011;Krishnamurthy and Kollar, 2013).
Here, we study a somewhat simpler precursor task, namely that of resolution of reference to objects in static images (photographs), but use a larger set of object types than is usually done in HRI work (> 300, see below). More formally, the task is to retrieve, given a referring expression e and an image I, the region bb * of the image that is most likely to contain the referent of the expression. As candidate regions, we use both manually annotated regions as well as automatically computed ones.
As our starting point, we use the "words-asclassifiers" model recently proposed by Kennington and Schlangen (2015). It has before only been tested in a small domain and with specially designed features; here, we apply it to real-world photographs and use learned representations from a convolutional neural network (Szegedy et al., 2015). We learn models for between 400 and 1,200 words, depending on the training data set. As we show, the model performs competitive with the state of the art (Hu et al., 2016;Mao et al., 2016) on the same data sets.
Our background interest in situated interaction makes it important for us that the approach we use is 'dialogue ready'; and it is, in the sense that it supports incremental processing (giving results while the incoming utterance is going on) and incremental learning (being able to improve performance from interactive feedback). However, in this paper we focus purely on 'batch', noninteractive performance. 1

Related Work
The idea of connecting words to what they denote in the real world via perceptual features goes back at least to Harnad (1990), who coined "The Symbol Grounding Problem": " [H]ow can the semantic interpretation of a formal symbol system be made intrinsic to the system, rather than just parasitic on the meanings in our heads?" The pro-posed solution was to link 'categorial representations' with "learned and innate feature detectors that pick out the invariant features of object and event categories from their sensory projections". This suggestion has variously been taken up in computational work. An early example is Deb Roy's work from the early 2000s Roy, 2002;Roy, 2005). In , computer vision techniques are used to detect object boundaries in a video feed, and to compute colour features (mean colour pixel value), positional features, and features encoding the relative spatial configuration of objects. These features are then associated in a learning process with certain words, resulting in an association of colour features with colour words, spatial features with prepositions, etc., and based on this, these words can be interpreted with reference to the scene currently presented to the video feed.
Of more recent work, that of Matuszek et al. (2012) is closely related to the approach we take. The task in this work is to compute (sets of) referents, given a (depth) image of a scene containing simple geometric shapes and a natural language expression. In keeping with the formal semantics tradition, a layer of logical form representation is assumed; it is not constructed via syntactic parsing rules, however, but by a learned mapping (semantic parsing). The non-logical constants of this representation then are interpreted by linking them to classifiers that work on perceptual features (representing shape and colour of objects). Interestingly, both mapping processes are trained jointly, and hence the links between classifiers and nonlogical constants on the one hand, and non-logical constants and lexemes on the other are induced from data. In the work presented here, we take a simpler approach that forgoes the level of semantic representation and directly links lexemes and perceptions, but does not yet learn the composition.
Most closely related on the formal side is recent work by Larsson (2015), which offers a very direct implementation of the 'words as classifiers' idea (couched in terms of type theory with records (TTR; (Cooper and Ginzburg, 2015)) and not model-theoretic semantics). In this approach, some lexical entries are enriched with classifiers that can judge, given a representation of an object, how applicable the term is to it. The paper also describes how these classifiers could be trained (or adapted) in interaction. The model is only speci-fied theoretically, however, with hand-crafted classifiers for a small set of words, and not tested with real data.
The second area to mention here is the recently very active one of image-to-text generation, which has been spurred on by the availability of large datasets and competitions structured around them. The task here typically is to generate a description (a caption) for a given image. A frequently taken approach is to use a convolutional neural network (CNN) to map the image to a dense vector (which we do as well, as we will describe below), and then condition a neural language model (typically, an LSTM) on this to produce an output string (Vinyals et al., 2015;Devlin et al., 2015).  modify this approach somewhat, by using what they call "word detectors" first to specifically propose words for image regions, out of which the caption is then generated. This has some similarity to our word models as described below, but again is tailored more towards generation. Socher et al. (2014) present a more compositional variant of this type of approach where sentence representations are composed along the dependency parse of the sentence. The representation of the root node is then mapped into a multimodal space in which distance between sentence and image representation can be used to guide image retrieval, which is the task in that paper. Our approach, in contrast, composes on the level of denotations and not that of representation.
Two very recent papers carry this type of approach over to the problem of resolving references to objects in images. Both (Hu et al., 2015) and (Mao et al., 2015) use CNNs to encode image information (and interestingly, both combine, in different ways, information from the candidate region with more global information about the image as a whole), on which they condition an LSTM to get a prediction score for fit of candidate region and referring expression. As we will discuss below, our approach has some similarities, but can be seen as being more compositional, as the expression score is more clearly composed out of individual word scores (with rule-driven composition, however). We will directly compare our results to those reported in these papers, as we were able to use the same datasets.

The "Words-As-Classifiers" Model
We now briefly review (and slightly reformulate) the model introduced by Kennington and Schlangen (2015). It has several components: A Model of Word Meanings Let w be a word whose meaning is to be modelled, and let x be a representation of an object in terms of its visual features. The core ingredient then is a classifier then takes this representation and returns a score f w (x), indicating the "appropriateness" of the word for denoting the object.
Noting a (loose) correspondence to Montague's (1974) intensional semantics, where the intension of a word is a function from possible worlds to extensions (Gamut, 1991), the intensional meaning of w is then defined as the classifier itself, a function from a representation of an object to an "appropriateness score": 2 ] is a function returning the meaning of its argument, and x is a feature vecture as given by f obj , the function that computes the representation for a given object.) The extension of a word in a given (here, visual) discourse universe W can then be modelled as a probability distribution ranging over all candidate objects in the given domain, resulting from the application of the word intension to each object (x i is the feature vector for object i, normalize() vectorized normalisation, and I a random variable ranging over the k candidates): Composition Composition of word meanings into phrase meanings in this approach is governed by rules that are tied to syntactic constructions. In the following, we only use simple multiplicative composition for nominal constructions: (Z takes care that the result is normalized over all candidate objects.) Selection To arrive at the desired extension of a full referring expression-an individual object, in our case-, one additional element is needed, and this is contributed by the determiner. For uniquely referring expressions ("the red cross"), what is required is to pick the most likely candidate from the distribution: In other words, the prediction of an expression such as "the brown shirt guy on right" is computed by first getting the responses of the classifiers corresponding to the words, individually for each object. I.e., the classifier for "brown" is applied to objects o 1 , . . . , o n . This yields a vector of responses (of dimensionality n, the number of candidate objects); similarly for all other words. These vectors are then multiplied, and the predicted object is the maximal component of the resulting vector. Figure 1 gives a schematic overview of the model as implemented here, including the feature extraction process. SAIAPR TC-12 / ReferItGame The basis of this data set is the IAPR TC-12 image retrieval benchmark collection of "20,000 still natural images taken from locations around the world and comprising an assorted cross-section of still natural images" (Grubinger et al., 2006). A typical example of an image from the collection is shown in Figure 2 on the left.

E ¥¥. t
This dataset was later augmented by Escalante et al. (2010) with segmentation masks identifying objects in the images (an average of 5 objects per image). Figure 2 (middle) gives an example of such a segmentation. These segmentations were Figure 2: Image 27437 from IAPR TC-12 (left), with region masks from SAIAPR TC-12 (middle); "brown shirt guy on right" is a referring expression in REFERITGAME for the region singled out on the right done manually and provide close maskings of the objects. This extended dataset is also known as "SAIAPR TC-12" (for "segmented and annotated IAPR TC-12").
The third component is provided by Kazemzadeh et al. (2014), who collected a large number of expressions referring to (presegmented) objects from these images, using a crowd-sourcing approach where two players were paired and a director needed to refer to a predetermined object to a matcher, who then selected it. (An example is given in Figure 2 (right).) This corpus contains 120k referring expressions, covering nearly all of the 99.5k regions from  The average length of a referring expression from this corpus is 3.4 tokens. The 500k token realise 10,340 types, with 5785 hapax legomena. The most frequent tokens (other than articles and prepositions) are "left" and "right", with 22k occurrences. (In the following, we will refer to this corpus as REFERIT.) This combination of segmented images and referring expressions has recently been used by Hu et al. (2015) for learning to resolve references, as we do here. The authors also tested their method on region proposals computed using the EdgeBox algorithm (Zitnick and Dollár, 2014). They kindly provided us with this region proposal data (100 best proposals per image), and we compare our results on these region proposals with theirs below. The authors split the dataset evenly into 10k images (and their corresponding referring expressions) for training and 10k for testing. As we needed more training data, we made a 90/10 split, ensuring that all our test images are from their test split.

MSCOCO / GoogleRefExp / ReferItGame
The second dataset is based on the "Microsoft Common Objects in Context" collection (Lin et al., 2014), which contains over 300k images with object segmentations (of objects from 80 prespecified categories), object labels, and image captions. Figure 3 shows some examples of images containing objects of type "person".
This dataset was augmented by Mao et al. (2015) with what they call 'unambiguous object descriptions', using a subset of 27k images that contained between 2 and 4 instances of the same object type within the same image. The authors collected and validated 100k descriptions in a crowd-sourced approach as well, but unlike in the ReferItGame setup, describers and validators were not connected live in a game setting. 4 The average length of the descriptions is 8.3 token. The 790k token in the corpus realise 14k types, with 6950 hapax legomena. The most frequent tokens other than articles and prepositions are "man" (15k occurrences) and "white" (12k). (In the following, we will refer to this corpus as GREXP.) The authors also computed automatic region proposals for these images, using the multibox method of Erhan et al. (2014) and classifying those using a model trained on MSCOCO categories, retaining on average only 8 per image. These region proposals are on average of a much higher quality than those we have available for the other dataset.
As mentioned in (Mao et al., 2015), Tamara Berg and colleagues have at the same time used their ReferItGame paradigm to collect referring expressions for MSCOCO images as well. Upon request, Berg and colleagues also kindly provided us with this data-140k referring expressions, for 20k images, average length 3.5 token, 500k token altogether, 10.3k types, 5785 hapax legomena; most frequent also "left" (33k occurrences) and "right" (32k). (In the following, we will call this corpus REFCOCO.) In our experiments, we use the training/validation/test splits on the images suggested by Berg et al., as the splits provided by Mao et al. (2015) are on the level of objects and have some overlap in images.
It is interesting to note the differences in the expressions from REFCOCO and GREXP, the latter on average being almost 5 token longer. Figure 3 gives representative examples. We can speculate that the different task descriptions ("refer to this object" vs. "produce an unambiguous description") and the different settings (live to a partner vs. offline, only validated later) may have caused this. As we will see below, the GREXP descriptions did indeed cause more problems to our approach, which is meant for reference in interaction.

Training the Word/Object Classifiers
The basis of the approach we use are the classifiers that link words and images. These need to be trained from data; more specifically, from pairings of image regions and referring expressions, as provided by the corpora described in the previous section.
Representing Image Regions The first step is to represent the information from the image regions. We use a deep convolutional neural network, "GoogLeNet" (Szegedy et al., 2015), that was trained on data from the Large Scale Visual Recognition Challenge 2014 (ILSVRC2014) from the ImageNet corpus (Deng et al., 2009) to extract features. 5 It was optimised to recognise categories from that challenge, which are different from those occurring in either SAIAPR or COCO, but in any case we only use the final fully-connected layer before the classification layer, to give us a 1024 dimensional representation of the region. We augment this with 7 features that encode information about the region relative to the image: the (relative) coordinates of two corners, its (relative) area, distance to the center, and orientation of the image. The full representation hence is a vector of 1031 features. (See also Figure 1 above.) Selecting candidate words How do we select the words for which we train perceptual classifiers? There is a technical consideration to be made here and a semantic one. The technical consideration is that we need sufficient training data for the classifiers, and so can only practically train classifiers for words that occur often enough in the training corpus. We set a threshold here of a minimum of 40 occurences in the training corpus, determined empirically on the validation set to provide a good tradeoff between vocabulary coverage and number of training instances.
The semantic consideration is that intuitively, the approach does not seem appropriate for all types of words; where it might make sense for attributes and category names to be modelled as image classifiers, it does less so for prepositions and other function words. Nevertheless, for now, we make the assumption that all words in a referring expression contribute information to the visual identification of its referent. We discuss the consequences of this decision below.
This assumption is violated in a different way in phrases that refer via a landmark, such as in "the thing next to the woman with the blue shirt". Here we cannot assume for example that the referent region provides a good instance of "blue" (since it is not the target object in the region that is described as blue), and so we exclude such phrases from the training set (by looking for a small set of expressions such as "left of", "behind", etc.; see appendix for a full list). This reduces the train-5 http://www.image-net.org/challenges/ LSVRC/2014/. We use the sklearn-theano (http://sklearn-theano. github.io/feature_extraction/index.html# feature-extraction) port of the Caffe replication and re-training (https://github.com/BVLC/caffe/ tree/master/models/bvlc_googlenet) of this network structure. ing portions of REFERIT, REFCOCO and GREXP to 86%, 95%, and 82% of their original size, respectively (counting referring expressions, not tokens). Now that we have decided on the set of words for which to train classifiers, how do we assemble the training data?
Positive Instances Getting positive instances from the corpus is straightforward: We pair each word in a referring expression with the representation of the region it refers to. That is, if the word "left" occurs 20,000 times in expressions in the training corpus, we have 20,000 positive instances for training its classifier.
Negative Instances Acquiring negative instances is less straightforward. The corpus does not record inappropriate uses of a word, or 'negative referring expressions' (as in "this is not a red chair"). To create negative instances, we make a second assumption which again is not generally correct, namely that when a word was never in the corpus used to refer to an object, this object can serve as a negative example for that word/object classifier. In the experiments reported below, we randomly selected 5 image regions from the training corpus whose referring expressions (if there were any) did not contain the word in question. 6 The classifiers Following this regime, we train binary logistic regression classifiers (with 1 regularisation) on the visual object features representations, for all words that occurred at least 40 times in the respective training corpus. 7 To summarise, we train separate binary classifiers for each word (not making any a-priori distinction between function words and others, or attribute labels and category labels), giving them the task to predict how likely it would be that the word they represent would be used to refer to the image region they are given. All classifiers are presented during training with data sets with the same balance of positive and negative examples (here, a fixed ratio of 1 positive to 5 negative). Hence, the classifiers themselves do not reflect any word frequency effects; our claim (to be validated in future 6 This approach is inspired by the negative sampling technique of Mikolov et al. (2013) for training textual word embeddings. 7 We used the implementation in the scikit learn package (Pedregosa et al., 2011).  (Mao et al., 2015) -0.61 ---- work) is that any potential effects of this type are better modelled separately.

Experiments
The task in our experiments is the following: Given an image I together with bounding boxes of regions (bb 1 , . . . , bb n ) within it, and a referring expression e, predict which of these regions contains the referent of the expression.

By Corpus
We start with training and testing models for all three corpora (REFERIT, REFCOCO, GREXP) separately. But first, we establish some baselines. The first is just randomly picking one of the candidate regions. The second is a 1-rule classifier that picks the largest region. The respective accuracies on the corpora are as follows: REFERIT 0.20/0.19; REFCOCO 0.16/0.23; GREXP 0.19/0.20. Training on the training sets of REFERIT, REF-COCO and GREX with the regime described above (min. 40 occurrences) gives us classifiers for 429, 503, and 682 words, respectively. Table 1 shows the evaluation on the respective test parts: accuracy (acc), mean reciprocal rank (mrr) and for how much of the expression, on average, a word classifier is present (arc). '>0' shows how much of the testcorpus is left if expressions are filtered out for which not even a single word is the model (which we evaluate by default as false), and accuracy for that reduced set. The 'NR' rows give the same numbers for reduced test sets in which all relational expressions have been removed; '%tst' shows how much of a reduction that is relative to the full testset. The rows with the citations give the best reported results from the literature. 8 As this shows, in most cases we come close, but do not quite reach these results. The distance is the biggest for GREXP with its much longer expressions. As discussed above, not only are the descriptions longer on average in this corpus, the vocabulary size is also much higher. Many of the descriptions contain action descriptions ("the man smiling at the woman"), which do not seem to be as helpful to our model. Overall, the expressions in this corpus do appear to be more like 'mini-captions' describing the region rather than referring expressions that efficiently single it out among the set of distractors; our model tries to capture the latter.
Combining Corpora A nice effect of our setup is that we can freely mix the corpora for training, as image regions are represented in the same way regardless of source corpus, and we can combine occurrences of a word across corpora. We tested combining the testsets of REFERIT and RE-FCOCO (RI+RC in the Table below), REFCOCO and GREXP (RC+GR), and all three (REFERIT, REF-COCO, and GREXP; RI+RC+GR), yielding models for 793, 933, 1215 words, respectively (with the same "min. 40 occurrences" criterion). For all testsets, the results were at least stable compared to Table 1, for some they improved. For reasons of space, we only show the improvements here.  Computed Region Proposals Here, we cannot expect the system to retrieve exactly the ground truth bounding box, since we cannot expect the set of automatically computed regions to contain it. We follow Mao et al. (2015) in using intersection over union (IoU) as metric (the size of the intersective area between candidate and ground truth bounding box normalised by the size of the union) and taking an IoU ≥ 0.5 of the top candidate as a threshold for success (P@1). As a more relaxed metric, we also count for the SAIAPR proposals (of which there are 100 per image) as success when at least one among the top 10 candidates exceeds this IoU threshold (R@10). (For MSCOCO, there are only slightly above 5 proposals per image on average, so computing this more relaxed measure does not make sense.) The random baseline (RND) is computed by applying the P@1 criterion to a randomly picked region proposal. (That it is higher than 1/#regions for SAIAPR shows that the regions cluster around objects.) RP@1 RP@10 rnd REFERIT 0.09 0.24 0.03 REFERIT; NR 0.10 0.26 0.03 (Hu et al., 2015) 0.18 0.45 REFCOCO 0.52 -0.17 REFCOCO; NR 0.54 -0.17 (Mao et al., 2015) 0.52 GREXP 0.36 -0.16 GREXP; NR 0.37 -0.17 (Mao et al., 2015) 0.45 Table 3: Results on region proposals With the higher quality proposals provided for the MSCOCO data, and the shorter, more prototypical referring expressions from REFCOCO, we narrowly beat the reported results. (Again, note that we use a different split that ensures separation on the level of images between training and test.) (Hu et al., 2015) performs relatively better on the region proposals (the gap is wider), on GREXP, we come relatively closer using these proposals. We can speculate that using automatically computed boxes of a lower selectivity (REFERIT) shifts the balance between needing to actually recognise the image and getting information from the shape and position of the box (our positional features; see Section 5).

Ablation Experiments
To get an idea about what the classifiers actually pick up on, we trained variants given only the positional features (POS columns below in Table 4) and only object features (NOPOS columns). We also applied a variant of the model with only the top 20 classifiers (in terms of number of positive training examples; TOP20). We only show accuracy here, and repeat the relevant numbers from Table 1   This table shows an interesting pattern. To a large extent, the object image features and the positional features seem to carry redundant information, with the latter on their own performing better than the former on their own. The full model, however, still gains something from the combination of the feature types. The top-20 classifiers (and consequently, top 20 most frequent words) alone reach decent performance (the numbers are shown for the full test set here; if reduced to only utterances where at least one word is known, the numbers rise, but the reduction of the testset is much more severe than for the full models with much larger vocabulary).  Error Analysis Figure 4 shows the accuracy of the model split by length of the referring expression (top lines; lower lines show the proportion of expression of this length in the whole corpus). The pattern is similar for all corpora (but less pronounced for GREXP): shorter utterances fare better.
Manual inspection of the errors made by the system further corroborates the suspicion that composition as done here neglects too much of the internal structure of the expression. An example from REFERIT where we get a wrong prediction is "second person from left". The model clearly does not have a notion of counting, and here it wrongly selects the leftmost person. In a similar vein, we gave results above for a testset where spatial relations where removed, but other forms of relation (e.g., "child sitting on womans lap") that weren't modelled still remain in the corpus.
We see as an advantage of the model that we can inspect words individually. Given the performance of short utterances, we can conclude that the word/object classifiers themselves perform reasonably well. This seems to be somewhat independent of the number of training examples they received. Figure 5 shows, for REFERIT, # training instances (x-axis) vs. average accuracy on the validation set, for the whole vocabulary. As this shows, the classifiers tend to get better with more training instances, but there are good ones even with very little training material. Mean average precision (i.e., area under the precision / recall curve) over all classifiers (exemplarily computed for the RI+RC set, 793 words) is 0.73 (std 0.15). Interestingly, the 155 classifiers in the top range (average precision over 0.85) are almost all for concrete nouns; the 128 worst performing ones (below 0.60) are mostly other parts of speech. (See appendix.) This is, to a degree, as expected: our assumption behind training classifiers for all ocurring words and not pre-filtering based on their part-of-speech or prior hypotheses about visual relevance was that words that can occur in all kinds of visual contexts will lead to classifiers whose contributions cancel out across all candidate objects in a scene.
However, the mean average precision of the classifiers for colour words is also relatively low at 0.6 (std 0.08), for positional words ("left", "right", "center", etc.) it is 0.54 (std 0.1). This might suggest that the features we take from the CNN might indeed be more appropriate for tasks close to what they were originally trained on, namely category and not attribute prediction. We will explore this in future work.

Conclusions
We have shown that the "words-as-classifiers" model scales up to a larger set of object types with a much larger variety in appearance (SAIAPR and MSCOCO); to a larger vocabulary and much less restricted expressions (REFERIT, REFCOCO, GR-EXP); and to use of automatically learned feature types (from a CNN). It achieves results that are comparable to those of more complex models.
We see as advantage that the model we use is "transparent" and modular. Its basis, the word/object classifiers, ties in more directly with more standard approaches to semantic analysis and composition. Here, we have disregarded much of the internal structure of the expressions. But there is a clear path for bringing it back in, by defining other composition types for other construction types and different word models for other word types. Kennington and Schlangen (2015) do this for spatial relations in their simpler domain; for our domain, new and more richly annotated data such as VISUALgenome looks promising for learning a wide variety of relations. 9 The use of denotations / extensions might make possible transfer of methods from extensional semantics, e.g. for the addition of operators such as negation or generalised quantifiers. The design of the model, as mentioned in the introduction, makes it amenable for use in interactive systems that learn; we are currently exploring this avenue. Lastly, the word/object classifiers also show promise in the reverse task, generation of referring expressions (Zarrieß and Schlangen, 2016).
All this is future work. In its current statebesides, we believe, strongly motivating this future work-, we hope that the model can also serve as a strong baseline to other future approaches to reference resolution, as it is conceptually simple and easy to implement.