Visual Referring Expression Recognition: What Do Systems Actually Learn?

We present an empirical analysis of state-of-the-art systems for referring expression recognition – the task of identifying the object in an image referred to by a natural language expression – with the goal of gaining insight into how these systems reason about language and vision. Surprisingly, we find strong evidence that even sophisticated and linguistically-motivated models for this task may ignore linguistic structure, instead relying on shallow correlations introduced by unintended biases in the data selection and annotation process. For example, we show that a system trained and tested on the input image without the input referring expression can achieve a precision of 71.2% in top-2 predictions. Furthermore, a system that predicts only the object category given the input can achieve a precision of 84.2% in top-2 predictions. These surprisingly positive results for what should be deficient prediction scenarios suggest that careful analysis of what our models are learning – and further, how our data is constructed – is critical as we seek to make substantive progress on grounded language tasks.


Introduction
There has been increasing interest in modeling natural language in the context of a visual grounding. Several benchmark datasets have recently been introduced for describing a visual scene with natural language (Chen et al., 2015), describing or localizing specific objects in a scene (Kazemzadeh et al., 2014;Mao et al., 2016), answering natural language questions about the scenes (Antol et al., 2015), and performing visually grounded dialogue (Das et al., 2016). Here, we focus on referring expression recognition (RER) -the task of identifying the object in an image that is referred to by a natural language expression produced by a human (Kazemzadeh et al., 2014;Mao et al., 2016;Yu et al., 2016;Nagaraja et al., 2016;Hu et al., 2017).
Recent work on RER has sought to make progress by introducing models that are better capable of reasoning about linguistic structure (Hu et al., 2017;Nagaraja et al., 2016) -however, since most of the state-of-the-arts systems involve complex neural parameterizations, what these models actually learn has been difficult to interpret. This is concerning because several post-hoc analyses of related tasks (Zhou et al., 2015;Devlin et al., 2015;Agrawal et al., 2016;Jabri et al., 2016;Goyal et al., 2016) have revealed that some positive results are actually driven by superficial biases in datasets or shallow correlations without deeper visual or linguistic understanding. Evidently, it is hard to be completely sure if a model is performing well for the right reasons.
To increase our understanding of how RER systems function, we present several analyses inspired by approaches that probe systems with perturbed inputs (Jia and Liang, 2017) and employ simple models to exploit and reveal biases in datasets (Chen et al., 2016a). First, we investigate whether systems that were designed to incorporate linguistic structure actually require it and make use of it. To test this, we perform perturbation experiments on the input referring expressions. Surprisingly, we find that models are robust to shuffling the word order and limiting the word categories to nouns and adjectives. Second, we attempt to reveal shallower correlations that systems might instead be leveraging to do well on this task. We build two simple systems called Neural Sieves: one that completely ignores the input referring expression and another that only predicts the category of the referred object from the input expression. Again, surprisingly, both sieves are able to identify the correct object with surprising precision in top-2 and top-3 predictions. When these two simple systems are com-bined, the resulting system achieves precisions of 84.2% and 95.3% for top-2 and top-3 predictions, respectively. These results suggest that to make meaningful progress on grounded language tasks, we need to pay careful attention to what and how our models are learning, and whether our datasets contain exploitable bias.

Related Work
Referring expression recognition and generation is a well studied problem in intelligent user interfaces (Chai et al., 2004), human-robot interaction (Fang et al., 2012;Chai et al., 2014;Williams et al., 2016), and situated dialogue (Kennington and Schlangen, 2017). Kazemzadeh et al. (2014) and Mao et al. (2016) introduce two benchmark datasets for referring expression recognition. Several models that leverage linguistic structure have been proposed. Nagaraja et al. (2016) propose a model where target and supporting objects (i.e. objects that are mentioned in order to disambiguate the target object) are identified and scored jointly.
The resulting model is able to localize supporting objects without direct supervision. Hu et al. (2017) introduce a compositional approach for the RER task. They assume that the referring expression can be decomposed into a triplet consisting of the target object, the supporting object, and their spatial relationship. This structured model achieves state-of-the-art accuracy on the Google-Ref dataset. Cirik et al. (2018) propose a type of neural modular network (Andreas et al., 2016) where the computation graph is defined in terms of a constituency parse of the input referring expression.
Previous studies on other tasks have found that state-of-the-art systems may be successful for reasons different than originally assumed. For example, Chen et al. (2016b) show that a simple logistic regression baseline with carefully defined features can achieve competitive results for reading comprehension on CNN/Daily Mail datasets (Hermann et al., 2015), indicating that more sophisticated models may be learning realtively simple correlations. Similarly, Gururangan et al. (2018) reveal bias in a dataset for semantic inference by demonstrating a simple model that achieves competitive results without looking at the premise.

Analysis by Perturbation
In this section, we would like to analyze how the state-of-the-art referring expression recognition systems utilize linguistic structure. We conduct experiments with perturbed referring expressions where various aspects of linguistic structure are obscured. We perform three types of analyses: the first one studying syntactic structure (Section 3.2), the second focusing on the importance of word categories (Section 3.3), and the final one analyzing potential biases in the dataset (Section 3.4).

Analysis Methodology
To perform our analysis, we take two state-of-theart systems CNN+LSTM-MIL (Nagaraja et al., 2016) and CMN (Hu et al., 2017) and train them from scratch with perturbed referring expressions. We note that the perturbation experiments explained in next subsections are performed on all train and test instances. All experiments are done on the standard train/test splits for the Google-Ref dataset (Mao et al., 2016). Systems are evaluated using the precision@k metric, the fraction of test instances for which the target object is contained in the model's top-k predictions. We provide further details of our experimental methodology in Section 4.1.

Syntactic Analysis by Permuting Word Order
In English, word order is important for correctly understanding the syntactic structure of a sentence. Both models we analyze use Recurrent Neural Networks (RNN) (Elman, 1990) with Long Short-Term Memory (LSTM) cells (Hochreiter and Schmidhuber, 1997). Previous studies have shown that reccurrent architectures can perform well on tasks where word order and syntax are important: for example, tagging (Lample et al., 2016), parsing (Sutskever et al., 2014), and machine translation (Bahdanau et al., 2014). We seek to determine whether recurrent models for RER depend on syntactic structure. Premise 1: Randomly permuting the word order of an English referring expression will obscure its syntactic structure.
We train CMN and CNN+LSTM-MIL with shuffled referring expressions as input and evaluate their performance. Expressions. ∆ is the difference between no perturbation and shuffled version of the same system.  Table 1 shows accuracies for models with and without shuffled referring expressions. The column with ∆ shows the difference in accuracy compared to the best performing model without shuffling. The drop in accuracy is surprisingly low. Thus, we conclude that these models do not stongly depend on the syntactic structure of the input expression and may instead leverage other, shallower, correlations.

Lexical Analysis by Discarding Words
Following the analysis presented in Section 3.2, we are curious to study what other aspects of the input referring expression may be essential for state-ofthe-art performance. If syntactic structure is largely unimportant, it may be that spatial relationships can be ignored. Spatial relationships between objects are usually represented by prepositional phrases and verb phrases. In contrast, simple descriptors (e.g. green) and object types (e.g. table) are most often represented by adjectives and nouns, respectively. By discarding all words in the input that are not nouns or adjectives, we hope to test whether spatial relationships are actually important to stateof-the-art models. Notably, both systems we test were specifically designed to model object relationships. Premise 2: Keeping only nouns and adjectives from the input expression will obscure the relationships between objects that the referring expression describes. Table 2 shows accuracies resulting from training and testing these models on only the nouns and adjectives in the input expression. Our first observation is that the accuracies of models drop the most when we discard the nouns (the rightmost column in Table 2). This is reasonable since nouns  define the types of the objects referred to in the expression. Without nouns, it is extremely difficult to identify which objects are being described. Second, although both systems we analyze model the relationship between objects, discarding verbs and prepositions, which are essential in determining the relationship among objects, does not drastically effect their performance (the second column in Table 2). This may indicate the superior performance of these systems does not specifically come from their modeling approach for object relationships.

Bias Analysis by Discarding Referring Expressions
Goyal et al. (2016) show that some language and vision datasets have exploitable biases. Could there be a dataset bias that is exploited by the models for RER? Premise 3: Discarding the referring expression entirely and keeping only the input image creates a deficient prediction problem: achieving highpeformance on this task indicates dataset bias. We train CMN by removing all referring expressions from train and test. We call this model "image-only" since it ignores the referring expresion and will only use the input image. We compare the CMN "image-only" model with the state-of-theart configuration of CMN and a random baseline.  only" model is able to surpass the random baseline by a large margin. This result indicates that the dataset is biased, likely as a result of the data selection and annotation process. During the construction of the dataset, Mao et al. (2016) annotate an object box only if there are at least 2 to 4 objects of the same type in the image. Thus, only a subset of object categories ever appear as targets because some object types rarely occur multiple times in an image. In fact, out of 90 object categories in MSCOCO, 43 of the object categories are selected as target objects less than 1% of the time they occur in images. This potentially explains the relative high performance of the "image-only" system.

Discussion
The previous analyses indicate that exploiting bias in the data selection process and leveraging shallow linguistic correlations with the input expression may go a long way towards achieving high performance on this dataset. First, it may be possible to simplify the decision of picking an object to a much smaller set of candidates without even considering the referring expression. Second, because removing all words except for nouns and adjectives only marginally hurt performance for the systems tested, it may be possible to further reduce the set of candidates by focusing only on simple properties like the category of the target object rather than its relations with the environment or with adjacent objects.

Neural Sieves
We introduce a simple pipeline of neural networks, Neural Sieves, that attempt to reduce the set of candidate objects down to a much smaller set that still contains the target object given an image, a set of objects, and the referring expression describing one of the objects.
Sieve I: Filtering Unlikely Objects. Inspired by the results from Section 3.4, we design an "imageonly" model as the first sieve for filtering unlikely objects. For example in Figure 1, Sieve I filters out the backpack and the bench from the list of bounding boxes since there is only one instance of these object types. We use a similar parameterization of one of the baselines (CMN LOC ) proposed by Hu et al. (2017) for Sieve I and train it by only providing spatial and visual features for the boxes, ignoring the referring expression. More specifically, for visual features r vis of a bounding boxes of an object, we use Faster- RCNN (Ren et al., 2015). We use 5-dimensional vectors for spatial features where A r is the size and [x min , y min , x max , y max ] are coordinates for bounding box r and A V , W V , H V are the area, the width, and the height of the input image V . These two representations are concatenated as r vis,spat = [r vis r spat ] for a bounding box r.
We parameterize Sieve I with a list of bounding boxes R as the input with parameter set Θ I as follows: (1) Each bounding box is scored using a matrix W score I . Scores for all bounding boxes are then fed to softmax to get a probability distribution over boxes. The learned parameter Θ I is the scoring matrix W score I .

Sieve II: Filtering Based on Objects Categories
After filtering unlikely objects based only on the image, the second step is to determine which object category to keep as a candidate for prediction, filtering out the other categories. For instance, in Figure 1, only instances of suitcases are left as candidates after determining which type of object the input expression is talking about. To perform this step, Sieve II takes the list of object candidates from Sieve I and keeps objects having the same object category as the referred object. Unlike Sieve I, Sieve II uses the referring expression to filter bounding boxes of objects. We again use the baseline model of CMN LOC from the previous work (Hu et al., 2017) for the parametrization of Sieve II with a minor modification: instead of predicting the referred object, we make a binary decision for each box of whether the object in the box is the same category as the target object.
More specifically, we parameterize Sieve II as follows:r vis,spat = W vis,spat II r vis,spat z II =r vis,spat f att (T ) (4) We encode the referring expression T into an embedding with f att (T ) which uses an attention mechanism (Bahdanau et al., 2014) on top of a 2-layer bidirectional LSTM (Schuster and Paliwal, 1997). We project bounding box features r vis,spat to the same dimension as the embedding of referring expression (Eq 3). Text and box representations are element-wise multiplied to get z II as a joint representation of the text and bounding box (Eq 4). We L2-normalize to produceẑ II (Eq 5, 6). Box scores Model  , and parameters of the encoding module f att .

Filtering Experiments
We are interested in determining how accurate these simple nueral sieves can be. High accuracy here would give a possible explanation for the high performance of more complex models.
Dataset. For our experiments, we use Google-Ref (Mao et al., 2016) which is one of the standard benchmarks for referring expression recognition. It consists of around 26K images with 104K annotations. We use their Ground-Truth evaluation setup where the ground truth bounding box annotations from MSCOCO (Lin et al., 2014) are provided to the system as a part of the input. We used the split provided by Nagaraja et al. (2016) where splits have disjoint sets of images. We use precision@k for evaluating the performance of models. Implementation Details. To train our models, we used stochastic gradient descent for 6 epochs with an initial learning rate of 0.01 and multiplied by 0.4 after each epoch. Word embeddings were initialized using GloVe (Pennington et al., 2014) and finetuned during training. We extracted features for bounding boxes using the fc7 layer output of Faster-RCNN VGG-16 network (Ren et al., 2015) pre-trained on MSCOCO dataset (Lin et al., 2014). Hyperparameters such as hidden layer size of LSTM networks were picked based on the best validation score. For perturbation experiments, we did not perform any grid search for hyperparameters. We used hyperparameters of the previously reported best performing model in the literature. We released our code for public use 1 .
Baseline Models. We compare Neural Sieves to the state-of-the-art models from the literature. LSTM + CNN - MIL Nagaraja et al. (2016) score target object-context object pairs using LSTMs for processing the referring expression and CNN features for bounding boxes. The pair with the highest score is predicted as the referred object. They use Multi-Instance Learning for training the model. CMN (Hu et al., 2017) is a neural module network with a tuple of object-relationship-subject nodes. The text encoding of tuples is calculated with a two-layer bi-directional LSTM and an attention mechanism (Bahdanau et al., 2014) over the referring expression. Table 4 shows the precision scores. The referred object is in the top-2 candidates selected by Sieve I 71.2% of the time and in the top-3 predictions 86.6% of the time. Combining both sieves into a pipeline, these numbers further increase to 84.2% for top-2 predictions and to 95.3% for top-3 predictions. Considering the simplicity of Neural Sieve approach, these are surprising results: two simple neural network systems, the first one ignoring the referring expression, the second predicting only object type, are able to reduce the number of candidate boxes down to 2 on 84.2% of instances.

Conclusion
We have analyzed two RER systems by variously perturbing aspects of the input referring expressions: shuffling, removing word categories, and finally, by removing the referring expression entirely. Based on this analysis, we proposed a pipeline of simple neural sieves that captures many of the easy correlations in the standard dataset. Our results suggest that careful analysis is important both while constructing new datasets and while constructing new models for grounded language tasks. The techniques used here may be applied more generally to other tasks to give better insight into what our models are learning and whether our datasets contain exploitable bias.