Words Aren’t Enough, Their Order Matters: On the Robustness of Grounding Visual Referring Expressions

Visual referring expression recognition is a challenging task that requires natural language understanding in the context of an image. We critically examine RefCOCOg, a standard benchmark for this task, using a human study and show that 83.7% of test instances do not require reasoning on linguistic structure, i.e., words are enough to identify the target object, the word order doesn’t matter. To measure the true progress of existing models, we split the test set into two sets, one which requires reasoning on linguistic structure and the other which doesn’t. Additionally, we create an out-of-distribution dataset Ref-Adv by asking crowdworkers to perturb in-domain examples such that the target object changes. Using these datasets, we empirically show that existing methods fail to exploit linguistic structure and are 12% to 23% lower in performance than the established progress for this task. We also propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT, the current state-of-the-art model for this task. Our datasets are publicly available at https://github.com/aws/aws-refcocog-adv.


Introduction
Visual referring expression recognition is the task of identifying the object in an image referred by a natural language expression (Kazemzadeh et al., 2014;Nagaraja et al., 2016;Mao et al., 2016;Hu et al., 2016). Figure 1 shows an example. This task has drawn much attention due to its ability to test a model's understanding of natural language in the context of visual grounding and its application in downstream tasks such as image retrieval (Young et al., 2014) and question answering (Antol et al., 2015;Zhu et al., 2016). To track * Work done in part while AA was intern at Amazon AI. progress on this task, various datasets have been proposed, in which real world images are annotated by crowdsourced workers (Kazemzadeh et al., 2014;Mao et al., 2016). Recently, neural models have achieved tremendous progress on these datasets (Yu et al., 2018;Lu et al., 2019). However, multiple studies have suggested that these models could be exploiting strong biases in these datasets (Cirik et al., 2018b;Liu et al., 2019). For example, models could be just selecting a salient object in an image or a referring expression without recourse to linguistic structure (see Figure 1). This defeats the true purpose of the task casting doubts on the actual progress.
In this work, we examine RefCOCOg dataset (Mao et al., 2016), a popular testbed for evaluating referring expression models, using crowdsourced workers. We show that a large percentage of samples in the RefCOCOg test set indeed do not rely on linguistic structure (word order) of the expressions. Accordingly, we split RefCOCOg test set into two splits, Ref-Easy and Ref-Hard, where linguistic structure is key for recognition in the latter but not the former ( §2). In addition, we create a new out-of-distribution 1 dataset called Ref-Adv using Ref-Hard by rewriting a referring expression such that the target object is different from the original annotation ( §3). We evaluate existing models on these splits and show that the true progress is at least 12-23% behind the established progress, indicating there is ample room for improvement ( §4). We propose two new models, one which make use of contrastive learning using negative examples, and the other based on multi-task learning, and show that these are slightly more robust than the current state-of-the-art models ( §5).

Importance of linguistic structure
RefCOCOg is the largest visual referring expression benchmark available for real world images (Mao et al., 2016). Unlike other referring expression datasets such as RefCOCO and Ref-COCO+ (Kazemzadeh et al., 2014), a special care has been taken such that expressions are longer and diverse. We therefore choose to examine the importance of linguistic structure in RefCOCOg. Cirik et al. (2018b) observed that when the words in a referring expression are shuffled in random order, the performance of existing models on Re-fCOCOg drops only a little. This suggests that models are relying heavily on the biases in the data than on linguistic structure, i.e., the actual sequence of words. Ideally, we want to test models on samples where there is correlation between linguistic structure and spatial relations of objects, and any obscurity in the structure should lead to ambiguity. To filter out such set, we use humans.
We randomly shuffle words in a referring expression to distort its linguistic structure, and ask humans to identify the target object of interest via predefined bounding boxes. Each image in Ref-COCOg test set is annotated by five Amazon Mechanical Turk (AMT) workers and when at least three annotators select a bounding box that has high overlap with the ground truth, we treat it as a correct prediction. Following Mao et al. (2016), we set 0.5 IoU (intersection over union) as the threshold for high overlap. Given that there are at least two objects in each image, the optimal performance of a random choice is less than 50%. 2 However, we observe that human accuracy on distorted examples is 83.7%, indicating that a large portion of RefCOCOg test set is insensitive to linguistic structure. Based on this observation, we divide the test set into two splits for fine-grained evaluation of models:  We take each sample in Ref-Hard and collect additional referring expressions such that the target object is different from the original object. We chose the target objects which humans are most confused with when the referring expression is shuffled (as described in the previous section). For each target object, we ask three AMT workers to write a referring expression while retaining most content words in the original referring expression. In contrast to the original expression, the modified expression mainly differs in terms of the structure while sharing several words. For example, in Figure 1, the adversarial sample is created by swapping pastry and blue fork and making plate as the head of pastry. We perform an extra validation step to filter out bad referring expressions. In this step, three additional AMT workers select a bounding box to identify the target object, and we only select the samples where at least two workers achieve IoU > 0.5 with the target object.
Since Easy: The larger of two giraffes Hard: A giraffe eating leaves off the tree Adv: The giraffe that is not eating leaves off the tree Easy: A blue snowboard Hard: A woman wearing a blue jacket and orange glasses next to a woman with a white hood Adv: A woman with a white hood, next to a woman wearing orange glasses and a blue jacket.
Easy: Water in a tall, clear glass Hard: The glass of water next to the saucer with the cup on it Adv: The cup on the saucer, next to the glass of water Easy: The short blue bike on the right Hard: The blue bike behind the red car Adv: The red car behind the blue bike Easy: The man with the glasses on Hard: A man holding a cake that is not wearing a tie Adv: The man holding a cake that is wearing a tie Easy: A green cushion couch with a pillow Hard: A green couch across from a white couch Adv: A white couch across from a green couch (2020) who also use crowdworkers. While we use perturbed examples to evaluate robustness, they also use them to improve robustness (we propose complementary methods to improve robustness §5). Moreover, we are primarily concerned with the robustness of models for visual expression recognition task, while Gardner et al. and Kaushik et al. focus on different tasks (e.g., sentiment, natural language inference).   2020) who also use crowdworkers. While we use perturbed examples to evaluate robustness, they also use them to improve robustness (we propose complementary methods to improve robustness §5). Moreover, we are primarily concerned with the robustness of models for visual expression recognition task, while Gardner et al. and Kaushik et al. focus on different tasks (e.g., sentiment, natural language inference).

Human Performance on Ref-Easy, Ref-Hard and Ref-Adv
We conducted an additional human study (on AMT) to compare the human performance on  pretrain-then-transfer learning approach to jointly learn visiolinguistic representations from largescale data and utilizes them to ground expressions. This is the only model that does not explicitly model compositional structure of language, but BERT-like models are shown to capture syntactic structure latently (Hewitt and Manning, 2019).

Results and discussion
We trained on the full training set of RefCOCOg and performed hyperparameter tuning on a development set. We used the development and test splits of Mao et al. (2016). Table 2 shows the model accuracies on these splits and our proposed datasets. The models are trained to select ground truth bounding box from a set of predefined bounding boxes. We treat a prediction as positive if the predicted bounding box has IoU > 0.5 with the ground truth.
Although the overall performance on the test set seem high, in reality, models excel only at  . This suggests that models are relying on reasoning shortcuts found in training than actual understanding. Among the models, GroundNet performs worse, perhaps due to its reliance on rigid structure predicted by an external parser and the mismatches between the predicted structure and spatial relations between objects. ViLBERT achieves the highest performance and is relatively more robust than other models. In the next section, we propose methods to further increase the robustness of ViLBERT.

Model
Dev

Increasing the robustness of ViLBERT
We extend ViLBERT in two ways, one based on contrastive learning using negative samples, and the other based on multi-task learning on GQA (Hudson and Manning, 2019), a task that requires linguistic and spatial reasoning on images.
Contrastive learning using negative samples Instead of learning from one single example, contrastive learning aims to learn from multiple examples by comparing one to the other. In order to increase the sensitivity to linguistic structure, we mine negative examples that are close to the current example and learn to jointly minimize the loss on the current (positive) example and maximize the loss on negative examples. We treat the triplets i, e, b in the training set as positive examples, where i, e, b stands for image, expression and ground truth bounding box. For each triplet i, e, b , we sample another training example i , e , b , and use it to create two negative samples, defined by i , e, b and i, e , b , i.e., we pair wrong bounding boxes with wrong expressions. For efficiency, we only consider negative pairs from the mini-batch. We modify the batch loss function as follows:  Here (i, e, b) is the cross-entropy loss of ViL-BERT, [x] + is the hinge loss defined by max 0, x , and τ is the margin parameter. F indicates a function over all batch samples. We define F to be either sum of hinges (Sum-H) or max of hinges (Max-H). While Sum-H takes sum over all negative samples, If batch size is n, for each i, e, b , there will be n−1 triplets of i , e, b and i, e , b . For i, e, b , there will be one i , e, b and one i, e , b . Similar proposals are known to increase the robustness of vision and language problems like visual-semantic embeddings and image description ranking (Kiros et al., 2014;Gella et al., 2017;Faghri et al., 2018).
Multi-task Learning (MTL) with GQA In order to increase the sensitivity to linguistic structure, we rely on tasks that require reasoning on linguistic structure and learn to perform them alongside our task. We employ MTL with GQA (Hudson and Manning, 2019), a compositional visual question answering dataset. Specifically, we use the GQA-Rel split which contains questions that require reasoning on both linguistic structure and spatial relations (e.g., Is there a boy wearing a red hat standing next to yellow bus? as opposed to Is there a boy wearing hat?). Figure 3 depicts the neural architecture. We share several layers between the tasks to enable the model to learn representations useful for both tasks. Each shared layer constitute a co-attention transformer block (Co-TRM; Lu et al. 2019) and a transformer block (TRM; Vaswani et al. 2017). While in a transformer, attention is computed using queries and keys from the same modality, in a co-attention transformer they come from different modalities (see cross arrows in Figure 3). The shared representations are eventually passed as input to task-specific MLPs. We optimize each task using alternative training (Luong et al., 2015). Table 3 shows the experimental results on the referring expression recognition task. Although contrastive learning improves

Conclusion
Our work shows that current datasets and models for visual referring expressions fail to make effective use of linguistic structure. Although our proposed models are slightly more robust than existing models, there is still significant scope for improvement. We hope that

A Appendix
In this supplementary material, we begin by providing more details on RefCOCOg dataset to supplement Section 2 of the main paper. We then provide Ref-Adv annotation details, statistics, analysis, and random examples, to supplement Section 3 of the main paper. Finally, we provide details of our models (initialization & training, hyper-parameters) and show additional results to supplement Section 5 of the main paper.

A.1 RefCOCOg vs Other Referring Expressions Datasets
RefCOCO, RefCOCO+ (Kazemzadeh et al., 2014) and RefCOCOg (Google- RefCOCO;Mao et al. 2016) are three commonly studied visual referring expression recognition datasets for real images. All the three data sets are built on top of MSCOCO dataset (Lin et al., 2014) which contains more than 300,000 images, with 80 categories of objects. Re-fCOCO, RefCOCO+ were collected using online interactive game. RefCOCO dataset is more biased towards person category. RefCOCO+ does not allow the use of location words in the expressions, and therefore contains very few spatial relationships. RefCOCOg was not collected in an interactive setting and therefore contains longer expressions. For our adversarial analysis, we chose Ref-COCOg for the following three important reasons: Firstly, expressions are longer (by 2.5 times on average) in RefCOCOg and therefore contains more spatial relationships compared to other two datasets. Secondly, RefCOCOg contains at least 2 to 4 instances of the same object type within the same image referred by an expression. This makes the dataset more robust, and indirectly puts higher importance on grounding spatial relationships in finding the target object. Finally, as shown in Table 4, RefCOCO and RefCOCO+ are highly skewed towards Person object category (≈ 50%) whereas Re-fCOCOg is relatively less skewed (≈ 36%), more diverse, and less biased.

A.2 Importance of Linguistic Structure
Cirik et al. (2018b) observed that existing models for RefCOCOg are relying heavily on the biases in the data than on linguistic structure. We perform extensive experiments to get more detailed insights into this observation. Specifically, we distort linguistic structure of referring expressions in the Re-  fCOCOg test split and evaluate the SOTA models that are trained on original undistorted RefCOCOg training split. Similar to (Cirik et al., 2018b), we distort the test split using two methods: (a) randomly shuffle words in a referring expression, and (b) delete all the words in the expression except for nouns and adjectives. Table 5 shows accuracies for the models with (column 3 and 4) and without (column 2) distorted referring expressions. Except for the ViLBERT model (Lu et al., 2019), the drop in accuracy is not significant indicating that spatial relations are ignored in grounding the referring expression.
Using the relatively robust ViLBERT model, we repeat this analysis on our splits Ref-Easy, Ref-Hard and Ref-Adv. We randomly sampled 1500 expressions from each of these splits and then compare performance of ViLBERT on these three sets. As shown in Table 6, we find a large difference in model's accuracy on Ref-Hard and Ref-Adv. This clearly indicates that grounding expressions in both of these splits require linguistic and spatial reasoning.

A.3 Ref-Adv Annotation
We construct Ref-Adv by using all the 9602 referring expressions from RefCOCOg test data split. As shown in Figure 5, we follow a three stage approach to collect these new samples: Stage 1: For every referring expression in Ref-COCOg test split, we perturb its linguistic structure by shuffling the word order randomly. We show each of these perturbed expression along with im-

Model
Original Shuf N+J CMN (Hu et al., 2017) 69.4 66.4 67.4 GroundNet (Cirik et al., 2018a) 65.8 57.6 62.8 MattNet (Yu et al., 2018) 78.5 75.3 76.1 ViLBERT (Lu et al., 2019) 83.6 71.4 73.6  ages and all object bounding boxes to five qualified Amazon Mechanical Turk (AMT) workers and ask them to identify the ground-truth bounding box for the shuffled referring expression. We hired workers from US and Canada with approval rates higher than 98% and more than 1000 accepted HITs. At the beginning of the annotation, we ask the turkers to go through a familiarization phase where they become familiar with the task. We consider all the image and expression pairs for which at least 3 out of 5 annotators failed to locate the object correctly (with IoU < 0.5 ) as hard samples (Ref-Hard). We refer to the image-expressions for which at least 3 out of 5 annotators were able to localize the object correctly as easy samples (Ref-Easy). On average, we found that humans failed to localize the objects correctly in 17% of the expressions.
Stage 2: We take Ref-Hard images and ask turkers to generate adversarial expressions such that the target object is different from the original object. More concretely, for each of the hard samples, we identify the most confused image regions among human annotators as the target objects in stage 1. For each of these target objects, we then ask three     Figure 7 shows the relative frequency of the most frequent spatial relationships in all the three splits. As we can see, Ref-Adv comprises of rich and diverse spatial relationships. In Table 2  Stage 3: We filter out the noisy adversarial expressions generated in stage 2 by following a validation routine used in the generation of RefCOCOg dataset. We ask three additional AMT workers to select a bounding box to identify the target object in the adversarial expression and then remove the noisy samples for which the inter-annotator agreement among workers is low. The samples with at least 2 out of 3 annotators achieving IoU > 0.5 will be added to Ref-Adv dataset.

A.4 Dataset Analysis, Comparison, and Visualization
In Table 7 Figure 7 shows the relative frequency of the most frequent spatial relationships in all the three splits. As we can see, Ref-Adv comprises of rich and diverse spatial relationships. In Table 2  lational questions by applying the following constraint on question types: type.Semantic='rel'. We also apply this constraint for filtering the development set. We denote this subset as GQA-Rel. We considered GQA-Rel instead of GQA for two reasons: 1) GQA-Rel is a more related task to Ref-COCOg; and 2) MTL training with the full GQA set is computationally expensive. For each question in the dataset, there exists a long answer (free-form text) and a short answer (containing one or two words). We only consider the short answers for the questions and treat the unique set of answers as output categories. While the full GQA dataset has 3129 output categories, GQA-Rel contains only 1842 categories.
We follow Yu et al. (2018)  Single-Task Fine-tuning on RefCOCOg In order to fine-tune the baseline ViLBERT (Lu et al., 2019) model on RefCOCOg dataset, we pass the ViLBERT visual representation for each bounding box into a linear layer to predict a matching score (similar to RefCOCO+ training in Lu et al. 2019). We calculate accuracy using IoU metric (prediction is correct if IoU(predicted region, ground-truth region) > 0.5). We use a binary cross-entropy loss and train the model for a maximum of 25 epochs. We use early-stopping based on the validation performance. We use an initial learning rate of 4e-5 and use a linear decay learning rate schedule with warm up. We train on 8 Tesla V100 GPUs with a total batch size of 512.
Negative Mining We used a batch size of 512 and randomly sample negatives from the minibatch for computational efficiency. We sampled 64 negatives from each batch for both Sum of Hinges and Max of Hinges losses. We fine-tune the margin   parameters based on development split. We train the model for a maximum of 25 epochs. We use early-stopping based on the validation performance. We use an initial learning rate of 4e-5 and use a linear decay learning rate schedule with warm up. We train on 8 Tesla V100 GPUs with a total batch size of 512.

Multi-Task Learning (MTL) with GQA-Rel
The multi-task learning architecture is shown in Figure 3 in the main paper. The shared layers constitute transformer blocks (TRM) and coattentional transformer layers (Co-TRM) in ViL-BERT (Lu et al., 2019). The task-specific layer for GQA task is a two-layer MLP and we treat it as a multi-class classification task and the task-specific layer for RER is a linear layer that predicts a matching score for each of the image regions given an input referring expression. The weights for the taskspecific layers are randomly initialized, whereas the shared layers are initialized with weights pretrained on 3.3 million image-caption pairs from Conceptual Captions dataset (Sharma et al., 2018). We use a binary cross-entropy loss for both tasks. Similar to Luong et al. (2015), during training, we optimize each task alternatively in mini-batches based on a mixing ratio. We use early-stopping based on the validation performance. We use an initial learning rate of 4e-5 for RefCOCOg and 2e-5 for GQA, and use a linear decay learning rate schedule with warm up. We train on 4 RTX 2080 GPUs with a total batch size of 256. Table 3 in the main paper showed that MTL training with GQA-Rel significantly improved the performance of model on Ref-Hard and Ref-Adv splits. In addition, we also observed a significant improvement in GQA-Rel development, GQA development and test splits as shown in the Table 9.

A.5.3 Additional Experiments
In this subsection, we present results of additional experiments using transfer learning (TL) and multitask learning (MTL) with ViLBERT on VQA, GQA, and GQA-Rel tasks. As shown in Table 10, TL with VQA showed slight improvement. However, TL with GQA, TL with GQA-Rel, and MTL with VQA did not show any improvements 4 .