Automatic Generation of Contrast Sets from Scene Graphs: Probing the Compositional Consistency of GQA

Recent works have shown that supervised models often exploit data artifacts to achieve good test scores while their performance severely degrades on samples outside their training distribution. Contrast sets (Gardneret al., 2020) quantify this phenomenon by perturbing test samples in a minimal way such that the output label is modified. While most contrast sets were created manually, requiring intensive annotation effort, we present a novel method which leverages rich semantic input representation to automatically generate contrast sets for the visual question answering task. Our method computes the answer of perturbed questions, thus vastly reducing annotation cost and enabling thorough evaluation of models’ performance on various semantic aspects (e.g., spatial or relational reasoning). We demonstrate the effectiveness of our approach on the GQA dataset and its semantic scene graph image representation. We find that, despite GQA’s compositionality and carefully balanced label distribution, two high-performing models drop 13-17% in accuracy compared to the original test set. Finally, we show that our automatic perturbation can be applied to the training set to mitigate the degradation in performance, opening the door to more robust models.


Introduction
NLP benchmarks typically evaluate in-distribution generalization, where test sets are drawn i.i.d from a distribution similar to the training set. Recent works showed that high performance on test sets sampled in this manner is often achieved by exploiting systematic gaps, annotation artifacts, lexical cues and other heuristics, rather than learning meaningful task-related signal. As a result, Figure 1: Illustration of our approach based on an example from the GQA dataset. Top: QA pairs and an image annotated with bounding boxes from the scene graph. Bottom: relations among the objects in the scene graph. First line at the top is the original QA pair, while the following 3 lines show our pertubated questions: replacing a single element in the question (a fence) with other options (a wall, men, an elephant), leading to a change in the output label. For each QA pair, the LXMERT predicted output is shown.
the out-of-domain performance of these models is often severely deteriorated (Jia and Liang, 2017;Ribeiro et al., 2018;Gururangan et al., 2018;Geva et al., 2019;McCoy et al., 2019;Feng et al., 2019;Stanovsky et al., 2019). Recently, Kaushik et al. (2019) and Gardner et al. (2020) introduced the contrast sets approach to probe out-of-domain generalization. Contrast sets are constructed via minimal modifications to test inputs, such that their label is modified. For example, in Fig. 1, replacing "a fence" with "a wall", changes the answer from "Yes" to "No". Since such perturbations introduce minimal additional semantic complexity, robust models are expected to perform similarly on the test and contrast sets. However, a range of NLP models severely degrade in performance on contrast sets, hinting that they do not generalize well (Gardner et al., 2020). Except two recent exceptions for textual datasets (Li et al., 2020;Rosenman et al., 2020), contrast sets have so far been built manually, requiring extensive human effort and expertise.
In this work, we propose a method for automatic generation of large contrast sets for visual question answering (VQA). We experiment with the GQA dataset (Hudson and Manning, 2019). GQA includes semantic scene graphs (Krishna et al., 2017) representing the spatial relations between objects in the image, as exemplified in Fig. 1. The scene graphs, along with functional programs that represent the questions, are used to balance the dataset, thus aiming to mitigate spurious dataset correlations. We leverage the GQA scene graphs to create contrast sets, by automatically computing the answers to question perturbations, e.g., verifying that there is no wall near the puddle in Fig. 1.
We create automatic contrast sets for 29K samples or ≈22% of the validation set. We manually verify the correctness of 1,106 of these samples on Mechanical Turk. Following, we evaluate two leading models, LXMERT (Tan and Bansal, 2019) and MAC (Hudson and Manning, 2019) on our contrast sets, and find a 13-17% reduction in performance compared to the original validation set. Finally, we show that our automatic method for contrast set construction can be used to improve performance by employing it during training. We augment the GQA training set with automatically constructed training contrast sets (adding 80K samples to the existing 943K in GQA), and observe that when trained with it, both LXMERT and MAC improve by about 14% on the contrast sets, while maintaining their original validation performance.
Our key contributions are: (1) We present an automatic method for creating contrast sets for VQA datasets with structured input representations; (2) We automatically create contrast sets for GQA, and find that for two strong models, performance on the contrast sets is lower than on the original validation set; and (3) We apply our method to augment the training data, improving both models' performance on the contrast sets.

Automatic Contrast Set Construction
To construct automatic contrast sets for GQA we first identify a large subset of questions requiring specific reasoning skills ( §2.1). Using the scene graph representation, we perturb each question in a manner which changes its gold answer ( §2.2). Finally, we validate the automatic process via crowdsourcing ( §2.3).

Identifying Recurring Patterns in GQA
The questions in the GQA dataset present a diverse set of modelling challenges, as exemplified in Table 1, including object identification and grounding, spatial reasoning and color identification. Following the contrast set approach, we create perturbations testing whether models are capable of solving questions which require this skill set, but that diverge from their training distribution.
To achieve this, we identify commonly recurring question templates which specifically require such skills. For example, to answer the question "Are there any cats near the boat?" a model needs to identify objects in the image (cats, boat), link them to the question, and identify their relative position.
We identify six question templates, testing various skills (Table 1). We abstract each question template with a regular expression which identifies the question types as well as the physical objects, their attributes (e.g., colors), and spatial relations. Overall, these regular expressions match 29K questions in the validation set (≈22%), and 80K questions in the training set (≈8%).

Perturbing Questions with Scene Graphs
We design a perturbation method which guarantees a change in the gold answer for each question template. For example, looking at Fig. 2, for the question template are there X near the Y? (e.g., "Is there any fence near the players?"), we replace either X or Y with a probable distractor (e.g." replace "fence" with "trees").
We use the scene graph to ensure that the answer to the question is indeed changed. In our example, this would entail grounding "players" in the question to the scene graph (either via exact match or several other heuristics such as hard-coded lists of synonyms or co-hyponyms), locating its neighbors, and verifying that none of them are "trees." We then apply heuristics to fix syntax (e.g., changing from singular to plural determiner, see Appendix A.3), and verify that the perturbed sample

Question template Tested attributes Example
On which side is the X? Relational (left vs. right) On which side is the dishwasher? → On which side are the dishes?
What color is the X? Color identification What color is the cat?→ What color is the jacket?
Do you see X or Y? Compositions Do you see laptops or cameras?→ Do you see headphones or cameras?
Are there X near the Y? Spatial, relational Are there any cats near the boat? → Is there any bush near the boat? Is the X Rel the Y?
Is the boy to the right of the man? → Is the boy to the left of the man? Is the X Rel the Y?
Is the boy to the right of the man? → Is the zebra to the right of the man? does not already exist in GQA. The specific perturbation is performed per question template. In question templates with two objects (X and Y), we replace X with X', such that X' is correlated with Y in other GQA scene graphs. In question templates with a single object X, we replace X with a textually-similar X'. For example in the first row in Table 1 we replace dishwasher with dishes. Our perturbation code is publicly available. This process may yield an arbitrarily large number of contrasting samples per question, as there are many candidates for replacing objects participating in questions. We report experiments with up to 1, 3 and 5 contrasting samples per question.
Illustrating the perturbation process. Looking at Fig. 1, we see the scene-graph information: objects have bounding-boxes around them in the image (e.g., zebra); Objects have attributes (wood is an attribute of the fence object); and there are relationships between the objects (the puddle is to the right of the zebra, and it is near the fence). The original (question, answer) pair is ("is there a fence near the puddle?", "Yes"). We first identify the question template by regular expressions: "Is there X near the Y", and isolate X=fence, Y=puddle. The answer is "Yes", so we know that X is indeed near Y. We then use the existing information given in the scene-graph. We search for X' that is not near Y. To achieve this, we sample a random object (wall), and verify that it doesn't exist in the set of scenegraph objects. This results in a perturbed example "Is there a wall near the puddle?", and now the ground truth is computed to be "No". Consider a different example: ("Is the puddle to the left of the zebra?", "Yes"). We identify the question template "Is the X Rel the Y", where X=puddle, Rel=to the left, Y=zebra. The answer is "Yes". Now we can easily change Rel'=to the right, resulting in the (question, answer) pair ("Is the puddle to the right of the zebra?", "No").
We highlight the following: (1) This process is done entirely automatically (we validate it in Section 2.3); (2) The answer is deterministic given the information in the scene-graph; (3) We do not produce unanswerable questions. If we couldn't find an alternative atom for which the presuppositions hold, we do not create the perturbed (question, answer) pair; (4) Grounding objects from the question to the scene-graph can be tricky. It can involve exact match, number match (dogs in the question, and dog in the scene-graph), hyponyms (animal in the question, and dog in the scene-graph), and synonyms (motorbike in the question, and motorcycle in the scene-graph). The details are in the published code; (5) The only difference between the original and the perturbed instance is a single atom: an object, relationship, or attribute.

Validating Perturbed Instances
To verify the correctness of our automatic process, we sampled 553 images, each one with an original and perturbed QA pair for a total of 1,106 instances (≈4% of the validation contrast pairs). The (image, question) pairs were answered independently by human annotators on Amazon Mechanical Turk (see Fig. 3 in Appendix A.4), oblivious to whether the question originated from GQA or from our automatic contrast set. We found that the workers were able to correctly answer 72.3% of the perturbed questions, slightly lower than their performance on the original questions (76.6%). 2 We observed high agreement between annotators (κ = 0.679).
Our analysis shows that the human performance difference between the perturbed questions and the original questions can be attributed to the scene Each perturbation aims to change the label in a predetermined manner, e.g., from "yes" to "no".

Model
Training  graph annotation errors in the GQA dataset: 3.5% of the 4% difference is caused by a discrepancy between image and scene graph (objects appearing in the image and not in the graph, and vice versa). Examples are available in Fig. 5 in Appendix A.5.

Experiments
We experiment with two top-performing GQA models, MAC (Hudson and Manning, 2018) and LXMERT (Tan and Bansal, 2019), 3 to test their generalization on our automatic contrast sets, leading to various key observations.
Models struggle with our contrast set.  had the smallest performance drop, potentially because the models performance on such multi-class, subjective questions is relatively low to begin with.
Training on perturbed set leads to more robust models. Previous works tried to mitigate spurious datasets biases by explicitly balancing labels during dataset construction (Goyal et al., 2017;Zhu et al., 2016;Zhang et al., 2016) or using adversarial filtering (Zellers et al., 2018(Zellers et al., , 2019. In this work we take an inoculation approach (Liu et al., 2019) and augment the original GQA training set with contrast training data, resulting in a total of 1,023,607 training samples. We retrain both models on the augmented training data, and observe in Table 2 that their performance on the contrast set almost matches that of the original validation set, with no loss (MAC) or only minor loss (LXMERT) to original validation accuracy. 4 These results indicate that the perturbed training set is a valuable signal, which helps models recognize more patterns.
Contrast Consistency. Our method can be used to generate many augmented questions by simply sampling more items for replacement (Section 2).  This allows us to measure the contrast consistency (Gardner et al., 2020) of our contrast set, defined as the percentage of the contrast sets for which a model's predictions are correct for all examples in the set (including the original example). For example, in Fig. 1

Discussion and Conclusion
Our results suggest that both MAC and LXMERT under-perform when tested out of distribution. A remaining question is whether this is due to model architecture or dataset design. Bogin et al. (2020) claim that both of these models are prone to fail on compositional generalization because they do not decompose the problem into smaller sub-tasks. Our results support this claim. On the other hand, it is possible that a different dataset could prevent these models from finding shortcuts. Is there a dataset that can prevent all shortcuts? Our automatic method for creating contrast sets allows us to ask those questions, while we believe that future work in better training mechanisms, as suggested in Bogin et al. (2020) and Jin et al. (2020), could help in making more robust models. We proposed an automatic method for creating contrast sets for VQA datasets that use annotated scene graphs. We created contrast sets for the GQA dataset, which is designed to be compositional, balanced, and robust against statistical biases. We observed a large performance drop between the original and augmented sets. As our contrast sets can be generated cheaply, we further augmented the GQA training data with additional perturbed questions, and showed that this improves models' performance on the contrast set. Our proposed method can be extended to other VQA datasets.

A Appendix Ethical Considerations
We created contrast sets automatically, and verified their correctness via the crowdsourcing annotation of a sample of roughly 1K instances. Section 2.3 describes the annotation process on Amazon Mechanical Turk. The images and original questions were sampled from the public GQA dataset (Hudson and Manning, 2019), in the English language. Fig. 3 in Appendix A.4 provides example of the annotation task. Overall, the crowdsourcing task resulted in ≈6 hours of work, which paid an average of 11USD per hour per annotator.
Reproducibility The augmentations were performed with a MacBook Pro laptop. Augmentations for the validation data takes < 1 hour per question template, and for the training data < 3 hours per question template. Overall process, < 24 hours.
The configurations were modified to not include the validation set in the training process. The experiments were performed with a Linux virtual machine with a NVIDIA's Tesla V100 GPU. The training took ∼1-2 days in each model. Validation took ∼ 30 minutes.    Table 3 shows the breakdown of the performance of the MAC and LXMERT models per question type, on both the original GQA validation set and on the augmented contrast sets on validation.

A.1 Generated Contrast Sets Statistics
The LXMERT model has two stages of training: pre-training on several datasets (which includes GQA training and validation data) and fine-tuning. To avoid inflating results on the validation data, we re-trained the pre-training stage without the GQA data, and fine-tuned on the training sets. Table 2. We discovered lower performance on the original set (-∼5%) with both models, but the same improvement on the augmented set (+∼10).

A.3 Linguistic Heuristics for Questions Generation
For each question type, we select an object in the image scene graph, and update the question by substituting the reference to this object by another object. When substituting one object by another, we need to adjust the question to keep it fluent. Table 10 shows the specific linguistic rules we verify when performing this substitution.
A.4 Annotation Task for Verifying Generated Contrast Sets Fig. 3 shows the annotation task that is shown to Turkers to validate the QA pairs generated by our method.

A.5 Examples
Linguistic rule Explanation Examples

Singular vs. plural
If the noun is singular and countable: add "a" or "an" If needed, replace "Are" and "Is" "a fence", "men" "a boy", "an elephant" Definite vs. indefinite Do not change definite articles to indefinite articles, and vice versa "is there any fence near the boy" suggests that there is a boy in the scene graph, which is not always correct General vs. specific Meaning can be changed When replacing to general or specific terms "Cats in the image" =>"Animals in the image", "Animals not in the image" =>"cats not in the image", The opposite directions not necessarily holds Countable vs. uncountable If the noun is uncountable, do not add "a" or "an" "A cat", "water"