MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering

While progress has been made on the visual question answering leaderboards, models often utilize spurious correlations and priors in datasets under the i.i.d. setting. As such, evaluation on out-of-distribution (OOD) test samples has emerged as a proxy for generalization. In this paper, we present \textit{MUTANT}, a training paradigm that exposes the model to perceptually similar, yet semantically distinct \textit{mutations} of the input, to improve OOD generalization, such as the VQA-CP challenge. Under this paradigm, models utilize a consistency-constrained training objective to understand the effect of semantic changes in input (question-image pair) on the output (answer). Unlike existing methods on VQA-CP, \textit{MUTANT} does not rely on the knowledge about the nature of train and test answer distributions. \textit{MUTANT} establishes a new state-of-the-art accuracy on VQA-CP with a $10.57\%$ improvement. Our work opens up avenues for the use of semantic input mutations for OOD generalization in question answering.


Introduction
Availability of large-scale datasets has enabled the use of statistical machine learning in vision and language understanding, and has lead to significant advances. However, the commonly used evaluation criterion is the performance of models on test-samples drawn from the same distribution as the training dataset, which cannot be a measure of generalization. Training under this "independent and identically distributed" (i.i.d.) setting can drive decision making to be highly influenced by dataset biases and spurious correlations as shown in both natural language inference (Kaushik and Lipton, 2018;Poliak et al., 2018;McCoy et al., 2019)  2017; Agrawal et al., 2018a;Selvaraju et al., 2020). As such, evaluation on out-of-distribution (OOD) samples has emerged as a metric for generalization.
Visual question answering (VQA) (Antol et al., 2015) is a task at the crucial intersection of vision and language. The aim of VQA models is to provide an answer, given an input image and a question about it. Large datasets (Antol et al., 2015) have been extensively used for developing VQA models. However over-reliance on datasets can cause models to learn spurious correlations such as linguistic priors (Agrawal et al., 2018a) that are specific to certain datasets and do not generalize to "Out-of-Distribution" (OOD) samples, as shown in Figure 1. While learning patterns in the data is important, learning dataset-specific spurious correlations is not a feature of robust VQA models. Developing robust models has thus become a key pursuit for recent work in visual question answer-arXiv:2009.08566v1 [cs.CV] 18 Sep 2020 ing through data augmentation (Goyal et al., 2017), reorganization (Agrawal et al., 2018a). Every dataset contains biases; indeed inductive bias is necessary for machine learning algorithms to work. Mitchell (1980) states that an unbiased learner's ability to classify is no better than a lookup from memory. However this bias has a component which is useful for generalization (positive bias), and a component due to spurious correlations (negative bias). We use the term "positive bias" to denote the correlations that are necessary to perform a task -for instance, the answer to a "What sport is . . . " question is correlated to a name of a sport. The term "negative bias" is used for spurious correlations tat may be learned from the data -for instance, always predicting "tennis" as the answer to "What sport. . . " questions. The goal of OOD generalization is to mitigate negative bias while learning to perform the task. However existing methods such as LMH (Clark et al., 2019) try to remove all biases between question-answer pairs, by penalizing examples that can be answered without looking at the image; we believe this to be counterproductive. The analogy of antibiotics whic are designed to remove pathogen bacteria, but also end up removing useful gut microbiome (Willing et al., 2011) is useful to understand this phenomenon.
We present a method that focuses on increasing positive bias and mitigating negative bias, to address the problem of OOD generalization in visual question answering. Our approach is to enable the mutation of inputs (questions and images) in order to expose the VQA model to perceptually similar yet semantically dissimilar samples. The intuition is to implicitly allow the model to understand the critical changes in the input which lead to a change in the answer. This concept of mutations is illustrated in Figure 1. If the color of the frisbee is changed, or the child removed, i.e. when an image-mutation is performed, the answer to the question changes. Similarly, if a word is substituted by an adversarial word (bins→bottles), an antonym, or negation (healthy→not healthy), i.e. when a question-mutation is performed, the answer also changes. Notice that both mutations do not significantly change the input, most of the pixels in the image and words in the question are unchanged, and the type of reasoning required to answer the question is unchanged. However the mutation significantly changes the answer.
In this work, we use this concept of mutations to enable models to focus on parts of the input that are critical to the answering process, by training our models to produce answers that are consistent with such mutations. We present a question-type exposure framework which teaches the model that although such linguistic priors may exist in training data (such as the dominant answer "tennis" to "What sport is ..." questions), other sports can also be answers to such questions, thus mitigating negative bias. This is in contrast to Chen et al. (2020a) who focus on using data augmentation as a means for mitigating language bias.
Our method uses a pair-wise training protocol to ensure consistency between answer predictions for the original sample and the mutant sample. Our model includes a projection layer, which projects cross-modal features and true answers to a learned manifold and uses Noise-Contrastive Estimation Loss (Gutmann and Hyvärinen, 2010) for minimizing the distance between these two vectors. Our results establish a new state-of-the-art accuracy of 69.52% on the VQA-CP-v2 benchmark outperforming the current best models by 10.57%. At the same time, our models achieves the best accuracy (70.24%) on VQA-VQA-v2 among models designed for the VQA-CP task.
This work takes a step away from explicit debiasing as a method for OOD generalization and instead proposes amplification of positive bias and implicit attenuation of spurious correlations as the objective. Our contributions are as follow.
• We introduce the Mutant paradigm for training VQA models and the sample-generation mechanism which takes advantage of semantic transformations of the input image or question, for the goal of OOD generalization. • In addition to the conventional classification task, we formulate a novel training objective using Noise Contrastive Estimation over the projections of cross-modal features and answer embeddings on a shared projection manifold, to predict the correct answer. • Our pairwise consistency loss acts as a regularization that seeks to bring the distance between ground-truth answer vectors closer to the distance between predicted answer vectors for a pair of original and mutant inputs. • Extensive experiments and analyses demonstrate advantages of our method on the VQA-CP dataset, and establish a new state-of-theart of 69.52%, an improvement of 10.57%.
We consider the open-ended VQA problem as a multi-class classification problem. The VQA dataset consists of questions Q i ∈ Q and images I i ∈ I, and answers a i ∈ A. Many contemporary VQA models such as Up-Dn (Anderson et al., 2018) and LXMERT (Tan and Bansal, 2019) first extract cross-modal features from the image and question using attention layers, and then use these features as inputs to a neural network answering module which predicts the answer classes. In this section we define our Mutant paradigm under this formulation of the VQA task.

Concept of Mutations
Let X = (Q, I) denote an input to the VQA system with true answer a. A Mutant input X * is created by a small transformation in the image (Q, I * ) or in the question (Q * , I) such that this transformation leads to a new answer a * , as shown in Figure 1. There are three categories of transformation T that create the mutant input X * = T (X), addition, removal, or substitution. For image mutations, these correspond to addition or removal of objects, and morphing the attributes of the objects, such as color, texture, and lighting condition. For instance addition or removal of a person from the image in Figure 3 changes the answer to the question "How many persons are pictured". Question mutations can be performed by addition of a negative word ("no", "not", etc.) to the question, masking critical words in the question, and substituting an objectword with an antonym or adversarial word. Thus for each sample in the VQA dataset, we can obtain a mutant sample and use it for training.

Training with Mutants
Our method of training with mutant samples relies on three key concepts that supplement the conventional VQA classification task.
Answer Projection: The traditional learning strategy of VQA models optimizes for a standard classification task using softmax cross-entropy: QA as a classification task is popular since the answer vocabulary follows a long-tailed distribution over the dataset. However this formulation is problematic since it does not consider the meaning of the answer while making a decision, but instead learns a correlation between the one-hot vector of the answer-class and input features. Thus to answer the question "What is the color of the banana", models learn a strong correlation between the question features and the answer-class for "yellow", but do not encode the notion of yellowness or greenness of bananas. This key drawback negatively impacts the generalizability of these models to raw green or over-ripe black bananas at test-time.
To mitigate this, in addition to the classification task, we propose a training objective that operates in the space of answer embeddings. The key idea is to map inputs (image-question pairs) and outputs (answers) to a shared manifold in order to establish a metric of similarity on that manifold. We train a projection layer that learns to project features and answers to the manifold as shown in Figure 2. We then use Noise Contrastive Estimation (Gutmann and Hyvärinen, 2010) as a loss function to minimize the distance between the projection of cross modal features z and projection of glove vector v for ground-truth answer a, given by: where z f eat = f proj (z) and z a = f proj (glove(a)).
It is important to note that this similarity metric is not between true and predicted answer, but between projection of input features and answer, to incorporate context in the answering task.
Type Exposure: Linguistic priors in datasets have led models to learn spurious correlations between question and answers. For instance, in VQA, the most common answer to "What sport ..." is "tennis", and "two" for "How many ..." questions. Our aim is to remove this negative bias from the models. Instead of removing all bias from these models, we teach models to identify the question type, and learn which answers can be valid for a particular question type, irrespective of their frequency of occurrence in the dataset. For instance, the answer to "How many ..." can be all numbers, answers to "What color ..." can be all colors, and answers to questions such as "Is the / Are there ..." is either yes or no. We call this Type Exposure since it instructs the model that although a strong correlation may exist between a question-answer pair, there are other answers which are also valid for the specific type of question. Our Type Exposure model uses a feedforward network to predict question type and to create a binary mask over answer candidates that correspond to this type.
Pairwise-Consistency: The final component of Mutant is pairwise consistency. We jointly train our models with the original and mutant sample pair, with a loss function that ensures that the distance between two predicted answer vectors is close to the distance between two ground-truth answer vectors. The pairwise consistency loss is given below, where z a is the vector for answer a, m, GT denote mutant sample and ground-truth respectively.
This pairwise consistency is designed as a regularization that incorporates the notion of semantic shift in answer space as a consequence of a mutation. For instance, consider the image mutation in Figure 3 which changes the ground-truth answer from "two" to "one". This shift in answer-space should be reflected by the predictor.

Generating Input Mutations for VQA
In order to train VQA models under the mutant paradigm, we need a mechanism to create mutant samples. Mutations are transformations that act on semantic entities in either the image or the question, in ways that can reliably lead to a new answer. For the question, semantic entities are words, while for images, semantic entities are objects. It is important to note that our mutation process is automated and does not use the knowledge about the test set distribution in order to create new samples. In this section, we delineate our automated generation process for both image and question-mutation.

Image Mutations
For image mutation, we first identify critical objects from the image that results in a change in the answer, and either remove instances of these objects (removal) or morph their color (substitution). Removing Object Instances: Removing an instance of an object class can be either critical to the question (i.e. the answer to the question changes) or non-critical (i.e. answer is unchanged). If an object (or it's synonym or hypernym) is mentioned in the question, we deem it to be critical to the question, otherwise it is deemed non-critical. For each object with M instances in the image, we randomly remove m instances from the image s.t. m ∈ {0, . . . , M } using polygon annotations from the COCO (Lin et al., 2014) dataset. Thus for each image we get multiple masked images with pix-   Figure 3. These masked images are fed to a GAN-based inpainting network (Yu et al., 2018) that makes the mutant image photorealistic, and also avoids the model getting cues from the shape of the mask. In the case of numeric questions, if m critical objects are removed, the answer to for the mutant image changes from n to n−m. For yes-no questions, removal of all critical objects (m = n) will flip the answer from "yes" to "no", while removing m < n critical objects will not. Note that m = 0 corresponds to the original image and does not result in a change in the answer.
Color Inversion: For the color-change mutation, we use samples with questions about the color of objects in the image. We identify the critical object in the image and change the color by pixel-level inversion in RGB-space. The true answer is replaced with the new color of the critical objects. To get objects with new colors, we do not use the knowledge about colors of objects in the world. In some cases, the new colors of the object may not correspond to real-world scenes, thus forcing the model to actually identifying colors, and not answer from language priors, such as "bananas are yellow".

Question Mutations
We use three types of question mutations as shown in the example in Table 1. We identify the critical object as explained in the previous section, and then apply either of the three operators. The first operator is negation for yes-no questions, which is achieved by a template based procedure that negates the question by adding a "no" or "not" before a verb, preposition or noun phrase. The second is the use of antonyms or adversarial object-words to substitute critical words. The third mutation masks words in the question and thus introduces ambiguity in the question. Questions for which the new answer cannot be deterministically identified are annotated with a broad category label such as color, location, fruit instead of the exact answers such as red, library, apple which the model cannot be expected to answer since some words have been masked or replaced with adversarial words. Yet, we want the model to be able to identify this broad category of answers even under partially occluded inputs. The answer remains unchanged for mutations with non-critical objects or words.

Mutant Statistics:
We use the training set of VQA-CP-v2 (Agrawal et al., 2018a) to generate mutant samples. For each original sample, we generate on average ∼ 1.5 mutant samples, thus obtaining a total of 679k samples.

Results on VQA-CP-v2 and VQA-v2
Performance on two benchmarks VQA-CP-v2 and VQA-v2 is shown in   performance amongst methods designed specifically for OOD generalization, with an accuracy of 70.24%. This is closest among baselines to the SOTA established by LXMERT trained explicitly for the balanced, i.i.d. setting. To make this point clear, we report the gap between the overall scores for VQA-CP and VQA-v2, following the protocol from Chen et al. (2020a) in Table 3.
Results on VQA-v2 without re-training Additionally, we use our best model trained on VQA-CP and evaluate it on the VQA test standard set without re-training on VQA-v2 data. This gives us an overall accuracy of 67.63% comprising with 88.56% on yes-no questions, 50.76% on numberbased questions, and 54.56% on other questions. This is better than all existing VQA-CP models that are explicitly trained on VQA-v2, and thus demonstrates the generalizability of our approach.

Analysis
Effect of Training with Mutant Samples: In this analysis we measure the effect of adding mutant samples to the training data without any architectural changes. We evaluate this on UpDn and LXMERT as shown in Table 4. Both models improve when exposed to the mutant samples, UpDn by 10.42% and LXMERT by 13.46%. There is a markedly significant jump for both models for the yes-no and number categories. UpDn especially benefits from Mutant samples in terms of number accuracy (a boost of 23.94%).
We also compare the our models when trained only with image mutations and only with question mutations in Table 4. While this is worse than training with both types of mutations, it can be seen that question mutations are better than image mutations in the case of yes-no and other questions, while image mutations are better on numeric questions.

Ablation Study:
We conduct ablations to evaluate the efficacy of each component of our method, namely Answer Projection, Type Exposure and Pairwise Consistency, on both baselines, as shown in Table 5. Introduction of Answer Projection significantly improves yes-no performance, while Type Exposure improves performance on other questions. Introduction of the pairwise consistency loss significantly boosts performance on number questions and yes-no questions. Note that there is a minor difference between the original and the mutant sample, and the model needs to understand this difference, which in turn can enable the model to reason about the question and predict the new answer. For instance the model can now learn the correlation between one missing object and a change in answer from "two" to "one" in Figure 3. This improves counting ability of the VQA model.

Effect of LMH Debiasing on Mutant:
We compare the results of our model when trained with or without the explicit de-biasing method LMH (Clark et al., 2019). LMH is an ensemblebased method trained for avoiding dataset biases, and is the best performer on VQA-CP among debiasing methods. It uses a learned mixing strategy by using the main model in combination with a bias-only model trained only with the question, without the image. The learned mixing strategy uses the bias-only model to remove biases from the main model. It can be seen from Table 6 that LMH leads to a drop in performance when used in combination with Mutant. This is potentially because in the process of debiasing, LMH ends up attenuating positive bias introduced by Mutant that is useful for generalization.

Related Work
De-biasing of VQA datasets: The VQA-v1 dataset (Antol et al., 2015) contained imbalances and language priors between question-answer pairs. This was mitigated by VQA-v2 (Goyal et al., 2017) which balanced the data by collecting complementary images such that each question was associated with two images leading to two different answers. Identifying that the distribution of answers in the VQA dataset led to models learning superficial correlations, Agrawal et al. (2018a) proposed the VQA-CP dataset by re-organizing the train and test splits such that the the distribution of answers per question-type was significantly different for each split. (2020) explore robustness to logical transformation of questions using first-order logic connectives (and, or, not). Removal of bias has been a focus of Ramakrishnan et al. (2018); Clark et al. (2019) for the VQA-CP task. We distinguish our work from these by amplifying positive bias and attenuating negative bias. Data Augmentation: It is important to note that the above work on data de-biasing and robust models focuses on the language priors in VQA, but not much attention has been given to visual priors. Recently there has been interest in augmenting VQA training data with counterfactual images (Agarwal et al., 2019;Chen et al., 2020a;Teney et al., 2020a). Our work is the first to address OOD generalization by providing a novel architecture and training paradigm which uses consistency between original and mutant samples, as a training objective. Answer Embeddings: In the early days of VQA, Teney and Hengel (2016) used a combination of image and question representation and answer embedding to predict the final answer. Hu et al. (2018) learn two embedding functions that transform image-question pair and answers to a shared latent space. Our method is different than this since we use a combination of classification and NCE Loss on projection of answer vectors, as opposed to a single objective. This means that although the predicted answer is obtained as the most probable answer from a set of candidates, the NCELoss in the answer-space embeds the notion of semantic similarity between the answer.

Discussion and Conclusion
In this paper, we present a method that uses input mutations to train VQA models with the goal of Out-of-Distribution generalization. Our novel answer projection module trained for minimizing distance between answer and input projections complements the canonical VQA classification task. Our Type Exposure model allows our network to consider all valid answers per question type as equally probable answer candidates, thus moving away from the negative question-answer linguistic priors. Coupled with pairwise consistency, these modules achieve a new state-of-the-art accuracy on the VQA-CP-v2 dataset and reduce the gap between model performance on VQA-v2 data.
We differentiate our work from methods using random adversarial perturbations for robust learning (Madry et al., 2018). Instead we view input mutations as structured perturbations which lead to a semantic change in the input space and a deterministic change in the output space. We envision that the concept of input mutations can be extended to other vision and language tasks for robustness. Concurrent work in the domain of image classification domain shows that carefully designed perturbations or manipulations of the input can benefit generalization and lead to performance improvements (Chen et al., 2020b;Hendrycks et al., 2019). While perception is a cornerstone of understanding, the ability to imagine changes in the scene or language query, and predict outputs for that imagined input allows models to supplement "what" decision making (based on observed inputs) with "what if" decision making (based on imagined inputs  (Agrawal et al., 2018a) is a reorganization of the VQA dataset (Antol et al., 2015;Goyal et al., 2017). The aim of VQA-CP is to have a different distribution of answers per question type is different in test and train splits. There are 65 question types based on the prefix of the questions such as "how many", "what color", "what sport", "is there", "what is the", "which". In VQA-v2, samples are drawn at randomly and independently and assigned either to train or test, thus resulting in the same distribution for both splits.
P V QA train (A|Q, I) = P V QA test (A|Q, I). In VQA-CP however, samples are assigned using a greedy re-splitting algorithm, either to train or test, in a way that makes sure that questions with the same type an same answer are not shared by train and test. It is important to note that there is no leakage between train and test splits compared to the original VQA splits.
The train set for VQA-CP-v2 contains 121k images, 245k questions and 2.5M answers, while the test set contains 98k images, 220k questions and 2.2M answers.

A.2 COCO
The source of images in both VQA and VQA-CP is the MS-COCO dataset (Lin et al., 2014). COCO contains natural images representing complex, realworld scenes containing common objects of 91 categories such as "person", "chair", "fork", "horse", "sports-ball", etc. For each image, COCO provides 5 captions along with bounding boxes and polygon annotations for each object instance in the image.

B Image Mutant Generation Process
In this section we provide additional details about our process for generating mutant samples from original question-image-answer triplets (Q-I-A) in the VQA-CP dataset. For all linguistic operations we use a combination of SpaCy (Honnibal and Montani, 2017) and the LemmInflect library 2 for lemmatization and inflection.

B.1 Selection of Objects
For each VQA sample, a list of words W is created, which contains words from the ground-truth answers and the question. All nouns in W are converted to their singular form. For yes-no questions, numeric questions, and questions about colors of objects, a list of objects O is obtained from COCO. Background and crowd objects are filtered out from O. From O critical objects O C and and non-critical objects O N C are obtained. Critical objects are those objects in the image that when manipulated or removed, may change the answer to the question being asked. For this we follow a simple heuristic that states that if an object-word or it's synonym or hyponym is present in W , then it is a critical object. Then a critical object o ∈ O is chosen at random, and m instances of this object are chosen at random. The polygon annotations (a polygon border) for this object are obtained from the COCO dataset as shown in Figure 4. Using these annotations, either a removal or color-inversion operation is applied to create the mutant image.

B.2 Object Removal and In-painting
After the object instance is selected, it is removed from the image by replacing all pixel values by 1 (white). This masked image is then input to a GANbased image inpainting network (Yu et al., 2018) that fills up this pixels in the mask. This makes the Figure 5: Illustration of color inversion procedure image photorealistic. This network is one of the best available off-the-shelf blind image inpainting models, and is trained on the ImageNet (Deng et al., 2009). The masked image could also be used as the mutant image however we prefer to use photorealistic images for two main reasons. First, masked images do not lie in the same distribution as natural images, and secondly, the mask boundary may give clues to the network about the the shape or outline of the missing object.

B.3 Color Inversion Process
For mutation that involves a change in the color of the object, we perform a simple pixel-wise color inversion operation on each pixel in the mask to get the mutant image as shown in Figure 5. This is to ensure that we do not use any prior knowledge about valid colors of a specific object. For instance, bananas can typically be yellow, green, or black. However, if we only change the color or a banana to one of these three colors, we would be using domain knowledge and inadvertently introducing answers from the test set, defeating the purpose of OOD generalization. Although the simple inversion process can introduce unnatural colors like blue bananas, it forces the model to understand colors in the image to answer the question instead of simply answering from linguistic priors (such as the memorized knowledge that bananas can be green, yellow, or black).

B.4 Answer Generation
The new answers are generated based on the type of question. For yes-no questions, if all instances of the object are removed then the answer changes from yes to no. If only some instances are removed or if the object is non-critical, the answer remains the same. For number questions, if m instances of a critical object are removed, the answer changes from n to n − m, else the answer remains the same. For color-based questions we convert the answer color to their HEX value using Webcolors 3 , invert the value, and find the color in CSS-21 colors closest to this value to generate the new answer.

C Question Mutant Generation Process
For generating question mutants, we use three operators: negation, substitution by antonyms or adversarial words, and masking critical words.

C.1 Negation
For yes-no questions and color-based questions, we use a template-based negation technique that puts a negative word such as "not" or "no" before a preposition, noun phrase, or verb. For instance "Is this chair broken?" is negated to "Is this chair not broken?". We show examples of negation in Table 7. Negation simply flips the answer from yes to no or no to yes.

C.2 Adversarial Words and Masking
Another form of question mutation is substituting object-words with their adversarial words. To do so, we create a list of all object words and their synonyms and use BERT (Devlin et al., 2018) similarity to rank the most similar words. To replace a word, we chooser the most similar word which is not present in the image. The third type of mutation is masking, where a critical object word is removed from the question and replaced with the token "MASK".
For both these types of mutations, determining the correct answer in some cases is not possible as can be seen from examples in Table 7. Thus we use the broad category as the answer. For instance, when a question such as "How big is the book" is replaced with either "How big is the plane" or "How big is the [MASK]", it is clear that the question is about the size of an object. Thus we annotate this question with this broad category "size" as the answer. In other cases, where even a broad category cannot be ascertained, the answer is replaced with "can't say" or "don't know".
3 https://pypi.org/project/webcolors/   To generate answer clusters and representative answer categories, we extract Glove (Pennington et al., 2014) word vectors for each answer phrase/word using Spacy. We use k-means clustering (Lloyd, 1982) with Euclidean distance metric and with varying number of K. We manually tune the number of clusters till we observe a clear set of categories appear at K = 50. We then manually annotate the category names.

D Dataset Analysis
Here we provide dataset analysis in terms of distribution of answers by question-type, number of samples for each type of mutation, and the final distribution of the dataset in terms of answer-type.

D.1 Distribution by Question Type
We show the distribution of answers per question type in Figure 6 for three categories "How many", "What sport", and "What color" for the top-10 answers. It can be seen that the distribution is distinct from the test data and close to the VQA-CP train data apart from the introduction of categorical answers such as "number" and "sports" during question mutation. Our mutation method does not leak information about answers from test set to train set. Table 9 shows the number of samples generated by each type of mutation.

D.3 Distribution by Answer Type
There are three answer types in both VQA-CP and Mutant datasets: yes/no, number, and other. Cre-   Table 8: Examples of answer categories and member answers per category  Table 10.