SOrT-ing VQA Models : Contrastive Gradient Learning for Improved Consistency

Recent research in Visual Question Answering (VQA) has revealed state-of-the-art models to be inconsistent in their understanding of the world -- they answer seemingly difficult questions requiring reasoning correctly but get simpler associated sub-questions wrong. These sub-questions pertain to lower level visual concepts in the image that models ideally should understand to be able to answer the higher level question correctly. To address this, we first present a gradient-based interpretability approach to determine the questions most strongly correlated with the reasoning question on an image, and use this to evaluate VQA models on their ability to identify the relevant sub-questions needed to answer a reasoning question. Next, we propose a contrastive gradient learning based approach called Sub-question Oriented Tuning (SOrT) which encourages models to rank relevant sub-questions higher than irrelevant questions for an<$image, reasoning-question$>pair. We show that SOrT improves model consistency by upto 6.5% points over existing baselines, while also improving visual grounding.


Introduction
Current Visual Question Answering (VQA) models have problems with consistency. They often correctly answer complex reasoning questions, i.e, those requiring common sense knowledge and/or logic on top of perceptual capabilities, but fail on associated low level perception questions, i.e., those directly related to the visual content in the image. For e.g., in Fig 1, models answer the reasoning question "Was this taken in the daytime?" correctly, but fail on the associated perception question "Is the sky bright?" indicating that the models likely answered the reasoning question correctly for the wrong reason(s). In this work, we explore the usefulness of leveraging information about sub-questions, i.e., low level perception questions relevant to a reasoning question, and irrelevant questions, i.e., any other questions about the image that are unrelated to the reasoning question, to improve consistency in VQA models. Selvaraju et al. (2020) have studied this problem and introduced the VQA-Introspect dataset that draws a distinction between higher-level reasoning questions and lower-level perception sub-questions. We augment this dataset with additional perception questions from the VQAv2 dataset such that each <image, reasoning question> pair contains a set of relevant perception questions, which we refer to as sub-questions (e.g.,"Is the sky bright?" in Fig 1) and irrelevant perception questions, which we refer to as irrelevant questions (e.g., "Is the train moving?" in Fig 1) throughout this paper.
We use Gradient-based Class Activation Mapping (Grad-CAM) vectors (Selvaraju et al., 2019a) -a faithful function of the model's parameters, question, answer and image -to propose an interpretability technique that determines the questions most strongly correlated with a reasoning question for a model. This is measured by ranking questions based on the cosine similarity of their Grad-CAM vectors with that of the reasoning question. We find that even top-performing VQA models often rank irrelevant questions higher than relevant questions.
Motivated by this, we introduce a new approach called contrastive gradient learning to fine-tune a VQA model by adding a loss term that enforces relevant sub-questions to be ranked higher than irrelevant questions while answering a reasoning question. This is achieved by forcing the cosine similarity of the reasoning question's Grad-CAM vector with that of a sub-question to be higher than with that of an irrelevant question. We find that our approach improves the model's consistency, defined as the frequency with which the model correctly answers a sub-question given that it correctly Gradients R S I Figure 1: The approach for SOrT. The reasoning question Was this taken in the daytime? has the sub-question Is the sky bright? and an irrelevant question Is the train moving? We tune the model using cross entropy losses and a contrastive gradient loss to bring the reasoning question's Grad-CAM vector closer to that of its sub-question and take it farther away from that of an irrelevant question.
answers the reasoning question. Additionally, we assess the effects of our approach on visual grounding (i.e, does the model look at the right visual reasons while answering the question?) by comparing Grad-CAM heatmaps with human attention maps collected in the VQA-HAT dataset (Das et al., 2016). We find that our approach of enforcing this language-based alignment through better ranking of sub-questions also improves visual grounding.

Related Work
Visual Question Answering. The VQA task (Agrawal et al., 2015) requires answering a free form natural language question about visual content in an image. Previous work has shown that models often do well on the task by exploiting language and dataset biases (Agrawal et al., 2017;Zhang et al., 2015). In order to evaluate if these models are consistent in reasoning, Selvaraju et al.
(2020) collected a new dataset, VQA-Introspect, containing human explanations in the form of subquestions and answers for questions in the VQA dataset requiring higher level reasoning. Model Interpretability. While prior work has attempted to explain VQA decisions in the visual modality (Selvaraju et al., 2019a,b;Qiao et al., 2017), the multi-modal task of VQA has a language component which cannot always be explained visually, i.e., visual regions can be insufficient to express underlying concepts (Goyal et al., 2016b;Hu et al., 2017). Liang et al. (2019) introduced a spatio-temporal attention mechanism to interpret a VQA model's decisions across a set of consecutive frames. Our work, operating on a single image, interprets the model's decisions using Grad-CAM vectors which are more faithful to the model's pa-rameters than attention maps. Park et al. (2018) and Wu and Mooney (2019) generate textual justifications through datasets curated with human explanations. Our approach differs by being fully self-contained and faithful to the model, requiring no additional parameters for interpreting its decisions. Aligning network importances. Selvaraju et al. (2019b) introduced an approach to align visual explanations with regions deemed important by humans, thereby improving visual grounding in VQA models. In followup work, Selvaraju et al. (2020) introduced an approach to align attention maps for the reasoning question and associated perception sub-questions from VQA-Introspect to improve language based grounding. In contrast to attention maps, our work encourages Grad-CAM vectors of a reasoning question to be closer to those of sub-questions and farther away from those of irrelevant questions. Intuitively, this means that we are making the neurons used while answering a reasoning question to be similar to those used while answering a sub-question and dissimilar to those used while answering an irrelevant question. Our experiments show that this alignment improves the model's consistency and visual grounding. In this work, we adopt Grad-CAM to compute the contribution of a neuron at the layer in a VQA model where the vision and language modalities are combined. This is computed by taking the gradient of the predicted output class score with respect to the neuron activations in the layer. We then point-wise multiply this with the corresponding activations to obtain our Grad-CAM importance vector. Specifically, if y c denotes the score of the ground-truth output class and A k the activations of layer k of the model, the Grad-CAM importance vector G c k (or simply, Grad-CAM vector) is computed as follows, Unlike Grad-CAM visualizations, these vectors are not visually interpretable as they are not computed on the final convolutional layer of the CNN. Dataset. We construct our dataset by augmenting VQA-Introspect with perceptual question-answer pairs from VQAv2 (Goyal et al., 2016a). For every reasoning question in the VQA-Introspect dataset, we have a set of <sub-question, answer> pairs and a set of <irrelevant question, answer> pairs. Consistency in VQA models. As defined in Selvaraju et al. (2020), the consistency of a VQA model refers to the proportion of sub-questions answered correctly, given that their corresponding reasoning questions were answered correctly. If a model is inconsistent, it is likely relying on incorrect perceptual signals or biases in the dataset to answer questions, which implies that the model may be inaccurate when evaluated on data with different language priors. Models that are consistent and rely on appropriate perceptual signals are more likely to reliably generalize.

Sub-question Oriented Tuning
The key idea behind Sub-question Oriented Tuning (SOrT) is to encourage the neurons most strongly relied on (as assessed by Grad-CAM vectors) while answering a reasoning question ("Was this taken in the daytime?" in Fig 1) to be similar to those used while answering the sub-questions ("Is the sky bright?") and dissimilar to those used while answering the irrelevant questions ("Is the train moving?"). This enforces the model to use the same visual and lingustic concepts while making predictions on the reasoning question and the sub-questions. Our loss has the following two components. Contrastive Gradient Loss. With the Grad-CAM vectors of the reasoning question (G R ), subquestion (G S ) and irrelevant question (G I ), we formalize our intuition of a contrastive gradient loss L CG as, Binary Cross Entropy Loss. To retain performance of the model on the base task of answering questions correctly, we add a Binary Cross Entropy Loss term (L BCE ) that penalizes incorrect answers for all the questions. Total Loss. Let o R , gt R , o S , gt S , o I and gt I represent the predicted and ground-truth answers for the reasoning, sub-questions and irrelevant questions respectively, and λ 1 , λ 2 , λ 3 be tunable hyperparameters. Our total loss L SOrT is,

Experiments
Dataset. As mentioned in Sec 3.1, our dataset pools VQA-Introspect and VQAv2. The training/val splits contain 54,345/20,256 <image, reasoning question> pairs with an average of 2.58/2.81 sub-questions and 7.63/5.80 irrelevant questions for each pair respectively. Baselines. We compare SOrT against the following baselines: 1) Pythia. 2) SQuINT in which, as discussed in Sec 2, Selvaraju et al. (2020) finetuned Pythia with an attention alignment loss to ensure that the model looks at the same regions when answering the reasoning and sub-questions.
3) SOrT with only sub-questions in which we discard the irrelevant questions associated with a reasoning question and just align the Grad-CAM vectors of the sub-questions with that of the reasoning question. This ablation benchmarks the usefulness of the contrastive nature of our loss function.

Metrics
Ranking. We use four metrics to assess the capability of a model to correctly rank its sub-questions. 1. Mean Precision@1 (MP@1). Proportion of <image, reasoning question> pairs for which the highest ranked question is a sub-question. 2. Ranking Accuracy. Proportion of <image, reasoning question> pairs whose sub-questions are all ranked above their irrelevant questions .

3.
Mean Reciprocal Rank (MRR). Average value of the highest reciprocal rank of a sub-question among all the <image, reasoning question> pairs. Higher is better. 4. Weighted Pairwise Rank (WPR) Loss. This searches for pairs of incorrectly ranked <sub, irrelevant> questions and computes the differences of their similarity scores with the reasoning question. Averaged across all pairs, this computes the extent by which rankings are incorrect.
Model Performance. 1) Quadrant Analysis. a. R S The pairs where reasoning and subquestions are both correctly answered. b. R S The pairs where reasoning question is correctly answered, while the sub-question is incorrectly answered. c. R S The pairs where reasoning question is incorrectly answered, while the sub-question is correctly answered. d. R S The pairs where reasoning and subquestions are both incorrectly answered.
2) Consistency. The frequency with which a model correctly answers a sub-question given that it correctly answers the reasoning question.
3) Reasoning Accuracy. The accuracy on the reasoning split of VQAv2 dataset, and 4) Overall Accuracy. Accuracy on the VQAv2 validation set. More details on the metrics are in the Appendix.

Results
We attempt to answer the following questions: Does SOrT help models better identify the perception questions relevant for answering a reasoning question? As described in Sec 3.2, the model ranks perception questions (subquestions and irrelevant questions) associated with an <image, reasoning question> pair according to the cosine similarities of their Grad-CAM vectors with that of the reasoning question. As seen in Table 1, we find that our approach outperforms its baselines on nearly all the ranking metrics. We observe gains of 4-6% points on MP@1 and MRR, and 1.5-2.5% points on Ranking Accuracy. Likewise, the improvement in WPR -the soft metric that computes the extent by which rankings are incorrect -is a substantial 12% points over Pythia. This confirms that SOrT-ing VQA models helps distinguish between (and better rank) the relevant and irrelevant perceptual concepts needed for answering a reasoning question. Does recognizing relevant sub-questions make models more consistent? We find that the improved ranking of sub-questions through SOrT improves consistency by 6.5% points over Pythia, 3.25% points over SQuINT and 0.5% points over using just the sub-questions and discarding the irrelevant questions 1 . As seen in Table 1, the consistency gains are due to significant improvements in the R S and R S quadrants. This however comes at the expense of a drop in overall accuracy and reasoning accuracy by ∼1% point, likely due to the active disincentization of memorizing language priors and dataset biases (Agrawal et al., 2017;Ramakrishnan et al., 2018;Guo et al., 2019;Manjunatha et al., 2018) through our contrastive gradient learning approach. Gradient-based explanations have been shown to be more faithful to model decisions compared to attention maps (Selvaraju et al., 2019b). Our results confirm this by showing that aligning Grad-CAM vectors for reasoning and sub-questions makes models more consistent compared to SQuINT, which aligns their attention maps. Fig 2 shows an example of improved consistency using SOrT. The Pythia model answers its sub-question incorrectly. After SOrTing, our model now ranks the relevant sub-question higher than the irrelevant ones and also answers it correctly -thus improving consistency. These results illustrate a common trade-off among multiple desirable characteristics of VQA models: accuracy, consistency, interpretability. Building models that maximize all these characteristics presents a challenging problem for future work. Does enforcing language-based alignment lead to better visual grounding? To evaluate this, we compute visual grounding through Grad-CAM applied on the final convolutional layer. We then compute the correlation of Grad-CAM heatmaps with the validation split of the VQA-Human ATtenion (VQA-HAT) dataset (Das et al., 2016), comprising 4,122 attention maps. This dataset contains humanannotated 'ground truth' attention maps which indicate where humans chose to look while answering questions about images in the VQAv1 dataset. The proposed method to compare human and modelbased attention maps in this work was to rank their pixels according to their spatial attention, and then compute the correlation between these two ranked lists.
We find that our SOrT approach gets a Spearman rank correlation of 0.103 ± 0.008, versus 0.080 ± 0.009 for Pythia and 0.060 ± 0.008 for SQuINT. These improvements indicate that enforcing language-based alignment during training improves visual grounding on an unseen dataset.
A qualitative example that demonstrates the superior visual grounding of SOrT compared to its baselines is shown in Fig 3. For the question Is the baby using the computer? and its corresponding answer Yes, we see that the Grad-CAM heatmap generated by SOrT is closest to that of the human attention map. It is also the only heatmap in this example that actually points to the fingers of the child, which is the essential visual component for answering the question.

Discussion
In this work, we developed language-based interpretability metrics that measure the relevance of a lower-level perception question for answering a higher-level reasoning question. Evaluating stateof-the-art VQA models on these metrics reveals that these models often rank irrelevant questions higher than relevant ones. To address this, we present SOrT (Sub-question Oriented Tuning), a contrastive gradient learning based approach for teaching VQA models to distinguish between relevant and irrelevant concepts while answering a complex reasoning question. SOrT aligns Grad-CAM vectors of reasoning questions with those of their sub-questions, while also de-aligning them with those of their irrelevant questions. We demonstrate SOrT's effectiveness at making VQA models more consistent without significantly affecting their overall predictive performance. We also show that this alignment achieves better visual grounding.