Faithful Multimodal Explanation for Visual Question Answering

AI systems’ ability to explain their reasoning is critical to their utility and trustworthiness. Deep neural networks have enabled significant progress on many challenging problems such as visual question answering (VQA). However, most of them are opaque black boxes with limited explanatory capability. This paper presents a novel approach to developing a high-performing VQA system that can elucidate its answers with integrated textual and visual explanations that faithfully reflect important aspects of its underlying reasoning while capturing the style of comprehensible human explanations. Extensive experimental evaluation demonstrates the advantages of this approach compared to competing methods using both automated metrics and human evaluation.


Introduction
Deep neural networks have made significant progress on visual question answering (VQA), the challenging AI problem of answering natural-language questions about an image (Antol et al. 2015). However successful systems based on deep neural networks are difficult to comprehend because of many layers of abstraction, and the large number of parameters that are interconnected. This makes it hard to develop user trust. Partly due to the opacity of current deep models, there has been a recent resurgence of interest in explainable AI, systems that can effectively explain their reasoning to human users. In particular, there has been some recent development of explainable VQA systems (Selvaraju et al. 2017;Park et al. 2018;Hendricks et al. 2016).
One approach to explainable VQA is to generate visual explanations, which highlight image regions that most contributed to the system's answer, as determined by attention mechanisms (Lu et al. 2016) and gradient analysis (Selvaraju et al. 2017). However, such simple visualizations do not explain how these regions support the answer. An alternate approach is to generate a textual explanation, a natural-language sentence that provides reasons for the answer. Some recent work has generated textual explanations for VQA by training a recurrent neural network (RNN) on examples of human explanations (Hendricks et al. 2016). A Figure 1: Example of our multimodal explanation. It highlights relevant image regions together with a textual explanation with corresponding words in the same color. multimodal approach that integrates both a visual and textual explanation provides the advantages of both. Words and phrases in the text can point to relevant regions in the image. An illustrative multimodal explanation generated by our system is shown in Figure 1.
Some initial work on such multimodal VQA explanation is presented in (Park et al. 2018) that employs a form of "post hoc rationalization" that does not truly reflect the system's actual processing. First, a textual explanation is generated using an independent RNN that is trained on human explanations. Next, a subsequent "clean-up" step tries to connect phrases in this post-hoc explanation to actual detected objects in the image, removing phrases that cannot be properly grounded in the image. We believe that explanations should more faithfully reflect the actual processing of the underlying system in order to provide users with a deeper understanding of the system that increases trust for the right reasons, rather than simply trying to convince them of the system's reliability (Bilgic and Mooney 2005).
Several recent VQA models incorporate a Bottom-Up-  Park et al. (2018), we train an RNN to produce human-like textual explanations. However, our approach is more faithful in that the generated explanations are strongly biased to include terms (words and phrases) from semantic image segments that are highly-attended to by VQA when computing the answer. This also provides direct links between terms in the textual explanation and segmented items in the image, as shown in Figure 1. The result is a nice synthesis of a faithful explanation that highlights concepts actually used to compute the answer and a comprehensible, human-like, linguistic explanation. Below we describe the details of our approach and present extensive experimental results on the VQA-X (Park et al. 2018) dataset that demonstrate the advantages of our approach compared to prior work on this data (Park et al. 2018) in terms of both automatic metrics and human evaluation.

Related Work
In this section, we review related work including visual and textual explanation generation and VQA.

VQA
Answering visual questions (Antol et al. 2015) has been widely investigated in both the NLP and computer vision communities. Most VQA models (Fukui et al. 2016;Lu et al. 2016) jointly embed images' CNN features and questions' RNN features and then train an answer classifier to predict answers from a pre-extracted answer set. Attention mechanisms are frequently applied to recognize important visual features and filter out irrelevant parts. A recent advance is the use of the Bottom-Up-Top-Down (BUTD) attention mechanism (Anderson et al. 2018) that attends over high-level objects instead of convolutional features to avoid emphasis on irrelevant portions of the image. We adopt this mechanism, but replace object detection (Ren et al. 2015) with segmentation  to obtain more precise object boundaries.

Visual Explanation
A number of approaches have been proposed to visually explain decisions made by vision systems by highlighting relevant image regions. GradCam (Selvaraju et al. 2017) analyzes the gradient space to find visual regions that most effect the decision. Attention mechanisms in VQA models can also be directly used to determine highly-attended regions and generate visual explanations. Unlike conventional visual explanations, ours highlight segmented objects that are linked to words in an accompanying textual explanation, thereby focusing on more precise regions and filtering out noisy attention weights.

Textual and Multimodal Explanation
Visual explanations highlight key image regions behind the decision; however, they do not explain the reasoning process and crucial relationships between the highlighted regions. Therefore, there has been some work on generating textual explanations for decisions made by visual classifiers (Hendricks et al. 2016). As mentioned in the introduction, there has also been some work on multimodal explanations that link textual and visual explanations (Park et al. 2018).
A recent extension of this work ) first generates multiple textual explanations and then filters out those that could not be grounded in the image. We argue that a good explanation should directly focus its explanation on covering the visual detections that influenced the system's decision, therefore generating more faithful explanations.

Approach
Our goal is to generate more faithful multimodal VQA explanations that specifically include the segmented objects in the image that are the focus of the VQA system. Figure 2 illustrates our model's pipeline consisting of instance segmentation, phrase detection, answer prediction, and textual explanation generation. We first segment the objects in the image and then train an answer prediction module equipped with an attention mechanism over the semantically segmented objects. We also detect possible relevant phrases for the explanation with a phrase detection module. Finally, the explanation module generates textual explanations based on the question, answer, attended image segments in the VQA process, and detected phrases. At each step, an RNN in the explanation module determines whether the next word or phrase should be based on attended image content or linguistic content learned from human explanations.

Notation
We use f n to denote the output of the n-th layer of the fullyconnected f c layers of the neural network, and omit n if n = 1. These f c layers do not share parameters. We notate the sigmoid functions as σ. The subscript i indexes either elements of the segmented object sets from images or phrases in the detected phrase sets. Bold letters denote vectors, overlining · denotes averaging and [·, ·] denotes concatenation.

VQA Module
We base the VQA module on BUTD (Anderson et al. 2018) with some simplifications. First, we replace the two-branch gated tanh answer classifier with single branch fc layers with Leaky ReLU activation (Maas, Hannun, and Ng 2013). In order to ground the explanations to a more precise visual regions, we adopt instance segmentation  to segment out objects in over 3,000 common categories instead of bounding-box detection used in the original BUTD model. Specifically, we extract at most the top V (V < 80) objects in terms of detection scores and concatenate each object's fc6 in the bounding box classification branch and mask fcn [1][2][3][4] Figure 2: Overview of our explanation model: (a) shows the input image; (b) illustrates the segmented instances in the image; (c) shows the highly attended segments in the VQA process; (d) shows sample detected phrases from phrase detection module; and (e) depicts the visual-textual correspondence generated in the explanation module.
Like Wu, Hu, and Mooney (2018), we embed the questions using a standard single-layer GRU (Cho et al. 2014) into a vector q. An attention mechanism over all the image segments, taking q and V as input, is introduced in Eq. 1 to weight each instance in the image feature set V in order to compute the question-attended image feature set V q in Eq. 2: Note that we adopt a sigmoid instead of the softmax in previous works (Anderson et al. 2018; Wu, Hu, and Mooney 2018) since there could be multiple objects that contribute to the answers. For answer prediction, we feed the question and images' joint representation h, which is computed as the elementwise multiplication of the embedding of the average of the features in V q and the question features q as shown in Eq. 3, to the answer classifier: We frame answer prediction as multi-label classification (Anderson et al. 2018; Wu, Hu, and Mooney 2018) , where we use soft scores (Antol et al. 2015) as labels to supervise the sigmoid-normalized predictions ( Eq. 4) via crossentropy loss as shown in Eq. 5: where the index j runs over N candidate answers, and theŷ are the aforementioned soft answer scores. The soft scores allow modeling the confidences of each of the feasible answers annotated by humans, in line with the VQA evaluation metric. We only extract 3, 129 answer candidates that appear more than 8 times as possible answers in the VQAv2 (Antol et al. 2015) training set.
VQS pretraining. In order to provide supervision for where the VQA module should attend, we pre-train the VQA model on the training set of the VQS (Gan et al. 2017) data, which is also a subset of the VQAv2 dataset where the key segmentations that are relevant to the answers are annotated.
In particular, we pretrain the model on the entire VQAv2 training set and provide segmentation supervision when it exists (i.e. the training example in is VQS dataset). Because the annotation in the VQS dataset only has the 80 categories from COCO (Lin et al. 2014), we design an approach to match the segmentation supervision and the pre-extracted segmentation in our model that we assign label 1 when the an IoU between the pre-extracted segmentation and the segmentation in the VQS dataset is over than 0.5. Next, in addition to general VQA loss, we also add an attention loss that minimizes the cross entropy of α vqa and its labels.

Question and Answer Embedding
As suggested in (Park et al. 2018), we also encode questions and answers as input features to the explanation module. However, directly encoding the answer vectorsŝ can introduce extra noise to the explanation module since the predicted scores of correct answers in testing are usually much lower than during training. Therefore, to adjust for the difference in answer probabilities between training and testing, we instead regard the normalized answer prediction output as a multinominal distribution, and sample one answer from this distribution at each time step. We further re-embed it as a one-hot vector a s , as shown in Eq. 6: where the δ denotes one-hot embedding. Next, we element-wise multiply the embedding of the question features q and sampled answers' features a s with the question-attended image features V q to compute the joint representation u. Note that u faithfully represents the focus of the VQA process, in that it is derived from the VQAattended image features.

Phrase Detection
Since our segmentation categories are mainly common nouns representing objects, the explanation generation process, if only based on these segmentations, will have to determine the activities (e.g. throwing, standing, etc. ) and attributes (e.g. red car, tall building, etc), which are significantly harder to learn. Therefore, to reduce these difficulties, we detect a set of significant phrases and determine if what they describe actually appears in the image. We extract frequently appearing n-grams from the human explanations in our data, with the constraint that the last word must be a common noun, to ensure it can describe an object. Specifically, we first part-of-speech tag (Kitaev and Klein 2018) the ground truth explanations in the training set and extract all 2-5 grams ending in an NN or NNS. We remove phrases appearing less than 10 times, resulting in 1,735 candidate phrases. We then train a multilabel classifier to predict the phrases appearing in an explanation from the the segmentation features v i of the image. The classifier consists of gated tanh units with sigmoid attention over the visual features v i , followed by two f c layers, as shown in Eq. 8 To ensure the accuracy of the detected phrases, we apply a post-selection process that keeps only the high-performance phrase detectors. Specifically, we first train the phrase detector on the training set using binary cross-entropy loss for 12 epochs with the Adam optimizer (Kingma and Ba 2014). Next, we run the detector on the validation set and regard phrases with an output score over 0.5 as detected phrases. Finally, we evaluate each phrase on the validation set and filter out those with an F1 score less than 0.7 to form our final phrase set P = {P 1 , P 2 , ...P M }.
In the explanation module, we re-embed at most 10 phrases with detection scores over 0.5 as additional features for textual explanation generation. The re-embedded features are constructed by summing each word's pretrained GloVe vectors (Pennington, Socher, and Manning 2014) as shown in Eq. 9. This produces a phrase set P containing P 300d vectors p i in the same form as the visual features v i . Therefore, we apply the same question and answer embedding procedure shown in Eq. 7 to produce a joint representation u p .
GloV e(w), i = 1, 2, ..., P This novel phrase detection technique provides several advantages. First, additional phrase features can provide more information to guide explanation generation by providing it direct access to embeddings of common phrases found in human explanations. Second, it can detect common, higher-level concepts useful in explanations by modeling complex compositional phrases (e.g. baseball player, holding a bat, etc), thereby reducing the burden of discovering these complex concepts in the explanation module.

Explanation Generation
The explanation module consists of both a visual and linguistic component each containing a two-layer LSTM similar to BUTD (Anderson et al. 2018). The first layer produces an attention vector over the elements in either the image segmentation feature set V or the previously re-embedded phrase features P. The second layer learns a hidden representation for predicting the next word in the textual explanation from the features generated by the first layer. The first visual attention LSTM takes the concatenation x 1 t of the language LSTM's previous output h 2 t−1 , the average pooling of u i , and the previous words' embedding as input and generates the hidden presentation h 1 t . Then, an attention mechanism re-weights the image feature u i using the generated h 1 t as input shown in Eq. 10. For the detailed module structure, please refer to (Anderson et al. 2018).
For the purpose of faithfully grounding the generated explanation in the image, we argue that the generator should be able to determine if the next word should be based on image content attended to by the VQA system or on learned linguistic content.
To achieve this, we introduce a "source identifier" to balance the total amount of attention paid to the visual features u i versus the recurrent hidden representation h 1 t at each time step. In particular, given the output h 1 t from the attention LSTM and the average pooling u i over u i , we train a twolayer softmax network as demonstrated in Eq. 12 to produce a 2-d output s = (s 0 , s 1 ) that identifies which source the current generated word should be based on (i.e. s 0 for the output of the attention LSTM 1 and s 1 for the attended image features).
We use the following approach to obtain training labelŝ s for the source identifier. For visual features u i , we assign label 1 (indicating the use of attended visual information) when there exist a segmentation u i whose cosine similarity between its category name's GloVe representation and the current generated word's GloVe representation is over 0.6. For phrase features p i , we assign label 1 when the language LSTM is generating the phrase. Given the labelled data, we train the source identifier using cross entropy loss L source as shown in Eq.: 13.
Next, we concatenate the re-weighted h 1 t and u i with the output of the source identifier as the input x 2 t for the language LSTM. For the more detail module structure of language LSTM, please refer to (Anderson et al. 2018).
The phrase module takes the phrase feature u p as input and follows the same architecture, producing a hidden representation h p,2 t similar to h 2 t . With hidden layers from both visual features and phrases, we model the next word's conditional probability using Eq. 15, minimizing the cross entropy loss as shown in Eq. 16 p(y t |y 1:t−1 ) = sof tmax(f (h 2 t + h p,2 t )) (15) log(p(y t |y 1:t1 )) In addition to the standard cross entropy loss, we also introduce a cover loss to encourage the explanation module to cover all of the image segments attended to by the VQA process. Specifically, this loss function minimizes the KL divergence between the VQA attention and the normalized summation of attention of the explanation module over all time steps, as demonstrated in Eq. 17: where Z = i s 1,t t α i,t is a normalization term and s 1,t is the output of source identifier at time step t. This encourages the generated explanation to faithfully include as many of the visual detections used by the VQA system as possible. Finally, when using the language LSTM to generate the final textual explanation, we use beam search with a beam size of 3. 1 We tried to directly use the source weights s0 in the language LSTM's hidden representation h 2 t−1 and found that using h 1 t works better. The reason is that directly constraining h 2 t−1 makes the language LSTM forget the previously encoded content and prevents it from learning long term dependencies.

Training
We pre-train the phrase detector for 12 epochs with human explanations on the VQA-X dataset (Park et al. 2018), which contains 29,459 question answer pairs and each pair is associated with a human explanation. Meanwhile, the VQA module is pretrained on the entire VQAv2 training set combined with VQS annotation using the Adam optimizer (Kingma and Ba 2014). We then fine-tune the parameters in the VQA model and train the explanation module using the human explanations in the VQA-X dataset by minimizing the combined loss L in Eq. 18, with λ 1 , λ 2 and λ 3 set respectively to 1.0, 0.5 and 0.2. We ran the Adam optimizer for 25 epoch with a batch size of 128 explanations. The learning rate was initialized to 5e-4 and decayed by a factor of 0.8 every three epochs.

Multimodal Explanation Generation
As a last step, we link words in the generated textual explanation to image segments in order to generate the final multimodal explanation. To determine which words to link, we extract all common nouns whose source identifier weight s 1 in Eq.12 exceeds 0.5. We then link them to the segmented object with the highest attention weight α t (Eq. 10) when that corresponding output word y t was generated, but only if this weight is greater than 0.2. 2

Experimental Evaluation
This section presents experimental results that evaluate both the textual and visual aspects of our multimodal explanations, including comparisons to competing methods and ablations that study the impact of the various components of our overall system.  Table 1, we outperform the current state-of -the-art PJ-X model (Park et al. 2018) on all automated metrics by a clear margin. This indicates that constructing explanations that more faithfully reflect the VQA process can actually generate explanations that match human explanations better than just training to directly match human explanations, possibly by avoiding over-fitting and focusing more on important aspects of the test image.

Textual Explanation Evaluation
Human Evaluation: The first Amazon Mechanical Turk (AMT) evaluation asks human judges to directly compare our textual explanations with PJ-X (Park et al. 2018) and  fair comparison, we choose 1,000 correctly answered questions from the 1,968 questions in the VQA-X test data and ask workers to rank the explanations produced by our model, the PJ-X model, and one of the three ground truth explanations (randomly ordered), allowing for ties. Each AMT HIT (Human Inference Task) contains 5 ranking tasks, where one is a "validation" item with an obviously correct answer. For quality control, we discard data from HITs where the validation item is answered incorrectly. We report the percentage of tasks where an automatically generated explanation ranks higher than or ties the ground truth explanation. The AMT results in Table 1 show that our explanations match or exceed the judged quality of human explanations 55% of the time compared to 45% for the PJ-X explanations. We also found that 76.8% of our explanations are ranked as high or higher than PJ-X's.
Our second AMT experiment asked workers to compare our explanation with all three of the ground truth human explanations. We randomly choose 1,000 correctly answered questions, showed judges our explanation and all 3 human ones (randomly ordered), and asked them to rank all 4 explanations, allowing for ties. Table 3 shows the number of our explanations assigned to each rank (from best to worst). Our explanations are clearly competitive Rank1 Rank2 Rank3 Rank4 ours 294 228 179 299 Table 3: Rank comparing to 3 human explanations.
with human explanations, 29.4% of them rank first and only 29.9% rank last.
Ablation Study: In this section, we present ablation results evaluating three different aspects of our overall model: phrase detection, source identifier, and cover loss. We successively added these aspects to our model, in this order, and evaluated the overall quality of the resulting textual explana-tions using the automated metrics. As shown in Table. 2, we first report the benefit of building our model on image segmentation rather than bounding-box detection by comparing our initial base model (labeled BUTD-S) to the PJ-X approach. Segmentations provide our system with higher-level concepts and more semantic information about the objects which are the focus of the VQA module, making our model both more faithful and easier to explain. The effectiveness of phrase detection is also shown in Table 2, it improves all of the metrics except METEOR. It particularly improves BLEU-4 and CIDEr, where phrases directly take part in the evaluation metric. The source identifier helps the model by allowing it to choose which content to take from the attended image segments; and the cover loss encourages the system to include as many of these attended segments as possible, both of which improve overall performance.

Visual and Multimodal Explanation Evaluation
Automated Evaluation: As in previous work (Selvaraju et al. 2017;Park et al. 2018), we first used Earth Mover Distance (EMD) (Pele and Werman 2008) to compare the image regions highlighted in our explanation to image regions highlighted by human judges. In order to fairly compare to prior results, we resize the all the images in the entire test split to 14×14 and adjust the segmentations in the images accordingly. Next, we sum up the multiplication of attention values and source identifiers' values in Eq, 10 over time (t) and assign the accumulated attention weight to each corresponding segmentation region. We then normalize attention weights over the 14 × 14 resized images to sum to 1, and finally compute the EMD between the normalized attentions and the ground truth.
As shown in the Visual results in Table 1, our approach matches human attention maps more closely than the PJ-X approach. We attribute this improvement to the following reasons. First, we base our approach on detailed image segmentation which avoids the risk of focusing on background and is much more precise than bounding-box detection. Second, our visual explanation is focused by textual explanation where the segmented visual objects are linked to specific words in the textual explanation. As a result, we filter out many of the noisy attentions in a purely visual explanation method.
Human Evaluation: We also asked AMT workers to evaluate our final multimodal explanations that link words in the textual explanation directly to segments in the image. Specifically, we randomly selected 1,000 correctly answered question and asked turkers " How well do the highlighted image regions support the answer to the question?" and provided them a Likert-scale set of possible answers: "Very supportive", "Supportive", "Neutral", 'Unsupportive" and "Completely unsupportive". The second task was to evaluate the quality of the links between words and image regions in the explanations. We asked turkers "How well do the colored image segments highlight the appropriate regions for the corresponding colored words in the explanation?" with the Likert-scale choices: "Very Well", "Well", "Neutral", "Not Well", "Poorly". We assign five questions in each AMT HIT with one "validation" item to control the HIT's qualities.   4 shows the results for both questions. For both tasks, over 75% of the evaluations are positive and over 45% of them are strongly positive. This indicates that our multimodal explanations provide good connections between visual explanations, textual explanations, and the VQA process. Fig. 5 presents some sample positively-rated multimodal explanations. 3

Faithfulness Evaluation
In this section, we try to quantitatively measure the faithfulness of our explanations, i.e. how well they actually reflect the underlying VQA system's reasoning. First, we measured how many words in a generated explanation are actually linked to a visual segmentation in the image, and next we measured the fraction of the common nouns, whose cosine similarities with at least one of the 3,000 segmentation cate-  gories' GLoVe representation is over than 0.6, that are linked to a segmentation.
We analyzed the explanations from 1,000 correctly answered questions from the VQA-X test data. On average, our model is able to link 1.6 words in an explanation to an image segment, and 81.7% of the common nouns are linked to visual segments. This indicates that, at least with respect to nouns that might conceivably reference observable objects, the explanations are faithfully making reference to visual detections that are actually utilized by the underlying VQA system.

Conclusion and Future Work
This paper has presented a new approach to generating multimodal explanations for visual question answering systems that aims to more faithfully represent the reasoning of the underlying VQA system while maintaining the style of human explanations. The approach generates textual explanations with words linked to relevant image regions actually attended to by the underlying VQA system. Experimental evaluations of both the textual and visual aspects of the explanations using both automated metrics and crowdsourced human judgments were presented that demonstrate the advantages of this approach compared to a previouslypublished competing method. In the future, we would like to incorporate more information from the VQA network into the explanations. In particular, we would like to integrate the output of network dissection (Bau et al. 2017) to allow recognizable concepts in the learned hidden-layer representations to be included in explanations. We would also like to move beyond objects and include attributes, relations, and activities actually detected and used by a VQA system in textual explanations.