Generating Question Relevant Captions to Aid Visual Question Answering

Visual question answering (VQA) and image captioning require a shared body of general knowledge connecting language and vision. We present a novel approach to better VQA performance that exploits this connection by jointly generating captions that are targeted to help answer a specific visual question. The model is trained using an existing caption dataset by automatically determining question-relevant captions using an online gradient-based method. Experimental results on the VQA v2 challenge demonstrates that our approach obtains state-of-the-art VQA performance (e.g. 68.4% in the Test-standard set using a single model) by simultaneously generating question-relevant captions.


Introduction
In recent years, visual question answering (VQA) (Antol et al., 2015) and image captioning (Donahue et al., 2015;Rennie et al., 2017) have been widely studied in both the computer vision and NLP communities. Most recent VQA research (Lu et al., 2017;Pedersoli et al., 2017;Anderson et al., 2018;Lu et al., 2018) concentrates on directly utilizing visual input features including detected objects, attributes, and relations between pairs of objects.
However,little VQA research works on exploiting textual features from the image which are able to tersely encode the necessary information to answer the questions. This information could be richer than the visual features in that the sentences have fewer structural constraints and can easily include the attributes of and relation among multiple objects. In fact, we observe that appropriate captions can be very useful for many VQA questions. In particular, we trained a model to answer visual questions for the VQA v2 challenge (Antol et al., 2015) only using the human annotated Human Captions : 1) A man on a blue surfboard on top of some rough water.
2) A young surfer in a wetsuit surfs a small wave.
3) A young man rides a surf board on a small wave while a man swims in the background. 4) A young man is on his surf board with someone in the background. 5) A boy riding waves on his surf board in the ocean.
Question 1: Does this boy have a full wetsuit on? Caption: A young man wearing wetsuit surfing on a wave. Question 2: What color is the board? Caption: A young man riding a wave on a blue surfboard. captions without images and achieved a score of 59.6%, outperforming a large number of VQA models that use image features. Existing work using captions for VQA has generated questionagnostic captions using a pretrained captioner (Li et al., 2018a). This approach can provide additional general information; however, this information is not guaranteed to be relevant to the given VQA question.
We explore a novel approach that generates question-relevant image descriptions, which contain information that is directly relevant to a particular VQA question. Fig. 1 shows examples of our generated captions given different questions. In order to encourage the generation of relevant captions, we propose a novel greedy algorithm that aims to minimize the cross entropy loss only for the most relevant and helpful gold-standard captions. Specifically, helpfulness is measured using the inner-product of the gradients from the caption generation loss and the VQA answer prediction loss. A positive inner-product means the two objective functions share some descent directions in the optimization process, and therefore indicates that the corresponding captions help the VQA training process.
In order to incorporate the caption information, we propose a novel caption embedding module that, given the question and image features for a visual question, recognizes important words in the caption, and produces a caption embedding tailored for answer prediction. In addition, the caption embeddings are also utilized to adjust the visual top-down attention weights for each object.
Furthermore, generating question-relevant captions ensures that both image and question information is encoded in their joint representations, which reduces the risk of learning from question bias (Li et al., 2018a) and ignoring the image content when high accuracy can be achieved from the questions alone.
Experimental evaluation of our approach shows significant improvements on VQA accuracy over our baseline Up-Down (Anderson et al., 2018) model on the VQA v2 validation set (Antol et al., 2015), from 63.2% to 67.1% with gold-standard human captions from the COCO dataset (Chen et al., 2015) and 65.8% with automatically generated question-relevant captions. Our single model is able to score 68.4% on the test-standard split, and an ensemble of 10 models scores 69.7%.

Visual Question Answering
Recently, a large amount of attention-based deeplearning methods have been proposed for VQA, including top-down (Ren et al., 2015a;Fukui et al., 2016;Wu et al., 2016;Goyal et al., 2017;Li et al., 2018a) and bottom-up attention methods (Anderson et al., 2018;Li et al., 2018b;Wu and Mooney, 2019). Specifically, a typical model first extracts image features using a pre-trained CNN, and then trains an RNN to encode the question, using an attention mechanism to focus on specific features of the image. Finally, both question and attended image features are used to predict the final answer.
However, answering visual questions requires not only information about the visual content but also common knowledge, which is usually too hard to directly learn from only a limited number of images with human annotated answers as supervision. However, comparatively little previous VQA research has worked on enriching the knowledge base. We are aware of two related papers. Li et al. (2018a) use a pre-trained captioner to generate general captions and attributes with a fixed annotator and then use them to predict answers. Therefore, the captions they generate are not necessarily relevant to the question, and they may ignore image features needed for answer prediction. Narasimhan et al. (2018) employed an out-of-thebox knowledge base and trained their model to filter out irrelevant facts. After that, graph convolutional networks use this knowledge to build connections to the relevant facts and predict the final answer. Unlike them, we generate captions to provide information that is directly relevant to the VQA process.

Image Captioning
Most recent image captioning models are also attention-based deep-learning models (Donahue et al., 2015;Karpathy and Fei-Fei, 2015;Vinyals et al., 2015;Luo et al., 2018;Liu et al., 2018). With the help of large image description datasets (Chen et al., 2015), these models have demonstrated remarkable results. Most of them encode the image using a CNN, and build an attentional RNN (i.e. GRU (Cho et al., 2014), LSTM (Hochreiter and Schmidhuber, 1997)) on top of the image features as a language model to generate image captions.
However, deep neural models still tend to generate general captions based on the most significant objects (Vijayakumar et al., 2016). Although previous works (Luo et al., 2018;Liu et al., 2018) build captioning models that are encouraged to generate different captions with discriminability objectives, the captions are usually less informative and fail to describe most of the objects and their relationships diversely. In this work, we develop an approach to generating captions that directly focus on the critical objects in the VQA process and provide information that can help the VQA module predict the answer.

Approach
We first describe the overall structure of our joint model in Sec. 3.1 and explain the foundational Overall structure of our model that generates question-relevant captions to aid VQA. Our model is first trained to generate question-relevant captions as determined in an online fashion in phase 1. Then, the VQA model is fine-tuned with generated captions from the first phase to predict answers. ⊗ denotes element-wise multiplication and ⊕ denotes element-wise addition. Blue arrows denote fully-connected layers (f c) and yellow arrows denote attention embedding. feature representations (i.e. image, question and caption embeddings) in Sec. 3.2. Then, the VQA module is presented in Sec. 3.3, which takes advantage of the generated image captions to improve the VQA performance. In Sec. 3.4, we explain the image captioning module which generates question-relevant captions. Finally, the training and implementation details are provided in Sec. 3.5.

Overview
As illustrated in Fig. 2, the proposed model first extracts image features V = {v 1 , v 2 , ..., v K } using bottom-up attention and question features q to produce their joint representation and then generates question-related captions. Next, our caption embedding module encodes the generated captions as caption features c as detailed in Sec. 3.2. After that, both question features q and caption features c are utilized to generate the visual attention A cv to weight the images' feature set V, producing attended image features v qc . Finally, we add v qc to the caption features c and further perform element-wise multiplication with the question features q (Anderson et al., 2018) to produce the joint representation of the question, image and caption, which is then used to predict the answer.

Feature Representation
In this section, we explain the details of this joint representation. We use f (x) to denote fullyconnected layers, where f (x) = LReLU(W x + b) with input features x and ignore the notation of weights and biases for simplicity, where these f c layers do not share weights. LReLU denotes a Leaky ReLU (He et al., 2015).

Image and Question Embedding
We use object detection as bottom-up attention (Anderson et al., 2018), which provides salient image regions with clear boundaries. In particular, we use a Faster R-CNN head (Ren et al., 2015b) in conjunction with a ResNet-101 base network (He et al., 2016) as our detection module. The detection head is first pre-trained on the Visual Genome dataset (Krishna et al., 2017) and is capable of detecting 1, 600 objects categories and 400 attributes. To generate an output set of image features V, we take the final detection outputs and perform non-maximum suppression (NMS) for each object category using an IoU threshold of 0.7. Finally, a fixed number of 36 detected objects for each image are extracted as the image features (a 2, 048 dimensional vector for each object) as suggested by Teney et al. (2017).
For the question embedding, we use a standard GRU (Cho et al., 2014) with 1, 280 hidden units and extract the output of the hidden units at the final time step as the question features q. Following Anderson et al. (2018), the question features q and image feature set V are further embedded together to produce a question-attended image feature set V q via question visual-attention A qv as illustrated in Fig. 2.

Caption Embedding
Our novel caption embedding module takes as in-put the question-attended image feature set V q , question features q, and C captions W c The goals of the caption module are to serve as a knowledge supplement to aid VQA, and to provide additional clues to identify the relevant objects better and adjust the top-down attention weights. To achieve this, as illustrated in Fig. 3, we use a two-layer GRU architecture. The firstlayer GRU (called the Word GRU) sequentially encodes the words in a caption W c i at each time step as h 1 i,t .
where W e is the word embedding matrix, and Π c i,t is the one-hot embedding for the word w c i,t . Then, we design a caption attention module A c which utilizes the question-attended feature set V q , question features q, and h 1 i,t to generate the attention weight on the current word in order to indicate its importance. Specifically, the Word GRU first encodes the words embedding Π c i,t in Eq. 1, and then we feed the outputs h 1 i,t and V q to the attention module A c as shown in Eq. 4.
where σ denotes the sigmoid function, and K is the number of objects in the bottom-up attention.
Next, the attended words in the caption are used to produce the final caption representation in Eq. 5 via the Caption GRU. Since the goal is to gather more information, we perform element-wise max pooling across the representations of all of the input captions c i in Eq. 7.
where max denotes the element-wise max pooling across all of caption representations c i of the image.

VQA Module
This section describes the details of the VQA module. The generated captions are usually capable of capturing relations among the questionrelevant objects; however these relations are absent in the bottom-up attention. Therefore, our VQA module utilizes the caption embeddings c to adjust the top-down attention weights in VQA in order to produce the final caption-attended features v qc in Eq. 10: where k traverses the K objects features.
To better incorporate the information from the captions into the VQA process, we add the caption features c to the attended image features v qc , and then element-wise multiply by the question features as shown in Eq. 11: We frame the answer prediction task as a multilabel regression problem (Anderson et al., 2018). In particular, we use the soft scores in the goldstandard VQA-v2 data (which are used in the evaluation metric), as labels to supervise the sigmoidnormalized predictions as shown in Eq. 13: where the index j runs over N candidate answers and s are the soft answer scores.
In case of multiple feasible answers, the soft scores capture the occasional uncertainty in the ground-truth annotations. As suggested by Teney et al. (2017), we collect the candidate answers that appear more than 8 times in the training set, which results in 3, 129 answer candidates.

Image Captioning Module
We adopt an image captioning module similar to that of Anderson et al. (2018), which takes the object detection features as inputs and learns attention weights over those objects' features in order to predict the next word at each step. The key difference between our module and theirs lies in the input features and the caption supervision. Specifically, we use the question-attended image features V q as inputs, and only use the most relevant caption, which is automatically determined in an online fashion (detailed below), for each question-image pair to train the captioning module. This ensures that only question-relevant captions are generated.

Selecting Relevant Captions for Training
Previously, Li et al. (2018b) selected relevant captions for VQA based on word similarities between captions and questions, however, their approach does not take into account the details of the VQA process. In contrast, during training, our approach dynamically determines for each problem, the caption that will most improve VQA. We do this by updating with a shared descent direction  which decreases the loss for both captioning and VQA. This ensures a consistent target for both the image captioning module and the VQA module in the optimization process.
During training, we compute the cross-entropy loss for the i-th caption using Eq. 14, and backpropagate the gradients only from the most relevant caption determined by solving Eq. 15.
In particular, we require the inner product of the current gradient vectors from the predicted answer and the human captions to be greater than a positive constant ξ, and further select the caption that maximizes that inner product.
where theŝ pred is the logit 1 for the predicted answer, W c i denotes the i-th human caption for the image and k traverses the K object features.
Therefore, given the solution to Eq. 15, i , the final loss of our joint model is the sum of the VQA loss and the captioning loss for the selected captions as shown in Eq. 16. If Eq. 15 has no feasible solution, we ignore the caption loss.

Training and Implementation Details
We train our joint model using the AdaMax optimizer (Kingma and Ba, 2015) with a batch size of 384 and a learning rate of 0.002 as suggested by Teney et al. (2017). We use the validation set for VQA v2 to tune the initial learning rate and the number of epochs, yielding the highest overall VQA score. We use 1, 280 hidden units in the question embedding and attention model in the VQA module with 36 object detection features for each image. For captioning models, the dimension of the LSTM hidden state, image feature embedding, and word embedding are all set to 512. We also use Glove vectors (Pennington et al., 2014) to initialize the word embedding matrix in the caption embedding module.
We initialize the training process with human annotated captions from the COCO dataset (Chen et al., 2015) and pre-train the VQA and captiongeneration modules for 20 epochs with the final joint loss in Eq. 16. After that, we generate question-relevant captions for all question-image pairs in the COCO train, validation, and test sets. In particular, we sample 5 captions per questionimage pair. We fine-tune our model using the generated captions with 0.25 × learning-rate for another 10 epochs.

Experiments
We perform extensive experiments and ablation studies to evaluate our joint model on VQA.

VQA Dataset
We use the VQA v2.0 dataset (Antol et al., 2015) for the evaluation of our proposed joint model, where the answers are balanced in order to minimize the effectiveness of learning dataset priors. This dataset is used in the VQA 2018 challenge and contains over 1.1M questions from the over 200K images in the MSCOCO 2015 dataset (Chen et al., 2015). Following Anderson et al. (2018), we perform standard text pre-processing and tokenization. In particular, questions are first converted to lower case and then trimmed to a maximum of 14 words, and the words that appear less than 5 times are replaced with an "<unk>" token. To evaluate answer quality, we report accuracies using the official VQA metric using soft scores, which accounts for the occasional disagreement between annotators for the ground truth answers.

Image Captioning Dataset
We use the MSCOCO 2014 dataset (Chen et al., 2015) for the image caption module. To maintain consistency with the VQA tasks, we use the dataset's official configuration that includes 82, 372 images for training and 40, 504 for validation. Similar to the VQA question pre-processing, we first convert all sentences to lower case, tokenizing on white spaces, and filtering words that do not occur at least 5 times.

Results on VQA
We first report the experimental results on the VQA task and compare our results with the stateof-the-art methods in this section. After that, we perform ablation studies to verify the contribution of additional knowledge from the generated captions, and the effectiveness of using caption representations to adjust the top-down visual attention weights.
As demonstrated in Table 1, our single model outperforms other state-of-the-art single models by a clear margin, i.e. 2.06%, which indicates the effectiveness of including caption features as additional inputs. In particular, we observe that our single model outperforms other methods, especially in the 'Num' and 'Other' categories. This is because the generated captions are capable of providing more numerical clues for answering the 'Num' questions, since the captions can describe the number of relevant objects and provide general knowledge for answering the 'Other' questions. Furthermore, an ensemble of 10 models with different initialization seeds results in a score of 69.7% for the test-standard set. Fig. 4 shows several examples of our generated question-relevant captions.
These examples illustrate how different captions are generated for the same image when the question is changed. They also show how the objects in the image that are important to answering the question are described in the question-relevant captions.

Comparison Between Using Generated and Human Captions
Next, we analyze the difference between using automatically generated captions and using those provided by human annotators. In particular, we train our model with generated question-agnostic captions using the Up-Down (Anderson et al., 2018) captioner, question-relevant captions from our caption generation module, and human annotated captions from the COCO dataset.
As demonstrated in Table 2, our model gains Validation Up-Down (Anderson et al., 2018) 63.2 Ours with Up-Down captions 64.6 Ours with our generated captions 65.8 Ours with human captions 67.1 about 4% improvement from using human captions and 2.5% improvement from our generated question-relevant captions on the validation set. This indicates the insufficiency of directly answering visual questions using a limited number of detection features, and the utility of incorporating additional information about the images. We also observe that our generated question-relevant captions trained with our caption selection strategy provide more helpful clues for the VQA process than the question-agnostic Up-Down captions, outperforming their captions by 1.2%.

Effectiveness of Adjusting Top-Down Attention
In this section, we quantitatively analyze the ef-  fectiveness of utilizing captions to adjust the topdown attention weights, in addition to the advantage of providing additional information. In particular, we compare our model with a baseline version where the top-down attention-weight adjustment factor A cv is manually set to 1.0 (resulting in no adjustment). Tables 3 and 4, we observe an improvement when using caption features to adjust the attention weights. This indicates that the caption features help the model to more robustly locate the objects that are helpful to the VQA pro-cess. We use w CAA to indicate with caption attention adjustment and w/o CAA to indicate without it. Fig. 5 illustrates an example of caption attention adjustment. Without CAA, the top-down visual attention focuses on both the yellow surfboard and the blue sail, generating the incorrect answer "yellow and blue.". However, with "yellow board" in the caption, the caption attention adjustment (CAA) helps the VQA module focus attention just on the yellow surfboard, thereby generating the correct answer "yellow and red" (since there is some red coloring in the surfboard).   Next, in order to directly demonstrate that our generated question-relevant captions help the model to focus on more relevant objects via attention adjustment, we compare the differences between the generated visual attention and humanannotated important objects from the VQA-X dataset (Park et al., 2018), which has been used to train and evaluate multimodal (visual and textual) VQA explanation (Wu and Mooney, 2018). The VQA-X dataset contains 2, 000 question-image pairs from the VQA v2 validation set with human annotations indicating the objects which most influence the answer to the question. In particular, we used Earth Mover Distance (EMD) (Rubner et al., 2000) to compare the highly-attended objects in the VQA process to the objects highlighted by human judges. This style of evaluation using EMD has previously been employed to compare automatic visual explanations to humanattention annotations (Selvaraju et al., 2017;Park et al., 2018).

As demonstrated in
We resize all of the 2, 000 human annotations in VQA-X dataset to 14×14 and adjust the object bounding boxes in the images accordingly. Next, we assign the top-down attention weights to the corresponding bounding boxes, both before and after caption attention adjustment, and add up the weights of all 36 detections. Then, we normalize attention weights over the 14 × 14 resized images to sum to one, and finally compute the EMD between the normalized visual attentions and the human annotations. Table 5 reports the EMD results for the attentions weights both before and after the caption attention adjustments.  The results indicate that caption attention adjustment improves the match between automated attention and human-annotated attention, even though the approach is not trained on supervised data for human attention. Not surprisingly, human captions provide a bit more improvement than automatically generated ones.

Conclusion
In this work, we have explored how generating question-relevant image captions can improve VQA performance. In particular, we present a model which jointly generates question-related captions and uses them to provide additional information to aid VQA. This approach only utilizes existing image-caption datasets, automatically determining which captions are relevant to a given question. In particular, we design the training algorithm to only update the network parameters in the optimization process when the caption generation and VQA tasks agree on the direction of change. Our single model joint system outperforms the current state-of-the-art single model for VQA.