Improving Visual Question Answering by Referring to Generated Paragraph Captions

Paragraph-style image captions describe diverse aspects of an image as opposed to the more common single-sentence captions that only provide an abstract description of the image. These paragraph captions can hence contain substantial information of the image for tasks such as visual question answering. Moreover, this textual information is complementary with visual information present in the image because it can discuss both more abstract concepts and more explicit, intermediate symbolic information about objects, events, and scenes that can directly be matched with the textual question and copied into the textual answer (i.e., via easier modality match). Hence, we propose a combined Visual and Textual Question Answering (VTQA) model which takes as input a paragraph caption as well as the corresponding image, and answers the given question based on both inputs. In our model, the inputs are fused to extract related information by cross-attention (early fusion), then fused again in the form of consensus (late fusion), and finally expected answers are given an extra score to enhance the chance of selection (later fusion). Empirical results show that paragraph captions, even when automatically generated (via an RL-based encoder-decoder model), help correctly answer more visual questions. Overall, our joint model, when trained on the Visual Genome dataset, significantly improves the VQA performance over a strong baseline model.


Introduction
Understanding visual information along with natural language have been studied in different ways. In visual question answering (VQA) (Antol et al., 2015;Goyal et al., 2017;Lu et al., 2016;Fukui et al., 2016;Xu and Saenko, 2016;Zhu et al., 2016;Anderson et al., 2018), models are trained to choose the correct answer given a question about an image. On the other hand, in image captioning tasks (Karpathy and Fei-Fei, 2015;Johnson et al., 2016;Anderson et al., 2018;Krause et al., 2017;Liang et al., 2017;Melas-Kyriazi et al., 2018), the goal is to generate sentences which should describe a given image. Similar to the VQA task, image captioning models should also learn the relationship between partial areas in an image and the generated words or phrases. While these two tasks seem to have different directions, they have the same purpose: understanding visual information with language. If their goal is similar, can the tasks help each other?
In this work, we propose an approach to improve a VQA model by exploiting textual information from a paragraph captioning model. Suppose you are assembling furniture by looking at a visual manual. If you are stuck at a certain step and you are given a textual manual which more explicitly describes the names and shapes of the related parts, you could complete that step by reading this additional material and also by comparing it to the visual counterpart. With a similar intuition, paragraph-style descriptive captions can more explicitly (via intermediate symbolic representations) explain what objects are in the image and their relationships, and hence VQA questions can be answered more easily by matching the textual information with the questions.
We provide a VQA model with such additional 'textual manual' information to enhance its ability to answer questions. We use descriptive captions generated from a paragraph captioning model which capture more detailed aspects of an image than a single-sentence caption (which only conveys the most obvious or salient single piece of information). We also extract properties of objects, i.e., names and attributes from images to create simple sentences in the form of "[object name] is [attribute]". Our VTQA model takes these paragraph captions and attribute sentences as input in addition to the standard input image features. The VTQA model combines the information from text and image with early fusion, late fusion, and later fusion. With early fusion, visual and textual features are combined via crossattention to extract related information. Late fusion collects the scores of candidate answers from each module to come to an agreement. In later fusion, expected answers are given an extra score if they are in the recommendation list which is created with properties of detected objects. Empirically, each fusion technique provides complementary gains from paragraph caption information to improve VQA model performance, overall achieving significant improvements over a strong baseline VQA model. We also present several ablation studies and attention visualizations.

Related Work
Visual Question Answering (VQA): VQA has been one of the most active areas among efforts to connect language and vision (Malinowski and Fritz, 2014;Tu et al., 2014). The recent success of deep neural networks, attention modules, and object plus salient region detection has made more effective approaches possible (Antol et al., 2015;Goyal et al., 2017;Lu et al., 2016;Fukui et al., 2016;Xu and Saenko, 2016;Zhu et al., 2016;Anderson et al., 2018). Paragraph Image Captioning: Another thread of research which deals with combined visual and language problem is the translation of visual contents to natural language. The first approach to this included using a single-sentence image captioning model (Karpathy and Fei-Fei, 2015). However, this task is not able to accommodate the variety of aspects of a single image. Johnson et al. (2016) expanded single-sentence captioning to describe each object in an image via a dense captioning model. Recently, paragraph captioning models (Krause et al., 2017;Liang et al., 2017;Melas-Kyriazi et al., 2018) attempt to capture the many aspects in an image more coherently.

Models
The basic idea of our approach is to provide the VQA model with extra text information from paragraph captions and object properties (see Fig. 1).

Paragraph Captioning Model
Our paragraph captioning module is based on Melas-Kyriazi et al. (2018)'s work, which uses CIDEr (Vedantam et al., 2015) directly as a reward to train their model. They make the approach possible by employing self-critical sequence training (SCST) (Rennie et al., 2017). However, only employing RL training causes repeated sentences. As a solution, they apply n-gram repetition penalty to prevent the model from generating such duplicated sentences. We adopt their model and approach to generate paragraph captions. Paragraph Captions: These provide diverse aspects of an image by describing the whole scene. We use GloVe (Pennington et al., 2014) for the word embeddings. The embedded words are sequentially fed into the encoder, for which we use GRU (Cho et al., 2014), to create a sentence representation, s i ∈ R d : s i = ENC sent (w 0:T ), where T is the number of words. The paragraph feature is a matrix which contains each sentence representation in each row, P ∈ R K×d , where K is the number of sentences in a paragraph.
Object Property Sentences: The other text we use is from properties of detected objects in images (name and attribute), which can provide explicit information of the corresponding object to a VQA model. We create simple sentences like, "[object name] is [attributes]". We then obtain sentence representations by following the same process as what we do with the paragraph captions above. Each sentence vector is then attached to the corresponding visual feature, like 'name tag', to allow the model to identify objects in the image and their corresponding traits.

Early Fusion Attention Late Fusion Later Fusion
Att.

Three Fusion Levels
Early Fusion: In the early fusion stage, visual features are fused with paragraph caption and object property features to extract relevant information. For visual and paragraph caption features, cross-attention is applied to get similarity between each component of visual features (objects) and a paragraph caption (sentences). We follow Seo et al. (2016)'s approach to compute the similarity matrix, S ∈ R O×K . From the similarity matrix V p = softmax(S T )V and the new paragraph representation, P f is obtained by concatenating P and P * V p : P f = [P ; P * V p ], where * is element-wise product operation. For visual feature and object property feature C, they are already aligned and the new visual feature V f becomes V f = [V ; V * C]. Given the fused representations, the attention mechanism is applied over each row of the representations to weight more relevant features to the question.
where, s f i is a row vector of new fused paragraph representation and q is the representation vector of a question which is encoded with GRU unit. w T a , W sa , and W qa are trainable weights. Given the attention weights, the weighted sum of each row vector, s f i leads to a final paragraph vector p = K i=1 α i s f i . The paragraph vector is fed to a nonlinear layer and combined with question vector by element-wise product.
where W p and W q are trainable weights, and L p contains the scores for each candidate answer. The same process is applied to the visual features to obtain L v = classifier(v q ).
Late Fusion: In late fusion, logits from each module are integrated into one vector. We adopt the approach of Wang et al. (2016). Instead of just adding the logits, we create two more vectors by max pooling and averaging those logits and add them to create a new logit L new = L 1 + L 2 + ... + L n + ... + L max + L avg , where L n is nth logit, and L max and L avg are from max-pooling and averaging all other logits. The intuition of creating these logits is that they can play as extra voters so that the model can be more robust and powerful.

Answer Recommendation or 'Later Fusion':
Salient regions of an image can draw people's attention and thus questions and answers are much more likely to be related to those areas. Objects often denote the most prominent locations of these salient areas. From this intuition, we introduce a way to directly connect the salient spots with candidate answers. We collect properties (name and attributes) of all detected objects and search over answers to figure out which answer can be extracted from the properties. Answers in this list of expected answers are given extra credit to enhance the chance to be selected. If logit L before from the final layer contains scores of each answer, we want to raise the scores to logit L after if the correspond-ing answers are in the list l c : L before = {a 1 , a 2 , ..., a n , ..} L after = {â 1 ,â 2 , ...,â n , ..} (5) a n = a n + c · std(L bef ore ) if n ∈ l c a n otherwise (6) where the std(·) operation calculates the standard deviation of a vector and c is a tunable parameter. l c is the list of the word indices of detected objects and their corresponding attributes. The indices of the objects and the attributes are converted to the indices of candidate answers.

Experimental Setup
Paragraph Caption: We use paragraph annotations of images from Visual Genome  collected by Krause et al. (2017), since this dataset is the only dataset (to our knowledge) that annotates long-form paragraph image captions. We follow the dataset split of 14,575 / 2,487 / 2,489 (train / validation / test). Visual Question Answering Pairs: We also use the VQA pairs dataset from Visual Genome so as to match it with the provided paragraph captions. We almost follow the same image dataset split as paragraph caption data, except that we do not include images that do not have their own question-answer pairs in the train and evaluation sets. The total number of candidate answers is 177,424. Because that number is too huge to train, we truncate the question-answer pairs whose answer's frequency are under 30, which give us a list of 3,453 answers. So, the final number of question-answering pairs are 171,648 / 29,759 / 29,490 (train / validation / test). Training Details: Our hyperparameters are selected using validation set. The size of the visual feature of each object is set to 2048 and the dimension of the hidden layer of question encoder and caption encoder are 1024 and 2048 respectively. We use AdaMax (Kingma and Ba, 2014) for the optimizer and a learning rate of 0.002. We modulate the final credit, which is added to the final logit of the model, by multiplying a scalar value c (we tune this to 1.0).   (Yu et al., 2017). This implies that our textual data helps improve VQA model performance by providing clues to answer questions. We run each model five times with different seeds and take the average value of them. For each of the five runs, our VTQA model performs significantly better (p < 0.001) than the VQA baseline model.

Late Fusion and Later Fusion Ablations
As shown in row 2 of Table 2, late fusion improves the model by 0.95%, indicating that visual and textual features complement each other. As shown in row 3 and 4 of Table 2, giving an extra score to the expected answers increases the accuracy by 1.54% from the base model (row 1) and by 1.24% from the result of late fusion (row 2), respectively. This could imply that salient parts (in our case, objects) can give direct cues for answering questions. 1 Ground-Truth vs. Generated Paragraphs We manually investigate (300 examples) how many questions can be answered only from the groundtruth (GT) versus generated paragraph (GenP) captions. We also train a TextQA model (which uses cross-attention mechanism between question and caption) to evaluate the performance of the GT and GenP captions. As shown in Table 3, the GT captions can answer more questions correctly than GenP captions in TextQA model evaluation. Human evaluation with GT captions also shows better performance than with GenP captions as seen in Table 4. However, the results from the man-   Attention Analysis Finally, we also visualize the attention over each sentence of an input paragraph caption w.r.t. a question. As shown in Figure 2, a sentence which has a direct clue for a question get much higher weights than others. This explicit textual information helps a VQA model handle what might be hard to reason about onlyvisually, e.g., 'two (2) cows'. Please see Appendix A for more attention visualization examples.

Conclusion
We presented a VTQA model that combines visual and paragraph-captioning features to significantly improve visual question answering accuracy, via a model that performs early, late, and later fusion. While our model showed promising results, it still used a pre-trained paragraph captioning model to 2 We also ran our full VTQA model with the ground truth (GT) paragraph captions and got an accuracy value of 48.04% on the validation dataset (we ran the model five times with different seeds and average the scores), whereas the VTQA result from generated paragraph captions was 47.43%. This again implies that our current VTQA model is not able to extract all the information enough from GT paragraph captions for answering questions, and hence improving the model to better capture clues from GT captions is useful future work.  obtain the textual symbolic information. In future work, we are investigating whether the VTQA model can be jointly trained with the paragraph captioning model.