Multi-grained Attention with Object-level Grounding for Visual Question Answering

Attention mechanisms are widely used in Visual Question Answering (VQA) to search for visual clues related to the question. Most approaches train attention models from a coarse-grained association between sentences and images, which tends to fail on small objects or uncommon concepts. To address this problem, this paper proposes a multi-grained attention method. It learns explicit word-object correspondence by two types of word-level attention complementary to the sentence-image association. Evaluated on the VQA benchmark, the multi-grained attention model achieves competitive performance with state-of-the-art models. And the visualized attention maps demonstrate that addition of object-level groundings leads to a better understanding of the images and locates the attended objects more precisely.


Introduction
Visual Question Answering (Antol et al., 2015;Goyal et al., 2017a) is a multi-modal task requiring to provide an answer to the question with reference to a given image. Most current VQA systems resort to deep neural networks and solve the problem by end-to-end learning. First the question and the image are encoded into semantic representations independently. Then the multi-modal features are fused into one unified representation for which the answer is predicted (Malinowski et al., 2015;Fukui et al., 2016;Anderson et al., 2018).
A key point to a successful VQA system is to discover the most relevant image regions to the question. This is commonly resolved by attention mechanisms, where a spatial attention distribution highlighting the visual focus is computed according to the similarity between the whole question and image regions (Xu et al., 2015;Lu et al., 2016). Although such coarse sentence-image alignment reports promising results in general, it sometimes fails to locate small objects or understand a complicated scenario. For the example in Figure 1, the question is "What is the man wearing around his face". Human has no difficulty in finding the visual clue on the people's faces, and accordingly provide the correct answer "glasses". However, by visualizing the attention map of a state-of-the-art VQA model, we find that the attention is mistakenly focused on the men's body rather than their faces. In order to identify related objects more precisely, this paper proposes a multi-grained attention mechanism that involves object-level grounding complementary to the sentence-image association. Specifically, a matching model is trained on an object-detection dataset to learn explicit correspondence between the content words in the question and their visual counterparts. And the labels of the detected objects are considered and their similarity with the questions are computed. Besides, a more sophisticated language model is adopted for better representation of the question. Finally the three types of word-object, word-label and sentence-image attention are accumulated to enhance the performance.
The contributions of this paper are twofold. First, this paper proposes a multi-grained attention mechanism integrating two types of object features that were not previously used in VQA atten-  tion approaches. Second, the deep contextualized word representation ELMo (Peters et al., 2018) is firstly adopted in the VQA task to facilitate a better question encoding.

Proposed Model
The flowchart of the proposed model is illustrated in Figure 2. We start from the bottom-up topdown (up-down) model (Teney et al., 2017;Anderson et al., 2018), which is the winning entry to the 2017 VQA challenge. Then this model is enhanced with two types of object-level groundings to explore fine-grained information, and a more sophisticated language model for better question representation.

Image Features
We adopt the object-detection-based approach to represent the input image. Specifically, following Anderson et al. (2018), a state-of-the-art object detection model Faster R- CNN (Ren et al., 2015) with ResNet-101  as its backbone is trained on the Visual Genome (VG) (Krishna et al., 2016) dataset. Then the trained model 1 is applied to identify instances of objects with bounding boxes belonging to certain categories. The target categories of this detection model contain 1600 objects and 400 attributes. For each input image, the top-K objects with the highest confidence scores are selected to represent the image. For each object, the output of ResNet's pool-flat-5 layer is used as its visual feature, which is a 2048-dimensional vector v k . Besides, the label of each object's category c k is also kept as a visually grounded evidence. c k is a Ndimensional one-hot vector, where N is the vocabulary size. Then the input image is represented

Text Features
In our model, text features include token features and sentence features for the question, which are respectively used for fine-grained and coarsegrained attention computation.
Word Features Let Q = [q 1 , ..., q T ] ∈ R N ×T denote the one-hot representation for the input question tokens, where T is the question length, and N is the vocabulary size. Then each token q t is turned into two word embeddings: GloVe (Pennington et al., 2014) . D 1 and D 2 are the dimensions of GloVe embedding and ELMo embedding respectively. E G is the GloVe matrix pre-trained on the Wikipedia & Gigaword 2 . The ELMo embedding is dynamically computed by a L-layer bi-LSTM language model (Hochreiter and Schmidhuber, 1997). We use the publicly available pre-trained ELMo model 3 to get the contextualized embeddings.

Sentence Features
The above two sets of token embeddings are then concatenated (Cho et al., 2014) to encode the question sentence. The final hidden state of the GRU i.e., h T ∈ R D 3 is taken as sentence feature, where D 3 is the hidden state size for GRU.

Multi-grained Attentions
Word-Label Matching Attention (WL) Object category labels are high-level semantic representation compared to visual pixels, and have proven to be useful for both visual tasks like scene classification (Li et al., 2010) and multi-modal tasks like image caption and VQA (Wu et al., 2018).
For VQA task, we observed that the semantic similarity between the object category labels and the words in the question helps to locate the referred objects. For the input image in Figure 1, Faster-RCNN detected objects with labels of "man", "head". Some labels are exactly the same as or are semantically close to the words in the question "What is the man wearing around his face?". Therefore, we compute the WL attention vector, that indicates how much weight we should give to each of the K objects in the image, in terms of the semantic similarity between the category labels of the objects and the words in the question. For the k-th object with label c k we encode it into GloVe embedding 4 l G k = c k E G , and compute its attention score by measuring its similarity to the question GloVe embedding as follows: where X G = x G 1 , ..., x G T ∈ R D 1 ×T is the GloVe embeddings for the question tokens. L G = l G 1 , ..., l G k ∈ R D 1 ×K is the GloVe embeddings for the objects labels. a W L ∈ R K is the WL attention vector. In contrast to Anderson et al. (2018) that only use objects' visual features without the labels, and unlike Wu et al. (2018) that discard the visual features once the labels are generated, we utilize both category labels and the visual features to enhance the fine-grained attention with objectlevel grounding.
Word-Object Matching Attention (WO) A word-object matching module is exploited to directly evaluate how likely a question word matches a visual object. The pairwise training structure of the module is shown in Figure 3. The training set is constructed on the VG object detection data. Let (c, b) be a positive sample consisting of the annotated object bounding-box b with category label c, then a negative sample (c,b) is constructed by randomly replacing b with the objectb in the same image, ifb is not labelled with 4 The reason why GloVe embedding alone is used instead of ELMo for object labels, is that object labels have no context sentence to derive the context-sensitive ELMo embeddings. c. Then, each sample (c, b) is represented as feature vectors (x G c , v b ), where x G c is the GloVe embedding of c, and v b is extracted with the same Faster R-CNN model as described in section 2.1. At last, a margin-based pairwise ranking loss is used to train the model: where f is ReLU and σ is sigmoid activation function, • means element wise multiplication. W c , W v , W s are weight parameters 5 . And the margin is set λ = 0.5.
After s W O is pre-trained, we forwardly select at most B noun tokens in the question and compute the WO attention a W O (X, V ) over the K objects as follows: where the parameters of s W O are fine-tuned in down-streaming VQA task.

Sentence-Object Attention (SO)
Following previous methods of sentence-level question guided visual attention, we also use the global semantic of the whole sentence to guide the focus on relevant objects. Taking sentence feature h T and objects features V as input, SO attention vector a SO is computed as follows: where f is ReLU, σ is sigmoid activation function, and W j , W v , W t are weight parameters.

Multi-modal Fusion and Answer Prediction
The above three attentions are summed together for the final attention vector. Then we get the weighted visual feature vector v a ∈ R 2048 for the image: Then the question feature h T and the attended visual feature v a are transformed into the same dimension and fused together with element-wise multiplication, to get the joint representation vector r ∈ R D 4 .
where f is ReLU, W rt , W rv are weight parameters. Following Teney et al. (2017), we treat VQA task as a classification problem, and use the binary cross-entropy loss to take multiple marked answers into consideration: whereŝ ∈ R A is the predicted score over all A answer candidates, s a is the target accuracy score 6 .

Settings
Experiments are conducted on VQA v2 dataset (Goyal et al., 2017b). Questions are trimmed to a maximum of T = 14 words. We set 6 accuracy = min( #humans that provided that answer 3 , 1), i.e. an answer is accurate if at least 3 markers provided the answer.  the number of detected boxes to K = 36, and set the dimension of GloVe embeddings and ELMo embeddings to D 1 = 300 and D 2 = 1024, respectively. The GRU hidden size for question sentence is D 3 = 1024, and the joint representation r is of dimension D 4 = 2048. Noun tokens count is set as fixed B = 3 with padding 7 . Candidate answers are restricted to the correct answers in the training set that appear more than a threshold, which results in a number of A = 3129 answer candidates. Adamax optimizer (Kingma and Ba, 2014) is used with initial learning rate of 0.002, and we use a learning rate decay schedule that reduces the learning rate by a factor of 0.1 every 3 epochs after 8 epochs. The batch size is 512.

Model Analysis
To understand the effects of different components, the performance by adding one certain proposed component to the baseline is reported in Table 2. Adding our proposed two branches of fine-grained WL and WO attentions significantly improves the baseline performance. The result also verifies that ELMo embeddings combined with GloVe embeddings provide more sophisticated text representations, thus improves the overall performance.

Study on Attention Maps
To validate the effectiveness of the enhanced attention mechanism, we visualize the attentions and  compare them versus those of the up-down model. As Figure 4 shows, the addition of object-level groundings leads to a better understanding of the images and locates the attended objects more precisely. For example, in Figure 4(a), for question "Can you see its paws?", the attention generated by our method is focused on the "paws", while the baseline does not focus on the key regions as accurate as we do. In Figure 4(b), for the Numbers type question "How many dog ears are shown?", our model gives the strongest attention on the "ear" part of the dog, while the baseline model attends to the whole dog body. For small object clues, our model shows more advantage. As shown in the examples in Figure 4(c), Figure 4(d).
We also notice cases where though the final answer is wrong, our model generates appropriate attention maps. As shown in Figure 4(e), for Yes/no question "Does his bow tie match his pants?", our model correctly finds "tie" and "pants" object regions, but we suspect that the model does not understand the meaning of "match".
A mean opinion score (MOS) test to quantitatively compare our attention mechanism with the baseline model is also performed. Specifically, we randomly select 100 cases and generate their attention maps. Then, we asked subjects to rate a score from 0 (bad quality), 0.5 (medium quality) and 1 (excellent quality) to these attention maps. The distribution of MOS ratings are summarized in Figure 5. The mean scores of our model 0.8125 wins a large margin over the baseline model 0.7315, indicating that the attention maps generated by our attention mechanism are preferred by human.

Conclusion
This paper proposes a multi-grained attention mechanism. It involves both word-object grounding and sentence-image association to capture different degrees of granularity and interpretability of the images. Visualizations of object-level attention show a clear improvement in the ability of the model to attend to small details in complicated scenes.