Aligned Dual Channel Graph Convolutional Network for Visual Question Answering

Visual question answering aims to answer the natural language question about a given image. Existing graph-based methods only focus on the relations between objects in an image and neglect the importance of the syntactic dependency relations between words in a question. To simultaneously capture the relations between objects in an image and the syntactic dependency relations between words in a question, we propose a novel dual channel graph convolutional network (DC-GCN) for better combining visual and textual advantages. The DC-GCN model consists of three parts: an I-GCN module to capture the relations between objects in an image, a Q-GCN module to capture the syntactic dependency relations between words in a question, and an attention alignment module to align image representations and question representations. Experimental results show that our model achieves comparable performance with the state-of-the-art approaches.


Introduction
As a form of visual Turing test, visual question answering (VQA) has drawn much attention. The goal of VQA (Antol et al., 2015;Goyal et al., 2017) is to answer a natural language question related to the contents of a given image. Attention mechanisms are served as the backbone of the previous mainstream approaches (Lu et al., 2016;Yu et al., 2017), however, they tend to catch only the most discriminative information, ignoring other rich complementary clues .
Recent VQA studies have been exploring higher level semantic representation of images, notably using graph-based structures for better image understanding, such as scene graph generation (Xu et al., 2017;Yang et al., 2018), visual relationship detection (Yao et al., 2018), object counting (Zhang et   2018a), and relation reasoning (Cao et al., 2018;Li et al., 2019;Cadene et al., 2019a). Representing images as graphs allows one to explicitly model interactions between two objects in an image, so as to seamlessly transfer information between graph nodes (e.g., objects in an image).
Very recent research methods Cadene et al., 2019a; have achieved remarkable performances, but there is still a big gap between them and human. As shown in Figure  1(a), given an image of a group of persons and the corresponding question, a VQA system needs to not only recognize the objects in an image (e.g., batter, umpire and catcher), but also grasp the textual information in the question "what color is the umpire's shirt". However, even many competitive VQA models struggle to process them accurately, and as a result predict the incorrect answer (black) rather than the correct answer (blue), including the state-of-the-art methods.
Although the relations between two objects in an image have been considered, the attention-based VQA models lack building blocks to explicitly capture the syntactic dependency relations between words in a question. As shown in Figure 1(c), these dependency relations can reflect which object is being asked (e.g., the word umpire's modifies the word shirt) and which aspect of the object is being asked (e.g., the word color is the direct object of the word is). If a VQA model only knows the word shirt rather than the relation between words umpire's and shirt in a question, it is difficult to distinguish which object is being asked. In fact, we do need the modified relations to discriminate the correct object from multiple similar objects. Therefore, we consider that it is necessary to explore the relations between words at linguistic level in addition to constructing the relations between objects at visual level.
Motivated by this, we propose a dual channel graph convolutional network (DC-GCN) to simultaneously capture the relations between objects in an image and the syntactic dependency relations between words in a question. Our proposed DC-GCN model consists of an Image-GCN (I-GCN) module, a Question GCN (Q-GCN) module, and an attention alignment module. The I-GCN module captures the relations between objects in an image, the Q-GCN module captures the syntactic dependency relations between words in a question, and the attention alignment module is used to align two representations of image and question. The contributions of this work are summarized as follows: 1) We propose a dual channel graph convolutional network (DC-GCN) to simultaneously capture the visual and textual relations, and design the attention alignment module to align the multimodal representations, thus reducing the semantic gaps between vision and language.
2) We explore how to construct the syntactic dependency relations between words at linguistic level via graph convolutional networks as well as the relations between objects at visual level.
3) We conduct extensive experiments and ablation studies on VQA-v2 and VQA-CP-v2 datasets to examine the effectiveness of our DC-GCN model. Experimental results show that the DC-GCN model achieves competitive performance with the state-of-the-art approaches.

Related Works
Visual Question Answering Attention mechanism has been proven effective on many tasks, such as machine translation (Bahdanau et al., 2014) and image captioning (Pedersoli et al., 2017). A number of methods have been developed so far, in which question-guided attention on image regions is commonly used. These can be categorized into two classes according to the types of employed image features. One class uses visual features from some region proposals, which are generated by Region Proposal Network (Ren et al., 2015). The other class uses convolutional features (i.e., activations of convolutional layers).
To learn a better representation of the question, the Stacked Attention Network  which can search question-related image regions is designed by performing multi-step visual attention operations. A co-attention mechanism that jointly performs question-guided visual attention and image-guided question attention is proposed to solve the problems of which regions to look at and what words to listen to (Shih et al., 2016). To obtain more fine-grained interaction between image and question, some researchers introduce rather sophisticated fusion strategies. Bilinear pooling method (Kim et al., 2018;Yu et al., 2017Yu et al., , 2018 is one of the pioneering works to efficiently and expressively combine multimodal features by using an outer product of two vectors. Recently, some researchers devoted to overcome the priors on VQA dataset and proposed the methods like GVQA , UpDn + Q-Adv + DoE (Ramakrishnan et al., 2018), and RUBi (Cadene et al., 2019b) to solve the language biases on the VQA-CP-v2 dataset.
Graph Networks Graph networks are powerful models that can perform relational inferences through message passing. The core idea is to enable communication between image regions to build contextualized representations of these regions. Below we review some of the recent works that rely on graph networks and other contextualized representations for VQA.
Recent research works (Cadene et al., 2019a;Li et al., 2019) focus on how to deal with complex scene and relation reasoning to obtain better image representations. Based on multimodal attentional networks, (Cadene et al., 2019a) introduces an atomic reasoning primitive to represent interactions between question and image region by a rich vecto- rial representation and model region relations with pairwise combinations. GCNs, which can better explore the visual relations between objects and aggregate its own features and neighbors' features, have been applied to various tasks, such as text classification (Yao et al., 2019), relation extraction (Guo et al., 2019;Zhang et al., 2018b), scene graph generation (Yang et al., 2018;Yao et al., 2018).
To answer complicated questions about an image, a relation-aware graph attention network (Re-GAT)  is proposed to encode each image into a graph and model multi-type interobject relations via a graph attention mechanism, such as spatial relations, semantic relations and implicit relations. One limitation of ReGAT  lies in the fact that it solely consider the relations between objects in an image while neglect the importance of text information. In contrast, our DC-GCN simultaneously capture visual relations in an image and textual relations in a question.

Feature Extraction
Similar to (Anderson et al., 2018), we extract the image features by using a pretrained Faster RCNN (Ren et al., 2015). We select µ object proposals for each image, where each object proposal is represented by a 2048 dimensional feature vector. The obtained visual region features are denoted as To extract the question features, each word is embedded into a 300-dimensional Glove vector (Pennington et al., 2014). The word embeddings are input into a LSTM (Hochreiter and Schmidhuber, 1997) to encode, which produces the initial

I-GCN Module
Image Fully-connected Relations Graph By treating each object region in an image as a vertex, we can construct a fully-connected undirected graph, as shown in Figure 3(b). Each edge represents a relation between two object regions. Pruned Image Graph with Spatial Relations Spatial relations represent an object position in an image, which correspond to a 4-dimensional spatial coordinate [x 1 , y 1 , x 2 , y 2 ]. Note that (x 1 , y 1 ) is the coordinate of the top-left point of the bounding box and (x 2 , y 2 ) is the coordinate of the bottom-right point of the bounding box. Identifying the correlation between objects is a key step. We calculate the correlation between objects by using spatial relations. The steps are as follows: (1) The features of two nodes are input into multi-layer perceptron respectively, and then the corresponding elements are multiplied to get a relatedness score.
(2) The intersection over union of two object regions is calculated. According to the overlapping part of two object regions, different spatial relations are classified into 11 different categories, such as inside, cover, and overlap (Yao et al., 2018). Following the work (Yao et al., 2018), we utilize the overlapping region between two object regions to judge whether there is an edge between two regions. If two object regions have large overlapping part, it means that there is a strong correlation between these two objects. If two object regions haven't any overlapping part, we consider two objects have a weak correlation, which means there are no edges to connect these two nodes. According to the spatial relations, we prune some irrelevant relations between objects and obtain a sparse graph, as shown in Figure 3(c). Image Graph Convolutions Following the previous studies Zhang et al., 2018b;Yang et al., 2018), we use GCN to update the representations of objects. Given a graph with µ nodes, each object region in an image is a node. We represent the graph structure with a µ × µ adjacency matrix A, where A ij = 1 if there is overlapping region between node i and node j; else A ij = 0.
Given a target node i and a neighboring node j ∈ N (i) in an image, where N (i) is the set of nodes neighboring with node i, and the representations of node i and node j are h vi and h vj , respectively. To obtain the correlation score s ij between node i and j, we learn a fully connected layer over concatenated node features h vi and h vj : where w a and W a are learned parameters, σ is the non-linear activation function, and [h vj ] denotes the concatenation operation. We apply a softmax function over the correlation score s ij to obtain weight α ij , as shown in Figure 3(c) where the numbers in red represent the weight scores: . (2) The l-th layer representations of neighboring nodes h (l) vj are first transformed via a learned linear transformation W b . Those transformed representations are then gathered with weight α ij , followed by a non-linear function σ. This layer-wise propagation can be denoted as: Following the stacked L layer GCN, the output of I-GCN module H v can be denoted as: (4)

Q-GCN Module
In practice, we observe that two words in a sentence usually hold certain relations. Such relations can be identified by the universal Standford Dependencies (De Marneffe et al., 2014). As shown in Table  1, we list a part of commonly-used dependency relations. For example, the sentence what color is Figure 4: The question is performed by syntactic dependency parsing. The word is is the root node of dependency relations while the words in blue (e.g., det, dobj) are dependency relations. The direction of arrow indicates that two words exist a relation.
the umpire's shirt is parsed to obtain the relations between words (e.g., cop, det and nmod), as shown in Figure 4. The words in blue are the dependency relations. The ending of arrow indicates that this word is a modifier. The word root in purple is used to indicate which word is the root node of dependency relations. Question Fully-connected Relations Graph By treating each word in a question as a node, we construct a fully-connected undirected graph, as shown in Figure 5(a). Each edge represents a relation between two words. Pruned Question Graph with Dependency Relations Irrelevant relations between two words may bring noises. Therefore, we need to prune some unrelated relations to reduce the noises. By parsing the dependency relations of a question, we obtain the relations between words (cf. Figure 4). According to dependency relations, we prune some edges between two nodes which do not have dependency relations. A sparse graph is obtained, as shown in Figure 5(b).   Question Graph Convolutions Following the previous works Zhang et al., 2018b;Yang et al., 2018), we use GCN to update the node representations of words. Given a graph with λ nodes, each word in a question is a node. We represent the graph structure with a λ × λ adjacency matrix B where B ij = 1 if there is a dependency relation between node i and node j; else B ij = 0. Given a target node i and a neighboring node j ∈ Ω(i) in a question, Ω(i) is the set of nodes neighboring with node i. The representations of node i and j are h qi and h qj , respectively. To obtain the correlation score t ij between node i and j, we learn a fully connected layer over concatenated node features h qi and h qj : where w c and W c are learned parameters, σ is the non-linear activation function, and [h (l) qi , h qj ] de-notes the concatenation operation. We apply a softmax function over the correlation score t ij to obtain weight β ij : As shown in Figure 5(c), the numbers in red are the weight scores. The l-th layer representations of neighboring nodes h (l) qj are first transformed via a learned linear transformation W d . Those transformed representations are gathered with weight β ij , followed by a non-linear function σ. This layer-wise propagation can be denoted as: Following the stacked L layer GCN, the output of Q-GCN module H q is denoted as:

Attention Alignment Module
Based on the previous works (Gao et al., 2019;, we use self-attention mechanism (Vaswani et al., 2017) to enhance the correlation between words in a question and the correlation between objects in an image, respectively. To enhance the correlation between words and highlight the important words, we utilize the selfattention mechanism to update question representation H q . The updated question representationH q is obtained as follows: where H T q is the transpose of H q and d q is the dimension of H q . The level of this self-attention is set to 4.
To obtain the image representation related to question representation, we align the image representation H v by utilizing the question representa-tionH q as the guided vector. The similarity score r between H v andH q is calculated as follows: where H T v is the transpose of H v and d v is the dimension of H v . A softmax function is used to normalize the score r to obtain the weight scorer: where µ is the number of image regions. By multiplying the weightr and the image representation H v , the updated image representatioñ The level of this question guided image attention is set to 4. The final outputs of the attention alignment module areH q andH v .

Answer Prediction
We apply the linear multimodal fusion method to fuse two representationsH q andH v as follows: where W v , W q , W e , and b e are learned parameters, and pred means the probability of the classified answers from the set of answer vocabulary which contains M candidate answers. Following , we use binary cross-entropy loss function to train an answer classifier.

Datasets
VQA-v2 (Goyal et al., 2017) is the most commonly used VQA benchmark dataset which is split into train, val, and test-standard sets. Among teststandard set, 25% are served as test-dev set. Each question has 10 answers from different annotators. Answers with the highest frequency are treated as the ground truth. All answer types can be divided into Yes/No, Number, and Other. VQA-CP-v2 (Agrawal et al., 2018) is a derivation of the VQA-v2 dataset, which is introduced to evaluate and reduce the question-oriented bias in VQA models. Due to significant difference of distribution between train set and test set, the VQA-CP-v2 dataset is harder than VQA-v2 dataset.

Experimental Setup
We use the Adam optimizer (Kingma and Ba, 2014) with parameters α = 0.0001, β 1 = 0.9, and β 2 = 0.99. The size of the answer vocabulary is set to M =3,129 as used in (Anderson et al., 2018). The base learning rate is set to 0.0001. After 15 epochs, the learning rate is decayed by 1/5 every 2 epochs. All the models are trained up to 20 epochs with the same batch size 64 and hidden size 512. Each image has µ ∈ [10, 100] object regions, all questions are padded and truncated to the same length 14, i.e., λ = 14. The levels of stacked layer L and attention alignment module are both 4.  Table 2: Comparison with previous state-of-the-art methods on VQA-v2 test dataset. "-" means data absence. Answer types consist of Yes/No, Num and Other categories. All means the total accurary rate. All results in our paper are based on single-model performance. (Anderson et al., 2018) is proposed to use features based on Faster RCNN (Ren et al., 2015) instead of ResNet . Dense Co-Attention Network (DCN) (Nguyen and Okatani, 2018) utilizes dense stack of multiple layers of co-attention mechanism. Counting method (Zhang et al., 2018a) is good at counting questions by utilizing the information of bounding boxes. DFAF (Gao et al., 2019) dynamically fuses Intra-and Inter-modality information. ReGAT ) models semantic, spatial, and implicit relations via a graph attention network. MCAN  utilizes deep modular networks to learn the multimodal feature representations, which is a state-of-the-art approach on VQA-v2 dataset. As shown in Table 2, our model increases the overall accuracy of DFAF and MCAN by 1.2% and 0.6% on the test-std set, respectively. Although still cannot achieve comparable performance in the category of Num with respect to ReGAT (which is the best one in counting sub-task), our DC-GCN outperforms it in other categories (e.g., Y/N with 1.2%, Other with 1.1% and Overall with 0.9%). It shows that DC-GCN has relation capturing ability in answering all kinds of questions by sufficiently exploring the semantics in both object appearances and object relations. In summary, our DC-GCN achieves outstanding performance on the VQA-v2 dataset.

Experimental Results
To demonstrate the generalizability of our DC-GCN model, we also conduct experiments on the VQA-CP-v2 dataset. To overcome the language biases of the VQA-v2 dataset, the research work  designed the VQA-CP-v2 dataset and specifically proposed the GVQA model for reducing the influence of language biases. Table 3 shows the results on VQA-CP-v2 test split. The Murel (Cadene et al., 2019a) and ReGAT  build the relations between objects to realize the reasoning task and question answering task, which are the state-of-the-art models. Our DC-GCN model surpasses both Murel and ReGAT on VQA- . The performance gain is lifted to +1.05%. Although our proposed method is not designed for VQA-CP-v2 dataset, our model has a slight ad-  (Cadene et al., 2019a) 39.54 ReGAT-Sem  39.54 ReGAT-Imp  39.58 ReGAT-Spa  40.30 ReGAT  40.42 GVQA

Qualitative Analysis
In Figure 6, we visualize the learned attentions from the I-GCN module, Q-GCN module and At-tention Alignment module. Due to the space limitation, we only show one example and visualize six attention maps from different attention units and different layers. From the results, we have the following observations. Question GCN Module: The attention maps of Q-GCN(2) focus on the words color and shirt as shown in Figure 6(a) while the attention maps of Q-GCN(4) correctly focus on the words color, umpire's, and shirt, as shown in Figure 6(b). Those words have the larger weight than others. That is to say, the keywords color, umpire's and shirt are identified correctly. Image GCN Module For the sake of presentation, we only consider 20 object regions in an image. The index within [1,20] shown on the axes of the attention maps corresponds to each object in the image. Among these indexes, indexes 4, 6, 9, and 12 are the most relevant ones for the question. Compared with I-GCN(2) which focuses on the 4-th, 6-th, 9-th, 12-th, and 14-th objects (cf. Figure  6(c)), the I-GCN(4) focuses more on the 4-th, 6-th, and 12-th objects where the 4-th object has larger weight than the 6-th and 12-th objects, as shown in Figure 6(d). The 4-th object region is the region of ground true while the 6-th, 9-th, and 12-th object regions are the most relevant ones. Attention Alignment Module Given a specific question, a model needs to align image objects guided by the question to update the representations of objects. As shown in Figure 6(e), the focus regions are more scattered, where the key regions are mainly the 4-th, 9-th and 12-th object regions. Through the guidance of the identified words color, umpire's and shirt, the DC-GCN model gradually pays more attention to the 4-th, 9-th, and 12-th object regions rather than other irrelevant object regions, as shown in Figure 6(f). This alignment process demonstrates that our model can capture the relations of multiple similar objects.
We also visualize some negative examples predicted by our DC-GCN model. As shown in Figure  7, which can be classified into three categories: (1) limitation of object detection; (2) text semantic understanding in scenarios; (3) subjective judgment. In Figure 7(a), although the question how many sheep are pictured is not so difficult, the image content is really confusing. If not observe carefully, it's rather easy to obtain the wrong answer 2 instead of 3. The reasons for this error include object occlusion, near and far degrees, and the limitation of object detection. The image feature extractor is based on Faster R-CNN model (Ren et al., 2015). The accuracy of object detection can indirectly affect the accuracy of feature extraction. Counting subtask in VQA task has a large room to improve. In Figure 7(b), the question what time should you pay can be answered by recognizing the text semantic understanding in the image. Text semantic understanding belongs to another task, namely text visual question answering (Biten et al., 2019), which requires to recognize the numbers, symbols and proper nouns in a scene. In Figure 7(c), subjective judgment is needed to answer the question is this man happy. Making this judgment requires some common sense knowledge and real life experience. Specifically, someone holding a banana against him and just like holding a gun towards him, so he is unhappy. Our model can not make such analysis like a human being done to make a subjective judgment and predict the correct answer yes.
Finally, to understand the distribution of three error types, we randomly pick up 100 samples on dev set of VQA-v2. The number of three error types (i.e., overlapping objects, text semantic understanding, and subjective judgment) is 3, 3, and 29, respectively. The predicted answers of the first two questions types are all incorrect. The last one has 12 incorrect answers, which means the error rate of this question type is 41.4%. These observations are helpful to make further improvement in the future.

Ablation Study
We perform extensive ablation studies on the VQA-v2 validation dataset (cf.  Firstly, we investigate the influence of GCN types. There are two GCN types: I-GCN and Q-GCN, as shown in Table 4. When removing the I-GCN, the performance of our model decreases from 66.57% to 65.52% (p-value = 3.22E-08 < 0.05). When removing the Q-GCN, the performance of our model slightly decreases from 66.57% to 66.15% (p-value = 2.04E-07 < 0.05). We consider that there are two reasons. One is that the image content is more complex than the question's content, hence which has richer semantic information. By building the relations between objects can help clarify what the image represents and help align with the question representations. The other is that the length of question is short, and less information is contained (e.g., what animal is this? and what color is the man's shirt?).
Then, we perform ablation study on the influence of dependency relations (cf. Table 1). The relations, like nsubj, nmod, dobj and amod, are crucial to semantic representations, therefore, we do not remove them from the sentence. As shown in Table 4, removing the relations like det, case, aux and advmod individually, has trivial influence to the semantic representations of the question. But the result accuracy decreases significantly when we simultaneously remove the relations det, case and cop. The reason may be that the sentence loses too much information and becomes difficult to fully express the meaning of the original sentence. For example, consider the two phrases on the table and under the table. If we remove the relation case, which means that the words on and under are removed, then it will be hard to distinguish whether it is on the table or under the table.

Conclusion
In this paper, we propose a dual channel graph convolutional network to explore the relations between objects in an image and the syntactic dependency relations between words in a question. Furthermore, we explicitly construct the relations between words by dependency tree and align the image and question representations by an attention alignment module to reduce the gaps between vision and language. Extensive experiments on the VQA-v2 and VQA-CP-v2 datasets demonstrate that our model achieves comparable performance with the stateof-the-art approaches. We will explore more complicated object relation modeling in future work.