GraghVQA: Language-Guided Graph Neural Networks for Graph-based Visual Question Answering

Images are more than a collection of objects or attributes — they represent a web of relationships among interconnected objects. Scene Graph has emerged as a new modality as a structured graphical representation of images. Scene Graph encodes objects as nodes connected via pairwise relations as edges. To support question answering on scene graphs, we propose GraphVQA, a language-guided graph neural network framework that translates and executes a natural language question as multiple iterations of message passing among graph nodes. We explore the design space of GraphVQA framework, and discuss the trade-off of different design choices. Our experiments on GQA dataset show that GraphVQA outperforms the state-of-the-art accuracy by a large margin (88.43% vs. 94.78%).


Introduction
Images are more than a collection of objects or attributes. Each image represents a web of relationships among interconnected objects. Towards formalizing a representation for images, Visual Genome (Krishna et al., 2017a) defined scene graphs, a structured formal graphical representation of an image that is similar to the form widely used in knowledge base representations. As shown in Figure 1, scene graph encodes objects (e.g., girl, burger) as nodes connected via pairwise relationships (e.g., holding) as edges. Scene graphs have been introduced for image retrieval (Johnson et al., 2015), image generation (Johnson et al., 2018), image captioning (Anderson et al., 2016), understanding instructional videos (Huang et al., 2018), and situational role classification (Li et al., 2017).
To support question answering on scene graphs, we propose GraphVQA, a language-guided graph 1 Equal Contribution. Authors listed in alphabetical order.

Input: Question
What is the red object left of the girl that is holding a hamburger?
Input: Image (Represented as Scene Graph) Step 1: Scene Graph Reasoning  Figure 1: Scene Graph: Scene graph encodes objects (e.g., girl, burger) as nodes connected via pairwise relationships (e.g., holding) as edges. GraphVQA Framework: Our core insight is to translates and executes a natural language question as multiple iterations of message passing among graph nodes (e.g., hamburger -> small girl -> red tray). The final state after message passing represents the answer (e.g., tray).
neural network framework for Scene Graph Question Answering(Scene Graph QA). Our core insight is to translate a natural language question into multiple iterations of message passing among graph nodes. Figure 1 shows an example question "What is the red object left of the girl that is holding a hamburger". This question can be naturally answered by the following iterations of message passing "hamburger → small girl → red tray". The final state after message passing represents the answer (e.g., tray), and the intermediate states reflect the model's reasoning. Each message passing iteration is accomplished by a graph neural network (GNN) layer. We explore various message passing designs in GraphVQA, and discuss the trade-off of different design choices. Scene Graph QA is closely related to Visual Question Answering (VQA). Although there are many research efforts in scene graph generation, Scene Graph QA remains relatively underexplored. Sporadic attempts in scene graph based VQA (Hu et al., 2019;Li et al., 2019;Santoro et al., 2017) mostly propose various attention mechanisms designed primarily for fully-connected graphs, thereby failing to model and capture the important structural information of the scene graphs.
We found that GraphVQA with de facto GNNs can outperform the state-of-the-art model by a large margin (88.43% vs. 94.78%). We discuss additional related work in appendix A. Our results suggest the importance of incorporating recent advances from graph machine learning into our community.

Machine Learning with Graphs
Modeling graphical data has historically been challenging for the machine learning community. Traditionally, methods have relied on Laplacian regularization through label propagation, manifold regularization or learning embeddings. Today's de facto choice is graph neural network (GNN), which is a operator on local neighborhoods of nodes.
GNNs follow the message passing scheme. The high level idea is to update each node's feature using its local neighborhoods of nodes. Specifically, node i's representation at l-th layer h (l) i can be calculated using previous layer's node representations h where e ji denotes the feature of edge from node j to node i, h N i denotes aggregated neighborhood information, γ (l) and φ (l) denotes differentiable functions such as MLPs, and AGG denotes aggregation functions such as mean or sum pooling.

Question Parsing Module
Question Parsing Module uses a sequence-tosequence transformer architecture to translate the question [q 1 , . . . , q Q ] into a sequence of instruction vectors [i (1) , . . . , i (M ) ] with a fixed M .

Scene Graph Encoding Module
Scene Graph Encoding Module first initializes node featuresX = [x 1 , ...,x N ] with the word embeddings of the object name and attributes, and edge features E with the word embedding of edge type. We then obtain contextualized node features X by: where σ denotes the activation function, e ij denotes the feature of the edge that connects node i and node j, and X = [x 1 , x 2 , ..., x N ] denotes the contextualized node features.

Graph Reasoning Module
Graph Reasoning Module is the core of GraphVQA framework. Graph Reasoning Module executes the M instruction vectors step-by-step, with N graph neural network layers. One major difference between our Graph Reasoning Module and standard GNN is that, we want the message passing in layer L conditioned on the L th instruction vector. Inspired by language model type condition (Liang et al., 2020b), we adopt a general design that is compatible with any graph neural network design: Before running the L th GNN layer, we concatenate the L th instruction vector to every node and edge feature from the previous layer. Specifically, whereĥ denotes the node feature and edge feature as inputs to the L th GNN layer. Next, we introduce three standard GNNs that we have explored, starting from the simplest one.

Graph Convolution Networks (GCN)
GCN (Kipf and Welling, 2017) treats neighborhood nodes as equally important sources of information, and simply averages the transformed features of neighborhood nodes.

Graph Isomorphism Network (GINE)
GIN (Xu et al., 2019) is provably as powerful as the Weisfeiler-Lehman graph isomorphism test. GINE (Hu et al., 2020) augments GIN by also considering edge features during the message passing: where Θ denotes expressive functions such as MLPs, and is a scale factor for the emphasis of the central node.

Graph Attention Network (GAT)
Different from GIN and GINE, GAT (Veličković et al., 2018) learns to use attention mechanism to weight neighbour nodes differently. Intuitively, GAT fits more naturally with our Scene Graph QA task, since we want to emphasis different neighbor nodes given different instruction vectors. Specifically, the attention score α (L) ij for message passing from node j to node i at L th layer is calculated as: where Softmax N i is a normalization to ensure that the attention scores from one node to its neighbor nodes sum to 1. After calculating the attention scores, we calculate each node's new representation as a weighted average from its neighbour nodes.
where σ denotes the activation function. Similar to transformer models, we use multiple attention heads in practice. In addition, many modern deep learning tool-kits can be incorporated into GNNs, such as batch normalization, dropout, gating mechanism, and residual connections.

Answering Module
After executing the Graph Reasoning module, we obtain the final states of all graph nodes after M iterations of message passing [h We first summarize the final states after message passing, and then predict the answer token with the question summary vector q: where y is the predicted answer. We note that GraphVQA does not require any explicit supervision on how to solve the question step-by-step, and we only supervise on the final answer prediction.

Experiments
Setup We evaluate our GraphVQA framework on the GQA dataset (Hudson and Manning, 2019a) which contains 110K scene graphs, 1.5M questions, and over 1000 different answer tokens. We use the official train/validation split of GQA. Since the scene graphs of the test set are not publicly available, we use validation split as test set. We set the number of instructions M = 5. More dataset and training details are included in Appendix C.

Models and Metrics
We evaluate three instantiations of GraphVQA: GraphVQA-GCN, GraphVQA-GINE, GraphVQA-GAT. We compare with the state-of-the-art model LCGN (Hu et al., 2019). We discuss LCGN in appendix B.3. We also compare with a simple GCN without instruction vector concatenation discussed in § 3.3 to study the importance of language guidance. We report the standard evaluation metrics defined in Hudson and Manning (2019a) such as accuracy and consistency.

Results
The first take-away message is that GraphVQA outperforms the state-of-the-art approach LCGN, even with the simplest GraphVQA-GCN. Besides, GraphVQA-GAT outperforms LCGN by a large margin (88.43% vs. 94.78% accuracy), highlighting the benefits of incorporating recent advances from graph machine learning. The second take-away message is that conditioning on instruction vectors is important. Removing such conditioning drops performance (GCN vs. GraphVQA-GCN, 85.7% vs. 90.18%). The third take-away message is that attention mechanism is important for Scene Graph QA, as GraphVQA-GAT also outperforms both GraphVQA-GCN and GraphVQA-GINE by a large margin (94.78% vs. 90.38%), even though GINE is provably more expressive than GAT (Xu et al., 2019). Analysis Figure 3 shows the accuracy breakdown on question semantic types. We found that GraphVQA-GAT achieves significantly higher accuracy in relationship questions (95.53%). This shows the strength in the attention mechanism in modeling the relationships in scene graphs. Figure 4 shows the accuracy breakdown on question word count. As expected, longer questions are harder to answer by all models. In addition, we found that as questions become longer, the accuracy GraphVQA-GAT deteriorates drops than other methods, showing that GraphVQA-GAT is better at answering long questions.

Conclusion
In this paper, we present GraphVQA to support question answering on scene graphs. GraphVQA translates and executes a natural language question as multiple iterations of message using graph neural networks. We explore the design space of GraphVQA framework, and found that GraphVQA-GAT (Graph Attention Network) is the best design. GraphVQA-GAT outperforms the state-of-the-art model by a large margin (88.43% vs. 94.78%). Our results suggest the potential benefits of revisiting existed Vision + Language multimodal models from the perspective of graph machine learning.

A.1 Visual Question Answering
VQA requires an interplay of visual perception with reasoning about the question semantics grounded in perception. The predominant approach to visual question answering (VQA) relies on encoding the image and question with a "black-box" neural encoder, where each image is usually represented as a bag of object features, where each feature describes the local appearance within a bounding box detected by the object detection backbone. However, representing images as collections of objects fails to capture relationships which are crucial for visual question answering. Recent study has further demonstrated some unsettling behaviours of those models: they tend to ignore important question terms (Mudrakarta et al., 2018), look at wrong image regions (Das et al., 2016), or undesirably adhere to superficial or even potentially misleading statistical associations . In addition, it has been shown that recent advances are primarily driven by perception improvements (e.g. object detection) rather than reasoning (Amizadeh et al., 2020).

A.2 Scene Graph Question Answering
Although there are many research efforts in scene graph generation, using scene graphs for visual question answering remains relatively underexplored (Hudson and Manning, 2019b; Hu et al., The framework requires specifying the detail ontology of the dataset (e.g., color: red, blue,...; material: wooden, metallic), and thus is not directly generalizable. Other attempts in graph based VQA (Hu et al., 2019;Li et al., 2019) mostly explore attention mechanism on fully-connected graphs, thereby failing to capture the important structural information of the scene graphs. Figure 6 provides another set of accuracy breakdown result on question structural types. We found that GraphVQA-GAT achieves the best for all types of questions except for the verify types. Specifically, GraphVQA-GAT outperforms significantly than other methods on answering queries, comparing among objects and making choices. This intuitively matches the principle of attention mechanism and again shows its advantages in modeling structural information in scene graphs.

B.2 Expressive Ability Analysis of GraphVQA-GINE
As mentioned in Section 3.3.1 and Section 3.3.2, a expressive function Θ is used in GINE layer. When Θ is just a single layer MLP, the corresponding GIN/GINE structure will be very similar to the GCN structure. Since in Section 4 we implemented Θ as a single layer MLP, the performance of GraphVQA-GCN and GraphVQA-GINE stays at very similar stage. As GIN and GINE are now very popular as basic components for large-scale graph neural network design, one may ask if using Θ with more powerful expression ability will help the performance. The short answer is no. We provide a simple ablation study on different choice of Θ, using a two layer MLP-style network with (FC, ReLU, FC, ReLU, BN) structure. Table 4 shows that the result of GraphVQA-GINE-2 degrades to the worst. One possible reason is that the scale for each scen graph is generally small, therefore the expression ability might already be enough for a single layer MLP, and use a more complex Θ may leads to harder optimization problems, and thus leads to a downgrade of the performance. Such guess could possibly be further investigated and evaluated in our future work. In addition, the scene graph-based VQA as in this work might offer an opportunity for further accelerating the real world image-based applications (Liang and Zou, 2020). Exploring such deployment benefits is another direction of future work.

B.3 Brief Introduction of LCGN
Language-Conditioned Graph Networks (LCGN) (Hu et al., 2019) updates node representations recurrently using the same single layer graph neural network. Given a set of instruction vectors . , x ctx,n ], and then use them to concatenate with node representations to form initial local features, i.e, With the assumption that all nodes are connected, LCGN computes the edge weights w (t) j,i for each node pair (i,j), i.e, i,j , are then computed as: Finally, LCGN aggregates the neighborhood message information to update the context local representation x ctx,i,t .  Note that the graph neural structure of LCGN can be regarded as a variant of recurrently-used single standard GAT layer, but with more self-designed learnable parameters. The main difference between LCGN's and other proposed graph neural structure is that the output node and edge features will be recurrently fed into the same layer again for each reasoning step, leading to a RNN-style network structure, instead of a sequential-style network. Moreover, our LCGN implementation is a variant of original LCGN, including a few improvements. Firstly, we use a transformer encoder and decoder to obtain instruction vectors instead of Bi-LSTM (Liang et al., 2020d). Secondly, we incorporate the true scene graph relations as edges instead of densely connected edges. Thirdly, edge attributes are also used in the generation of initial node features.
C Implementation Details C.1 Data Pre-processing The edges in the original scene graphs are directed. This means in most of the cases where we only have one directed edge connecting two nodes in the graph, the messages can only flow through one direction. However, this does not make sense in the natural way of human reasoning. For example, an relation of "A is to the left of B" should obviously entail an opposite relation of "B is to the right of A". Therefore, in order to enhance the connectivity of our graphs, we introduce a synthetic symmetric edge for every non-paired edge, making it pointing reversely to the source node. And in order to encode this reversed relationship, we negate the original edge's feature vector and use it as the representation of our synthetic symmetric edge.

C.2 Additional Dataset Information
These scene graphs are generated from 113k images on COCO and Flicker using the Visual Genome Scene Graph (Krishna et al., 2017b) annotations. Specifically, each node in the GQA scene graph is representing an object, such as a person, a window, or an apple. Along with the positional information of bounding box, each object is also annotated with 1-3 different attributes. These attributes are the adjectives used to describe associated objects. For examples, there can be color attributes like "white", size attributes like "large", and action attributes like "standing". Attributes are important sources of information beyond the coarse-grained object classes (Liang et al., 2020c). Each edge in the scene graph denotes relation between two connected objects. These relations can be action verbs, spatial prepositions, and comparatives, such as "wearing", "below", and "taller".
We use the official split of the GQA dataset. We use two files "val_sceneGraphs.json" and "train_sceneGraphs.json" directly obtained on the GQA website as our raw dataset. Since each image (graph) is independent, GQA splits the dataset by individual graphs with rough split percentages of train/validation: 88%/12%. In the table 2, we summarize the statistics that we collected from the dataset. We did not report the statistics of the test set since the scene graphs in the test set is not publicly available.

C.3 Training details
We train the models using the Adam optimization method, with a learning rate of 10 −4 , a batch size of 256, and a learning rate drop(divide by 10) each 90 epochs. We train all models for 100 epochs. Both hidden states and word embedding vectors have a dimension size of 300, the latter being initialized using GloVe (Pennington et al., 2014).The instruction vectors have a dimension size of 512. All results reported are for a single-model settings (i.e., without ensembling). We use cross validation for hyper-parameter tuning.