Multimodal Neural Graph Memory Networks for Visual Question Answering

We introduce a new neural network architecture, Multimodal Neural Graph Memory Networks (MN-GMN), for visual question answering. The MN-GMN uses graph structure with different region features as node attributes and applies a recently proposed powerful graph neural network model, Graph Network (GN), to reason about objects and their interactions in an image. The input module of the MN-GMN generates a set of visual features plus a set of encoded region-grounded captions (RGCs) for the image. The RGCs capture object attributes and their relationships. Two GNs are constructed from the input module using the visual features and encoded RGCs. Each node of the GNs iteratively computes a question-guided contextualized representation of the visual/textual information assigned to it. Then, to combine the information from both GNs, the nodes write the updated representations to an external spatial memory. The final states of the memory cells are fed into an answer module to predict an answer. Experiments show MN-GMN rivals the state-of-the-art models on Visual7W, VQA-v2.0, and CLEVR datasets.


Introduction
Visual question answering (VQA) has been recently introduced as a grand challenge for AI. Given an image and a free-form question about it, the VQA task is to produce an accurate natural language answer. VQA has many applications, such as image retrieval and search. This paper proposes a new neural network architecture for VQA based on the recent Graph Network (GN) (Battaglia et al., 2018).
The pairwise interactions between various regions of an image and spatial context in both horizontal and vertical directions are important to answer questions about objects and their interactions in the scene context. For example, to answer How many cats are in the picture? (see Figure 1), a Figure 1: An example from Visual Genome (https: //visualgenome.org/). The region-grounded captions provide useful clues to answer questions. For example, to answer Where are the cats?, orange and white cat laying on a wooden bench is informative. model needs to aggregate information from multiple, possibly distant, regions; hence applying a convolutional neural network may not be sufficient to perform reasoning over the regions. Our new architecture (see Figure 2), Multimodal Neural Graph Memory Network (MN-GMN), uses a graph structure to represent pairwise interactions between visual/textual features (nodes) from different regions of an image. GNs provide a contextaware neural mechanism for computing a feature for each node that represents complex interactions with other nodes. This enables our MN-GMN to answer questions that need reasoning about complex arrangements of objects in a scene.
Previous approaches such as Memory Networks (MN) (Sukhbaatar et al., 2015) and Dynamic Memory Networks (DMN) (Kumar et al., 2015) combined a memory component and an attention mechanism to reason about a set of inputs. The DMN was first proposed for text QA. The text QA task is composed of a question, and a set of statements, called facts, in the order that describes a short story. Only a subset of the facts is required to answer a question. DMN includes four modules: input, question, episodic memory, and answer. The input and question modules encode the question and the facts. Then, the episodic memory takes as input the question and aggregates the facts to produce a vector representation of the relevant information. This vector is passed to the answer module to predict an answer. Previous applications of the MN and DMN for VQA either represent each image region independently as a single visual fact (Xu and Saenko, 2015) or represent the regions of an image like facts of a story with a linear sequential structure (Xiong et al., 2016). But, whereas a linear order may be sufficient for text QA, it is insufficient to represent the 2D context of an image.
The major novel aspect of our approach is that we exploit the flexibility of GNs to combine information from two different sources: visual features from different image regions and textual features based on region-grounded captions (RGCs). An RGC detector is learned by transfer learning from a dataset with region-grounded captions. Like visual features, an RGC is specified with a bounding-box. The RGCs capture object attributes and relationships that are often useful to answer visual questions. For example, in Figure 2, to answer Is the water calm?, a wave in the ocean is informative; the water is blue specifies an attribute of water; surfer riding a wave describe interactions between objects. Captions also incorporate commonsense knowledge. Our multimodal graph memory network comprises a visual GN and a textual GN, one for each information source. Each node of the two GNs iteratively computes a question-guided contextualized representation of the visual/textual information at the bounding-box assigned to it. The third component in our multimodal graph memory module is an external spatial memory, which is designed to combine information across the modalities. Each node writes the updated representations to the external spatial memory, which is composed of memory cells arranged in a 2D grid. The final state of the memory cells is then fed into the answer module to predict an answer. The external spatial memory resolves the redundancy introduced by overlapping bounding-boxes, which causes difficulties, for example, with counting questions.
To summarize, our main contributions are: • We introduce a new memory network architecture, based on graph neural networks, which can reason about complex arrangements of objects in a scene to answer visual questions.
• To the best of our knowledge, this is the first work that explicitly incorporates local textual information (RGCs) of the image via a transfer learning technique into a multimodal memory network to answer visual questions.
• Our architecture, which can be seen as a multimodal relational extension to DMN, rivals the state-of-the-art on three VQA datasets.

Related Work
An important part of the VQA task is to understand the given question. Most approaches utilize a neural network architecture that can handle sequences of flexible length and learn complex temporal dynamics using a sequence of hidden states. Such architectures include Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM), and the Gated Recurrent Unit (GRU). To encode a given image, most VQA approaches employ a Convolutional Neural Network (CNN) pre-trained on Im-ageNet, such as VGGNet and ResNet, to extract visual information from an image. These two recent trends of applying CNNs and RNNs have been successfully applied to image captioning and visual grounding (Johnson et al., 2015) tasks. Grounding connects words to their visual meaning. Our approach sees VQA as first grounding the question in the image and then predicting an answer. Most early deep neural-based VQA models produce an answer conditioned on a global visual feature vector and the embedded question. However, since many questions and answers relate to a specific region in an image, these models often cannot predict a precise answer. To overcome this issue, many attention-based models are proposed. The attention-based models compute an attention weight of spatially localized CNN features based on the question to predict an answer (Xu and Saenko, 2015;Xiong et al., 2016).  used the Bottom-Up Attention model  to obtain a set of features at different regions of the image and computed an attention weight for each region based on the encoded question to predict an answer. In Lu et al. (2016), the authors proposed a hierarchical co-attention model that jointly implements both image-guided question attention and question-guided visual attention. Fukui et al. (2016) proposed a VQA model based on multimodal compact bilinear (MCB) pooling to get a joint representation for image and question. Similarly, ; Kim et al. (2018) utilized higher-order fusion techniques to combine the question with visual features more efficiently. Cadene et al. (2019) proposed a bilinear fusion algorithm to represent interactions between question and image regions.
In Jabri et al. (2016), the authors introduced a model called Relation Networks, which uses multilayer perceptron models to reason over all pairs of local image features extracted from a grid of image regions. Dynamic tree structures have been used in VQA to capture the visual context of image objects (Tang et al., 2019). Yi et al. (2018) proposed a model called neural-symbolic visual question answering (NS-VQA). The NS-VQA uses symbolic structure as prior knowledge to answer questions that need complex reasoning. This model first extracts a structural scene representation from the scene and a program trace from the given question. Then, it applies the program to the scene representation to predict an answer.
Recently, a few models are proposed which can learn the interactions between image regions. The graph learner model (Norcliffe-Brown et al., 2018) merges a graph representation of the image based on the question with a graph convolutional network, to learn visual features that can represent question specific interactions. Yang et al. (2018) proposed to reason over a visual representation of the image called scene graph which represents objects and their relationships explicitly. Li et al. (2019) introduced a VQA model called Relation-aware Graph Attention Network (ReGAT). Guided by the question, ReGAT encodes an image into a graph that represents relations among visual objects. The Re-GAT is trained on Visual Genome dataset (Krishna et al., 2016).
Most of the above models need datasets with annotated object relationship triplets for training. Because annotating triplets is difficult, such datasets are relatively small. Instead, our VQA architecture exploits the rich textual information of an image via incorporating the RGCs to learn the attributes of an image region and the interactions between a set of image regions enclosed by an RGC bounding-box. This information is much easier to obtain because large caption datasets are available.
More recently, Hudson and Manning (2019a) proposed a model called Neural State Machine (NSM) for the visual questions that need compositionality and multi-step inference. Given an image, the NSM first predicts a probabilistic graph as a structured semantic representation of the image. Then, NSM executes sequential reasoning guided by the input question over the predicted graph, by iteratively traversing the nodes of the graph. The authors show that the proposed model can achieve state-of-the-art results on VQA-CP (Agrawal et al., 2018) and GQA (Hudson and Manning, 2019b) datasets. Shrestha et al. (2019) introduced a VQA model called Recurrent Aggregation of Multimodal Embeddings Network (RAMEN), which is suitable for both natural image understanding and the synthetic datasets that need compositional reasoning. The RAMEN processes visual and question features in three steps: early fusion of spatially localized image features with question features, learning bimodal embeddings, and aggregating them across the image by applying a bidirectional GRU to capture the interactions between bimodal embeddings.

Graph Networks
In this section, we briefly explain the graph networks (GN) framework (Battaglia et al., 2018). The GN extends several other graph neural networks such as message-passing neural networks (Gilmer et al., 2017), and non-local neural networks . In a GN framework, a graph is represented by a 3-tuple G = (u, V, E), where u is a graph-level attribute. The V = {v i } i=1:N is a set of node attributes, where v i is a node attribute of node i, and N is the number of nodes. The E = {(e k , r k , s k )} k=1:M is a set of edges, where e k is an edge attribute for the edge going from node s k to node r k , and M is the number of edges.
A GN block has three update functions φ and three aggregation functions ρ. Given an input graph, a GN block updates the graph using the update and aggregation functions. The computational steps in a GN are represented in Algorithm 1. The function φ e is mapped over entire edges to calculate per-edge updates, φ v is mapped over entire nodes to calculate per-node updates, and φ u is used to update the global attribute. The ρ's should be unvarying to permutations of their inputs and must be flexible to a varying number of arguments, such as maximum, summation, etc.
Algorithm 1: Computational steps in a Graph Network block. Input Aggregate edge attributes for each node Compute new node attributes Aggregate edge attributes for the whole graph Aggregate node attributes for the whole graph Our Proposed Architecture Figure 2 shows our MN-GMN architecture, which is composed of four modules: input, question, multimodal graph memory network, and answer. We now describe these modules.

Input Module
The input module has two components: A deep CNN, e.g., Bottom-Up Attention , ResNet (He et al., 2015), etc. and a regiongrounded caption (RGC) encoder which encodes the RGCs. The RGCs are generated by a dense captioning model. Then, they are encoded with a GRU and a parser (Schuster et al., 2015). The RGCs are useful to answer questions about object attributes and their relationships. We now describe the details and motivation for these components.
Visual Feature Extraction. To extract visual features, we use the Bottom-Up Attention model. The features are obtained via Faster R-CNN and 101-layer ResNet, which attend to specific image regions. Using a fixed threshold on object detection, we extract N 2048-dimensional image features from N different regions of the image. The value of N depends on the image and ranges from 10 to 100. Each feature vector has a bounding-box specified by its coordinates r = (r x , r y , r x , r y ), where (r x , r y ) and (r x , r y ) are the top-left and bottomright corners of the bounding-box which are normalized to have a values between 0 and 1 based on the height and width of the image. We concatenate each feature vector with its bounding-box to obtain a vector denoted by x i , (i = 1, . . . , N ). Note that x i only describes the image at its bounding-box without exploiting the global spatial context.

Captions.
To extract a set of RGCs for the image, we use a dense captioning model proposed by Johnson et al. (2015). This model contains a CNN, a dense localization layer, and an RNN language model that generates the captions (https:// github.com/jcjohnson/densecap). The model is trained on RGCs from the Visual Genome dataset. The training set that we use does not include VQA-v2.0/Visual7W test images. Through transfer learning, our model is leveraging the caption annotations. Each RGC has a caption, a bounding-box, and a confidence score. To encode a caption, we first create a dictionary using all words in the captions and questions. We preprocess the captions and questions with basic tokenization by converting all sentences to lower case and throwing away non-alphanumeric characters.
We map the words to a dense vector representation using a trainable word embedding matrix L ∈ L × D, where D is the dimensionality of the semantic space, and L is the size of the dictionary. To initialize the word embeddings, we use the pretrained GloVe vectors. The words that don't occur in the pretrained word embedding model are initialized with zeros. We encode a caption using a GRU and a parser. The parser takes a caption and parses it into a set of objects with their attributes and a set of relationship triplets. The encoded RGC is a vector representation denoted byx ∈ R D . See appendix A for more detail about the RGC encoding.

Question Module
We encode a question using the same dictionary as we use for captions. This enables our model to match the words in a caption with the words in a question and attend to the relevant caption. The final hidden state of a GRU, denoted by q, is used as the representation of the question.

Multimodal Graph Memory Network
Given a set of visual feature vectors, a set of encoded RGCs, and the encoded question, the multimodal graph memory network module produces a representation of the relevant information based on the encoded question. The memory chooses which parts of the inputs to focus on using an attention mechanism. Unlike previous work (Xu and Saenko, 2015;Xiong et al., 2016), our memory network module is multimodal and relational. That is, it employs both textual and visual information of the input image regions, and it exploits pair-wise interactions between each pair of visual/textual features using a visual/textual GN. Similar to visual features, most of the RGCs may be irrelevant to the given question. Thus, the memory module needs to learn an attention mechanism for focusing on the relevant RGCs.
Formally, the multimodal graph memory network is composed of a visual GN G = (u, V, E) with N nodes, a textual GNG = (ũ,Ṽ,Ẽ) withÑ nodes, and an external spatial memory. Each node of the visual GN represents a visual feature with an associated bounding-box. Similarly, each node of the textual GN has a bounding-box corresponds to a detected RGC of the image. In both GNs, we connect two nodes via two forward and backward edges if they are nearby. That is, we connect two nodes if the Euclidean distance between the normalized center of their bounding-boxes is less than γ = 0.5. Note that even if two nodes of a GN are not neighbors, they may still communicate via the message passing mechanism of the GN.
The external memory is a network of memory cells arranged in a P × Q grid. Each cell has a fixed location that corresponds to a specific (H/P )×(W/Q) region in the image, where H and W are height and width of the image. Each node of the visual/textual GN sends its information to a memory cell if its bounding-box covers the location of the cell. Since the bounding-boxes may overlap, a cell may get information from multiple nodes. The external memory network is responsible for aggregating the information from both GNs and eliminating redundancy introduced by overlapping bounding-boxes. This makes our architecture less sensitive to the number of detected bounding-boxes. Since the input to the spatial memory is the output of the GNs, the state of the GN nodes can be seen as an internal memory, and the state of the spatial memory can be seen as an "external" memory like Neural Turing Machines (Graves et al., 2014).
Initialization. To initialize each node attribute of the visual GN, we combine a visual feature vector extracted from a region of the image with the encoded question using MCB pooling as v i = q x i , where represents the MCB pooling. Similarly, we initialize each node attribute of the textual GN asṽ i = q x i , where is the element-wise multiplication. We use the MCB to combine the visual features with the encoded question since the question and visual features are from different modalities. The global attribute u is initialized by a global feature vector of the image extracted from the last layer of the 101-layer ResNet. This helps to answer questions that need the global features of the scene. The global attributeũ is initialized with the encoded question. The edge features of the GNs and memory cells are initialized with zero vectors.
Updates. At each iteration, we first update the GNs. Then, we update the content of the memory cells. We update the edge attributes, node attributes, and global attribute of both GNs as described in Algorithm 1. For each GN, we use three different GRUs to implement the functions φ e , φ v , and φ u . The ρ e→v is an element-wise summation. The ρ v→u and ρ e→u for visual GN are implemented as where, σ and ψ are the sigmoid and tangent hyperbolic activation functions, and W i , b i , i = 1, . . . , 4, are trainable parameters. This allows to incorporate information from the question for computing the attention weights using the sigmoid function for each node/edge. The ρ v→u and ρ e→u for the textual GN are implemented in a similar way. Let,v p,q = 1 |Np,q| i∈Np,q v i and v p,q = 1 |Ñp,q| i∈Ñp,qṽ i , where N p,q andÑ p,q are the set of nodes which are connected to the memory cell (p, q) in the visual and textual GNs, respectively. Each memory cell is updated as where f is a neural network layer which aggregates the memories from the neighboring cells. We repeat these steps for two iterations. Applying one iteration decreases the accuracy by about 2.0 points. As observed by Kumar et al. (2015), iterating over the inputs allows the memory network to take several reasoning steps which some questions require.

Answer Module
The answer module predicts an answer using a GN called answer GN. The nodes of the answer GN are the external spatial memory cells. However, there is an edge between every ordered pair of the nodes (cells), hence the answer GN is a complete graph. This supports reasoning across distant regions of the image. Let m • p,q be the final state of the memory cell at location (p, q). We initialize the node attributes of the answer GN denoted by v • p,q as v • p,q = m • p,q . The edge attributes are initialized using the one-hot representation of the location of the sender and receiver memory cells. That is, the edge attribute of the edge going from the memory cell at location (p, q) to (p , q ), is initialized with a vector of size 2P + 2Q which is computed by concatenating the one-hot representation of p, q, p , and q . The global attribute of the answer GN is initialized with a vector of zeros.
Then, we update the edge attributes, the node attributes and the global attribute of the answer GN as described in Algorithm 1. As before, we use three different GRUs to implement functions φ e , φ v , and φ u . The ρ e→v is a simple elementwise summation. The ρ v→u and ρ e→u are implemented as before, but with different set of parameters. The answer module predicts an answer aŝ p = σ Wg(u • ) +Wg(u • ) + b where, u • is the updated global attribute of the answer GN, W ∈ R Y ×2048 ,W ∈ R Y ×300 , b ∈ R Y are trainable parameters, g,g are non-linear layers, and Y is the number of possible answers.
Following , to exploit prior linguistic information about the candidate answers, the GloVe embeddings of the answer words are used to initialize the rows of theW. Initialization with the Glove embeddings improves the performance by about 1.0 point. Similarly, to utilize prior visual information about the candidate answers, a visual embedding is used to initialize the rows of W. The visual embedding is obtained by retrieving 10 image from Google Images for each word. Then, the images are encoded using the ResNet-101 pretrained on ImageNet to obtain a feature vector of size 2048. For each word, the average of the feature vectors is used to initialize a row of W. The loss for a single sample is defined where,p i is the ith element ofp, and p i is the ith element of the ground-truth vector p (p i = 1.0 if A ≥ 3 annotators give the ith answer word, otherwise p i = A/3). For multiple choice task, the candidate answers are encoded by the last state of a GRU and concatenated with u • using a neural network layer asṕ = σ ẃf ([u • , a]) +b where, a is an encoded answer choice,f is a non-linear layer, andẃ,b are trainable parameters. For multiple choice task, the binary logistic loss −p log(ṕ) − (1 − p) log(1 −ṕ) is used, where p is 1.0 for an (image,question,answer) triplet, if the answer choice is correct, otherwise p is 0.
Training Details and Optimization. The MN-GMN is implemented in TensorFlow. We use a library from https://github.com/deepmind/ graph_nets to implement the GNs. We follow VQA tips in  to train our models. More specifically, to apply an ensemble technique, 20 instances of the model is trained with various initial random seeds. For test images, the scores for the answers by all models are summed, and the answer is predicted using the highest summed score. To minimize the loss, we apply the RMSprop optimization algorithm with a learning rate of 0.0001 and minibatches of size 100.
Dropout with probability 0.5 and early stopping are applied to prevent overfitting. Dropout is used after the layer that computes the updated global attribute of the answer GN. During training, all parameters are tuned except for the weights of the CNN and RGC detector to avoid overfitting. For VQA-v2.0 and Visual7W datasets, we augment the training dataset with Visual Genome/GQA images and QA pairs. The training set that we use does not include the VQA-v2.0/Visual7W test or Visual7W validation images. The output dimension of the MCB and the dimension of the hidden layer in both RGC and question GRUs are set to 512. Also, we set P, Q = 14 and D = 512. The full model takes around 6 hours to train on two Titan X GPUs.

Experiments
We explain the datasets, baseline models, and evaluation metric that we use in our experiments. Then, the experimental results are discussed.
Datasets. VQA-v2.0 (Antol et al., 2015) includes 82, 783 training images, 40, 504 validation images, and 81, 434 testing images. There are 443, 757 training questions, 214, 354 validation questions, and 447, 793 test questions in this dataset. A subset of the standard test set, called test-dev, contains 107, 394 questions. Each question has 10 candidate answers generated by humans. We choose correct answers that appear more than 8 times. This makes Y = 3, 110 candidate answers. We use the standard metric (Antol et al., 2015), which is an answer is correct if at least 3 people agree.
We also experiment on CLEVR dataset (Johnson et al., 2017a). CLEVR evaluates different aspects of visual reasoning, such as attribute recognition, counting, comparison, logic, and spatial relationships. Each object in an image has the following attributes: shape (cube, sphere, or cylinder), size (large or small), color (8 colors), and material (rubber or metal). An object detector with 96 classes is trained using all combinations of the attributes by the Tensorflow Object Detection API. We use Faster R-CNN NasNet trained on the MS-COCO dataset as the pretrained model. Given an image, the output of the object detector is a set of object bounding-boxes with their feature vectors. For CLEVR, we omit the textual GN, since CLEVR images do not have rich textual information.
Baselines. We compare our model with several architectures developed recently, including the state-of-the-art models ReGAT, BAN, VCTREE, and MuRel. For comparison, we also include three related models in Table 1 that have been proposed more recently in Arxiv preprints during the preparation of this work: LXRT, MSM@MSRA, and MIL@HDU. The ReGAT exploits supervision from Visual Genome relationships. MAN is a memory-augmented neural network which attends to each training exemplar to answer visual questions, even when the answers infrequently happen in the training set. The Count  is a neural network model designed to count objects from object proposals. For Visual7W, we compare our models with Zhu et al. (2015), MCB, MAN, and MLP. The MCB leverages the Visual Genome QA pairs as additional training data and the 152-layer ResNet as a pretrained model. The MLP method uses (image,question,answer) triplets to score answer choices. For CLEVR, we compare our models with several baselines proposed by Johnson et al. (2017a) as well as the state-ofthe-art models RAMEN, PROGRAM-GEN, and NS-VQA. N2NMN learns to predict a layout based on the question and compose a network using a set of neural modules. The CNN+LSTM+RN learns to infer a relation using a neural network model called Relation Networks. The PROGRAM-GEN exploits supervision from functional programming, which is used to generate CLEVR questions.
Ablation Study. We implement several lesion architectures. The MN+ResNet model does not use any GNs and is designed to evaluate the effect of using GN. This model is similar to MN (Sukhbaatar et al., 2015). It applies a soft attention for 14 × 14 ResNet feature maps (the last 14 × 14 pooling layer) and generates a representation Here h, h are non-linear layers, and α i is an attention weight computed as α i = softmax wh ([x i , q]) , where w is a learned parameter vector and h is a non-linear layer. Then, an answer is predicted as described before.
The N-GMN model only uses the visual GN (no textual GN nor spatial memory). This model evaluates the effect of incorporating RGCs. After two iterations, the global feature vector of the visual GN is used as u • to generate an answer. The N-GMN + model only uses the visual GN and the external spatial memory components (no textual GN). This model is used for the CLEVR dataset since CLEVR images do not have rich textual information. The MN-GMN − model does not use the external spatial memory. After two iterations, the global feature vector of the visual and textual GNs are concatenated and fed into a non-linear layer to generate u • . Finally, MN-GMN is our full model.

Results and Discussion.
Our experimental results on VQA-v2.0 dataset are reported in Table  1. For LXRT, MSM@MSRA, and MIL@HDU, the numbers are reported from the VQA Challenge 2019 Leaderboard (using an ensemble of models). Across all question types, N-GMN outperforms MN+ResNet. This shows that applying the visual GN with explicit object bounding-boxes provides a usefully richer representation than a grid of fixed visual features. MN-GMN − outperforms N-GMN. This shows that RGCs help to improve accuracy. RGCs are especially useful for answering the Other and Yes/No question types. Our full model MN-GMN outperforms MN-GMN − . This shows that applying external spatial memory is effective, especially for Number questions. The full model's accuracy is higher than the baselines.     Our results on Visual7W are reported in Table 2. Our N-GMN, MN-GMN − , and MN-GMN outperform the baselines MLP, MAN, and MCB+ATT. The results for our N-GMN + on CLEVR in Table 3 are competitive with the state-of-the-art RAMEN, PROGRAM-GEN, and NS-VQA. We emphasize that, unlike PROGRAM-GEN, our algorithm does not exploit supervision from functional programming. Also, unlike NS-VQA, our model is not tailored to synthetic datasets only, since it performs well on both natural and artificial datasets that need multi-step compositional reasoning.      Figure 4 illustrates the visualization of the attention weights with MN-GMN to answer a Number question. We compute the attention weights that are used to obtainv for each spatial memory cell. More precisely, the magnitude of the sigmoid output that implements ρ v→u for the spatial memory is visualized. Each attention weight shows the importance of a fixed region in a 14 × 14 grid of cells to the question. Figure 5 shows a VQA example on the CLEVR dataset. Appendix B provides more examples.

Conclusions
Multi-modal Neural Graph Memory Networks are a new architecture for the VQA task. The MN-GMN represents bimodal local features as node attributes in a graph. It leverages a graph neural network model, Graph Network, to reason about objects and their interactions in a scene. In experiments on three datasets, the MN-GMN showed superior quantitative and qualitative performance compared to the lesion approaches and rivals the state-of-the-art models. A future research direction is to combine RGCs with distant supervision by an external knowledge base to answer the visual questions that need external knowledge; for example Which animal in this photo can climb a tree? of the captions. We obtain a fixed-length representation for the output of the parser, denoted bỹ c ∈ R 14D by allocating the embedding of 14 words: 6 words for up to two relationship triplets and 8 words for up to 4 objects and their attributes. For the aforementioned example, the fixed-length representation is the concatenation of the embedding of each word in sequence ≺cat-lay on-bench,x-xx,bench-wooden,cat-orange,cat-white,x-x , where x is a special token to represent an empty slot. To create the sequences, we use a fixed arbitrary order. Each RGC has also a bounding-box specified by its coordinatesr = (r x ,r y ,r x ,r y ), where (r x ,r y ) and (r x ,r y ) are the top-left and bottom-right corners of the bounding-box which are normalized to have a value between 0 and 1 based on the height and width of the image. For each RGC, we project the concatenation ofr, c andc to a space of dimensionality D using a densely-connected layer with ReLU activation function to obtain a vector representation denoted byx ∈ R D .
B More Examples for VQA Task Figure 6 shows a VQA example on CLEVR dataset. Figure 7 shows how MN-GMN can answer a ques- tion correctly by incorporating the region-grounded captions, whereas N-GMN gives the wrong answer. Figure 8 illustrates the visualization of the attention weights with MN-GMN to answer a Number question. For this example, we compute the attention weights that are used to obtainv for each spatial memory cell. More precisely, the magnitude of the sigmoid output that implements ρ v→u for the external spatial memory is visualized. Each attention weight shows the importance of a fixed region in a 14 × 14 grid of memory cells to the question.