Towards Knowledge-Augmented Visual Question Answering

Visual Question Answering (VQA) remains algorithmically challenging while it is effortless for humans. Humans combine visual observations with general and commonsense knowledge to answer questions about a given image. In this paper, we address the problem of incorporating general knowledge into VQA models while leveraging the visual information. We propose a model that captures the interactions between objects in a visual scene and entities in an external knowledge source. Our model is a graph-based approach that combines scene graphs with concept graphs, which learns a question-adaptive graph representation of related knowledge instances. We use Graph Attention Networks to set higher importance to key knowledge instances that are mostly relevant to each question. We exploit ConceptNet as the source of general knowledge and evaluate the performance of our model on the challenging OK-VQA dataset.


Introduction
The task of Visual Question Answering (VQA) (Antol et al., 2015) was introduced to bridge the gap between natural language processing and image understanding applications. Most VQA methods focus on the visual aspect of the VQA task and predict the answer by combining the question and image representations. However, visual-based approaches are practical when no insights beyond the visual content is required.
Incorporating external knowledge into VQA models combines visual observations with external knowledge (Garderes et al., 2020). Organizing the external knowledge and storing them in a structured database, such as a Knowledge Bases (KB), have become important resources for representing the general knowledge. A typical KB could be represented as a graph, which is usually a collection of triples also known as facts. The triples specify that two entities (nodes) are connected by a particular relation (edge), e.g., (Shakespeare, writerOf, Hamlet) where Shakespeare and Hamlet are entities and writerOf is the relation between the entities. Example of such graph structures are YAGO (Suchanek et al., 2007), DBpedia (Auer et al., 2007), NELL (Carlson et al., 2010), Freebase (Bollacker et al., 2008), and the Google Knowledge Graph (Steiner et al., 2012).
In recent years, a significant amount of research has been devoted to visual-based VQA while external knowledge remains unavailable. To address this challenge, recent work (Wang et al., 2017;Wang et al., 2018;Shah et al., 2019) generated new datasets for the purpose of evaluating the performance of VQA algorithms capable of answering higher knowledge level. We use OK-VQA dataset (Marino et al., 2019) for our experiments since it requires handling outside knowledge. In addition to understanding the question and the image, the model needs to learn what knowledge is necessary to answer the OK-VQA questions since no ground-truth facts are provided with the dataset.
In this work, our main motivation is to develop a technique to integrate general knowledge while leveraging the relations between objects in the visual scene and entities in the knowledge graph. Exploring OK-VQA questions, we found questions that need information from visual concepts and beyond. For instance given question "What vegetable is on the lowermost portion of the plate?" (cf. Fig. 1), we need to learn what objects are located on the lowermost plate as well as what object among them is vegetable to retrieve the correct answer "carrot". In order to capture this type of information, we need to represent the dynamics and interactions between different objects in an image and entities from a relevant external knowledge source. The outline of our proposed model is depicted in Fig. 1. As shown in the figure, we first compute embeddings for the question and objects detected in the input image. We then construct a Scene graph  which is a relation representation of visual concepts, where the nodes are the detected objects (e.g., plate, carrot) and the edges are the relationships between objects (e.g., located). To incorporate the general knowledge, we design a knowledge retrieval module based on sentence-level similarity scores and pass the retrieved knowledge entities to a Concept graph. We use Graph Attention Networks (GATs) (Velickovic et al., 2018) to assign higher weights to those objects and knowledge instances that are mostly relevant to each question. The outputs of these three steps (question embedding, Scene graph, and Concept graph) are fused to represent joint language-vision-knowledge embeddings and then fed to a classifier to predict the answer.
In summary, the main contributions of our work are: 1) Novel methodology to incorporate general knowledge to VQA models ( Figure 1). Unlike existing models, we avoid the step of query construction and do not use ground-truth facts which makes it feasible to incorporate any knowledge resources to our model, 2) We use sentence level embeddings to retrieve knowledge instances rather than word level embeddings which capture the semantic context of the questions and knowledge instances (Section 4.1), 3) We develop Concept graphs using GATs which operate on neighboring to attend to key knowledge instances (Section 4.2), 4) We use both Scene graphs and Concept graphs to capture the relations between objects and entities.
2 Related Work 2.1 Scene Graph Representation for Visual Question Answering  propose a model to build graphs over the scene objects and over the question words. Each object in the scene corresponds to a node in the fully-connected scene graph, with each edge representing the relative position of the objects in the image. They evaluate their proposed model over a clip-art dataset. Shi et al. (2019) propose a model using scene graphs to represent objects as nodes and the pairwise relationships as edges for both abstract scenes and real images. The explicit relations between objects are not taken into account in their work. Li et al. (2019) propose a VQA model that encodes each image into a graph and represents explicit and implicit inter-object relations using a graph attention mechanism. The graph for learning implicit relation is fully-connected and include relative geometry features. The explicit relation contains visual relationship extracted from Visual Genome dataset (Krishna et al., 2017). Norcliffe-Brown et al. (2018) propose a graph learner module using Spatial graph convolutions (Monti et al., 2017). Their model learns a graph representation of the input image that is conditioned on the question, and models the relevant interactions between objects in the scene.

Knowledge-based Visual Question Answering
Knowledge-based VQA is still relatively unexplored compared to visual-based VQA. Some methods (Wang et al., 2017;Wang et al., 2018) convert an input question into fixed templates to query an external KB and processes the returned knowledge to form the final answer.  and  propose models that retrieves most relevant facts to a question-answer pair based on word level GloVe embeddings. However, they require ground-truth facts per question to classify the relation that a given question refers to. Marino et al. (2019) propose ArticleNet for OK-VQA questions. ArticleNet retrieves articles from Wikipedia for each question-image pair and then predict whether and where the ground-truth answers appear in the article. While this method handles external knowledge resources, it requires an expensive hand-crafted process to extract Wikipedia articles including collecting possible search queries for each question-image pair, using the Wikipedia search API to get the top retrieved article for each query and extracting a subset of each article that is most relevant for the query.

Problem Formulation
Here is the problem definition of the knowledge-based VQA task: Given a question q ∈ Q grounded in an image I ∈ I and a knowledge base G, the goal is to predict a meaningful answer a ∈ A. Let Θ be the parameters of the model p that needs to be trained. Therefore, the predicted answerâ of our model is: In order to retrieve the correct answer, we aim to learn a joint representation z ∈ R dz of q, I, and G such that: where a * is the ground-truth answer. d z is a hyperparameter that represents the dimension of the joint space z. d z is selected based on a trade-off between the capability of the representation and the computational cost.

Our Approach
Our proposed approach is outlined in Fig. 1 which consists of Image Representation, Question Representation, Knowledge Retrieval, Graph Construction, and Multimodal Fusion modules. We use pre-trained Faster R-CNN features (Anderson et al., 2017) where each object i is associated with a visual feature vector v i ∈ R dv and bounding-box coordinates.
For the Question Representation, we use a bidirectional RNN (GRU) and perform self attention on the sequence of RNN hidden states to generate question embeddings q ∈ R dq . The following sub-sections explain the Knowledge Retrieval, Graph Construction, and Multimodal Fusion modules in detail.

Knowledge Retrieval
Given a question-image pair (q, I), the knowledge retrieval module outputs a set of knowledge entities E and an adjacency matrix A to capture the relation between the entities. These outputs are obtained in four steps: i) generating a question-image instance; ii) generating knowledge instances; iii) instance representation; and iv) knowledge instance ranking. We discuss each of these steps in the following.
Generating question-image instance: We first obtain a question-image instance x by concatenating tokens of the question q = {w i | i = 1, ..., n t } and object labels L = {l i | i = 1, ..., n v } detected in the image, where n t is the number of tokens. For instance given the question "What vegetable is on the lowermost portion of the plate?" and object labels "spoon, carrot, meat, plate" detected from the given image, we obtain x =" what vegetable is on the lowermost portion of the plate spoon carrot meat plate".
Generating knowledge instances: We use ConceptNet (Li et al., 2016) as the source of general knowledge which consists of a set of facts denoted as G = {f i | i = 1, ..., n kb }, where n kb is the number of facts. In ConceptNet, a fact f i is represented as a triple of the form f i = (r, h, t), where r is a relation between two entities, h is a head entity, and t is a tail entity. We select 20 most frequent relations among 34 relations in ConceptNet, i.e., r ∈ R = {RelatedTo, FormOf, IsA, PartOf, HasA, UsedFor, CapableOf, AtLocation, Causes, HasSubevent, HasFirstSubevent, HasProperty, HasPrerequisite, MotivatedByGoal, DerivedFrom, DefinedAs, SimilarTo, CausesDesire, MadeOf, Desires}.
We pre-process each triple and convert it to a semi-phrase to create knowledge instances y. For example, the triple (IsA, carrot, orange vegetable), is converted to "carrot is orange vegetable" after the pre-processing step.
Instance embedding: We propose to use pre-trained language representation (LR) models to embed the question-image instance x and knowledge instances y generated in the previous steps. While the pre-trained LR model such as BERT (Devlin et al., 2018) is an emerging direction, there is little work on its fusion with the external knowledge in VQA tasks. We are particularly interested in LR models that are trained for sentence similarity tasks since we use the generated embedding vectors to compute a sentence level similarity between the question-image instance and each knowledge instances.
We use two state-of-the-art sentence embedding models, i.e, Universal Sentence Encoder (USE) (Cer et al., 2018) and Sentence-BERT (SBERT) (Reimers and Gurevych, 2019) to generate the question-image representation x ∈ R ds and knowledge instance representations y ∈ R n kb ×ds . USE presents two models for producing sentence embeddings. We use the transformer-based model since it works best in our model. Transformer-based USE uses an attention mechanism to compute context aware representations of words in a sentence that take into account both the ordering and identity of all the other words. On the other hand, SBERT is a modification of the pre-trained BERT network that uses Siamese and triplet network structures to derive semantically meaningful sentence embeddings.
Knowledge instance ranking: We obtain a similarity score per knowledge instance by computing the cosine similarity of the question-image representation x and knowledge instance representations y. We rank the knowledge instances y based on the similarity scores and retrieve the top-k knowledge instances and the corresponding set of k facts, F k , for each question.
For every question, we extract unique entities E = {e i | i = 1, ..., 2k} from the facts F k (at most 2 unique entities per fact) and construct an adjacency matrix A that shows the relation between the entities in each fact. The size of the matrix A is 2k×2k. The value of an entry A ij is between 0 and 20 depending on the type of the relation r between entity i and entity j (0 if entity i and entity j do not belong to the same fact).

Graph Construction
In order to capture both inter-object and knowledge entity relations, we construct a scene graph using objects detected in the image and their relations, as well as a concept graph that represents the explicit relations between entities. The details of the scene and concept graph construction are as follows.
Scene graph construction: We first construct a scene graph G s , where nodes are a set of objects detected in the image and edges show the relation between objects. Given two object regions, the goal of the scene graph construction is to determine which relation exists between these two regions. We use pre-trained scene graphs generated by  which includes 14 explicit relations from Visual Genome dataset (Krishna et al., 2017), with an additional no-relation class. An illustration of the scene graph is shown in Fig. 2(a).
Inspired by (Krishna et al., 2017), we use an attention mechanism to inject information from questions into the scene graphs. This is obtained by first concatenating the question embedding q with each of the visual features v i to generate node embeddings. Self-attention is then performed on each node to obtain attended visual features based on the relations between a target object and its neighboring objects. The attended visual features are added to the original visual feature v i to serve as final visual features V . Concept graph construction: Our proposed model is designed to incorporate the general knowledge to the VQA model by integrating the explicit relations between knowledge entities. For questions that need outside knowledge insights, knowledge instances might have different weights. Therefore in designing the proposed graph, we use a question-conditioned attention mechanism and dynamically assign higher weights to those knowledge instances that are mostly relevant to each question, instead of treating all the entities equally.
Considering each entity e i as one node, we construct a concept graph G e where edges are extracted from the adjacency matrix A obtained by the knowledge retrieval module (Section 4.1). An illustration of the concept graph is shown in Fig. 2(b).
We pass the entities e i through a GRU model to generate entity embeddings e i ∈ R de (d e = 1024 in our experiment). To obtain a question-conditioned attention mechanism, we first concatenate entity embeddings e i with the question embedding q to generate node embeddings: GATs are used to assign different weights to N neighbours of a target node. The graph attention mechanism is employed by applying multi-head attention with the attention mechanism as the following: φ(.) is a nonlinear activation function such as ReLU. The attention weight α ij depends on node embeddings as well as the relation between entity i and j, denoted as ), where B, C, D ∈ R d h ×(de+dq) are projection matrices. b, d are bias terms and rel(i, j) represents the label of each edge. α e ij represents the similarity between the entity features, computed by scaled dotproduct and is sensitive to the relation between entities.
After encoding all the entities via the above graph attention mechanism, the output features e * i of attention heads are concatenated and added to the original entity embeddings e i to generate final entity features E.

Multimodal Fusion
We fuse the outputs of question embeddings, scene graphs, and concept graphs to create a joint languagevision-knowledge representation. The fusion method needs to detect high-level interactions between these three features to provide a meaningful answer, without erasing the lower-level interactions extracted in the previous steps.
Popular fusion methods such as BAN (Kim et al., 2018) or MUTAN (Ben-younes et al., 2017) are not suitable for our work since we have three types of features to fuse. Therefore, we design a fusion method by applying the Compact Trilinear Interaction (CTI) (Do et al., 2019) to the question embeddings, scene graph visual features, and concept features and generate a vector to jointly represent the three features.
Given V ∈ R nv×dv , Q ∈ R n h ×dq where n h is the number of hidden states (n h = 1 in our experiments, i.e., Q = q), and E ∈ R 2k×de generated from previous steps, we generate a joint representation z ∈ R dz . The joint representation z is computed by applying CTI to each (V , Q, E) : where M is an attention map M ∈ R nv×n h ×2k : where W zv , W zq , W ze , W vr , W qr , W er are learnable factor matrices, and • is the Hadamard product. R is a slicing parameter, establishing a trade-off between the decomposition rate and the performance, and G r ∈ R dq r ×dv r ×de r is a learnable Tucker tensor.
The joint embedding computes more efficient and more compact representations than simply concatenating the embeddings. In addition, we overcome the issue of dimensionality faced with concatenating large matrices. The output of the fusion model is then fed to a classifier to predict the answer.

Experiments
We evaluate the performance of our proposed model using the standard evaluation metric recommended in the VQA challenge (Agrawal et al., 2017): Acc(ans) = min 1, #{humans provided ans} 3 All experiments have been performed on OK-VQA dataset. OK-VQA: is composed of 14,031 images and 14,055 questions. OK-VQA is divided into eleven categories: vehicles and transportation (VT); brands, companies and products (BCP); objects, materials and clothing (OMC); Sports and Recreation (SR); Cooking and Food (CF); Geography, History, Language and Culture (GHLC); People and Everyday Life (PEL); plants and animals (PA); science and technology (ST); weather and climate (WC). If a question was classified as belonging to different categories by different people, it was categorized as "Other".
Implementation details: Each image has a total of 36 image region features (n v = 36), each represented by a bounding box and an embedding vector computed by pre-trained Faster R-CNN features where d v = 2048. The input questions are embedded using GRU where d q = 1024 and n t = 14. We use USE and SBERT to embed instances. USE outputs a 512 dimensional vector (d s = 512) where as d s = 768 using SBERT. Our GATs include 16 attention heads. In the fusion model, R = 32 as suggested in (Do et al., 2019) and d z = 512 since it leads to the best results in our model.
To train our proposed model, we use a binary cross-entropy loss with a batch size of 64 over a maximum of 20 epochs on 8 Tesla GPUs. We use the Adamax optimizer with an initial learning rate of 1e-3. A linear decay learning rate schedule with warm up is used to train the model. The details of different experimental setups and results are provided in the following subsections.  Table 2: Performance on Ok-VQA validation set for different graph setups.

Knowledge Retrieval Experiments
Our knowledge retrieval module uses sentence-level embeddings to represent question-image and knowledge instances. For this purpose, we use two different methods based on popular Transformer and BERT networks.
We first passed each question-image and knowledge instances through a bert-as-a-service model (Xiao, 2018) and derived a fixed sized vector by using the output of the special CLS token. This technique yielded rather worst performance. Our theory is that, since Bert model is not trained for sentencelevel similarity tasks, it did not work well in our model. To bypass this limitation, we use SBERT which is a modification of the pre-trained BERT network to generate semantically meaningful sentence embeddings. We have tested SBERT with different backbones, e.g. BERT-base model with meantokens pooling and with CLS token pooling, BERT-large with mean-tokens pooling and with CLS token pooling, RoBERTa-base with mean-tokens pooling, and RoBERTa-large with mean-tokens pooling. The best results were achieved by BERT-base model with mean-tokens pooling. As the second approach, we compute the sentence embeddings using USE-large model which is trained with a Transformer encoder.
We then compute similarity scores which are the cosine similarity of a question-image instance embedding and knowledge instance embedings. Top-k knowledge instances with the highest similarity scores are retrieved as mostly relevant knowledge to each question. The accuracy of the VQA task using USE and SBERT for different number of k is reported in Table 1.
As indicated in Table 1, we observe a higher accuracy using USE technique. We chose k = 30 and USE to retrieve knowledge instances for the rest of experiments as this gives the best accuracy.

Graph Construction Experiments
In this subsection, we evaluate the advantage of capturing inter-object relations as well as incorporating general knowledge. Table 2 shows the results of this experiment as explained below: BAN: In this setup, we fuse the question embedding q with object features V using Bilinear Attention Networks (BAN). The output of the fusion network is then fed to the classifier to predict the answer. We found that setting γ (number of glimpses) to 2 in BAN model yields the best performance in our model. We do not use scene graphs nor concept graphs in this experiment.
G s + BAN: The same fusion network as BAN, but visual features are V resulted by the scene graph. The question embedding q is fused with V and passed to the classifier to predict the answer. G s + G e : This is the complete form of our model. q, V , and E are fused in this setup. CTI is used for the fusion technique as explained in Section 4.3.
From Table 2, consistent performance gain is obtained across most of categories by combining the scene graph and concept graph. Adding the scene graph alone improves the results in BCP, SR, PEL, PA, ST, and Other categories. Combining the scene graph and concept graph boosts the performance on the remaining categories except "Geography, History, Language and Culture" (GHLC) and "Weather and Climate" (WC).  Table 3: Performance on Ok-VQA validation set for ablation study and compared with SOTA.

Ablation Study and Comparison with SOTA
In Table 3, we compare two ablated instances of our model (Q-Only and Q + G e ) with its complete form. We also report the accuracy of the state-of-the-art baselines on OK-VQA dataset. ArticleNet (AN) (Marino et al., 2019) is a knowledge-based approach that retrieves articles from Wikipedia. Moreover, we applied XNM Net model (Shi et al., 2019) on Ok-VQA dataset and provide the results in the table. Table 3 shows the accuracy on the OK-VQA validation set in the following setting: Q-Only: Only question embedding q is fed to the classifier. Q + G e : Integrating general knowledge without using visual features in the pipeline. Questions embedding q and entity embedding E are fused using BAN and fed to the classifier. G s is removed from the model in this experiment.
MUT + AN: SOTA-Incorporate hidden states of ArticleNet for the top retrieved sentences into MU-TAN.
BAN + AN: SOTA-Incorporate hidden states of ArticleNet for the top retrieved sentences into BAN. XNM Net: SOTA-Scene graph-based VQA model. The explicit relationships between objects are not considered in this work. Node embeddings are concatenated and used as edge features.
Our model: The complete form of our model. Question embedding q, scene graph embedding V , and concept graph embedding E are fused using CTI and fed to the classifier.
From the table, we observe that adding general knowledge to the model leads to a gain of 5.35% in the overall performance (Q-Only = 15.08 % vs. Q + G e = 20.43 %).
We also note that the G s + BAN setup (cf. Table 2) results in a better performance compared to Q + G e . The reason is that most of the questions in OK-VQA datasets are related to objects found in the images. Therefore, the accuracy drops without providing the visual features.
Furthermore, we observe that our model surpasses the SOTA models in most of the categories. Our model performs especially well in SR, ST, and Other categories with a gain larger than 3%. Our performance is slightly below ArticleNet (MUTAN + AN) in OMC, PEL, PA, and WC. However unlike ArticleNet, our model does not require query construction and search APIs to retrieve relevant knowledge. Question, predicted answer, and provided answers are denoted by Q, A, and GT, respectively. (.) in GT shows number of humans that provided the answer. For example in the first example in the first row, 4 human annotators provided the answer "wool".

Qualitative Results
These results show incorporating external knowledge can improve the relational representation between entities in the question and objects in the image. For instance, the third example in the first row asks What kind of animal this resemble? To answer this questions, one requires to know that the question is referred to the airplane (visual information) and airplanes are designed to imitate birds (general knowledge).
The last row in Fig 3 shows failure cases. In the first example, our model fails because the retrieved knowledge instances are not mostly relevant to the question. The second example retrieves relevant knowledge instances, but it is not able to predict the correct answer. The answer predicted in the fourth Q: What are these animals farmed for?
Q: What is the blue item used for? Q: What kind of animal this resemble? Q: What climate the yellow fruit grow? A: milk A: fight fire A: bird A: tropical GT: wool(4), milk(3), meat (3) GT: put out fire(4), fight fire (3), water (3) GT: bird (10) GT: tropical (6), warm (4) Top-3 retrieved knowledge instances: Top-3 retrieved knowledge instances: Top-3 retrieved knowledge instances: Top-3 retrieved knowledge instances: 1-sheep is animal that eat grass and 1-fire hydrant used for fight fire 1-airplane used for imitate bird 1-fruit receives action grow in produce wool and milk 2-fire hydrant receives action use by fire 2-this is us military airplane warm weather 2-sheep is farm animal department 3-most mammal not capable of fly 2-fruit not capable of grow in cold place 3-sheep used for shear 3-fire engine used for put out fire 3-banana is tropical fruit  (10) GT: cowboy (4), sombrero(4), stetson (2) GT: boat (10) GT: nintendo (10) Top-3 retrieved knowledge instances: Top-3 retrieved knowledge instances: Top-3 retrieved knowledge instances: Top-3 retrieved knowledge instances: 1-home run is baseball term 1-cowboy capable of wear cowboy hat 1-boat that capable of travel under water 1-tv used for video game 2-score home run used for bat in other 2-hat used for slap your horse 2-boat used for keep person out of water 2-nintendo is video game company runner on base 3-this is hat man would wear 3-vessel used for move person around 3-game created by programmer 3-bat used for hit baseball Q: How is this beverage made?
Q: Which item is to wash hands? Q: What is the nickname of this city? Q: What kind of gathering is this? A: bake A: towel A: new york A: birthday GT: brew(6), coffee maker(2), blender (2) GT: sink(10) GT: big apple(5), windy city (3), matrix(2) GT: party(6), family(2), wine taste(2) Top-3 retrieved knowledge instances: Top-3 retrieved knowledge instances: Top-3 retrieved knowledge instances: Top-3 retrieved knowledge instances: 1-cup on table used for drink from 1-sink used for wash your hand 1-stop sign used for control traffic 1-guest capable of bring wine to party 2-cup on table used for fill with beverage 2-sink used for wash hair 2-stop sign used for stop 2-bottle wine at location party 3-cup on table has subevent pour of drink 3-sink used for wash up face and hand 3-sign used for traffic direction 3-party used for meet person Figure 3: Qualitative results on OK-VQA validation set. The first two rows show success cases and the last row shows failure cases.
example does not belong to the list of the provided answers. However, the predicted answer could be correct. This example shows that we need a better evaluation metric for VQA tasks which covers semantic cases such as this example.

Conclusion
In this paper, we proposed a novel VQA model for questions which require knowledge from external content. We developed a knowledge retrieval model to extract most relevant facts to each question based on sentence level embeddings. We then combined visual observations with retrieved knowledge by learning graphs to captures the interactions between objects and knowledge entities. The experimental results have shown the performance of our proposed model on OK-VQA dataset.
For future work, we will explore how to integrate the fact retrieval module to the main VQA pipeline to have an end-to-end trainable model. We will also investigate how to capture the semantic similarity between provided answers and predicted answers in the evaluation metric.