ConceptBert: Concept-Aware Representation for Visual Question Answering

Visual Question Answering (VQA) is a challenging task that has received increasing attention from both the computer vision and the natural language processing communities. A VQA model combines visual and textual features in order to answer questions grounded in an image. Current works in VQA focus on questions which are answerable by direct analysis of the question and image alone. We present a concept-aware algorithm, ConceptBert, for questions which require common sense, or basic factual knowledge from external structured content. Given an image and a question in natural language, ConceptBert requires visual elements of the image and a Knowledge Graph (KG) to infer the correct answer. We introduce a multi-modal representation which learns a joint Concept-Vision-Language embedding inspired by the popular BERT architecture. We exploit ConceptNet KG for encoding the common sense knowledge and evaluate our methodology on the Outside Knowledge-VQA (OK-VQA) and VQA datasets.


Introduction
Visual Question Answering (VQA) was firstly introduced to bridge the gap between natural language processing and image understanding applications in the joint space of vision and language (Malinowski and Fritz, 2014).
Most VQA benchmarks compute a question representation using word embedding techniques and Recurrent Neural Networks (RNNs), and a set of object descriptors comprising bounding box coordinates and image features vectors. Word and image representations are then fused and fed to a network to train a VQA model. However, these approaches are practical when no knowledge beyond the visual content is required.
Incorporating the external knowledge introduces several advantages. External knowledge and supporting facts can improve the relational representation between the objects detected in the image, or between entities in the question and objects in the image. It also provides information on how the answer can be derived from the question. Therefore, the complexity of the questions can be increased based on the supporting knowledge base.
Organizing the world's facts and storing them in a structured database, large scale Knowledge Bases (KB), have become important resources for representing the external knowledge. A typical KB consists of a collection of subject-predicateobject triplets also known as a fact. A KB in this form is often called a Knowledge Graph (KG) (Bollacker et al.) due to its graphical representation. The entities are nodes and the relations are the directed edges that link the nodes. The triples specify that two entities are connected by a particular relation, e.g., (Shakespeare, writerOf, Hamlet).
A VQA system that exploits KGs is an emerging research topic, and is not well-studied. Recent research has started integrating knowledge-based methods into VQA models (Wang et al., 2017(Wang et al., , 2016Zhu et al., 2015;Marino et al., 2019). These methods incorporate the external knowledge through two approaches: i) they exploit a set of associated facts for each question provided in VQA datasets , or ii) they collect possible search queries for each question-image pair and use a search API to retrieve the answers (Wang et al., 2017(Wang et al., , 2016Zhu et al., 2015;Marino et al., 2019). However, we go one step further and implement an end-to-end VQA model that is fully trainable. Our model does not require knowledge annotations in VQA datasets or search queries. Most of the recent works are still based on the idea of context-free word embeddings rather than the pre-trained language representation (LR) model. While the pre-trained LR model such as BERT (Devlin et al., 2018) is an emerging direction, there is little work on its fusion with KG and image representation in VQA tasks. Liu et al. propose a knowledge-based language representation and use BERT as the token embedding method. However, this model is also a query-based method. It collects entity names involved in questions and queries their corresponding triples from the KG. Then, it injects queried entities into questions.
In this paper, we introduce a model which jointly learns from visual, language, and KG embeddings and captures image-question-knowledge specific interactions. The pipeline of our approach is shown in Figure 1. We compute a set of object, question, and KG embeddings. The embedded inputs are then passed through two main modules: i) the vision-language representation, and ii) the conceptlanguage representation. The vision-language representation module jointly enhances both the image and question embeddings, each improving its context representation with the other one. The conceptlanguage representation uses a KG embedding to incorporate relevant external information in the question embedding. The outputs of these two modules are then aggregated to represent concept-visionlanguage embeddings and then fed to a classifier to predict the answer.
Our model is different from the previous methods since we use pre-trained image and language features and fuse them with KG embeddings to incorporate the external knowledge into the VQA task. Therefore, our model does not need additional knowledge annotations or search queries and reduces computational costs. Furthermore, our work represents an end-to-end pipeline that is fully trainable.
In summary, the main contributions of our work are: 1. Novel methodology to incorporate common sense knowledge to VQA models ( Figure 1) 2. Concept-aware representation to use knowledge graph embeddings in VQA models  3. Novel multimodal Concept-Visual-Language embeddings (Section 3.4)

Problem formulation
Given a question q ∈ Q grounded in an image I ∈ I and a knowledge graph G, the goal is to predict a meaningful answer a ∈ A. Let Θ be the parameters of the model p that needs to be trained. Therefore, the predicted answerâ of our model is: In order to retrieve the correct answer, we aim to learn a joint representation z ∈ R dz of q, I, and G such that: where a * is the ground-truth answer. d z is a hyperparameter that represents the dimension of the joint space z. d z is selected based on a trade-off between the capability of the representation and the computational cost.
3 Our approach

Input representations
The input to our model, ConceptBert, consists of an image representation, a question representation, and a knowledge graph representation module (cf. the blue-dashed box in Figure 1) which are discussed in detail below.
Image representation: We use pre-trained Faster R-CNN features (Anderson et al., 2017) to extract a set of objects V = {v i | i = 1, ..., n v } per image, where each object v i is associated with a visual feature vector v i ∈ R dv and bounding-box Question representation: Given a question consisting of n T tokens, we use BERT embeddings (Devlin et al., 2018) to generate question representation q ∈ R n T ×dq . BERT operates over sequences of discrete tokens consisting of vocabulary words and a small set of special tokens, i.e., SEP, CLS, and MASK. The representation of each token is a sum of a token-specific learned embedding and encodings for position and segment. Position refers to the token's index in the sequence and segment shows the index of the token's sentence if multiple sentences exist.
Knowledge graph representation: We use ConceptNet (Speer et al., 2016) as the source of common sense knowledge. ConceptNet is a multilingual knowledge base, representing words and phrases that people use and the common sense relationships between them. ConceptNet is a knowledge graph built from several different sources (mostly from Wiktionary, Open Mind Common Sense (Singh et al., 2002) and Games with a purpose such as Ahn et al.). It contains over 21 million edges and over 8 million nodes. In this work, we focus on the English vocabulary which contains approximately 1.5 million nodes. To avoid the step of the query construction and take full advantage of the large scale KG, we exploit ConceptNet embedding proposed in (Malaviya et al., 2020) and generate the KG representation k ∈ R n T ×d k .
This method uses Graph Convolutional Networks (Kipf and Welling, 2016) to incorporate information from the local neighborhood of a node in the graph. It includes an encoder and a decoder.
A graph convolutional encoder takes a graph as input, and encodes each node. The encoder operates by sending messages from a node to its neighbors, weighted by the relation type defined by the edge. This operation occurs in multiple layers, incorporating information multiple hops away from a node. The last layer's representation is used as the graph embedding of the node.

Vision-Language representation
To learn joint representations of language q and visual content V, we generate vision-attended language features V and language-attended visual features Q (cf. the orange box in Figure 1) inspired by VilBERT model (Lu et al., 2019).
Our vision-language module is mainly based on two parallel BERT-style streams, which operate over image regions and text segments (cf. Figure  2-a). Each stream is a succession of transformer blocks and co-attentional transformer layers to enable information exchange between image and text modalities. These exchanges are restricted between specific layers and the text features go through more processing than visual features. The final set of image features represent high-level information of language features, and final text features include high-level vision features.

Concept-Language representation
The vision-language module represents the interactions between the image and the question. However, this module alone is not able to answer questions that require insights that are neither in the image, nor in the question. To this end, we propose the concept-language representation to produce language features conditioned on knowledge graph embeddings (cf. the red box in Figure 1). It performs knowledge-conditioned language attention in the concept stream (Figure 2-b). With this system, the model is able to incorporate common sense knowledge to the question, and enhance the question comprehension with the information found in the knowledge graph.
The entities in the knowledge graph have both contextual and relational information that we desire to integrate in the question embedding. To this purpose, we use an attentional transformer layer which is a multi-layer bidirectional Transformer using the encoder part of the original Transformer (Vaswani et al., 2017).
The concept-language module is a series of a) Vision-Language representation b) Concept-Language representation The input consists of "queries" from question embeddings and "keys" and "values" of KG embeddings. We use Multi-Head Attention with scaled dot-product. Therefore, we pack a set of q into a matrix Q w , and k into a matrix K G and V G .
(3) The output of the final Transformer block, G, is a new representation of the question, enhanced with common sense knowledge extracted from the knowledge graph. Figure 2-b shows an intermediate representation H C .

Concept-Vision-Language embedding module
We aggregate the outputs of the three streams to create a joint concept-vision-language representation. The aggregator needs to detect high-level interactions between the three streams to provide a meaningful answer, without erasing the lower-level interactions extracted in the previous steps. We design the aggregator by applying the Compact Trilinear Interaction (CTI) (Do et al., 2019) to question, answer, and image features and generate a vector to jointly represent the three features.
Given V ∈ R nv×dv , Q ∈ R n T ×dq , and G ∈ R n T ×d k , we generate a joint representation z ∈ R dz of the three embeddings. The joint representation z is computed by applying CTI to each where W zv , W zq , W zg , W vr , W qr , W gr are learnable factor matrices, and • is the Hadamard product. R is a slicing parameter, establishing a trade-off between the decomposition rate and the performance, and G r ∈ R dq r ×dv r ×dg r is a learnable Tucker tensor.
The joint embedding computes more efficient and more compact representations than simply concatenating the embeddings. It creates a joint representation in a single space of the three different embedding spaces. In addition, we overcome the issue of dimensionality faced with concatenating large matrices.
The outputs of the aggregator is a joint conceptvision-language representation which is then fed to a classifier to predict the answer.

Experiments
We evaluate the performance of our proposed model using the standard evaluation metric recommended in the VQA challenge (Agrawal et al., 2017): Acc(ans) = min 1, #{humans provided ans} 3

Datasets
All experiments have been performed on VQA 2.0 (Goyal et al., 2016) and Outside Knowledge-VQA (OK-VQA) (Marino et al., 2019) datasets. VQA 2.0 is a public dataset containing about 1.1 million questions and 204,721 images extracted from the 265,016 images of the COCO dataset. At least 3 questions (5.4 questions on average) are provided per image, and each question is associated with 10 different answers obtained by crowd sourcing. Since VQA 2.0 is a large dataset, we only consider questions whose set of answers has at least 9 identical ones. With this common practice, we can cast aside questions which have lukewarm answers. The questions are divided in three categories: Yes/No, Number, and Other. We are especially interested in the "Other" category, which can require external knowledge to find the correct answer.
OK-VQA: To evaluate the performance of our proposed model, we require questions which are not answerable by direct analysis of the objects detected in the image or the entities in the question. Most of knowledge-based VQA datasets impose hard constraints on their questions, such as being generated by templates (KB-VQA (Wang et al., 2015)) or directly obtained from existing knowledge bases (FVQA (Wang et al., 2016)). We select OK-VQA which is the only VQA dataset that requires handling unstructured knowledge to answer natural questions about images.
The OK-VQA dataset is composed of 14,031 images and 14,055 questions. For each question, we select the unanimous answer as the groundtruth answer. OK-VQA is divided into eleven categories: vehicles and transportation (VT); brands, companies and products (BCP); objects, materials and clothing (OMC); Sports and Recreation (SR); Cooking and Food (CF); Geography, History, Language and Culture (GHLC); People and Everyday Life (PEL); plants and animals (PA); science and technology (ST); weather and climate (WC). If a question was classified as belonging to different categories by different people, it was categorized as "Other".

Implementation details
In this section, we provide the implementation details of our proposed model in different building blocks.
Image embedding: Each image has a total of 36 image region features (n v = 36), each represented by a bounding box and an embedding vector computed by pre-trained Faster R-CNN features where d v = 2048. Each bounding box includes a 5-dimensional spatial coordinate (d b = 5) corresponding to the coordinates of the top-left point of the bounding box, the coordinates of the bottomright point of the bounding box, and the covered fraction of the image area.
Question embedding: The input questions are embedded using BERT's BASE model. Therefore, each word is represented by a 768-D word embedding (d q = 768). Each question is divided into 16-token blocks (n T = 16), starting with a [CLS] token and ending with a [SEP] token. The answers are transformed to one-hot encoding vectors.

Vision-Language representation:
We initialize our vision-language representation with pretrained ViLBERT features. The ViLBERT model is built on the Conceptual Captions dataset (Sharma et al., 2018), which is a collection of 3.3 million image-caption pairs, to capture the diversity of visual content and learn some interactions between images and text. Our vision-language module includes 6 layers of Transformer blocks with 8 and 12 attention heads in the visual stream and linguistic streams, respectively.

Concept-Language representation:
We train the concept stream of our ConceptBert from scratch. The module includes 6 layers of Transformer blocks with 12 attention heads.

Concept-Vision-Language embedding:
We have tested our concept-vision-language representation with d z = 512 and d z = 1024. The best results were reached using d z = 1024. Our hypothesis is that we can improve the capability of the module by increasing d z . However, it leads to an increase in the computational cost. We set R = 32 in Equation 5, the same value as in the CTI (Do et al., 2019) for the slicing parameter.
Classifier: We use a binary cross-entropy loss with a batch size of 1024 over a maximum of 20 epochs on 8 Tesla GPUs. We use the BertAdam  optimizer with an initial learning rate of 4e-5. A linear decay learning rate schedule with warm up is used to train the model.

Experimental results
This sub-section provides experimental results on the VQA 2.0 and OK-VQA datasets. Ablation Study: In Table 1, we compare two ablated instances of ConceptBert with its complete form. Specifically, we validate the importance of incorporating the external knowledge into VQA pipelines on top of the vision and language embeddings. Table 1 reports the overall accuracy on the VQA 2.0 and OK-VQA validation sets in the following setting: • L: Only questions features q are fed to the classifier. • VL: Only the outputs of the Vision-Language representation module [V ; Q] are concatenated and fed to the classifier. • CL: Only the output of the Concept-Language representation module G is fed to the classifier. • CVL: ConceptBert complete form; the outputs of both Vision-Language and Concept-Language modules are fused (cf. Section 3.4) and fed to the classifier. Comparison between L and CL instances shows the importance of incorporating the external knowledge to accurately predict answers. Adding the KG embeddings to the model leads to a gain of 11.56% and 7.19% in VQA and OK-VQA datasets, respectively.
We also note that the VL model outperforms the CL model. The reason is that most of the ques-tions in both VQA 2.0 and OK-VQA datasets are related to objects found in the images. Therefore, the accuracy drops without providing the detected object features. Compared to the VL and CL, the CVL model gives the highest accuracy which indicates the effectiveness of the joint concept-visionlanguage representation.
Results on VQA 2.0 dataset: The performance of our complete model on VQA 2.0 validation set is compared with the existing models in Table 2. Up-Down model (Anderson et al., 2017) combines the bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects. XNM Net (Shi et al., 2018) and ReGAT (Li et al., 2019) are designed to answer semanticallycomplicated questions. In addition to the existing approaches we elaborated two other baselines: (i) SIMPLE: First, we create the embedding G, which is the output of the concept-language module. Then, we use G and the image embedding, feed them to the vision-language module, and send its output to a classifier and check the answer. (ii) CONCAT: we concatenate the embeddings from the question and ConceptNet to form a mixed embedding Q KB . Then, we send Q KB and the image embedding to the vision-language module, and feed its output to a classifier and check the answer. It is worthy to note that SIMPLE and CONCAT do not have CTI involved. The results show that our model outperforms the existing models. Since we report our results on the validation set, we removed the validation set from the training phase, so that the model only relies on the training set.
Results on OK-VQA dataset: Table 3 shows the performance of our complete model on OK-VQA validation set. Since there exists only one work on OK-VQA dataset in the literature, we apply a few state-of-the-art models on OK-VQA and report their performance. We also performed SIMPLE and CONCAT baselines on OK-VQA dataset. In the OK-VQA study (Marino et al., 2019), the best results are obtained by fusing MUTAN and ArticleNet (MUTAN + AN) as a knowledge-based baseline. AN retrieves some articles from Wikipedia for each question-image pair and then train a network to predict whether and where the ground-truth answers appear in the article and in each sentence.
From the table, we observe that our model surpasses the baselines and SOTA models in almost every category which indicates the usefulness of   external knowledge in predicting answers. Con-ceptBert performs especially well in the "Cooking and Food" (CF), "Plants and Animals" (PA), and "Science and Technology" (ST) categories with a gain larger than 3%. The answers to these type of questions often are entities out of the main entities in the question and the visual features in the image. Therefore, the information extracted from the knowledge graph plays an important role in determining the answer. ViLBERT performs better in the category "Geography, History, Language and Culture" (GHLC) compared to ConceptBert, since "dates" are not entities in ConceptNet.

Qualitative results
We illustrate some qualitative results of Concept-Bert complete form CVL by comparing it with the VL model. In particular, we aim at illustrating the advantage of adding (i) the external knowledge extracted from the ConceptNet knowledge graph, and (ii) concept-vision-language embedding representations. Figure 3 and Figure 4 illustrate some qualitative results on VQA 2.0 and OK-VQA validation sets, respectively. From the figures, we observed that the VL model is influenced by the objects detected in the picture. However, the CVL model is able to identify the correct answer without only focusing on the visual features. For example in the third row in Figure 4, CVL model uses the facts that an elephant is herbivorous, and black cat is associated with Halloween to find the correct answers.
It is worthy to note that the CVL answers remain consistent from a semantic perspective even in the case of wrong answers. For example, How big is the distance between the two players? exposes a distance as opposed to the VL model which provides a Yes/No answer (cf. Figure 5). In another example for the question Sparrows need to hide to avoid being eaten by what?, the CVL model mentions an animal species that can eat sparrows, while the VL model returns an object found in the image. From these visualization results, we observe that the knowledge strongly favours the capture of interactions between objects, which contributes  to a better alignment between image regions and questions.

Conclusions
In this paper, we present ConceptBert, a conceptaware end-to-end pipeline for questions which require knowledge from external structured content. We introduce a new representation of questions enhanced with the external knowledge exploiting Transformer blocks and knowledge graph embeddings. We then aggregate vision, language, and concept embeddings to learn a joint concept-visionlanguage embedding. The experimental results have shown the performance of our proposed model on VQA 2.0 and OK-VQA dataset.
For future work, we will investigate how to integrate the explicit relations between entities and objects. We believe that exploiting the provided relations in knowledge graphs and integrating them with relations found between objects in questions/images can improve the predictions.