Fusion of Detected Objects in Text for Visual Question Answering

To advance models of multimodal context, we introduce a simple yet powerful neural architecture for data that combines vision and natural language. The “Bounding Boxes in Text Transformer” (B2T2) also leverages referential information binding words to portions of the image in a single unified architecture. B2T2 is highly effective on the Visual Commonsense Reasoning benchmark, achieving a new state-of-the-art with a 25% relative reduction in error rate compared to published baselines and obtaining the best performance to date on the public leaderboard (as of May 22, 2019). A detailed ablation analysis shows that the early integration of the visual features into the text analysis is key to the effectiveness of the new architecture. A reference implementation of our models is provided.


Introduction
It has long been understood that the meaning of a word is systematically and predictably linked to the context in which it occurs (e.g., Firth 1957;Harris 1954;Deerwester et al. 1990;Mikolov et al. 2013). Different notions of context have resulted in different levels of success with downstream NLP tasks. Recent neural architectures including Transformer (Vaswani et al., 2017) and BERT (Devlin et al., 2018) have dramatically increased our ability to include a broad window of potential lexical hints. However, the same capacity allows for multimodal context, which may help model the meaning of words in general, and also sharpen its understanding of instances of words in context (e.g., Bruni et al. 2014). * Work done as part of the Google AI residency. 1 https://visualcommonsense.com 2 https://github.com/google-research/ language/tree/master/language/question_ answering/b2t2  In this paper, we consider visual context in addition to language and show that the right integration of visual and linguistic information can yield improvements in visual question answering. The challenge we consider is to answer naturalquestions related to a given image. The more general question we address in the context of this problem is how to encode visual and verbal information in a neural architecture. How to best do that is still unclear. How are text entities bound to objects seen in images? Are text and image best integrated late, allowing for independent analysis (late fusion), or should the processing of one be conditioned on the analysis of the other (early fusion)? How is cross-modal co-reference best encoded at all? Does it make sense to ground words in the visual world before encoding sentence semantics?
In this work we gather evidence to answer these questions by designing the Bounding Boxes in Text Transformer, B2T2 for short, a neural architecture for multimodal encoding of natural language and images, and we evaluate B2T2 on the Visual Commonsense Reasoning benchmark (VCR, Zellers et al. 2019). Figure 1 shows an illustrative example from the VCR benchmark. VCR is well suited to test rich multimodal representations because it requires the analysis of images depicting people engaged in complex activities; it presents questions, answers and rationales created by human annotators rather than automatic generation; it has a clean multiplechoice interface for evaluation; and yet it is still challenging thanks to a careful selection of answer choices through adversarial matching. VCR has much longer questions and answers compared to other popular Visual Question Answering (VQA) datasets, such as VQA v1 (Antol et al., 2015), VQA v2 (Goyal et al., 2017) and GQA (Hudson and Manning, 2019), requiring more modeling capacity for language understanding.
In our experiments, we found that early fusion of co-references between textual tokens and visual features of objects was the most critical factor in obtaining improvements on VCR. We found that the more visual object features we included in the model's input, the better the model performed, even if they were not explicitly co-referent to the text, and that positional features of objects in the image were also helpful. We finally discovered that our models for VCR could be trained much more reliably when they were initialized from pretraining on Conceptual Captions (Sharma et al., 2018), a public dataset of about 3M images with captions. From the combination of these modeling improvements, we obtained a new model for visual question answering that achieves state-ofthe-art on VCR, reducing error rates by more than 25% relative to the best published and documented model (Zellers et al., 2019).

Problem Formulation
In this work, we assume data comprised of 4tuples (I, B, T, l) where 1. I is an image,  While it might seem surprising to mix natural text with explicit references to bounding boxes, this is actually a quite natural way for people to discuss objects in images and the VCR dataset is annotated in exactly this way.
We assume an image representation function Φ that converts an image, perhaps after resizing and padding, to a fixed size vector representation of dimension d.
We similarly assume a pretrained textual representation capable of converting any tokenized passage of text, perhaps after truncating or padding, into a vector representation of dimension h. We assume a context independent token representation E in the shape of a vector of dimension h for each token and a passage level representation Ψ which operates on E(T ) and returns a passage level vector representation of dimension h.
We refer the reader to Table 1 for an overview of the notation used in this work. Full details on how the VCR dataset is encoded into this formalism are given in Section 4.

Models and Methods
We evaluate two main architectures: "Dual Encoder", a late fusion architecture where image and text are encoded separately and answer scores are computed as an inner product, and the full B2T2 model, an early fusion architecture where visual features are embedded on the same level as input word tokens. Section 5.2 will summarize experiments with model variants to answer the research questions laid out in the introduction and to analyze what works, and why.

Dual Encoder
Dual Encoders, discussed for example by Wu et al. (2018) and Gillick et al. (2018), are models that embed objects of potentially different types into a common representation space where a similarity function can be expressed e.g. as a dot product or a cosine similarity. A notable example of a dual encoder for image classification is WSABIE, proposed by Weston et al. (2011).
Our Dual Encoder architecture is shown in Figure 2. We model the class distribution as where D is a learned matrix of size d × h. In this model, co-reference information is completely ignored, and the model must rely on fixed dimensional vectors for the late fusion of textual and visual contexts. However, we found this to be surprisingly competitive on VCR compared to published baselines, perhaps due to our choice of powerful pretrained models.

B2T2
Our B2T2 architecture is shown in Figure 3. We model the class distribution as are learned parameters. E (I, B, R, T ) is a noncontextualized representation for each token and of its position in text, but also of the content and position of the bounding boxes. The key difference from "Dual Encoder" is that text, image and bounding boxes are combined at the level of the non-contextualized token representations rather than right before the classification decision.
The computation of E (I, B, R, T ) is depicted in Figure 4. More formally, for a given example, let matrix R ∈ {0, 1} m×n encode the references between the bounding boxes in B and the tokens in T , so that R ij is 1 if and only if bounding box i is referenced by token j. Then denotes cropping image I to bounding box b i and then extracting a visual feature vector of size d, and π(b i ) denotes the embedding of b i 's shape and position information in a vector of size d.
To embed the position and size of a bounding box b, we introduce two new learnable embedding matrices X and Y of dimension k × d 4 . Let the coordinates of the opposite corners of b be (x 1 , y 1 ) and (x 2 , y 2 ), after normalizing so that a bounding box covering the entire image would have x 1 = y 1 = 0 and x 2 = y 2 = k. Position embeddings are thus defined to be

Loss
All of our models are trained with binary cross entropy loss using label l. Denoting p := p(l = 1|I, B, R, T ), we have for each example

Pretraining on Conceptual Captions
Before training on VCR, we pretrain B2T2 on image and caption pairs using a Mask-LM pretraining technique like the one used in BERT (Devlin et al., 2018). The setup used during pretraining is shown in Figure 5, where the model uses the image as additional context when filling in the mask.
We use two tasks for pretraining: (1) impostor identification and (2) masked language model prediction. For the impostor task, we sample a random negative caption for each image and ask the model to predict whether the caption is correctly associated. For mask-LM, we randomly replace tokens in the caption with the [MASK] token, and the model must predict the original token (see Devlin et al. (2018)   boxes during pretraining, so B = ∅. The binary label l indicates whether the caption is an impostor or not. The loss for impostor identification is binary cross entropy L BCE with label l as in 3.3. We denote the loss for mask-LM as L MLM , which is the summed cross entropy of the predicted token distributions against the true tokens.
To ensure that our model correctly grounds the language to the image with the mask LM loss, we only use it for positive captions, zeroing it out for negative captions. Our final objective is the sum of the losses: where I[l = 1] is an indicator for the label l being positive for the image and caption pair.
We pretrain on Conceptual Captions (Sharma et al., 2018), a dataset with over 3M images paired with captions. 3 We found empirically that pretraining improves our model slightly on VCR, but more importantly, allows our model to train stably. Without pretraining, results on VCR exhibit much higher variance. We refer the reader to Section 5.2 for an ablation analysis on the effect of pretraining.

Implementation Details
We use ResNet-152 4 (He et al., 2016) pretrained on ImageNet for Φ, which yields a vector representation of size d = 2048. BERT-Large (Devlin et al., 2018) provides both E and Ψ. The latter is a pretrained Transformer with 24 layers, 16 attention heads, and hidden size 1024. For BERT, 3 We also tried pretraining on MS-COCO images and captions (Lin et al., 2014), but found this to be ineffective. This could be because MS-COCO is smaller (with around 80k images, 400k captions). 4 Publicly available at tfhub.dev  Figure 4: How input embeddings are computed in our B2T2 architecture.  E corresponds to its token embeddings, Ψ to the [CLS] token representation in the final layer, and so Ψ(E(T )) corresponds to the BERT passage representation of size h = 1024. We found empirically that it was slightly better to keep Φ fixed rather than fine-tuning it, but that it was of critical importance to fine-tune Ψ and E for the new task.
In all of our finetuning experiments we use the Adam optimizer (Kingma and Ba, 2014) and trained our models with a grid of hyperparameters: a learning rate of 2 · 10 −5 and 3 · 10 −5 , for 3, 4, and 5 epochs with a linear learning rate decay, and two random seed for initialization.
To maximize performance on VCR, we also evaluate an ensemble of B2T2 models. Our ensemble is comprised of 5 identical B2T2 models, trained for 3 epochs with an initial learning rate of 2 · 10 −5 , but initialized with 5 different random seeds. The resulting class logits are then summed to obtain the ensemble scores.

Visual
Commonsense Reasoning (VCR, visualcommonsense.com, Zellers et al. 2019) is a corpus that contains a sample of stills from movies. Questions and answers revolve around conclusions or assumptions that require knowledge external to the images. The associated task is to not only select a correct answer but also provide reasoning in line with common sense. Matching our problem formulation given before, a VCR sample is defined as a tuple (I, O, Q, A, R).
Here, I is the image, and O is a sequence of objects identified in the image. A question Q = [q 0 , . . . , q k ] is given, where tokens are either textual words or deictic references to objects in O. Each question contains a set of four answers A = {A 1 , A 2 , A 3 , A 4 }, with exactly one correct answer A * . Each response follows the schema of the queries. Finally, there is a set of four rationales R = {R 1 , R 2 , R 3 , R 4 }, with exactly one rationale R * identified as correct in supporting A * .
Each of the objects in O = [(b 1 , l 1 ), . . . , (b |O| , l |O| )] is identified in the image I by bounding boxes b i . The objects are also labeled with their classes with a text token l i . The Q → A task is to choose A * given (I, O, Q, A). The QA → R task is to choose R * given (I, O, Q, A * , R). Finally, the Q → AR task is a pipeline of the two, where a model must first correctly choose A * from A, then correctly choose R * given A * .
We adapt VCR to our problem formulation by converting each VCR example to four instances for the Q → A task, one per answer in A, and four instances for the QA → R task, one per rationale in R. We construct the text for the instances in the Q → A task as Here, [b 0 ] is a bounding box referring to the entire input image. q 0 , . . . are all question tokens, a 0 , . . . answer tokens, a * 0 , . . . answer tokens for the correct answer, and r 0 , . . . rationale tokens. We append the first p bounding boxes in O with class labels to the end of the sequence (in our experiments, we use p = 8), and for objects referenced in Q, A, R, we prepend the class label token (i.e.
. We assign the binary label l to every instance to represent whether the answer or rationale choice is the correct one.

VCR Task Performance
Our final results on the VCR task are shown in Table 2. Our Dual Encoder model worked surprisingly well compared to Zellers et al. (2019), surpassing the baseline without making use of bounding boxes. We also evaluate a Text-Only baseline, which is similar to the Dual Encoder model but ignores the image. The ensemble of B2T2 models, pretrained on Conceptual Captions, obtained absolute accuracy improvements of 8.9%, 9.8% and 13.1% compared to the published R2C baseline for the Q → A, QA → R, and Q → AR tasks respectively. At the time of this writing (May 22, 2019), both our single B2T2 and ensemble B2T2 models outperform all other systems in the VCR leaderboard.

Ablations
To better understand the reason for our improvements, we performed a number of ablation studies on our results, summarized in Table 3. We consider ablations in order of decreasing impact on the VCR dev set Q → A accuracy.
Use of Bounding Boxes. The bounding boxes considered by our model turns out to be the most important factor in improving the accuracy of our model. Without any bounding boxes we obtain 67.5% accuracy, just above the accuracy of the dual encoder. With 4 instead of 8 appended bounding boxes we obtain 71% accuracy. With 8 bounding boxes, but no textual labels from the bounding boxes in the text we obtain 70.9% accuracy,   showing that our model can make use of labels for detected objects. Example 1 in Table 4 shows an example that our models can only get right if bounding box 5 is available. Late Fusion vs. Early Fusion. The second most important architectural choice in our model is to combine visual information at the level of context independent token embeddings, rather than at the highest levels of the neural representation. If in the the full B2T2 model we add visual embeddings in the last layer of BERT rather than in the first, we lose 3.3% accuracy.
Effect of Textual Model Size. The original VCR work by Zellers et al. (2019) made use of BERT-base, while we use BERT-large to initialize our models. To test how much of our improvements are simply due to our model being larger, we retrained B2T2 models using BERT-base and found that we lose 2.9% accuracy.
Effect of Visual Model Size. How important is the choice of the visual model in the performance  Figure 6: Boxplot of dev Q → A accuracy on VCR with and without pretraining. Pretraining on Conceptual Captions lowers variance when fine-tuning on VCR, from a grid search on multiple random seeds, learning rates, and VCR training epochs. of B2T2? As further discussed in the error analysis section of this work, we suspect that B2T2 could be significantly improved by extending the visual features to represent more than just objects, but also activities, expressions and more. However it appears that even the size of the object detection model is important. If we swap out ResNet-152 for ResNet-50, accuracy decreases by 1.5%.
Pretraining. We found that performance improvements from pretraining are quite small, around 0.4% accuracy, but initializing from a pretrained model heavily reduces variance of results. We show this effect in Figure 6 over the grid of learning rates, random seeds, and training epochs described in Section 3.5.
Position of Bounding Boxes We additionally investigated the effect of removing position information from the model. The benefit of having bounding box positional embeddings is the smallest of the ones we considered. A model trained without positional embeddings only loses 0.3% accuracy compared to the full model.

Error Analysis
We picked some examples, shown in Table 4, to illustrate the kinds of correct and incorrect choices that B2T2 is making, compared to our dual encoder and to a text only model.
In Example 1 we show an example of how our model picks the right answer only when it is able to make use of all provided bounding boxes. Bounding box 5 in particular contains the clue that allows the observer to know that the man in the picture might have just gone shopping.  In Examples 2 and 3, no specific bounding box appears to contain critical clues for answering the question, but B2T2 outperforms models without access to the image or without access to bounding boxes. It is possible that B2T2 might be gaining deeper understanding of a scene by combining information from important regions of the image.
In Examples 4 and 5, we see failure cases of both the dual encoder and B2T2 compared to the text only-model. Both these examples appear to point to a limitation in the amount of information that the we are able to extract from the image. Indeed our vision model is trained on ImageNet, and so it might be very good at recognizing objects, but might be unable to recognize human expressions and activities. Our models could have correctly answered the question in Example 4 if they were able to recognize smiles. Similarly our models could have ruled out the incorrect answer they picked for the question in Example 5 if they were able to see that both people in the picture are sitting down and are not moving.

Related Work
Modeling visual contexts can aid in learning useful sentence representations (Kiela et al., 2017) and even in training language models (Ororbia et al., 2019). This paper takes these more general ideas to a downstream task that requires modeling of visual input. Similar to B2T2, VideoBERT (Sun et al., 2019) jointly processes video frames and text tokens with a Transformer architecture. However, VideoBERT cannot answer questions, nor does the model consider bounding boxes.
Our B2T2 model is similar to the Bottom-Up Top-Down attention model (Anderson et al., 2018) in how bounding boxes generated at preprocessing time are attended to by the VQA model. "Bottom-Up" refers to the idea of attending from the text to the bounding boxes of objects detected in the image, while "Top-Down" refers to the idea of attending to regions constructed as a regular grid over the image. The Bottom-Up Top-Down model however reduces the text to a fixed length vector representation before attending to image regions, while B2T2 instead treats image regions as special visual tokens mixed in the text. In this sense, Bottom-Up Top-Down model is a late fusion model, while B2T2 is early fusion.
The Neuro-Symbolic Concept Learner (Mao et al., 2019) also uses bounding boxes to learn vi-sually grounded concepts through language. The Neuro-Symbolic Concept Learner however relies on a semantic parser to intepret language, while B2T2 uses a Transformer to construct a joint representation of textual tokens and visual tokens.
Another recently proposed model for VQA is MAC (Hudson and Manning, 2018). As presented, MAC does not make use of bounding boxes, which makes it a Top-Down model in the nomenclature of Anderson et al. (2018). MAC also reduces the textual information to a vector of fixed length. However MAC makes use of a new neural architecture designed to perform an explicit multi-step reasoning process and is reported to perform better than Anderson et al. (2018) on the GQA dataset (Hudson and Manning, 2019).
After the submission of this paper, several new works were published with excellent results on VCR, in some cases exceeding the performance of our system. In particular we mention ViLBERT , VL-BERT (Su et al., 2019), Unicoder-VL (Li et al., 2019a), and VisualBERT (Li et al., 2019b).
VCR is only one of several recent datasets pertaining to the visual question answering task. VQA (Antol et al., 2015;Goyal et al., 2017) contains photos and abstract scenes with questions and several ground-truth answers for each, but the questions are less complex than VCR's. CLEVR (Johnson et al., 2017) is a visual QA task with compositional language, but the scenes and language are synthetic. GQA (Hudson and Manning, 2019) uses real scenes from Visual Genome, but the language is artificially generated. Because VCR has more complex natural language than other datasets, we consider it the best evaluation of a model like B2T2, which has a powerful language understanding component.

Conclusion
In this work we contrast different ways of combining text and images when powerful text and vision models are available. We picked BERT-Large (Devlin et al., 2018) as our text model, ResNet-152 (He et al., 2016) as our vision model, and the VCR dataset (Zellers et al., 2019) as our main benchmark.
The early-fusion B2T2 model, which encodes sentences along with links to bounding boxes around identified objects in the images, produces the best available results in the visual question an-swering tasks. A control model, implementing late fusion (but the same otherwise), performs substantively worse. Thus, grounding words in the visual context should be done early rather than late.
We also demonstrate competitive results with a Dual Encoder model, matching state-of-the-art on the VCR dataset even when textual references to image bounding boxes are ignored. We then showed that our Dual Encoder model can be substantially improved by deeply incorporating in the textual embeddings visual features extracted from the entire image and from bounding boxes. We finally show that pretraining our deep model on Conceptual Captions with a Mask-LM loss yields a small additional improvement as well as much more stable fine-tuning results.