What Does BERT with Vision Look At?

Pre-trained visually grounded language models such as ViLBERT, LXMERT, and UNITER have achieved significant performance improvement on vision-and-language tasks but what they learn during pre-training remains unclear. In this work, we demonstrate that certain attention heads of a visually grounded language model actively ground elements of language to image regions. Specifically, some heads can map entities to image regions, performing the task known as entity grounding. Some heads can even detect the syntactic relations between non-entity words and image regions, tracking, for example, associations between verbs and regions corresponding to their arguments. We denote this ability as syntactic grounding. We verify grounding both quantitatively and qualitatively, using Flickr30K Entities as a testbed.


Introduction
Recently, BERT (Devlin et al., 2019) variants with vision such as ViLBERT (Lu et al., 2019), LXMERT (Tan and Bansal, 2019), and UNITER (Chen et al., 2019) have achieved new records on several vision-and-language reasoning tasks, e.g. VQA (Antol et al., 2015), NLVR 2 (Suhr et al., 2019), and VCR (Zellers et al., 2019). These pre-trained visually grounded language models use Transformers (Vaswani et al., 2017) to jointly model words and image regions. They are pretrained on paired image-text data, where given parts of the input the model is trained to predict the missing pieces. Despite their strong performance, it remains unclear if these models have learned the desired cross-modal representations.
Conversely, a large body of work (Liu et al., 2019;Tenney et al., 2019;Clark et al., 2019) has focused on understanding the internal behaviours of pre-trained language models (Peters et al., 2018b;Radford et al., 2018;Devlin et al., 2019) and find that they capture linguistic features such as POS, syntactic structures, and coreferences. This inspires us to ask: what do visually grounded language models learn during pre-training?
Following Clark et al. (2019), we find that certain attention heads of a visually grounded language model acquire an intuitive yet fundamental ability that is often believed to be a prerequisite for advanced visual reasoning (Plummer et al., 2015): grounding of language to image regions.
We first observe that some heads can perform entity grounding, where entities that have direct semantic correspondences in the image are mapped to the correct regions. For example, in Figure 1, the word "man" attends to the person on the left of the image. Further, non-entity words often attend to image regions that correspond to their syntactic neighbors and we call this syntactic grounding. For example, "wearing" is attending to its subject, the man in the image. We argue that syntactic grounding actually complements entity grounding and that it is a natural byproduct of cross-modal reasoning. For example, to ground "man" to the person on the left rather than other pedestrians, the model needs to identify the syntactic relationships among "man", "wearing", "white", and "shirt" and ground "shirt" and "man" subsequently. During such process, it is helpful and natural that "wearing" attends to the man in the image.
We verify such phenomena by treating each attention head as a ready-to-use classifier (Clark et al., 2019) that given an input word, always outputs the most-attended-to image region. Using Flickr30K Entities (Plummer et al., 2015) as a test bed, we demonstrate that certain heads could perform entity and syntactic grounding with an accuracy significantly higher than a rule-based baseline. Further, higher layers tend to have higher grounding accuracy, suggesting that the model is Figure 1: Attention weights of some selected heads in a pre-trained visually grounded language model. In high layers (e.g., the 10-th and 11-th layer), the model can implicitly grounding visual concepts (e.g., "other pedestrians" and "man wearing white shirt"). The model also captures certain syntactic dependency relations (e.g., "walking" is aligned to the man region in the 6-th layer). The model also refines its understanding over the layers, incorrectly aligning "man" and "shirt" in the 3-rd layer but correcting them in higher layers.

A person hits a ball with a tennis racket
Transformer …

[CLS] [MASK] [SEP]
Objective 2 Objective 1 The architecture of VisualBERT. Image regions and language are combined with a Transformer to allow the self-attention to discover implicit alignments between language and vision. n. It is pre-trained with a masked language modeling (Objective 1), and sentence-image prediction task (Objective 2), on caption data and then fine-tuned for different tasks.
refining its understanding of vision and language layer by layer. Additionally, we provide a qualitative analysis exemplifying these phenomena. A long version of this paper is at https://arxiv. org/abs/1908.03557. Our code is available at https://github.com/uclanlp/visualbert.

Model
Several pre-trained visually grounded models have been proposed recently, and they are conceptually similar yet vary in design details, making evaluating them complicated and difficult. Thus for simplicity, we propose a simple and performant baseline, VisualBERT (see Figure 2), and base our analysis on this model. We argue that our analysis on VisualBERT can be generalized to other similar models as all these models share the following two core ideas: (1) image features extracted from object detectors such as Faster- RCNN (Ren et al., 2015) are fed in a Transformer-based model along with text; (2) the model is pre-trained on image-text data with a masked visually grounded langauge model objective. Below we introduce VisualBERT briefly and leave details to the Appendix A. Input to VisualBERT includes a text segment and an image. The image is represeted as a set of visual embeddings, where each embedding vector corresponds to a bounding region in the image, derived from an object detector (Ren et al., 2015). Text and visual embeddings are then passed through multiple Transformer layers to build joint representations. VisualBERT is pre-trained on the COCO dataset (Chen et al., 2015), concisting of around 100K images with 5 captions each. We use two objectives for pre-training. (1) Masked language modeling with the image. Some elements of text input are masked and the model learns to predict the masked words based on the remaining text and visual context. (2) Sentence-image prediction. For COCO, where there are multiple captions corresponding to one image, we provide a text segment consisting of two captions. One of the caption is describing the image, while the other has a 50% chance to be another corresponding caption and a  Figure 3: Entity grounding accuracy of the attention heads organized by layer. The rule-based baseline is drawn as the grey line. We find that certain heads achieve high accuracy while the accuracy peaks at higher layers. 50% chance to be a randomly drawn caption. The model is trained to distinguish these two situations.
Extensive experiments on four vision-andlanguage datasets (Goyal et al., 2017;Zellers et al., 2018;Suhr et al., 2019;Plummer et al., 2015) verify that pre-trained VisualBERT exceeds all comparable baselines significantly. A summary of the results is present in Table 1. See the Appendix B for details. Some of the afore-mentioned pre-trained visually grounded language models use additional pre-training data or parameters and achieve better performance. As this paper focuses on the analysis, we do not focus on comparing the performance of VisualBERT and other similar models. For the rest of the paper, we analyze a VisualBERT that is configured the same as BERT Base with 12 layers and 144 self-attention heads in total. The model is pretrained on COCO. To mitigate the domain difference between the diagnostic dataset Flickr30K and COCO, we perform additional pre-training on the training set of Flickr30K with the fore-mentioned masked language modeling objective with the image.

Quantitative analysis
Entity Grounding We first focus on entity grounding and use the validation set of Flickr30K Entities for evaluation. The dataset contains imagecaption pairs and annotates the entities in the captions and the corresponding image regions. For each annotated entity and for each attention head of VisualBERT, we take the bounding region which receives the most attention weight as the prediction. An entity could attend to not only the image regions  Table 2: The best performing heads on grounding 10 most common dependency relationships. We only consider heads that are allocating on average more than 20% of the attention from source words to all image regions. A particular attention head is denoted as <layer>-<head number>.
but also other words in the text. For this evaluation, we regard the image region that receives the most attention weight compared to other image regions as the prediction, without considering other words in the text. The predicted region is considered correct as long as it overlaps with the gold bounding region with a IoU ≥ 0.5 (Kim et al., 2018). We also consider a rule-based baseline that always chooses the region with the highest detection confidence. We report the accuracy for all 144 attention heads in VisualBERT and the baseline in Figure 3. Despite that some heads are accurate at entity grounding, they are not actively attending to the image regions. For example, a head might be allocating 10% of its attention weights to all image regions, but it assigns the most of the 10% weights to the correct region. We regard heads paying on average more than 20% of its attention weights from the entities to the regions as "actively paying attention to the image" and draw then as dark and large dots, while the others are drawn as light and small dots. We make the following two observations. First, certain heads perform entity grounding with a remarkably high accuracy. This is consistent with the observations in Clark et al. (2019)  (2019) find, in that BERT also refines its understanding of the input over the layers.
Syntactic Grounding As motivated before, alignments between words other than nouns and Figure 4: Accuracy of attention heads of VisualBERT for syntactic grounding on specific dependency relationships ("pobj", "nsubj", "amod"). The grey lines denote a baseline that always chooses the region with the highest detection confidence. We observe that VisualBERT is capable of detecting these dependency relationships without direct supervision.
image regions could also be helpful for visual reasoning. More specifically, if two words are connected with a dependency relation, w 1 r ←→ w 2 , and w 1 is an entity aligned to an image region, we would like to know how often the attention heads attend from w 2 to the regions corresponding to w 1 . For evaluation, we parse all sentences in the validation set of Flickr30K using AllenNLP (Dozat and Manning, 2017;Gardner et al., 2018) and use the parser output as the gold parsing annotation.
We find that for each dependency relationship, there exists at least one head that significantly outperforms guessing the most confident bounding region. We report the 10 most common relations in Table 2 and plot the syntactic grounding accuracy of three particularly interesting dependency relationships in Figure 4. Similar to what we observe for entity grounding, the model becomes more accurate on syntactic grounding in higher layers.

Qualitative Analysis
Finally, we showcase several interesting examples of how VisualBERT performs grounding in Figure  1 and Figure 5. To generate these examples, for each ground-truth box, we show a predicted bounding region closest to it and manually group the bounding regions into different categories. We also include regions that the model is actively attending to, even if they are not present in the gold annotations (marked with an asterisk). We then aggregate the attention weights from words to those regions in the same category. We show the best heads of 6 layers that achieve the highest entity grounding accuracy but we find that they also exhibit a certain level of syntactic grounding.
We observe the same behaviours as in the quantitative analysis, in that VisualBERT not only performs grounding but also refines its predictions through successive Transformer layers. For ex-ample, in the bottom image in Figure 5, initially the word "husband" and the word "woman" both assign significant attention weight to regions corresponding to the woman. By the end of the computation, VisualBERT has disentangled the woman and man, correctly aligning both. Furthermore, there are many examples of syntactic alignments. In the same image, the word "teased" aligns to both the man and woman while "by" aligns to the man. . We, however, present a quantitative study on whether visually grounded language models acquire the grounding ability during pre-training without explicit supervision.

Related Work
Our work is inspired by papers on analyzing pretrained language models. One line of work uses probing tasks to study the internal representations   Figure 5: Attention weights of 6 selected heads in VisualBERT where alignments match Flickr30k annotations.

Conclusion and Future Work
We have presented an analysis on the attention maps of VisualBERT, a proposed visually grounded language model. We note that the grounding behaviour we have found is linguistically inspired, as entity grounding can be regarded as cross-modal entity coref resolution while syntactic grounding can be regarded as cross-modal parsing. Moreover, VisualBERT exhibits a hint of cross-modal pronoun resolution, as in the bottom image of Figure  5, the word "her" is resolved to the woman. For future work, it would be interesting to see if more linguistically-inspired phenomena can be systematically found in cross-modal models.

Acknowledgement
We would like to thank Xianda Zhou for help with experiments as well as Patrick H. Chen, members of UCLA NLP, and anonymous reviewers for helpful comments. We also thank Rowan Zellers for evaluation on VCR and Alane Suhr for evaluation on NLVR 2 . Cho-Jui Hsieh acknowledges the support of NSF IIS-1719097 and Facebook Research Award. This work was supported in part by DARPA MCS program under Cooperative Agreement N66001-19-2-4032. The views expressed are those of the authors and do not reflect the official policy or position of the Department of Defense or the U.S. Government.

A VisualBERT
First we give background on BERT, then summarize the adaptations we made to allow processing images and text jointly, and finally explain our training procedure. Each embedding e ∈ E is computed as the sum of 1) a token embedding e t , specific to the subword, 2) a segment embedding e s , indicating which part of text the token comes from (e.g., the hypothesis from an entailment pair) and 3) a position embedding e p , indicating the position of the token in the sentence. The input embeddings E are then passed through a multi-layer Transformer that builds up a contextualized representation of the subwords.
BERT is commonly trained with two steps: pretraining and fine-tuning. Pre-training is done using a combination of two language modeling objectives: (1) masked language modeling, where some parts of the input tokens are randomly replaced with a special token (i.e., [MASK]), and the model needs to predict the identity of those tokens and (2) next sentence prediction, where the model is given a sentence pair and trained to classify whether they are two consecutive sentences from a document. Finally, to apply BERT to a particular task, a taskspecific input, output layer, and objective are introduced, and the model is fine-tuned on the task data from pre-trained parameters.

A.2 Model
The core of our idea is to reuse the self-attention mechanism within the Transformer to implicitly align elements of the input text and regions in the input image. In addition to all the components of BERT, we introduce a set of visual embeddings, F , to model an image. Each f ∈ F corresponds to a bounding region in the image, derived from an object detector. It is computed by summing three embeddings: (1) f o , a visual feature representation of the bounding region of f , computed by a convolutional neural network, (2) f s , a segment embedding indicating it is an image embedding as opposed to a text embedding, and (3) f p , a position embedding, which is used when alignments between words and bounding regions are provided as part of the input, and set to the sum of the position embeddings corresponding to the aligned words (see Section B.2). The visual embeddings are then passed to a multi-layer Transformer along with the original set of text embeddings, allowing the model to implicitly discover alignments between both sets of inputs, and build up a joint representation. 1

A.3 Training VisualBERT
We would like to adopt a similar training procedure as BERT but VisualBERT must learn to accommodate both language and visual input. Therefore we reach to a resource of paired data: COCO (Chen et al., 2015) that contains images each paired with 5 independent captions. Our training procedure contains three phases: Task-Agnostic Pre-Training As introduced before, we pre-train VisualBERT on COCO using two visually-grounded language model objectives. (1) Masked language modeling with the image. Some elements of text input are masked and must be predicted but vectors corresponding to image regions are not masked.
(2) Sentence-image prediction. We supply two captions in one training example and one of the caption has a 50% chance to not match the image. The model is trained to determine if the provided captions is describing the image.
Task-Specific Pre-Training Before fine-tuning VisualBERT to a downstream task, we find it beneficial to train the model using the data of the task with the masked language modeling with the image objective. This step allows the model to adapt to the new target domain.
Fine-Tuning This step mirrors BERT finetuning, where a task-specific input, output, and objective are introduced, and the model is trained to maximize performance on the task.  (Plummer et al., 2015), each described in more details in the following sections. For all tasks, we use the Karpathy train split (Karpathy and Fei-Fei, 2015) of COCO for task-agnostic pre-training, which has around 100k images with 5 captions each. The Transformer encoder in all models has the same configuration as BERT Base : 12 layers, a hidden size of 768, and 12 self-attention heads. The parameters are initialized from BERT Base released by Devlin et al. (2019).

B Experiment
For the image representations, each dataset we study has a different standard object detector to generate region proposals and region features. To compare with them, we follow their settings, and as a result, different image features are used for different tasks (see details in the subsections). 2 For consistency, during task-agnostic pre-training on COCO, we use the same image features as in the end tasks. For each dataset, we evaluate three variants of our model: VisualBERT: The full model with parameter initialization from BERT that undergoes pre-training on COCO, pre-training on the task data, and finetuning for the task.
VisualBERT w/o Early Fusion: VisualBERT but where image representations are not combined with the text in the initial Transformer layer but instead at the very end with a new Transformer layer. This allows us to test whether interaction between language and vision throughout the whole Transformer stack is important for performance.
VisualBERT w/o COCO Pre-training: Visual-BERT but where we skip task-agnostic pre-training on COCO captions. This allows us to validate the importance of this step.
Following Devlin et al. (2019), we optimize all models using SGD with Adam (Kingma and Ba, 2015). We set the warm-up step number to be 10% of the total training step count unless specified otherwise. Batch sizes are chosen to meet hardware constraints and text sequences whose lengths are longer than 128 are capped. Experiments are conducted on Tesla V100s and GTX 1080Tis, and all experiments can be replicated on 4 Tesla V100s each with 16GBs of GPU memory. Pre-training on COCO generally takes less than a day on 4 cards while task-specific pre-training and fine-tuning usually take less. Other task-specific training details are in the corresponding subsections.

B.1 VQA
Given an image and a question, the task is to correctly answer the question. We use the VQA 2.0 (Goyal et al., 2017), consisting of over 1 million questions about images from COCO. We train the model to predict the 3,129 most frequent answers and use image features from a ResNeXt-based Faster RCNN pre-trained on Visual Genome (Jiang et al., 2018). We report the results in Table 3, including baselines using the same visual features and number of bounding region proposals as our methods (first section), our models (second section), and other incomparable methods (third section) that use external question-answer pairs from Visual Genome (+VG) , multiple detectors (Yu et al., 2019a) (+Multiple Detectors) and ensembles of their models. In comparable settings, our method is significantly simpler and outperforms existing work.

B.2 VCR
VCR consists of 290k questions derived from 110k movie scenes, where the questions focus on visual commonsense. The task is decomposed into two multi-choice sub-tasks wherein we train indimethods on a similar footing.
vidual models: question answering (Q → A) and answer justification (QA → R). Image features are obtained from a ResNet50 (He et al., 2016) and "gold" detection bounding boxes and segmentations provided in the dataset are used 3 . The dataset also provides alignments between words and bounding regions that are referenced to in the text, which we utilize by using the same position embeddings for matched words and regions. Results on VCR are presented in Table 4. We compare our methods against the model released with the dataset which builds on BERT (R2C) and list the top performing single model on the leaderboard when we submit VisualBERT to the leaderloard (VL-BERT  Table 5: Comparison with the state-of-the-art models on NLVR 2 . The two ablation models significantly outperform MaxEnt while the full model widens the gap.

B.4 Flickr30K Entities
Flickr30K Entities dataset tests the ability of systems to ground phrases in captions to bounding regions in the image. The task is, given spans from a sentence, selecting the bounding regions they correspond to. The dataset consists of 30k images and 4 We conducted a preliminary experiment on the effect of the number of object proposals kept per image. We tested models with 9, 18, 36, 72, and 144 proposals, which achieve an accuracy of 64.8, 65.5, 66.7, 67.1, and 67.4 respectively on the development set. nearly 250k annotations. We adapt the setting of BAN (Kim et al., 2018), where image features from a Faster R-CNN pre-trained on Visual Genome are used. For task specific fine-tuning, we introduce an additional self-attention block and use the average attention weights from each head to predict the alignment between boxes and phrases. For a phrase to be grounded, we take whichever box receives the most attention from the last sub-word of the phrase as the model prediction. Results are listed in Table 6. VisualBERT outperforms the current state-of-the-art model BAN. In this setting, we do not observe a significant difference between the ablation model without early fusion and our full model, arguing that perhaps a shallower architecture is sufficient for grounding when supervision is available.

C Ablation Study
In this section we conduct ablation study on what parts of our approach are important to Visual-BERT's strong performance. We compare two ablation models in the Experiment section and four additional variants on NLVR 2 . For ease of computations, these models are trained with only 36 features per image (including the full model). Our analysis (Table 7) aims to investigate the contributions of the following four components in Visual-BERT: C1: Task-agnostic Pre-training We investigate the contribution of task-agnostic pre-training by  Table 7: Performance of the ablation models on NLVR 2 . Results confirm the importance of taskagnostic pre-training (C1) and early fusion of vision and language (C2).
entirely skipping such pre-training (VisualBERT w/o COCO Pre-training) and also by pre-training with only text but no images from COCO (Visual-BERT w/o Grounded Pre-training). Both variants underperform, showing that pre-training on paired vision and language data is important.

C2: Early Fusion
We include VisualBERT w/o Early Fusion to justify allowing early interaction between image and text features, confirming again that multiple interaction layers between vision and language are important.
C3: BERT Initialization All models discussed before are initialized from a pre-trained BERT. To understand its contribution, we introduce a variant that is randomly initialized and then trained as the full model. While it seems weights from language-only pre-trained BERT are important, performance does not degrade as much as we expect, arguing that the model is likely learning many of the same useful aspects about grounded language during COCO pre-training.

C4: The sentence-image prediction objective
We introduce a model without the sentence-image prediction objective during pre-training (Visual-BERT w/o Objective 2). Results suggest that this objective has positive but less significant effect, compared to other components. Overall, the results confirm that the most important design choices are task-agnostic pre-training (C1) and early fusion of vision and language (C2). In pre-training, both the inclusion of additional COCO data and using both images and captions are paramount.