Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions

Pre-trained contextual vision-and-language (V&L) models have achieved impressive performance on various benchmarks. However, existing models require a large amount of parallel image-caption data for pre-training. Such data are costly to collect and require cumbersome curation. Inspired by unsupervised machine translation, we investigate if a strong V&L representation model can be learned through unsupervised pre-training without image-caption corpora. In particular, we propose to conduct “mask-and-predict” pre-training on text-only and image-only corpora and introduce the object tags detected by an object recognition model as anchor points to bridge two modalities. We find that such a simple approach achieves performance close to a model pre-trained with aligned data, on four English V&L benchmarks. Our work challenges the widely held notion that aligned data is necessary for V&L pre-training, while significantly reducing the amount of supervision needed for V&L models.


Introduction
Pre-trained contextual vision-and-language (V&L) models Tan and Bansal, 2019;Li et al., 2019;Su et al., 2019;Chen et al., 2020c) have achieved high performance on various V&L tasks. However, different from contextual language models, such as BERT (Devlin et al., 2019a), which are trained on easily-accessible unannotated text corpora, existing V&L models are still a step away from self-supervision. They require a massive amount of aligned text-image pairs for "mask-and-predict" pre-training. Such aligned data are costly to collect and hard to scale up. For example, the widely used MS-COCO dataset (Chen et al., 2015) requires extensive * The two authors contributed equally. annotation from crowd workers. 1 In this paper, we explore unsupervised V&L pre-training with unaligned image and text corpora. 2 This research direction aligns with the theme of unsupervised and self-supervised learning that moves from heavily-annotated data to unannotated data, e.g. unsupervised machine translation (Lample et al., 2018) and unsupervised image captioning (Feng et al., 2019). Unsupervised V&L pretraining is highly desirable as in many domains, aligned data is scarce (e.g. multimodal hate speech detection (Kiela et al., 2020) and the medical domain (Li et al., 2020c)) and it is easier to collect unaligned text and images. In addition to its practical implication, our endeavour challenges the widely held notion that image-caption corpora is indispensable for pre-training  and brings valuable insight into the role that aligned data play in V&L pre-training.
We are inspired by works on multi-lingual contextual language models (Pires et al., 2019). If we treat an image as a set of regions and each region as a visual token (Dosovitskiy et al., 2020), V&L models share a similar goal with multi-lingual models as they both learn shared representations across different domains. Although a multi-lingual language model pre-trained on non-parallel corpora such as mBERT (Devlin et al., 2019b) cannot align or translate languages out-of-the-box, its representation spaces for different languages can be easily aligned with a linear probe (Conneau et al., 2020). This property suggests the existence of universal latent symmetries in the unaligned contextual embedding spaces and is believed to contribute to  Figure 1: An illustration of pre-training without aligned data. Given text, the model is trained to predict masked words; given an image, the model is trained to predict masked regions and detector tags. The semantic class "cake" appears in both the language modality and the visual modality and is linked through the detector tags. Note that we do not require a text segment with the word cake to appear together with the image. Rather, we assume that as long as the text corpora are general enough, the word cake will appear in the textual modality eventually. The model can thus learn V&L representations from such weak supervision signals.
mBERT's cross-lingual transfer ability. Thus we hypothesize that strong V&L representations can be similarly learned by "mask-and-predict" pretraining on unaligned language and vision data.
We propose unsupervised V&L pre-training with unaligned text and images (see an illustration in Figure 1). Specifically, we take VisualBERT  as a running example and apply unsupervised pre-training, resulting in Unsupervised VisualBERT (U-VisualBERT). The model takes the form of a single Transformer that can accept inputs from both modalities. During each step of pre-training, unlike the existing models that observe a batch of text-image pairs, our model observes either a batch of text segments or a batch of images. When provided with text, part of the text is masked and the model is trained to predict the masked words; when provided with an image, part of the image regions are masked and the model is trained to predict properties of the masked regions.
To further encourage cross-modal fusion, we leverage the tags from an object detector as "anchor points" (Li et al., 2020b). For every object, we append its detected tag as a word to the visual input. The mask-and-predict objective is applied to the tags. For instance, for the image in Figure 1, the model can observe "cake" appears naturally as a word, a tag, and an image region. The direct typing of image regions and words can be learned and serves as a starting point for further alignment. The function of the detector tags resembles that of the "overlapping vocabulary" in multi-lingual language models, i.e., identical strings that appear in differ-ent languages with the same meanings (e.g., "DNA" appears in both English and French). As the "overlapping vocabulary" improves cross-lingual transfer (Wu and Dredze, 2019), we argue the detector tags can improve cross-modal grounding.
We first conduct controlled experiments by pretraining on an English image-caption corpus without providing the alignment, following unsupervised machine translation and image captioning (Gu et al., 2019). Results on four English V&L benchmarks (VQA , NLVR 2 (Suhr et al., 2019), Flickr30K Image Retrieval (Plummer et al., 2015, and RefCOCO+ ) show that U-VisualBERT achieves comparable performance as models with access to textimage pairs (Section 4).
Additionally, our approach is effective in practical settings, 1) when using independently collected images and captions and 2) when using images and general-domain text (BookCorpus (Zhu et al., 2015)) without any captions (Section 5.1). Quantitative and qualitative analysis confirms the anchoring effect of the detector tags (Section 5.2). As a byproduct, we conduct preliminary experiments to show the promise of the approach in a semi-supervised setting, where a hybrid model pre-trained with both aligned and additional unaligned data surpasses a model pre-trained only on aligned data. (Section 6). The above experiments demonstrate the wide applicability of our method. We will open-source the project under https://github.com/uclanlp/visualbert.

2 Related Work
Pre-trained V&L Transformers Various V&L models that are pre-trained with a "mask-andpredict" objective on aligned text-image data have been proposed Tan and Bansal, 2019;Li et al., 2019;Su et al., 2019;Chen et al., 2020c;Li et al., 2020a;Zhou et al., 2020;Huang et al., 2020;. Two kinds of designs have been proposed. Two-stream models Tan and Bansal, 2019; utilize separate Transformers (Vaswani et al., 2017) for each modality and a cross-modality module is adopted. Single-stream models Su et al., 2019;Chen et al., 2020c) directly input the text and visual embeddings into one single Transformer. They have been widely used by downstream tasks (Kiela et al., 2020). Probing tasks (Cao et al., 2020) confirm that they capture useful V&L information after pre-training.
Two studies also try to incorporate "tag" information during pre-training. Oscar (Li et al., 2020b) adds detected tags as additional signals when pretraining with aligned data. We, however, do so for pre-training with unaligned data and show that the tags serve a more important role in unsupervised pre-training (Section 5.2). VIVO  targets novel object captioning. They use manually annotated image-tag data for pre-training and image-caption data for fine-tuning. We do not use manually annotated data and the tags are noisily generated by a detector.

Self-supervised
Representation Learning Self-supervision involves creating supervision objectives from natural data, often by corrupting the input and training the model to reconstruct the input (Kolesnikov et al., 2019) or contrastive learning (Chen et al., 2020b). Self-supervised training on language (Peters et al., 2018;Devlin et al., 2019a) such as BERT has been proven useful for various NLP tasks , while self-supervised visual representation learning has been centered around learning low-level visual features, in hope of enhancing the backbone CNN (Doersch et al., 2015;Pathak et al., 2016;Noroozi and Favaro, 2016;Chen et al., 2020b). In this paper, we conduct V&L pre-training by optimizing a reconstructive objective on unlabeled languageonly and image-only data. Thus, our proposed model could be regarded as "self-supervised". Notably, our contextual visual representation is built on top of a pre-trained detector, operating at a level above local visual features.

Unsupervised Multi-lingual Language Model
This work is inspired by multi-lingual representations trained without parallel corpora (Devlin et al., 2019b). They are effective for cross-lingual transfer, which involves learning a model in one language and applying it to another with no additional training. Studies (Wu and Dredze, 2019;Conneau et al., 2020) have confirmed several design choices that facilitate such transfer, e.g. shared parameters and overlapping vocabularies across languages, and we make similar design choices in U-VisualBERT (Section 3.2). We argue that multi-lingual representations bear resemblance to multi-modal representations as both seek to encode the alignment between two domains (Chen et al., 2020a).
Unsupervised Grounding Learning Prior works have explored learning grounding with weak or no supervision (Rohrbach et al., 2016;Xiao et al., 2017;. Closest to this paper is unsupervised image captioning (Feng et al., 2019;Laina et al., 2019;Gu et al., 2019), which conducts image captioning with unpaired images and captions. Similar to this work, the detector tags serve as the anchor points for image captioning. However, unsupervised image captioning still requires captions, while our approach works with easy-to-collect general-domain text without any caption text (Section 5.1).

Approach
We first take Supervised VisualBERT (S-VisualBERT) as an example and illustrate how a typical V&L model is pre-trained with aligned data. Then we introduce unsupervised V&L pre-training, and the resulting model Unsupervised VisualBERT (U-VisualBERT).

Background
As mentioned in Section 2, there are several V&L representation learning methods based on BERT. We take Supervised VisualBERT (S-VisualBERT) as an example, which will also be used as a baseline in the experiments. S-VisualBERT is modified from the original VisualBERT  and augmented with the visual objectives from LXMERT (Tan and Bansal, 2019) and detector tags similar to Oscar (Li et al., 2020b) (discussed in detail in Section 3.2).
Every input to S-VisualBERT contains a text segment T and an image I. The text and the image are first mapped into embedding vectors respectively. Text embeddings T is a matrix in which each column vector represents the embedding of a subword in the text sequence, i.e. T = [w 1:n ]. Following BERT, each subword embedding w i is the sum of its token, position, and segment embedding. Image embeddings I include both the image region embeddings r 1:m and the detector tag embeddings d 1:l (see Section 3.2 for details). Each region embedding r i is the sum of a visual feature vector from the detector and a spatial box coordinate embedding (Tan and Bansal, 2019). The text and visual embeddings are then passed through a Transformer to built contextual representations.
The model is pre-trained with a mask-andpredict objective. Given a text-image pair [T, I] from the aligned dataset D, we randomly mask out some words w i , some regions r j , and some tags d k to obtain masked [T,Ĩ]. The model is trained to predict the masked words, the properties of the masked regions, and the masked tags given [T,Ĩ]. The pre-training objective can be summarized as: f θ represents the embedding layer and the multilayer Transformer. L T +I+M is the sum of 1) the masked language model loss L T , 2) the image reconstruction loss L I , and 3) an "text-image match" objective L M . Specifically, L I includes a tag reconstruction loss L tag I (more details in Section 3.2) and the two visual losses as in LXMERT (Tan and Bansal, 2019): the region feature regression loss L ref I , which forces the model to regress to the visual vector, and the noisy label classification loss L cls I , which predicts the detected labels of masked objects with the cross-entropy loss. With a probability of 0.5, we provide the model with a mismatched text-image pair instead of a matched pair, and L M asks the model to predict whether the image matches the text. After the model is pre-trained, it can be fine-tuned for V&L tasks similar to how BERT is fine-tuned for NLP tasks.

Unsupervised Pre-training
We introduce the two core design choices of unsupervised pre-training: mask-and-predict pretraining with unaligned data and the detector tags.

Mask-and-Predict Pre-training with Unaligned Data
We assume access to a text corpus D T and an image corpus D I for pre-training. During every pre-training step, we randomly sample either a batch of text from D T or a batch of images from D I . No alignment between text and images is provided to the model. When pre-training with a text segment T , the model is trained to reconstruct T given the maskedT . 3 When pre-training with an image I, the model is trained to reconstruct I given the maskedĨ. A single Transformer is used throughout two modalities (i.e. θ shared across modalities). The pre-training objective can be summarized as: After pre-training, the model is fine-tuned on downstream tasks just as its supervised counterpart, with the input being a text-image pair.
Detector Tags While mask-and-predict pretraining with unaligned data in itself achieves nontrivial performance (Section 5.2), we find it beneficial to provide noisy alignment signals in the form of the detector tags. When modeling an image I, for each region detected, we append the tag outputted by the object detector to the input. The detector (Ren et al., 2015) is pre-trained on a general object detection dataset (Krishna et al., 2017;Anderson et al., 2018) and the tags are essentially a bag of words that provide some noisy grounding signals to the model. During pre-training, we apply the mask-and-predict objective to the tags, which further encourages grounding. We process the detector tags as a subword sequence d 1:l with spatial coordinates. 4 Every tag subword is embedded as the sum of its token embedding and a spatial coordinate embedding. The token embedding is the same as the token embedding used in text modeling, while the spatial coordinate embedding is the same as the coordinate embedding of the corresponding region. The coordinate embedding allows the model to distinguish tags from different regions. 5 With the de-tector tags added, the image I is embedded as a sequence of image region features r 1:m followed by a sequence of detector tag embeddings d 1:l , i.e. I = [r 1:m ; d 1:l ]. The tags are added during both pre-training and fine-tuning. Further, during pretraining, certain tag subwords are masked and the tag reconstruction loss L tag I supervises the model to predict the masked tags. The tags are predicted just as masked subwords are predicted in text modeling. The prediction softmax layer is shared between the tag and text subwords.
The parameters involved in modeling tags include the token embedding, the coordinate embedding, and the subword softmax embedding. These embedding parameters are shared across modalities and encourage the model to project text, visual, and tag representations into the same space (see Section 5.2 for an example). This resembles the design in multi-lingual language models, which use shared BPE embeddings and softmax weights across languages (Wu and Dredze, 2019).

Experiment
As the domain and quality of data may affect the model performance, the conventional practice in unsupervised learning is to use aligned corpora without providing alignments, allowing for controlled comparison with a supervised model. For example, unsupervised machine translation creates unaligned corpora by splitting up parallel corpora (Lample et al., 2018) while unsupervised image captioning (Gu et al., 2019) create unaligned corpus by shuffling images and captions from MSCOCO (Chen et al., 2015). Following prior work, we first conduct experiments by using Conceptual Captions (CC) (Sharma et al., 2018) as the source of images and text for both the supervised and unsupervised model. Later in Section 5.1, we show that our method is effective when the images and captions are collected independently and when no caption text is used.

U-VisualBERT
The model is pre-trained with shuffled captions and images. At each training step, we sample either a batch of images or a batch of text. Following VL-BERT (Su et al., 2019), we find it beneficial to include BookCorpus (Zhu et al., 2015), a general-domain text corpus, during pretraining. In sum, U-VisualBERT is trained on 3M images from CC, 3M captions from CC, and 2.5M text segments from BookCorpus 6 .

S-VisualBERT
We introduce a Supervised Visu-alBERT (S-VisualBERT) trained with aligned data as introduced in Section 3.1. S-VisualBERT is pretrained on 3M caption-image pairs from CC and 2.5M text segments from BookCorpus.
Compared Models Additionally, we list the performance of a Base VisualBERT that is initialized from BERT and does not undergo further pre-training. Previously reported supervised models that are trained on CC are also listed, including ViLBERT, VL-BERT, and UNITER. For UNITER, we include the version that is trained only on CC (UNITER cc ) 7 . Although their network architectures differ from ours and cannot be directly compared, they jointly paint the picture of the performance we should expect by pre-training on CC. Models developed before BERT are listed as Setup For all the VisualBERT variants introduced in the paper, we initialize them from BERT base and pre-train for 10 epochs on their respective pre-training datasets with a batch size of 144. All models can be trained within 3 days on 4 V100s each with 16GB of memory. We use the Adam optimizer (Kingma and Ba, 2015) with a linear-decayed learning-rate schedule (Devlin et al., 2019a) and a peak learning rate at 6 × 10 −5 . We conduct evaluations by fine-tuning on four downstream tasks: Visual Question Answering (VQA 2.0) , Natural Language for Visual Reasoning (NLVR 2 ) (Suhr et al., 2019), Image Retrieval (Flickr 30K) , and Referring Expression (RefCOCO+) . We use a Faster R-CNN pre-trained on the Visual Genome dataset to extract region features (Anderson et al., 2018). For each task, we follow the recommended setting in previous works. For details, please refer to the appendix.  during pre-training. 8 To control for randomness, we report the means and standard deviations of U-VisualBERT and S-VisualBERT across three runs. U-VisualBERT outperforms the Base model on all benchmarks, while only lagging behind S-VisualBERT slightly on VQA, NLVR 2 , and Ref-COCO+. U-VisualBERT even surpasses or rivals with some supervised models (e.g., ViLBERT on VQA and RefCOCO+, VL-BERT on RefCOCO+, and UNITER cc on RefCOCO+). This shows that a model through unsupervised pre-training can perform comparably with supervised models.

Results
On Flickr30K Image Retrieval, the difference between U-VisualBERT and S-VisualBERT is more evident. The task focuses on identifying if an image and a text segment are coherent. S-VisualBERT is provided with explicit signals for such a task with the "text-image match" objective L M during pre-training (Section 3.1). While U-VisualBERT is not provided with such explicit signals, it still performs better than the Base model. Further, if we were to remove the explicit signal (i.e. the "text-image match" objective) when pre-training on aligned data, S-VisualBERT without L M achieves only 57.98 on R@1, much closer to U-VisualBERT 8 For models initialized from BERT, we do not count the BERT pre-training data. VL-BERT uses both BookCorpus and Wikipedia during V&L pre-training. We estimate that the two corpora roughly have 5OM segments with 64 words per segment. With a different pre-processing style (e.g. longer segments), the number of segments may change.

Analysis
In this section, we analyze the effect of the text data and the role of the detector tags.

The Effect of Text Data
The assumption behind unsupervised pre-training is that the detector tags should appear both in the images and text corpus, serving as the grounding anchor points. When the images and captions come from the same corpus, such an assumption clearly holds, and unsupervised pre-training works well (Section 4). However, we are curious if such an assumption still holds 1) if images and captions come from independently collected corpora (U-VisualBERT SBU ) and 2) if no caption text but general-domain text is provided (U-VisualBERT NC ).
The latter setting bears great practical value. Conceptually, collecting caption-style text could be as hard as collecting image-caption data as images and captions seldom appear separately. It is desirable to explore training V&L representations without caption-style text. Thus we experiment pre-training with general-domain text, which could be easier to collect.
U-VisualBERT SBU We use 3M images from CC and 1M captions from SBU captions (Ordonez et al., 2011). To compensate for the different amounts of text between CC and SBU, we upsam-  Table 3: Detector tags show a larger impact in the unsupervised setting (U-VisualBERT NT vs. U-VisualBERT) than in the supervised setting (S-VisualBERT NT vs. S-VisualBERT). Semi-supervised pre-training (H-VisualBERT) shows marginal improvement over supervised pre-training (S-VisualBERT).
ple the BookCorpus so that the amount of text data used by U-VisualBERT SBU is roughly the same as U-VisualBERT.
U-VisualBERT NC The model is trained on images from CC and text from BookCorpus, a generaldomain corpus.
Results Unsupervised pre-training is effective in both scenarios (Table 1). When pre-training images and text are collected independently, U-VisualBERT SBU achieves similar performance as U-VisualBERT, with the latter higher on VQA, and the former higher on the other three tasks. When no caption text is used, the performance on NLVR 2 and RefCOCO+ remains unaffected while the performance on VQA and Flickr30K drops slightly, potentially because the language style of VQA and Flickr30K is similar to captions, benefiting U-VisualBERT. Such results are not surprising. In general-domain corpora like Wikipedia, grounded words take up a decent portion (>25%) (Tan and Bansal, 2020). Thus the tags appear in pretraining text corpora with a non-trivial frequency and U-VisualBERT NC learns from such signals. The above results suggest the applicability of unsupervised pre-training to many language-only and image-only datasets, which are easier to collect than image-caption datasets (Trinh and Le, 2018;Sun et al., 2017).

The Detector Tags as Anchor Points
We study the effect of the detector tags in unsupervised and supervised pre-training, respectively.
W-VisualBERT NT U-VisualBERT NT observes no tags and only dense region features for image embeddings during pre-training and fine-tuning. For comparison, a base model without tags is introduced (Base NT ), which is initialized from BERT and does undergo further pre-training.
S-VisualBERT NT To study the effect of the detector tags when aligned data are present, we introduce S-VisualBERT NT which is trained on aligned data but observes no tags for image embeddings.
Result We first find that even without tags, unsupervised pre-training benefits downstream tasks (Table 3). U-VisualBERT NT outperforms Base NT on all metrics with a large margin. We attribute this to the (unaligned) contextual V&L representation learned through pre-training. This bears resemblance to the observation in multi-lingual language models that the shared vocabulary across languages (i.e. anchor points) is not necessary for cross-lingual transfer (Conneau et al., 2020).
Further, while the detector tags are beneficial for both supervised and unsupervised pre-training, the performance improvement is more evident for the latter. For example, performance difference on VQA between U-VisualBERT and U-VisualBERT NT is 0. 95 (70.82 vs. 69.87) while the difference between S-VisualBERT and S-VisualBERT NT is 0. 41 (70.90 vs. 70.49). The results are expected. When aligned data are present, object tags serve as additional signals while in unsupervised pre-training, they serve as the only source from which grounding is learned.
Visualization To gain a direct sense of how the detector tags help bridge the modalities, we visualize the contextual representation spaces of S-VisualBERT, U-VisualBERT, and U-VisualBERT NT in Figure 2. For each of the most frequent 15 object classes in the COCO dataset (Chen et al., 2015), we randomly sample at most 50 instances and take the last-layer contextual representations of the words, the objects, and the tags (when available) and visualize them with t-SNE The tags help to fuse text and visual representations for S-VisualBERT and U-VisualBERT. In U-VisualBERT NT , common structures emerge in the text and visual representation spaces even though they are not aligned. (Maaten and Hinton, 2008). We highlight the representations of six selected classes.
Though trained without aligned data, U-VisualBERT can group text, tag, and visual representations by their semantic classes. Similar phenomena can be observed in S-VisualBERT. U-VisualBERT NT , lacking any signal to align the two spaces, does not show signs of such behaviour. In U-VisualBERT NT , text and visual representations are almost completely separated (e.g., the two disjoint red rectangles in the figure on the right). However, some common structures emerge in both modalities. For instance, representations for "car", "truck", and "motorcycle", the three semanticallyrelated classes, are close to each other, in both the textual and visual modality (the red rectangles); representations for "cup", "bottle", and "bowl" are close (the blue rectangles). This also holds for the other two models and resembles what is observed in Li et al. (2020b) and Ilharco et al. (2020).

Semi-Supervised Pre-Training
Unsupervised pre-training in itself has great practical and research value in many domains where aligned data is scarce. As a byproduct, we wonder if the approach could find its use in a semisupervised setting, where we pre-train a model with both aligned data and unaligned data.

H-VisualBERT
We introduce a hybrid model that is trained on the 3M aligned data from Conceptual Captions (CC) and additional unaligned 1.7M images from Open Images (OI) (Kuznetsova et al., 2020). When a training sample comes from CC, we provide the model with a text-image pair, and when the training sample comes from OI, we provide only the image. We do not use any manually annotated visual labels provided in OI.
Result We control for randomness by running H-VisualBERT for three times and report the means and stand deviations. We observe that H-VisualBERT brings consistent improvement upon S-VisualBERT on most tasks (Table 3) except Flickr30K 9 . This preliminary result is promising as the dataset scale in this experiment is relatively small (million-scale). Meanwhile, unannotated data generally could not improve upon a model trained with annotated data significantly, unless drastically scaled up (He et al., 2020). We leave large-scale experiments to future work.

Conclusion
In this paper, we explore unsupervised pre-training with unaligned data. We conduct mask-and-predict pre-training on textual data and visual data and the detector tags are used as anchor points to bridge the two modalities. Experiments show that unsupervised pre-training can achieve performance similar to supervised pre-training.

Ethical Considerations
One caveat of the proposed method is that data collected from the web may contain biases (Zhao et al., 2017), toxic contents (Schmidt and Wiegand, 2017), and other ethical issues. This problem is common to ML models and we stress that de-biasing  and a rigorous examination are needed before deploying the system.
VQA Given an image and a question, the task is to correctly answer the question. We use the VQA 2.0 and use the Karpathy split for training and validation (Karpathy and Fei-Fei, 2015). We fine-tune with a binary cross-entropy loss. The model is trained with a batch size of 32 and a peak learning rate of 5 × 10 −5 over 8 epochs.
NLVR 2 NLVR 2 involves determining whether a natural language caption is true about a pair of images. While more sophisticated fine-tuning strategy exists (Chen et al., 2020c), we follow LXMERT (Tan and Bansal, 2019) to pair the caption with each image, concatenate the "[CLS]" representation of the two pairs, and build a classifier on top. We find it beneficial to conduct a moderate amount of "task-specific pre-training" where we use the data from the dataset to conduct mask-and-predict pre-training as suggested by VisualBERT . We conduct task-specific pre-training for at most 5 epochs and fine-tune from the epoch with the best validation LM loss. Fine-tuning is conducted for 8 epochs with a batch size of 32 and a peak learning rate of 2 × 10 −5 .
Flickr30K The task of image retrieval involves finding the corresponding image from a collection of images given a caption. We follow the split of  and use 1,000 images for validation and test each and train on the rest of the dataset. During fine-tuning, we follow UNITER (Chen et al., 2020c) and sample two negative textimage pairs along with a positive sample. We train for 5K steps with a batch size of 8 and a peak learning rate of 5 × 10 −5 .

RefCOCO+
The referring expression task involves locating an image region given a natural language phrase. We follow ViLBERT  and conduct evaluation on the RefCOCO+ dataset. We use the bounding box proposals provided by Yu et al. (2018). For each box proposal, the model is trained to classify if it matches the reference phrase or not. A proposal box is considered correct if it has an IoU with the gold box larger than 0.5. We train for 12 epochs with a batch size of 32 and a peak learning rate of 5 × 10 −5 .
The other datasets we used including Conceptual Captions, Open Images, VQA, NLVR 2 , Flickr30K, and RefCOCO+ are publicly available.