Cross-lingual Cross-modal Pretraining for Multimodal Retrieval

Recent pretrained vision-language models have achieved impressive performance on cross-modal retrieval tasks in English. Their success, however, heavily depends on the availability of many annotated image-caption datasets for pretraining, where the texts are not necessarily in English. Although we can utilize machine translation (MT) tools to translate non-English text to English, the performance still largely relies on MT’s quality and may suffer from high latency problems in real-world applications. This paper proposes a new approach to learn cross-lingual cross-modal representations for matching images and their relevant captions in multiple languages. We seamlessly combine cross-lingual pretraining objectives and cross-modal pretraining objectives in a unified framework to learn image and text in a joint embedding space from available English image-caption data, monolingual and parallel corpus. We show that our approach achieves SOTA performance in retrieval tasks on two multimodal multilingual image caption benchmarks: Multi30k with German captions and MSCOCO with Japanese captions.


Introduction
Recent pretrained vision-language models Su et al., 2020;Luo et al., 2020) based on Transformer (Vaswani et al., 2017) have achieved remarkable performance on cross-modal retrieval Yu et al., , 2021b, image captioning  and visual question and answering (VQA) (Su et al., 2020) tasks in English. For instance, most leading competitors in the VQA contest 1 rely on the transformer-based pretrained vision-language models.
However, their success heavily depends on the availability of a large amount of annotated imagecaption pretraining datasets (e.g., conceptual cap-1 https://visualqa.org/roe.html tions (Sharma et al., 2018)). In reality, there are limited such data in other languages. When generalizing to cross-lingual cross-modal downstream tasks, a straightforward way is to utilize machine translation (MT) tools to translate non-English text to English and reuse the fine-tuned models in English. Nevertheless, the performance strongly relies on the MT tool's capability and suffers from high latency problems in real-world applications.
To learn multilingual multimodal representations, recent researchers utilized multilingual datasets to model images and text captions in a joint embedding space. Based on how the shared feature space is learned, there are two categories: wordlevel alignments (Mohammadshahi et al., 2019) and sentence-level alignments (Wehrmann et al., 2019;Rajendran et al., 2016). Those models can capture a certain level of semantic similarity among languages and images. They, however, only modeled the relevance of text and images in a global manner. Such a limitation may prevent these models from effectively detecting relevance locally.
In parallel, cross-lingual language models such as multilingual BERT (Devlin et al., 2019) and XLM (Conneau and Lample, 2019), and pretrained vision-language models Su et al., 2020) have been prevalent in bridging different languages and modalities. Those models use the Transformer (Vaswani et al., 2017) architecture simultaneously pretrained from multiple languages or image-caption pairs to construct an encoder, and then fine-tune the encoder on downstream applications with task-specific objectives. The whole process enables sufficient interaction across languages and other modalities via crossattention. However, current cross-lingual models and cross-modal models are trained separately on multilingual corpus and English-caption data. Hence the resulting pretrained models are not directly applicable to downstream cross-modal tasks involving non-English languages.

Masked language modeling task TLM
Translation language modeling task MRC Masked region classification task CLTR Cross-lingual text recovery task CMTR Cross-modal text recovery task in this paper This paper proposes a cross-lingual cross-modal pretraining framework to learn a language invariant representation across image and text modalities. We hypothesize that introducing pretraining tasks involving different languages and modalities and modeling the interaction among them leads to a more powerful joint representation and generalizes well to downstream tasks. Extending previous vision-language pretraining works (e.g., Su et al. (2020)) that learn parameters solely based on the English-image caption data, we introduce monolingual and parallel corpus involving other languages to refine the shared latent space further.
In Figure 1, we provide a skeleton of our pretraining framework, which is built on top of visionlanguage BERT models (Su et al., 2020; with more pretraining tasks and data sources. In particular, we use masked language modeling (MLM) (Devlin et al., 2019) on monolingual text corpus, and translation language modeling (TLM) adopted from XLM (Conneau and Lample, 2019) on parallel text corpus. We follow the standard vision-language pretraining models for the Englishimage data and use MLM on text captions and masked region classification (MRC) on image re-gions. Besides, motivated by the success of the cross-lingual text recovery (CLTR) task in Unicoder (Huang et al., 2019), we propose a crossmodal text recovery (CMTR) task. Like CLTR, CMTR leverages the attention matrix between image-caption pairs to learn the alignment among words and regions of interest in images.
We performed text-to-image and image-to-text retrieval tasks on two multimodal multilingual image caption benchmarks: Multi30k (German and English) captions and MSCOCO (English and Japanese). We achieve SOTA results on retrieval tasks involving Japanese and German languages, compared with a machine translation baseline and other recently published works.

Vision-language Pretrained Model
Recently, BERT (Devlin et al., 2019) based visionlanguage pretraining models Su et al., 2020;Luo et al., 2020) emerge. In those models, the pretraining typically consists of three types of tasks: 1) masked language modeling, 2) masked region modeling, and 3) text-image matching. By exploiting the cross-modal attention and being pretrained on large-scale datasets, cross-modal BERT methods have achieved state-of-the-art performance in many text-vision understanding tasks. Nevertheless, all the above models deal with a single language English and image or video domain.

Cross-lingual Pretrained Model
Cross-lingual pretrained language models (Devlin et al., 2019;Conneau and Lample, 2019) are capable of simultaneously encoding texts from multiple languages. Most notably, multilingual BERT (Devlin et al., 2019) takes the same model structure and training objective as BERT but was pretrained on more than 100 languages on Wikipedia. XLM model (Conneau and Lample, 2019) is pretrained with MLM and TLM to take advantage of parallel sentence resources if available. Evaluations on a series of cross-lingual transfer tasks (Fei and Li, 2020;Yu et al., 2021a) have shown that these crosslingual LMs have significant utilities for transferring knowledge between languages. Therefore, we propose integrating cross-lingual pretraining tasks with vision-language pretraining to obtain a universal multilingual multimodal representation.

Methodology
Our framework adopts the network structure of VL-BERT (Su et al., 2020). VL-BERT is a singlestream cross-modal model that concatenates word features from the text and bounding box features from the image and feeds the concatenated sequence into a series of transformer blocks.

Pretraining tasks
Both vision-grounded masked language model (MLM) and text-grounded masked region classification (MRC) task on image-caption data are used in our model by default, as they have shown strong performance in VL-BERT (Su et al., 2020;. Since we introduce auxiliary multilingual text corpus, we also use MLM on the texts in other languages by default. Motivated by Unicoder (Huang et al., 2019) showing that pretrained models can be further improved by involving more tasks, we introduce two additional cross-lingual pretraining tasks and one cross-modal task for improving the performance. Cross-model Text Recovery. This task (CMTR) is motivated by the multilingual pretraining model Unicoder (Huang et al., 2019). As shown in Figure 2, CMTR is based on the image-caption pairs as input, but it does not use the original caption words. Instead, it computes an alignment between word features and bounding box features extracted by tools (e.g., Faster-RCNN (Anderson et al., 2018)), and uses attended features to simultaneously recover all input words. In particular, let (B, E) be an image-caption input pair, where B = (b 1 , b 2 , · · · , b n ) are bounding box feature embeddings and E = (e 1 , e 2 , · · · , e m ) are word embeddings. CMTR first calculates an attended representation for the caption words with bounding box features asê i = n j=1ã ij b j , whereã ij = softmax(A i,: )[j], b j ∈ R h , e i ∈ R h , and h denotes the embedding dimension. A ∈ R m×n is the attention matrix calculated by bi-linear attention as A ij = e T i Wb j , where W is a trainable parameter. Finally we takeÊ = tanh((ê 1 ,ê 2 , · · · ,ê m )) as input and predict the original caption words. The objective function is: where ∆(., .) is the sum of token-level crossentropy loss and e(.) is the encoder component including the input layer, the attention layer and  transformer layers. d(.) is the decoder applied on the output of transformers, which is a shared linear projection layer with other MLM tasks and CLTR task introduced below.
Cross-lingual Text Recovery. This task (CLTR) is adopted from Unicoder (Huang et al., 2019), which takes a pair of parallel sentences (X, Y ) and lets the pretrained model learn the underlying word alignments between two languages. Similar to CMTR, we also use the bi-linear attention mechanism to compute an attended representationX for input sentence X in the source language with its parallel sentence Y , and then try to recover X using the attended inputX. In CLTR task, we optimize the same objective function in Eq. (1). Note that CLTR and CMTR do not share attention parameters since there is still a large modal gap between text and image before applying cross-attention.
Translation Language Model. This task (TLM) is adopted from XLM (Conneau and Lample, 2019), which takes a pair of parallel sentences with randomly masked tokens in different languages as input. The model is trained to predict the masked tokens by attending to local contexts and distant contexts in another language. Interested readers please refer to Conneau and Lample (2019) for more details about its objective function.

Fine-tuning for Cross-modal Retrieval
For fine-tuning, we minimize the triplet ranking loss to fine-tune the retrieval model. To boost the performance, we use the hard negative mining strategy in SCAN (Lee et al., 2018). For each text query, there is only one positive image sample and the rest are negative. Denoting a mini-batch of training samples by {(q i , I i )} K i=1 , where a query q i is only relevant with the image I i , we only penalize the hardest negative image in the mini-batch by where m is the margin set to 0.2 by default, and [x] + = max(0, x) is a clip function. R(q, I) is the function to evaluate the similarity between query q and image I parameterized by u and b: On the other hand, for each image, we only penalize the hardest negative query in the mini-batch:

Experiment
For pretraining, we utilize two public English image-caption datasets: SBU Captions (Ordonez et al., 2011) and Conceptual Captions (Sharma et al., 2018). Due to broken URLs, we only collected around 3.7M text-image pairs in total. For monolingual (en, de, ja) text and parallel corpus (en-de), we randomly sample 20M sentences from Wikipedia text 2 and 9M parallel sentences from MultiUN corpus 3 . We also collected 2.8M en-ja parallel sentences from Pryzant et al. (2018).

Experiment Setting
We use the multilingual BERT uncased version (Devlin et al., 2019) to initialize our model, which has 12 layers of Transformer blocks. Each block has 768 hidden units, 12 self-attention heads, and the vocabulary size is 105, 879. The maximum sequence length is set to 64. Following , we detect 100 bounding boxes per image using Faster-RCNN (Anderson et al., 2018) pretrained on Visual Genome (Krishna et al., 2017).
Our pretraining is conducted on 16 NVIDIA V100 GPUs (16GB memory), and fine-tuning is conducted on 8 NVIDIA V100 GPUs. We use FP16 to speed up training and reduce memory usage. We use Adam optimizer (Kingma and Ba, 2015) and set the batch size per GPU to 16. The initial learning rate is 1e-5. We pretrain the model for 50 epochs and fine-tune the retrieval model based on the average of R@{1,5,10} on the validation set. We repeat our experiments five times and report the average metrics on the test set.

Baselines
We compare our models with several recent competitive methods. VL-BERT (Su et al., 2020) and Unicoder-VL  are two well-known vision-language BERT based models. For VL-BERT, We reproduce the English results by finetuning their official pretrained model 4 and generate non-English results from their released code following the same configuration as ours. For Unicoder-VL, we adopt their reported English results in the paper. Besides pretraining based models, we also compare several methods, including cross-attention based model SCAN (Lee et al., 2018), multilingual word embedding alignmentbased model AME (Mohammadshahi et al., 2019) and multilingual sentence alignment-based model LIME (Wehrmann et al., 2019). We directly use SCAN, AME, and LIME's reported performance from their papers. Finally, we compare with a machine translation baseline: "Translate-test", which translates the test data in Japanese or German to English using Google Translate, and then evaluates on fine-tuned VL-BERT retrieval model in English.     , our model performs slightly worse but obtains better results than VL-BERT. A possible reason is that Unicoder-VL is initialized with English BERT, which is specifically optimized for English.

Experimental Results
The benefit of our model is demonstrated in Table 3 for cross-modal retrieval tasks involving non-English languages. We first observe that the machine translation baseline "Translate-test" achieves better results than VL-BERT pretrained with MLM objective only on multilingual corpus and finetuned in the target language, proving the importance of aligning different languages.
Moreover, the average recall of the "Translatetest" is around 1-2% lower than our method. Such results indicate that pretraining with additional cross-lingual objectives is more effective than translating the target language into English for these two benchmarks. Though combining more powerful machine translation tools and better fine-tuned English retrieval models may lead to slightly better performance, our method learns a universal representation without dependency on external machine translation tools for particular language pairs, which is more suitable for real-world applications. Finally, compared with VL-BERT (Su et al., 2020) that is only pretrained with MLM task on multilingual corpus, our additional cross-lingual pretraining tasks bring performance improvement. MO

Ablation Study
To understand the effect of different components, we conduct an ablation study on the test set and report the average Recall@1 in Table 4. Although cross-lingual pretraining tasks (TLM and CLTR) do not help English-related retrieval tasks much, they contribute more than 1% improvement for Japanese and German. The result is under our expectation since those tasks effectively link non-English languages with the vision domain using English as the bridge. Among all the components, CMTR consistently contributes around 1 point improvement.

Conclusion
In this work, we introduce multilingual corpus and three pretraining objectives to improve transformer based vision-language models for retrieval tasks.
Extensive experiments demonstrate the effectiveness of our contributions on cross-modal retrieval tasks. Detailed ablation studies justify our modeling choices. Our future work is to explore the zero-shot transferring capability of our framework.