Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models

This paper studies zero-shot cross-lingual transfer of vision-language models. Specifically, we focus on multilingual text-to-video search and propose a Transformer-based model that learns contextual multilingual multimodal embeddings. Under a zero-shot setting, we empirically demonstrate that performance degrades significantly when we query the multilingual text-video model with non-English sentences. To address this problem, we introduce a multilingual multimodal pre-training strategy, and collect a new multilingual instructional video dataset (Multi-HowTo100M) for pre-training. Experiments on VTT show that our method significantly improves video search in non-English languages without additional annotations. Furthermore, when multilingual annotations are available, our method outperforms recent baselines by a large margin in multilingual text-to-video search on VTT and VATEX; as well as in multilingual text-to-image search on Multi30K. Our model and Multi-HowTo100M is available at http://github.com/berniebear/Multi-HT100M.


Introduction
One of the key challenges at the intersection of computer vision (CV) and natural language processing (NLP) is building versatile vision-language models that not only work in English, but in all of the world's approximately 7,000 languages. Since collecting and annotating task-specific parallel multimodal data in all languages is impractical, a framework that makes vision-language models generalize across languages is highly desirable.
One technique that has shown promise to greatly improve the applicability of NLP models to new languages is zero-shot cross-lingual transfer, where models trained on a source language are applied * Equal contribution. as-is to a different language without any additional annotated training data (Täckström et al., 2012;Klementiev et al., 2012;Cotterell and Heigold, 2017;Chen et al., 2018;Neubig and Hu, 2018). In particular, recent techniques for cross-lingual transfer have demonstrated that by performing unsupervised learning of language or translation models on many languages, followed by downstream task fine-tuning using only English annotation, models can nonetheless generalize to a non-English language (Wu and Dredze, 2019a;Lample and Conneau, 2019;Huang et al., 2019a;Artetxe et al., 2020;Hu et al., 2020). This success is attributed to the fact that many languages share a considerable amount of underlying vocabulary or structure. At the vocabulary level, languages often have words that stem from the same origin, for instance, "desk" in English and "Tisch" in German both come from the Latin "discus". At the structural level, all languages have a recursive structure, and many share traits of morphology or word order.
For cross-lingual transfer of vision-language models, the visual information is clearly an essential element. To this end, we make an important yet under-explored step to incorporate visual-textual relationships for improving multilingual models (Devlin et al., 2019;Artetxe et al., 2020). While spoken languages could be different, all humans share similar vision systems, and many visual concepts can be understood universally (Sigurdsson et al., 2020;Zhang et al., 2020). For example, while is termed "cat" for an English speaker and "chat" for a French speaker; they understand similarly. We leverage this observation to learn to associate sentences in different languages with visual concepts for promoting cross-lingual transfer of visionlanguage models.
In this work, we focus on multilingual text-tovideo search tasks and propose a Transformerbased video-text model to learn contextual mul-tilingual multimodal representations. Our vanilla model yields state-of-the-art performance in multilingual text→video search when trained with multilingual annotations. However, under the zero-shot setting, rather surprisingly, there is a significant performance gap between English and non-English queries (see §5.5 for details). To resolve this problem, motivated by recent advances in large-scale language model (Artetxe et al., 2020) and multimodal pre-training (Lu et al., 2019;Patrick et al., 2020), we propose a multilingual multimodal pre-training (MMP) strategy to exploit the weak supervision from large-scale multilingual text-video data. We construct the Multilingual-HowTo100M dataset, that extends the English HowTo100M  dataset to contain subtitles in 9 languages for 1.2 million instructional videos.
Our method has two important benefits. First, compared to pre-training on English-video data only, pre-training on multilingual text-video data exploits the additional supervision from a variety of languages, and therefore, enhances the search performance on an individual language. Second, by exploiting the visual data as an implicit "pivot" at scale, our methods learns better alignments in the multilingual multimodal embedding space (e.g., "cat"--"chat"), which leads to improvement in zero-shot cross-lingual transfer (e.g., from "cat"to "chat"-) of vision-language models.
In our experiments on VTT  and VATEX , our method yields state-of-the-art English→video search performance. For zero-shot cross-lingual transfer, the proposed multilingual multimodal pre-training improves English-video pre-training by 2 ∼ 2.5 in average R@1 across 9 languages. Additionally, when trained with in-domain multilingual annotations as other baselines, our method outperforms them by a large margin in multilingual text→video search on VATEX and text→image search on Multi30K (Elliott et al., 2016).
To summarize, we make the following contributions: (1) We propose a transformer-based videotext model that learns contextual multilingual multimodal representations ( §3.1). (2) We empirically demonstrate that vision-language models, unlike NLP models, have limited zero-shot cross-lingual transferrability. ( §5.5). (3) We introduce the multilingual multimodal pre-training strategy and construct a new Multi-HowTo100M dataset ( §4) for pre-training to improve zero-shot cross-lingual capability of vision-language models. (4) We demonstrate the effectiveness of our approach, by achieving state-of-the-art multilingual text→video search performance in both the zero-shot ( §5.5) and fully supervised setup ( §5.6).
Recently, XTREME (Hu et al., 2020) was proposed to evaluate the cross-lingual transfer capabilities of multilingual representations across a diverse set of NLP tasks and languages. However, a comprehensive evaluation of multilingual multimodal models on zero-shot cross-lingual transfer capabilities is still missing. To our best knowledge, we are the first work that investigates and improves zero-shot cross-lingual transfer of vision-language models.

Method
We consider the problem of learning multilingual multimodal representations from a corpus C of video-text pairs where v i is a video clip and x i is its corresponding text (caption or transcription) that is written in one of K languages. Our goal is to learn a shared multilingual text encoder c x = Φ(x) and a video encoder c v = Ψ(v), both of which project the input to a shared Ddimensional embedding space c v , c t ∈ R D , where semantically similar instances (i.e., paired (x i , v i )) are closer to each other than the dissimilar ones (i.e., (x i , v j ), i = j). In the following, we denote a batch of multilingual text-video samples Figure 1 gives an overview of the proposed method. Our text encoder consists of a multilingual Transformer (e.g. multilingual BERT (Devlin et al., 2019)) and a text Transformer pooling head (explained below). Similarly, our video encoder consists of a 3D-CNN (e.g. R(2+1)D network ) and a video Transformer pooling head. We use these multilingual multimodal Transformers to encode text and video for alignment. Unlike prior multilingual text-image models (Gella et al., 2017;Kim et al., 2020;Huang et al., 2019b) that utilize word embeddings and RNNs, our multilingual text encoder is built on a multilingual Transformer that generates contextual multilingual representations e x ∈ R N ×D to encode a sentence x containing N words. We employ an additional 2-layer Transformer which we will call a "Transformer pooling head (TP)" as it serves as a pooling function to selectively encode variablelength sentences and aligns them with the corresponding visual content. We use the first output token of the second Transformer layer as the final sentence representation. Precisely, we set c x = Trans

Multilingual Multimodal Transformers
is a 2-layer stack of Transformers (Vaswani et al., 2017) with e x as the (query,key,value) in the multihead attention. Note that we use the same text encoder to encode sentences in all languages.
For encoding videos, our model uses pre-trained 3D-CNNs that encode spatial-temporal context in a video. For a M -second video v, we apply R(2+1)D  and S3D  networks to its frames, concatenate network outputs, and apply a linear layer to encode the visual input, e v ∈ R M ×D , to our model. Similarly to the text part, we employ a two-layer Transformer as the pooling head to encode videos with different lengths into fixed-length representations. Formally, we set c v = Trans Since videos are typically long and have a high frame rate (e.g., 30 fps), it is infeasible to update 3D-CNNs simultaneously and therefore, we use pre-extracted video features. Our model is parameterized by θ = θ mBERT ∪ θ Transx ∪ θ Transv .

Multilingual Text-Video Alignment
For learning multimodal representations, the common practice is to minimize a contrastive objective to map the associated (video, text) embeddings to be near to each other in a shared embedding space. The inter-modal max-margin triplet loss has been widely studied in video-text (Yu et al., 2018;Liu et al., 2019) and image-text (Kim et al., 2020;Burns et al., 2020;Huang et al., 2019b) research. In this work, we generalize and model all inter-modal, intra-modal, and cross-lingual instances with a noise contrastive estimation objective (NCE) (Gutmann and Hyvärinen, 2010;van den Oord et al., 2018;Chen et al., 2020b). Inter-modal NCE. Let X and V denote the subsets of the sampled sentences in multiple languages and videos in B, respectively. And let s(a, b) = a T b a b be the cosine similarity measure. We use an (intermodal) NCE objective defined as: (2) In inter-modal NCE, L inter = L(X , V), the noise N is a set of "negative" video-text pairs sampled to enforce the similarity of paired ones are high and and those do not are low. Following , we set the negatives of (x i , v i ) as other x j and v j , j = i in B.
Intuitively, inter-modal NCE draws paired (semantically similar) instances closer and pushes apart non-paired (dissimilar) instances. Note that we do not distinguish language types in X and the sentences in all possible languages will be drawn towards their corresponding videos in the shared multilingual text-video embedding space. Intra-modal NCE. Beyond cross-modality matching, we leverage the intra-modal contrastive objective to learn and preserve the underlying structure within the video and text modality. For example, Corgi should be closer to Husky than Balinese. Prior image-text work (Gella et al., 2017;Huang et al., 2019c) utilizes a triplet loss to maintain such neighborhood relationships. Inspired by recent success in self-supervised image and video representation learning (Yalniz et al., 2019;Ghadiyaram et al., 2019), our model leverages intra-modal NCE that constrains the learned representations to be invariant against noise and to maintain the withinmodality structure simultaneously. We minimize the following intra-modal NCE loss: where X m and V m are the noised version of the original sentences and videos. For noising, we randomly mask 5% of the multilingual text tokens and video clips. We optimize our model by

When Visually-Pivoted Multilingual Annotations Are Available
In many multilingual multimodal datasets, there are sentences in different languages that describe a shared visual context. For example, 10 English and 10 Chinese descriptions are available for each video in VATEX. With these visually-pivoted (weakly paralleled) sentences (x, y), we further revise the contrastive objectives to leverage this additional supervisory signal. Given a visually-pivoted corpus C p that contains all possible combination of visually-pivoted pairs , B p ⊂ C p and revise the contrastive objective as: Visual-pivoted Cross-lingual NCE. Inspired by Translation Language Modeling (TLM) in XLM (Lample and Conneau, 2019), we propose a multimodal TLM-like contrastive objective which promotes alignments of descriptions in different languages that describe the same video. We use the intuition that conditioned on a video, the descriptions (need not to be translation pairs) in different languages would likely be semantically similar. To this end, we set the cross-lingual NCE as: For visually-pivoted sentences, as shown in Fig. 1, we generate their representations conditioned on the video they describe. We extend the key and value of multihead attention with the additional visual content e v and generate new c x|v and c y|v for matching. Specifically, our model employs With the access to (visually-pivoted) multilingual annotations, we optimize our model by ya fries ya Kifaransa unaweza pia kuandamana nayo también la voy a acompañar con un poco de papas fritas und dann ziehen Sie es so fest wie möglich 是什么，它是热风枪，我花了十美元买了 It will also be accompanied with a little of french fries а затем потяните его как можно плотнее What it is, is a heat gun and I got this for ten bucks 00:00:37.160 --> 00: 00:48.860 we just made our six-sided coaster so and then pull it as tight as possible nous venons de faire notre caboteur à six côtés donc ce que At the inference time, we simply apply c x = Φ(x) and c v = Ψ(v) to encode multilingual text queries and videos. For text-to-video search, we sort videos according to their cosine similarity scores to the text query.

The Multilingual HowTo100M Dataset
As large-scale pre-training has been shown important in recent NLP and vision-language models, we construct the Multilingual HowTo100M dataset (Multi-HowTo100M) to facilitate research in multilingual multimodal learning. The original HowTo100M  dataset is a large-scale video collection of 1.2 million instructional videos (around 138 million clips/segments) on YouTube, along with their automatic speech recognition (ASR) transcriptions as the subtitles. For each video in HowTo100M, we crawl and collect the multilingual subtitles provided by YouTube, which either consist of user-generated subtitles or those generated by Google ASR and Translate in the absence of user-generated ones. Essentially, we collect video subtitles in 9 languages: English (en), German (de), French (fr), Russian (ru), Spanish (es), Czech (cz), Swahili (sw), Chinese (zh), Vietnamese (vi).
At the time of dataset collection (May 2020), there are 1.1 million videos available, each with subtitles in 7-9 languages. The video length ranges from 1 minute to more than 20 minutes. We utilize Multi-HowTo100M for multilingual multimodal pre-training to exploit the weak supervision from large-scale multilingual text-video data. In Fig. 2, we provide a visualization of few instances sampled in Multi-HowTo100M with the corresponding video frame, timestamp, and transcriptions in different languages. Please refer to Appendix for more details and dataset statistics.

Experiment
In this section, we first describe our experimental setup ( §5.1-5.3). In §5.4, we conduct ablation studies to validate the effectiveness of proposed multilingual text-video model . With the best models at hand, we investigate their zero-shot cross-lingual transferability in §5.5, where we showcase that the proposed multilingual multimodal pre-training serves as the key facilitator. We then verify the superior text→video search performance of our method under the monolingual, multilingual, and cross-modality settings in §5.6.

Evaluation Datasets
MSR-VTT (VTT)  contains 10K videos, where each video is annotated with 20 captions. Additionally, we created pseudomultilingual data by translating the English captions into 8 languages with off-the-shelf machine translation models. 1 We use the official training set (6.5K videos) and validation set (497 videos (1) One parallel (English,German,French,Czech) translation pair and (2) five English and five Ger-man descriptions collected independently. The training, validation, and testing splits contain 29K, 1K, and 1K images respectively.

Implementation Details
For the video backbone, we use a 34-layer, R(2+1)-D  network pre-trained on IG65M (Ghadiyaram et al., 2019) and a S3D  network pre-trained on HowTo100M. We pre-extract video features and concatenate the two 3D-CNN outputs to form e x ∈ R M ×1024 as a video input.
For the text backbone, we use multilingual BERT (mBERT) (Devlin et al., 2019) or XLM-Robertalarge (XLM-R) (Artetxe et al., 2020), where the latter achieves near SoTA zero-shot cross-lingual transfer performance for NLP tasks. Following Hu et al. (2020), instead of using the top layer, we output the 12-th layer in XLM-R and mBERT. For vision-language tasks, we freeze layers below 9 as this setup empirically performs the best.
Our model employs a 2-layer Transformer with 4-head attention for the text and video transformer pooling (TP) modules. The embedding dimension D is set to 1024. We use the Adam (Kingma and Ba, 2015) optimizer and a 0.0002 learning rate to train our model for 16 (pre-training) and 10 (finetuning) epochs. The softmax temperature in all noise contrastive objectives is set to 0.1.

Experimental Setup
We use Multi-HowTo100M for multilingual multimodal pre-training (MMP). For each video, we randomly sample the start and end time to construct a video clip. For a video clip, we randomly sample one language type each time from 9 languages and use the consecutive ASR transcriptions that are closest in time to compose (text-video) pairs for training. For simplicity and speed purposes, we follow the training protocol of XLM-R to pre-train on a multilingual corpus wihtout using translation pairs, i.e., we use multilingual text-video pairs (x, v) but no translation pairs from Multi-HowTo100M and utilize only inter-and intramodal NCE (Eq. 1-3) for MMP.
We fine-tune our model on VTT, VATEX, and Multi30K to evaluate on text→video search tasks. In the zero-shot cross-lingual transfer experiments, we use only English-video data and fine-tune with Eq. 1-3. We then test the model with non-English queries. When annotations in additional languages are available (by humans in VATEX and Multi30K;   by MT models (i.e., translate-train) in VTT), we utilize all available multilingual annotations (i.e., fully supervised) and iterate over all possible (x, v, y) pairs to train with Eq. 5-7 to demonstrate the strong performance target for evaluating zeroshot cross-lingual transfer on VTT and to compare fairly with other fully-supervised baselines in multilingual text→video search on VATEX and Multi30K. We report the standard recall at k (R@k) metrics (higher is better).

Comparison Experiments and Ablations
In this section, we ablate and compare different text/video encoders, Transformer model architectures, and learning objectives for English→video search on VTT.
Text and Video Encoders.   sharing of the video and text Transformer slightly degrades the performance. Therefore, we choose to separate them. Learning Objective. From Table 3, the intramodal contrastive objective is important for both NCE and Triplet loss. In general, the NCE loss outperforms the Triplet loss. The proposed intermodal and intra-modal NCE objective achieves the best performance. When captions in multiple languages are available, cross-lingual NCE additionally provides a consistent improvement. In contrast, our proposed multilingual multimodal pre-training (MMP) is shown to be the key facilitator for zero-shot cross-lingual transfer. MMP improves German→Video search (11.1→15.0, +35% for mBERT, and 16.3→19.4, +20% for XLM-R) and achieves 2.6 ∼ 2.8 improvement in average R@1. We attribute the effectiveness of MMP to learning improved alignments between multilingual textual and visual context in the shared embedding space, as relatively balanced improvements between English→video and non-English→video is observed with fine-tuning. Fig. 3 demonstrates the trend of R@1 while incrementally incorporating additional languages for MMP. For XLM-R, the improvement in R@1 asymptotically converges when pre-training with more multilingual text-video pairs. On the other hand, for zero-shot German→video search, pretraining with more languages keeps improving the search performance, even though the additional language (e.g., French) is different from the target language (i.e., German).

VTT Zero-Shot Cross-Lingual Transfer
The lower section of Table 4 shows the results of models fine-tuned with (synthesized) pseudomultilingual annotations. It can be regarded as the translate-train scenario, which serves as a strong performance target for evaluating zero-shot cross-lingual transfer, as discussed in (Lample and Conneau, 2019;Hu et al., 2020). Both mBERT and XLM-R yield better performance across non-a soccer team walking out on the field English languages with the in-domain translated pseudo-multilingual annotations. However, for English→video search, a 0.7 degradation is observed compared to the zero-shot setting. It is likely due to the noise in the translated captions.
Notably, there is still a performance gap between zero-shot and translate-train settings for models with mBERT. In contrast, the gap is much smaller for models with XLM-R. In the following sections, we refer Ours-MMP as our best model with XLM-R as the text backbone and compare it with other state-of-the-art methods.
Qualitative Results Fig. 4 shows the multilingual text→video search results with Ours-MMP (VTT:en-only) on VTT under the zero-shot setup. Note that only one shared English-finetuned model is used for text→video search in all languages. As demonstrated, the proposed model successfully retrieves the correct videos with English (en) and Russian (ru) queries. The other top-ranked videos also share similar visual appearance to the correct one. For zero-shot transferring of the English-finetuned model to distant languages such as Vietnamese (vi) and Chinese (zh), we observe that there is still limitation for our zero-shot models to understand abstract concepts (e.g., "space project") and associate small objects (e.g., "microphone") with the text queries in distant languages.

Comparison to Supervised State of the Art
English→Video Search on VTT.   (Kiros et al., 2014) 28  tures and training on a smaller (6,513 vs 9,000) training set 2 , our model also outperforms CE (Liu et al., 2019) with or without pre-training. Multilingual Text→Video Search on VA-TEX.  To tackle the feature mismatch between 2D-CNN and 3D-CNN, we leverage a linear layer with a doubled learning rate to map 2D-CNN features to the same dimension as 3D-CNN features. Table 7 shows the results on Multi30K. For zero-shot cross-lingual transfer, when trained from scratch (M30K:en-only), our model achieves comparable performance to MHA-D but lags in German→image search since it only uses English annotations. In Ours-MMP, pre-training improves all recall metrics even with modality gap. The average R@1 improvement is 3.2. A larger gain for (relatively) low-resource language such as Czech is observed. Without using any Czech annotations, our zero-shot model with MMP achieves comparable Czech→image search performance to SMALR (Burns et al., 2020), which uses 10 languages including Czech. However, when transferring across modalities and using only English annotations, there are performance gaps between English→Image and German/Czech→Image search, implying that transferring models across modalities is feasible but remains challenging. We consider zero-shot crossmodal cross-lingual transfer as our future work.
For a fair comparison with other baselines, when trained with annotations in all 4 languages provided by Multi30K, our model greatly outperforms all baselines by large margins in multilingual text→image search.

Conclusion
We have presented a multilingual multimodal pretraining (MMP) strategy, the Multi-HowTo100M dataset, and a Transformer-based text-video model for learning contextual multilingual multimodal representations. The results in this paper have convincingly demonstrated that MMP is an essential ingredient for zero-shot cross-lingual transfer of vision-language models. Meanwhile, there are many remaining challenges, such as resolving the performance gap between zero-shot and training with in-domain non-English annotations; as well as techniques to transfer varieties of vision-language models (e.g., VQA (Goyal et al., 2017), TVQA (Lei et al., 2020)) or visually-enhanced NLP models such as unsupervised multimodal machine translation (Huang et al., 2020b). We believe the proposed methodology, and the corresponding resources we release, will be an important first step towards spurring more research in this direction.

B The Multilingual HowTo100M Dataset
In this section we provide the detailed statistics of the Multilingual HowTo100M (Multi-HowTo100M) dataset. We also provide a comparison to Sigurdsson et al. (2020) that also uses HowTo100M for unsupervised word translation. The Multi-HowTo100M dataset is built upon the original English HowTo100M dataset ) that contains 1.2 million instructional videos (138 million clips) on YouTube. We reuse the raw English subtitles in HowTo100M, where the subtitles in HowTo100M are either automatic speech recognition (ASR) transcriptions or user generated subtitles.
For Multi-HowTo100M, we use the same video collection as English HowTo100M. At the time of data collection (May 2020), there were 1.09 million videos accessible. We collect the subtitles provided by YouTube, which either consist of user-generated subtitles or those generated by Google ASR and Translate in the absence of user-generated ones. Essentially, we collect video subtitles in 9 languages: English (en), German (de), French (fr), Russian (ru), Spanish (es), Czech (cz), Swahili (sw), Chinese (zh), Vietnamese (vi). Table 8 summarizes the dataset statistics for each language. In most cases there are more than 1 billion tokens a language. Fig. 5 further shows the number of tokens per video. There are typically lengthy narrations that contains several hundreds of tokens available in each instructional video. Fig. 6 shows the distribution of number of tokens in a subtitle. For each subtitle segment, which ranges from 0∼20 seconds, there are typically 15∼25 words. The most of the cases, subtitles are well aligned in time for non-English languages. Fig. 2 visualizes a few examples in Multi-HowTo100M.
A similar HowTo100M variant has been recently reported in MUVE (Sigurdsson et al., 2020) that is created for unsupervised word translation.  Beyond scale, instructional videos in Multi-HowTo100M are feasible pre-training resources for many downstream vision-language models. Demonstrators in instructional videos typically perform intentionally and explain the visual object or action explicitly. According to the inspection by , for around 51% of clips, at least one object or action mention in the caption can be visually seen. Prior work has shown that instructional videos are useful for event recognition (Yu et al., 2014), action localization model (Alayrac et al., 2016), cross-modal alignments (Malmaud et al., 2015). We expect the previous success in the intersection of natural language processing (NLP) and computer vision (CV) could be further translated into more languages to have a broaden impact.

C Implementation and Experiment Details
we re-sample videos to 30 fps and employs a window of size 8 or 30 that takes consecutive frames starting from the beginning of every second for encoding. We simply concatenate the two 3D-CNN outputs and use the 1024-dimension vector as the visual input stream to our model. Notably, instead of using 9 different types of visual features as in CE (Liu et al., 2019), we use only the above 2 features and achieve superior performance. For the Transformer pooling head (TP) modules, we use a 2-layer Transformer with 4-head attention for each TP. The embedding dimension D is set to 1024. We do not use the positional embeddings in both text and video TP as we do not find them beneficial in our experiments. The softmax temperature in all NCE contrastive objectives is set to 0.1 as used in SimCLR (Chen et al., 2020b).
Note that unlike ViLBERT (Lu et al., 2019) or OAN (Huang et al., 2019d), our models does not employ cross-modality attention and keep the multi-head self-attention within the same modality. The main reason is to reduce the inference time complexity. For cross-modality attention, the complexity is O(T V ) to encode T text queries for V videos in a dataset before retrieval (since video and query representations depend on each other). It is clearly not scalable when the dataset contains millions of videos. To this end, our model keep self-attention within the same modality which results in a O(T + V ) complexity compared O(T V ) in prior work with cross-modality attention. In our preliminary experiments, we also incorporate crossmodality attention and achieved 0.3∼1.8 R@1 improvement. Considering the trade-off between performance and scalability, we choose the latter.

Training and Inference Details and Profiling.
For the softmax temperature in NCE, we set to 0.1 as used in SimCLR (Chen et al., 2020b). We use the Adam (Kingma and Ba, 2015) optimizer with a initial learning rate 2 · 10 −4 and clip gradients greater than 0.2 during the training phase. Dropout rate is 0.3. Since the video length and token length is longer in the pre-training phase, we use a 64 batch size for pre-training. For fine-tuning, we use a batch size of 128.
Pre-training on the 1.2 million HowTo100M videos takes around 10 GPU hours (NVIDA V100) for 16 epochs. We speed up the pre-training process by distributing the workload over 8 GPUs on a single node of our server. We use 1 GPU for the fine-tuning or training from scratch experiments.
For the MSR-VTT split, it takes 12 GPU hours to train our model on 180K video-text pairs for 20 epochs. For VATEX, it takes 32 GPU hours to train on 260K video-text pairs for 30 epochs. For inference, the encoding speed is around 250-300 videos/sec and 200-250 text queries/sec. The overall text→video search speed on 1,000 videotext pairs (1,000 text queries over 1,000 videos) is around 6 seconds including video/text encoding and ranking their similarity scores. Experiment Details. Our experiment consider three types of pre-training: (1) Multilingual multimodal pre-training (MMP), (2) Multimodal pretraining (MP), and (3) no pre-training (from scratch). For (1) and (2), we pre-train 16 epochs and use the model weight at 16-th epoch for finetuning experiments.
For multimodal pre-training, we pre-train on the original English HowTo100M dataset. We iterate over all videos in HowTo100M. For each video, we randomly sample the start and end time to construct a video clip. For each clip, we locate the nearest consecutive ASR transcriptions in time and use it as to construct the (video, text) pair for training.
For multilingual multimodal pre-training (MMP), we use Multi-HowTo100M for pretraining. For each video, we follow the same strategy as MP. For a clip, we sample one language type each time from 9 languages and use the consecutive ASR transcriptions that are closest in time to compose (video, text) pairs for training.

D Additional Ablation Studies
As has been investigated in XTREME (Hu et al., 2020), choosing different output layers will affect the zero-shot transferability of multilingual Transformers in various NLP tasks. For text→video search tasks, we conduct a series of experiments to identify the desirable choices of hyper-parameters in the proposed multilingual multimodal Transformer that lead to best performance in English-tovideo and (zero-shot) non-English-to-video search performance. Beyond our ablation studies in Sec. 5, in this part we highlight our trials in the choice of the output layer and the layers to be frozen in our multilingual Transformer backbone (i.e., mBERT and XLM-R). There are 24 layers in XLM-R (large) and 12 layers in mBERT. We perform grid-search on VTT to identify the best choice of these two hyper-parameters. Table 9 and Table 10 compare different choices of output layer and layers to freeze in multilingual Transformers. Our results suggest that the best output layer for mBERT and XLM-R is the 12-th layer. Surprisingly, while output layer does not affect English→video search significantly, it greatly affects the zero-shot crosslingual transfer performance of video-text models. For both XLM-R and mBERT, the performance degrade significantly if fine-tuning all layers.  Choice of Layers to Freeze Similar to output layers, the choice of frozen layers greatly affects cross-lingual transferability. For both mBERT and XLM-R, it is desirable to freeze part of the lower layers and make the top-3 layers trainable for videotext models. We observe that when freezing all layers (i.e., using the pre-extracted contextual multilingual embeddings) does not lead to satisfactory results. For mBERT, R@1 drops from 19.9 to 18.9 in English→video search and 11.1 to 9.8 in German→video search. For XLM-R, R@1 drops from 21.0 to 18.9 in English→video search and 16.3 to 14.1 in German→video search. These results imply that text-only contextual multilingual embeddings along are likely to be infeasible to be applied to vision-language tasks without proper fine-tuning.

Choice of Output Layers
An important observation is that the best English→video search performance corresponds to the best German→video performance. This trend implies that for model selection, the configuration for the best English→video model usually translates to the best configuration for (zero-shot) cross-lingual model. This shared trend justifies the English→video ablation studies in the original paper. Note that we utilize the best English→video for all (zero-shot) cross-lingual experiment in our experiment section.
For multilingual text→video search, the best configuration we found in our experiments is to output the 12-th layer and freeze the layers below 9 for both mBERT and XLM-R.

E Additional Experimental Results
The coverage of our text→video search experiments is summarized in Table 11. Our experiments cover the following scenarios: In-domain, English: Table 5 (VTT) and Table 6 (VATEX) in the original paper. In-domain, non-English: Table 4 (VTT, 9 languages) and Table 6 (VATEX, Chinese). Out-of-domain, English: Additional (zero-shot) generalization results across datasets are in §E.1. Out-of-domain, non-English: We consider this as our future work.

E.1 Generalizability across English-Video Datasets
In this section. we provide additional experiment results regarding zero-shot generalization of the VTT-finetuned model on out-of-domain dataset. Specifically, we test on YouTube2Text (Chen and Dolan, 2011). The aim of this experiment is to test the cross-dataset generalizabilty of our model without using domain-specific training data.