Cross-Lingual Training of Neural Models for Document Ranking

We tackle the challenge of cross-lingual training of neural document ranking models for mono-lingual retrieval, specifically leveraging relevance judgments in English to improve search in non-English languages. Our work successfully applies multi-lingual BERT (mBERT) to document ranking and additionally compares against a number of alternatives: translating the training data, translating documents, multi-stage hybrids, and ensembles. Experiments on test collections in six different languages from diverse language families reveal many interesting findings: model-based relevance transfer using mBERT can significantly improve search quality in (non-English) mono-lingual retrieval, but other “low resource” approaches are competitive as well.


Introduction
This work proposes techniques for leveraging relevance judgments in a source language (English) to train neural models for mono-lingual document retrieval in multiple target (non-English) languages, what we refer to as cross-lingual training. Success in this task would make it easier to develop effective search engines in multiple (potentially lowresource) languages, without gathering expensive relevance judgments in each language. A blog post by Google suggests that the company is exploring this approach to improving web search across a number of languages. 1 We are inspired by the work of Wu and Dredze (2019), who explored the cross-lingual potential of multi-lingual BERT as a zero-shot language transfer model for NLP tasks such as named-entity recognition and parsing. Mono-lingual BERT models (Devlin et al., 2019) have also proven effective in document retrieval (Dai and Callan, 2019;MacAvaney et al., 2019;Li et al., 2020). In particular, Akkalyoncu Yilmaz et al. (2019) demonstrated that BERT models fine-tuned with passage-level relevance data can transfer across domains: surprisingly, fine-tuning on social media data is effective for relevance classification on newswire documents without any additional modifications. Building on these results, we wondered if multi-lingual BERT could enable cross-lingual training of neural document ranking models as well.
The contribution of this work is to explore diverse methods to train neural document ranking models cross-lingually. While we are aware of two previous papers along these lines (Shi and Lin, 2019;MacAvaney et al., 2020), this work explores a far broader range of techniques and adds more nuance to previous findings. Beyond the basic approach proposed by these two papers, which we refer to as model-based transfer, we investigate additional approaches involving the translation of the training data, the translation of documents, hybrid models, as well as ensembles-which we broadly characterize into "high resource" and "low resource" settings. We show that various methods alone and in combination can yield robust increases in effectiveness across diverse languages with minimal resources, and that model-based cross-lingual transfer isn't the only way.

Approach
This work adopts the standard formulation of document ranking: given a user query Q, the task is to produce a ranking of documents from a collection that maximizes some ranking metric-in our case, average precision (AP). Given source language relevance judgments (in English), our task is to train a mono-lingual document ranking model for a target (non-English) language; that is, the queries and the documents are both in, for example, Bengali. token is passed to a single layer neural network with a softmax, obtaining the probability that sentence S is relevant to the query Q. Following Akkalyoncu Yilmaz et al. (2019), BERT is fine-tuned with data from the TREC Microblog Tracks (Lin et al., 2014) (MB for short). The authors showed that such a relevance matching model can be directly applied to effectively rank newswire documents, despite the mismatch in domains between training and test data; cf. Rücklé et al. (2020).
For document retrieval (i.e., at inference time), Akkalyoncu Yilmaz et al. (2019) first apply "bag of words" exact term matching to retrieve a candidate set of documents. Each document is split into sentences, and inference is applied on each sentence separately with BERT. The relevance score of each document is determined by combining the top k (by default, k " 3) scoring sentences with the document term-matching score as follows: S doc " α¨S r`p 1´αq¨ř k i"1 w i¨Si , where S i is the i-th top sentence score according to BERT and S r is the document level term-matching score. The parameters α and w i 's can be tuned via crossvalidation. All candidate documents are sorted by the above score S doc to produce the final output.

Cross-Lingual Relevance Transfer
Our main research question is as follows: Given English (source) training data, how can we bootstrap a good document ranking model in non-English (target) languages? We discuss a number of approaches below, which we characterize as "high" or "low" resource in terms of annotation effort.
Model-based transfer. Following Wu and Dredze (2019), the most obvious approach is to fine-tune mBERT using data in the source language, and apply inference directly on input in the target language. In essence, we follow the same setup as Akkalyoncu Yilmaz et al. (2019), with the exception that we use mBERT instead of (English) BERT. Note that this is essentially the approach explored in previous work (Shi and Lin, 2019;MacAvaney et al., 2020). We characterize this approach as "low resource" given that mBERT is pretrained in a selfsupervised manner. Training data translation. Instead of relying on mBERT to transfer models of relevance matching across languages, we can translate the English training data into the target language, and then fine-tune mBERT with the translated data. 2 At inference time, we directly apply the model on target-language documents. We considered two translation methods: Google Translate (MB gt ) and a simple embedding-based token-by-token translation approach (MB wt ). We characterize the first as "high resource" given the amount of bitext that is typically necessary to train a high-quality translation system, whereas the second as "low resource" since bilingual lexicons and aligned word embeddings are far easier to create.
Our token-based translation approach is inspired by Huang et al. (2019). The basic idea is to find the best token translation based on the cosine similarity between the token in the source language and candidate tokens in the target language. Specifically, for each token in the source language, the surface form is used for lookup in a bilingual dictionary. If the token has a unique translation, we use the translation directly. If it has multiple translations, we use an empirical scoring function F pw, w t,i q to select the best translation. This scoring function calculates the cosine similarity between a candidate translation w t,i and the source token w based on its contextual tokens w c,j (in this work, we consider two words in the left context and two words in the right context), as follows: where Epwq is the bilingual embedding of the token w, d j is the positional distance between the token w and its contextual token w c,j , and γ is a hyperparameter for balancing the effects of the translation pair and the contextual tokens. Following previous work, we set γ to 0.5. If the source language token has no translations, the original surface form is kept unchanged.
Note that model-based transfer uses the same model across all languages, whereas this approach requires a separate model for each language.  Hybrid transfer. Both approaches above can be combined in a stage-wise fashion: We can first fine-tune mBERT on the English data, and then fine-tune again on the translated training data (we refer to this as the enÑgt direction). Alternatively, we can switch the order of fine-tuning (the gtÑen direction). In these experiments, we used the output of Google Translate (and hence these are "high resource" approaches). Document translation. Another way to leverage existing translation capabilities is to translate the documents at search time from the target language into the source language (English), and directly apply the mBERT model that is trained on MB en . We used Google Translate in this method, and thus it is "high resource". Ensembles. Ensembles of the above approaches can exploit multiple signal and resources. One approach is to interpolate scores from multiple sources, on a per-document basis: S agg " βS model-transfer`p 1´βq¨S doc-translation . This method is denoted ENS INT , which combines model-based transfer and document translation (from the results, the two most promising techniques). Alternatively, we also experimented with Reciprocal Rank Fusion (Cormack et al., 2009) to aggregate two separate ranked lists, which is denoted ENS RRF . These methods are "high resource".
For "low resource" ensembles, we aggregated signals from model-based transfer and the tokenbased approach for translating training data. These signals are either combined by per-document score interpolation or RRF, as per above.

Experimental Setup
We experimented with six test collections (in Chinese, Arabic, French, Hindi, Bengali and Spanish) from diverse language families (Sino-Tibetan, Semitic, Romance, and Indo-Aryan). Dataset statistics are shown in Table 1. Following standard practice in information retrieval, average precision (AP) up to rank 1000 and precision at rank 20 (P@20) were adopted as the evaluation metrics, computed with the trec eval tool.
For the token-based translation method, we used the MUSE bilingual dictionary (Lample et al., 2018) and the aligned word embeddings from fast-Text (Joulin et al., 2018). For fine-tuning mBERT, we followed the same experimental setup as Akkalyoncu Yilmaz et al. (2019). We used data from the Microblog (MB) Tracks from TREC 2011-2014 (Lin et al., 2014) or its translated counterparts, setting aside 75% of the total data for training and the rest for validation, which was used for selecting the best model parameters. We trained each model using cross-entropy loss with a batch size of 16; the Adam optimizer was applied with an initial learning rate of 1ˆ10´5. During fine-tuning, the embeddings were fixed. The model with the highest AP on the validation set was chosen. We ran all experiments on an NVIDIA Tesla V100 16GB with PyTorch version 1.3.0. Each model was trained for up to 15 epochs, with an average running time of approximately two hours.
For retrieval, we used the open-source Anserini IR toolkit (Yang et al., 2018) with minor modifications based on version 0.6.0 to swap in Lucene Analyzers for different languages. Fortunately, Lucene provides analyzers for all the languages in our test collections. The query was used to retrieve the top 1000 hits from the corpus using BM25 or BM25+RM3 query expansion; default Anserini settings were used in both cases. Reranking with mBERT (see Section 2.1) used the approach with higher AP (either BM25 or BM25+RM3); the top three sentences were considered in aggregating sentence-level evidence. We applied five-fold crossvalidation on all datasets and the parameters α, the w i 's, and β were obtained by grid search, choosing the parameters that yielded the highest AP.

Results
Our results are shown in Table 2. Models (0) and (1) show the effectiveness of BM25 and BM25 with RM3 query expansion. We see that with the exception of the French and Spanish collections, RM3 actually decreases effectiveness. This interesting finding was not further investigated, as our goal was simply to establish a strong baseline; however, these results are consistent with MacAvaney et al. (2020). For each language, we selected the higher of the two models as the starting point of reranking (see Section 2.1) as well as the baseline for compar-  isons below. We organize results into five findings below. Unless otherwise stated, Fisher's two-sided, paired randomization test (Smucker et al., 2007) at p ă 0.05 was applied to test for statistical significance, with Bonferroni corrections as appropriate.
Finding #1: Model-based transfer, model (2), improves upon the baseline, with significant gains (denoted by Ĳ ) everywhere except for AP in Arabic and P@20 in Spanish. Since mBERT is widely available, mono-lingual retrieval improvements can be obtained "for free" with microblog relevance judgments in English. These results indicate that mBERT effectively transfers relevance matching across languages. This finding confirms previous work (Shi and Lin, 2019;MacAvaney et al., 2020), but see additional discussion below.
Finding #2: Comparing model-based transfer and the two approaches to translating training data, models (3) and (4), it is difficult to spot trends or reach definitive conclusions. Model-based transfer does not consistently beat simply translating the training data. In terms of AP, Google Translate, model (3), outperforms model-based transfer for Chinese and Arabic; token-based translation, model (4), beats model-based transfer in Hindi and achieves comparable scores in Arabic and Spanish. Interestingly, it is not always the case that Google Translate ("high resource") is better than token-based translation ("low resource"); the latter achieves higher AP for Hindi and Bengali. A Tukey's HSD test across models (2-4) showed no significant differences.
These results suggest that model-based transfer is not the only effective approach, and that simply translating the training data is at least competitive; neither Shi and Lin (2019)  Finding #3: Results show that hybrid two stage training in the enÑgt direction, model (6), can further improve over model-based transfer alone or translating training data with Google Translate alone, but the gains are not consistent; lower AP than either models (2) or (3) in Chinese, Bengali, and Spanish. When compared to the baseline, model (6) yields significant improvement on Chinese, Arabic, and Hindi (denoted by § ). In the opposite direction, gtÑen, while the hybrid model (7) significantly outperforms the baseline in a few cases, it doesn't seem to be consistently more effective than either models (2) or (3). Note that both hybrid approaches are "high resource" since they require Google Translate.
Finding #4: Document translation, model (5), generally beats model transfer, but it requires substantial resources, such as large amounts of parallel text for training a translation system. Because all our documents are in the newswire domain, the output of Google Translate is quite reasonable. Since this approach avoids language mismatch between training and test, it can outperform the model-based transfer approach: these improvements are significant (denoted by ¶ ) for the Spanish collection on both metrics, and for the Arabic, Bengali, and French collections on AP.
Finding #5: In general, ensembles outperform model transfer alone, with the "high resource" approaches beating the "low resource" approaches (as expected). Comparing the interpolation and RRF methods, we see no consistent trends. A Tukey's HSD test showed no significant differences between the four ensemble methods.

Discussion
Given the effectiveness of model transfer, we additionally investigated a research question focused on model (2): How much contextual information does mBERT rely on besides term matching?
Inspired by the query-centric assumption (Wu et al., 2007) that relevance information is localized in the contexts around query terms, we conducted the following experiments: For each query term, we only kept the texts around the matched tokens in each sentence within a window size, and used only those contexts for reranking. We tried window size 1 (only the matched query terms are kept), 3 (the matched query terms with their left and right tokens), 5, 7, 11, and "sentence" (the entire sentence is kept if at least one query token matched). If the segments are from the same sentence, they are concatenated to form a new "sentence".
Experimental results are shown in Figure 1 for two representative collections. For comparison, we also repeat results of the baseline, either model (0) or (1), denoted bm25 in the figure, and the results of model (2), denoted full in the figure. We can see that as the window size increases, AP tends to rise as well. This seems intuitive, as context is needed for relevance matching. Furthermore, results show that some words critical for determining relevance are located quite far from the query terms; these are discarded when the window size is too small, leading to lower AP scores. However, if we only keep sentences that have at least one query term, the ranking effectiveness is already comparable to using all sentences (0.3080 vs. 0.3081 in Arabic, 0.3095 vs. 0.3101 in Bengali). This simple filter can decrease the inference time needed for ranking 60% to 80% depending on different characteristics of the collections.

Conclusion
As a high-level summary, our experiments confirm that mBERT can enable cross-lingual training of document ranking models. However, mBERT's "multi-lingual capacity" for direct model-based transfer does not appear to be consistently better than other approaches of bridging language gaps. For example, simple approaches such as tokenbased translation of the training data also work well. However, model-based transfer requires only a single model, whereas the latter requires a model for each language. Overall, our work contributes to a better understanding of how relevance judgments in high-resource languages can be leveraged to improve search in low(er)-resources languages. Our code is available on GitHub. 3