Improving Low-Resource Cross-lingual Document Retrieval by Reranking with Deep Bilingual Representations

In this paper, we propose to boost low-resource cross-lingual document retrieval performance with deep bilingual query-document representations. We match queries and documents in both source and target languages with four components, each of which is implemented as a term interaction-based deep neural network with cross-lingual word embeddings as input. By including query likelihood scores as extra features, our model effectively learns to rerank the retrieved documents by using a small number of relevance labels for low-resource language pairs. Due to the shared cross-lingual word embedding space, the model can also be directly applied to another language pair without any training label. Experimental results on the Material dataset show that our model outperforms the competitive translation-based baselines on English-Swahili, English-Tagalog, and English-Somali cross-lingual information retrieval tasks.


Introduction
Cross-lingual relevance ranking, or Cross-Lingual Information Retrieval (CLIR), is the task of ranking foreign documents against a user query (Hull and Grefenstette, 1996;Ballesteros and Croft, 1996;Oard and Hackett, 1997;Darwish and Oard, 2003).As multilingual documents are more accessible, CLIR is increasingly more important whenever the relevant information is in other languages.
Traditional CLIR systems consist of two components: machine translation and monolingual information retrieval.Based on the translation direction, it can be further categorized into the document translation and the query translation approaches (Nie, 2010).In both cases, we first solve the translation problem, and the task is transformed to the monolingual setting.However, while conceptually simple, the performance of this modular approach is fundamentally limited by the quality of machine translation.
Recently, many deep neural IR models have shown promising results on monolingual data sets (Huang et al., 2013;Guo et al., 2016;Pang et al., 2016;Mitra et al., 2016Mitra et al., , 2017;;Xiong et al., 2017;Hui et al., 2017Hui et al., , 2018;;McDonald et al., 2018).They learn a scoring function directly from the relevance label of query-document pairs.However, it is not clear how to use them when documents and queries are not in the same language.Furthermore, those deep neural networks need a large amount of training data.This is expensive to get for lowresource language pairs in our cross-lingual case.
In this paper, we propose a cross-lingual deep relevance ranking architecture based on a bilingual view of queries and documents.As shown in Figure 1, our model first translates queries and documents and then uses four components to match them in both the source and target language.Each component is implemented as a deep neural network, and the final relevance score combines all components which are jointly trained given the relevance label.We implement this based on state-

Document in Target Language
Cosine Similarity max and k-max Term Gating (a) Bilingual POSIT-DRMM.The colored box represents hidden states in bidirectional LSTMs.of-the-art term interaction models because they enable us to make use of cross-lingual embeddings to explicitly encode terms of queries and documents even if they are in different languages.To deal with the small amount of training data, we first perform query likelihood retrieval and include the score as an extra feature in our model.In this way, the model effectively learns to rerank from a small number of relevance labels.Furthermore, since the word embeddings are aligned in the same space, our model can directly transfer to another language pair with no additional training data.We evaluate our model on the MATERIAL CLIR dataset with three language pairs including English to Swahili, English to Tagalog, and English to Somali.Experimental results demonstrate that our model outperforms other translation-based query likelihood retrieval and monolingual deep relevance ranking approaches.

Our Method
In cross-lingual document retrieval, given a user query in the source language Q and a document in the target language D, the system computes a relevance score s(Q, D).As shown in Figure 1, our model first translates the document as D or the query as Q, and then it uses four separate components to match: (1) source query with target document, (2) source query with source document, (3) target query with source document, (4) target query with target document.The final relevance score combines all components: To implement each component, we extend three state-of-the-art term interaction models: PACRR (Position-Aware Convolutional Recurrent Relevance Matching) proposed by Hui et al. (2017), POSIT-DRMM (POoled SImilariTy DRMM) and PACRR-DRMM proposed by McDonald et al. (2018).In term interaction models, each query term is scored to a document's terms from the interaction encodings, and scores for different query terms are aggregated to produce the querydocument relevance score.

Bilingual POSIT-DRMM
This model is illustrated in Figure 2a.We first use bidirectional LSTMs (Hochreiter and Schmidhuber, 1997) to produce the context-sensitive encoding of each query and document term.We also add residual connection to combine the pre-trained term embedding and the LSTM hidden states.For the source query and document term, we can use the pre-trained word embedding in the source language.For the target query and document term, we first align the pre-trained embedding in the target language to the source language and then use this cross-lingual word embedding as the input to LSTM.Thereafter, we produce the documentaware query term encoding by applying max pooling and k-max pooling over the cosine similarity matrix of query and document terms.We then use an MLP to produce term scores, and the relevance score is a weighted sum over all terms in the query with a term gating mechanism.

Bilingual PACRR and Bilingual PACRR-DRMM
These models are shown in Figure 2b.We first align the word embeddings in the target language to the source language and build a querydocument similarity matrix that encodes the similarity between the query and document term.Depending on the query language and document language, we construct four matrices, SIM , for each of the four components.Then, we use convolutional neural networks over the similarity matrix to extract n-gram matching features.We then use maxpooling and k-max-pooling to produce the feature matrix where each row is a document-aware encoding of a query term.The final step computes the relevance score: Bilingual PACRR uses an MLP on the whole feature matrix to get the relevance score, while Bilingual PACRR-DRMM first uses an MLP on individual rows to get query term scores and then use a second layer to combine them.

Related Work
Cross-lingual Information Retrieval.Traditional CLIR approaches include document translation and query translation, and more research efforts are on the latter (Oard and Hackett, 1997;Oard, 1998;McCarley, 1999;Franz et al., 1999).
Early methods use the dictionary to translate the user query (Hull and Grefenstette, 1996;Ballesteros and Croft, 1996;Pirkola, 1998).Other methods include the single best SMT query translation (Chin et al., 2014) and the weighted SMT translation alternatives known as the probabilistic structured query (PSQ) (Darwish and Oard, 2003;Ture et al., 2012) et al., 2016;Guo et al., 2016;Hui et al., 2017;Xiong et al., 2017;McDonald et al., 2018).The former builds representations of query and documents independently, and the matching is performed at the final stage.The latter explicitly encodes the interaction between terms to direct capture word-level interaction patterns.For example, the DRMM (Guo et al., 2016) first compares the term embeddings of each pair of terms within the query and the document and then generates fixedlength matching histograms.

Experiments
Training and Inference.We first use the Indri1 system which uses query likelihood with Dirichlet Smoothing (Zhai and Lafferty, 2004) to preselect the documents from the collection.To build the training dataset, for each positive example in the returned list, we randomly sample one negative example from the documents returned by Indri.The model is then trained with a binary crossentropy loss.On validation or testing set, we use our prediction scores to rerank the documents returned by Indri.Extra Features.Following the previous work (Severyn and Moschitti, 2015;Mohan et al., 2017;McDonald et al., 2018) tures: (1) the Indri score with the language modeling approach to information retrieval.
(2) the percentage of query terms with an exact match in the document, including the regular percentage and IDF weighted percentage.(3) the percentage of query term bigrams matches in the document.Cross-lingual Word Embeddings.We apply the supervised iterative Procrustes approach (Xing et al., 2015;Conneau et al., 2018) to align two pretrained mono-lingual fastText (Bojanowski et al., 2016) word embeddings using the MUSE implementation2 .To build the bilingual dictionary, we use the translation pages of Wiktionary3 .For Swahili, we build a training dictionary for 5301 words and a testing dictionary for 1326 words.For Tagalog, the training dictionary and testing dictionary contains 7088 and 1773 words, respectively.For Somali, the corresponding number is 7633 and 1909.We then learn the cross-lingual word embeddings from Swahili to English, from Tagalog to English, and from Somali to English.Therefore, all three languages are in the same word embedding space.Data Sets and Evaluation Metrics.Our experiments are evaluated on the MATERIAL4 program as summarized in Table 1.It consists of three language pairs with English queries on Swahili (EN->SW), Tagalog (EN->TL), Somali documents (EN->SO).
We use the TREC ad-hoc retrieval evaluation script5 to compute Precision@20, Mean Average Precision (MAP), Normalized Discounted Cumulative Gain@20 (NDCG@20).We also report the Actual Query Weighted Value (AQWV) (NIST, 2017), a set-based metric with penalty for both missing relevant and returning irrelevant documents.We use β = 40.0 and find the best global fixed cutoff over all queries.Baselines.For traditional CLIR approaches, we use query translation and document translation with the Indri system.For query translation, we use Dictionary-Based Query Translation (DBQT) and Probabilistic Structured Query (PSQ).For document translation, we use Statistical Machine Translation (SMT) and Neural Machine Translation (NMT).For SMT, we use the moses system (Koehn et al., 2007) with word alignments using mGiza and 5-gram KenLM language model (Heafield, 2011).For NMT, we use sequence-to-sequence model with attention (Bahdanau et al., 2015;Miceli Barone et al., 2017) implemented in Marian (Junczys-Dowmunt et al., 2018).
For deep relevance ranking baselines, we investigate recent state-of-the-art models including PACRR, PACRR-DRMM, and POSIT-DRMM.These models and our methods all use an SMTbased document translation as input.Implementation Details.For POSIT-DRMM and Bilingual POSIT-DRMM, we use the k-maxpooling with k = 5 and 0.3 dropout of the BiL-STM output.For PACRR, PACRR-DRMM and their bilingual counterparts, we use convolutional filter sizes with [1,2,3], and each filter size has 32 filters.We use k = 2 in the k-max-pooling.The loss function is minimized using the Adam optimizer (Kingma and Ba, 2014) with the training batch size as 32.We monitor the MAP performance on the development set after each epoch of training to select the model which is used on the test data.

Results and Discussion
Table 2 shows the result on EN->SW and EN->TL where we train and test on the same language pair.Performance of Baselines.For query translation, PSQ is better than DBQT because PSQ uses a weighted alternative to translate query terms and does not limit to the fixed translation from the dictionary as in DBQT.For document translation, we find that both SMT and NMT have a similar performance which is close to PSQ.The effectiveness of different approaches depends on the language pair (PSQ for EN->SW and SMT for EN->TL), which is a similar finding with McCarley (1999) and Franz et al. (1999).In our experiments with deep relevance ranking models, we all use SMT and PSQ because they have strong performances in both language pairs and it is fair to compare.Effect of Extra Features and Bilingual Representation.While deep relevance ranking can achieve decent performance, the extra features are critical to achieve better results.Because the extra features include the Indri score, the deep neural model essentially learns to rerank the document by effectively using a small number of training examples.Furthermore, our models with bilingual representations achieve better results in both language pairs, giving additional 1-3 MAP improvements over their counterparts.To compare language pairs, EN->TL has larger improvements over EN->SW.This is because EN->TL has better query translation, document translation, and query likelihood retrieval results from the baselines, and thus it enjoys more benefits from our model.We also found POSIT-DRMM works better than the other two, suggesting term-gating is useful especially when the query translation can provide more alternatives.We then perform ensembling of POSIT-DRMM to further improve the results.Zero-Shot Transfer Learning.Table 3 shows the result for a zero-shot transfer learning setting where we train on EN->SW + EN->TL and directly test on EN->SO without using any Somali relevance labels.This transfer learning delivers a 1-3 MAP improvement over PSQ and SMT.This presents a promising approach to boost performance by utilizing relevance labels from other language pairs.

Conclusion
We propose to improve cross-lingual document retrieval by utilizing bilingual query-document interactions and learning to rerank from a small amount of training data for low-resource language pairs.By aligning word embedding spaces for multiple languages, the model can be directly applied under a zero-shot transfer setting when no training data is available for another language pair.We believe the idea of combining bilingual document representations using cross-lingual word embeddings can be generalized to other models as well.

Figure 1 :
Figure 1: Cross-lingual Relevance Ranking with Bilingual Query and Document Representation.
Bilingual PACRR-DRMM.The colored box represents cross-lingual word embeddings.Bilingual PACRR is the same except it uses a single MLP at the final stage.

Figure 2 :
Figure 2: Model architecture.We only show the component of the source query with the target document.

Table 1 :
The MATERIAL dataset statistics.For SW and TL, we use the ANALYSIS document set with Q1 for training, Q2 for dev, and Q3 for test.For transfer learning to SO, we use the DEV document set with Q1.Q1 contains open queries where performers can conduct any automatic or manual exploration while Q2 and Q3 are closed queries where results must be generated with fully automatic systems with no human in the loop.

Table 2 :
Test set result on English to Swahili and English to Tagalog.We report the TREC ad-hoc retrieval evaluation metrics (MAP, P@20, NDCG@20) and the Actual Query Weighted Value (AQWV).
, we compute the final relevance score by a linear model to combine the model output with the following set of extra fea-

Table 3 :
Zero-shot transfer learning on English to Somali test set.