Cross-Lingual Learning-to-Rank with Shared Representations

Cross-lingual information retrieval (CLIR) is a document retrieval task where the documents are written in a language different from that of the user’s query. This is a challenging problem for data-driven approaches due to the general lack of labeled training data. We introduce a large-scale dataset derived from Wikipedia to support CLIR research in 25 languages. Further, we present a simple yet effective neural learning-to-rank model that shares representations across languages and reduces the data requirement. This model can exploit training data in, for example, Japanese-English CLIR to improve the results of Swahili-English CLIR.


Introduction
Multilingual document collections are becoming prevalent. Thus an important application is crosslingual information retrieval (CLIR), i.e. document retrieval which assumes that the language of the user's query does not match that of the documents. For example, imagine an investor who wishes to monitor consumer sentiment of an international brand in Twitter conversations around the world. She might issue a query string in English, and desire all relevant tweets in any language.
There are two main approaches to building CLIR systems. The modular approach involves a pipeline of two components: translation (machine translation or bilingual dictionary look-up) and monolingual information retrieval (IR). These approaches may be further divided into the document translation and query translation approaches (Nie, 2010). In the former, one translates all foreignlanguage documents to the language of the user query prior to IR indexing; in the latter, one indexes foreign-language documents and translates the query. In both, the idea is to solve the translation problem separately, so that CLIR becomes document retrieval in the monolingual setting.
A distinctly different way to build CLIR systems is what may be called the direct modeling approach (Bai et al., 2010;Sokolov et al., 2013). This assumes the availability of CLIR training examples of the form (q, d, r), where q is an English query, d is a foreign-language document, a r is the corresponding relevance judgment for d with respect to q. One directly builds a retrieval model S(q, d) that scores the querydocument pair. While q and d are in different languages, the model directly learns both translation and retrieval relevance on the CLIR training data. Compared to the modular approach, direct modeling is advantageous in that it focuses on learning translations that are beneficial for retrieval, rather than translations that preserve sentence meaning/structure in bitext.
However, there exist no large-scale CLIR dataset that can support direct modeling approaches in a wide variety of languages. To obtain relevance judgments, one typically needs a bilingual speaker who can read a foreign-language document and assess whether it is relevant for a given English query. This can be an expensive process. Here, we present a large-scale dataset that is automatically constructed from Wikipedia: it can support training and evaluation of CLIR systems between English queries and documents in 25 other languages (Section 2). The data is of sufficient size for direct modeling, and can also serve as an wide-coverage evaluation data for the modular approaches. 1 To demonstrate the utility of the data, we further present experiments for CLIR in low-resource languages. First, we introduce a neural CLIR model based on the direct modeling approach (Section Figure 1: CLIR data construction process: From an English article (E1), we extract the English query. Using the inter-language link, we obtain the most relevant foreign-language document (F1). Any article that has mutual links to and from F1 are labeled as slightly relevant (F2). All other articles are not relevant (F3). The data is a set of tuples: (English query q, foreign document d, relevance judgment r), where r ∈ {0, 1, 2} represents the three levels of relevance.
3.1). We then show how we can bootstrap CLIR models for languages with less training data by an appropriate use of paramater sharing among different language pairs (Section 3.2). For example, using the training data for Japanese-English CLIR, we can improve the Mean Average Precision (MAP) results of a Swahili-English CLIR system by 5-7 points (Section 4).

Large-Scale CLIR Dataset
We construct a large-scale CLIR data from Wikipedia. The idea is to exploit inter-language links: from an English page, we extract a sentence as query, and label the linked foreign-document pages as relevant. See Figure 1 for an illustration.
This data construction process is similar to (Schamoni et al., 2014) who made an English-German CLIR dataset, but ours is at a larger scale. Specifically, we use Wikipedia dumps released on August 23, 2017. English queries are obtained by extracting the first sentence of every English Wikipedia article. The intuition is that the first sentence is usually a well-defined summary of its corresponding article and should be thematically related for articles linked to it from another language. Similar to (Schamoni et al., 2014), title words from the query sentences are removed, because they may be present across different language editions. This deletion prevents the task from becoming an easy keyword matching task.
For practical purposes, each document is limited to the first 200 words of the article. Empty documents and category pages are filtered. Currently, our dataset consists of more than 2.8 mil-  Table 1: CLIR dataset statistics. For each language X, we show the total number of documents in language X and the number of English queries. The number of "most relevant" documents is by definition equal to #Query. The number of "slightly relevant" documents is shown in the column #SR.
lion English queries and relevant documents from 25 other selected languages (see Table 1).
In sum, we have created a CLIR dataset that is large-scale in terms of both the amount of examples as well as the number of languages. This can be used in two scenarios: (1) one mixed-language collection where an English query may retrieve relevant documents in multiple languages. (2) 25 independent datasets for training and evaluating CLIR on English queries against one foreign language collection. In the experiments in Section 4, we will utilize the dataset in terms of scenario (2). 2 3 Direct Modeling for CLIR

Neural Ranking Model
Given an English query q and a foreign-language document d, our models compute the relevance score S(q, d). First, we represent each word as n-dimensional vectors, so q and d are represented as matrices Q ∈ R n×|q| and D ∈ R n×|d| , where |q| and |d| are the numbers of tokens in q and d: q i and d i denote the i-th term in q and d. E is embedding function which transforms each term to a dense n-dimensional vector as its representation. ; is the concatenation operator. Then, we apply convolutional feature map 3 to these matrices, followed by tanh activation and average-pooling to obtain each representation vectorq andd.
Next, we define two variations in calculating S(q, d). The first is a cosine model which computes cosine similarity betweenq andd: The second is a deep model with a fully connected layer on top of the concatenation ofq and d (a 200-dimensional vector): Here, O ∈ R 1×h and W ∈ R h×200 are the deep model parameters, and h is the number of dimensions of the hidden state, h vec ∈ R 1×h . For regularization, we set dropout rate as 0.5 (Srivastava et al., 2014) at the hidden layer.
In the training phase, we minimize pairwise ranking loss, which is widely used for learningto-rank (Pang et al., 2016;Hui et al., 2017;Xiong et al., 2017;Dehghani et al., 2017), defined as follows: where d + and d − are relevant and non-relevant document respectively. We fix only the word embeddings and tune the other parameters. 3 The n × 4 convolution window has filter size of 100 and takes a stride of 1. Figure 2: Illustration of the proposed method. On low resource dataset (e.g. Swahili-English), the parameters of the CNN for encoding query (CN N En ) and the parameters of the fully connected layer (O En−Sw , W En−Sw ) are initialized by the ones pre-trained on high resource dataset (e.g. Japanese-English).

Ja
De Fr  We note there are many other ranking models that can be adapted to CLIR (Huang et al., 2013;Shen et al., 2014;Xiong et al., 2017;Mitra et al., 2017); they have a common framework in extracting features from both query and document and optimizing scores S(q, d) via some ranking loss.

Sharing Representations
Training a network like the deep model generally requires a nontrivial amount of data. To address the data requirement for low-resource languages, we propose a simple yet effective method that shares representations across CLIR models trained in different language-pairs. Basically, we use the same architecture as the deep model (S deep (q, d), Equation 3). However, we use the parameters trained on a high-resource dataset (e.g Japanese-English) to initialize the parameters for a lowresource language-pair (e.g. Swahili-English). Figure 2 illustrates the idea: Concretely, we initialize the parameters of the CNN for encoding query (CN N q ) and the parameters of the fully connected layer (O, W ) by using the pre-trained parameters. When training on low-resource data,  Table 3: P@1/MAP performances on low resource datasets. ∆ columns show the comparison between the basic deep models with in-language training (In) and the deep models with sharing parameters (Sh); + indicates Sh outperforms In, and -indicates the In outperforms Sh. Best value in each dataset is highlighted in bold.
we fix only the word embedding, and tune the parameters of CNNs and the fully connected layer.
The intuition behind this is that our direct modeling approach enforcesq andd to become language-independent representations of the query and document. The parameters O and W in the deep layer can therefore be used for any languagepair. Note for the cosine model, we can also share parameters for CN N q .

Experiment Results
Setup: We use datasets of 3 high-resource languages (Japanese [Ja], German [De], French [Fr]) and 2 low-resource languages (Tagalog [Tl], Swahili [Sw]). We also subsample German and French data to be equivalent to the size of Swahili, in order to compare training size effects. Word embedding with dimension 100 for each language is trained on Wikipedia corpus, using word2vec SGNS (Mikolov et al., 2013). The size of hidden states in the deep model is {100, 200, 300, 400, 500}. We adopt Adam (Kingma and Ba, 2014) for optimization, train for 20 epochs and pick the best epoch based on development set loss. For the proposed method of parameter sharing, we use the weight parameters pre-trained on Japanese-English dataset to initialize parameters. High-resource results: Table 2 shows the P@1 (precision at top position) and MAP (mean average precision) for datasets consisting of on the order of 100k+ training queries. The deep models outperformed the cosine models under all conditions, suggesting that the fully connected layer can exploit the large training set in learning more expressive scoring functions. Low-resource results: Table 3 shows the results on the low resource datasets under two conditions: training on only the language-pair of interest (in-language), or additionally sharing parameters using a pre-trained Japanese-English model. For the in-language case, we observe the cosine model outperforms the deep model. In contrast to the high-resource results, this implies that deep models, which have a lot of parameters, only become effective if provided with sufficient training data.
For the sharing case, the deep models with parameter sharing outperformed the basic deep models trained only on in-language data under almost all conditions. This indicates that our sharing method reduces training data requirement. Importantly, by sharing parameters, the deep models are now able to outperform the cosine model and achieve the best results on all datasets. 4

Conclusion and Future Work
We introduce a large-scale CLIR dataset in 25 languages. This enables the training and evaluation of direct modeling approaches in CLIR. We also present a neural ranking model with shared representations, and demonstrate its effectiveness in bootstrapping CLIR in low-resource languages.
Future work includes: (a) expansion of the dataset to more languages, (b) extraction of different types of queries and relevant judgments from Wikipedia, and (c) development of other ranking models. Importantly, we also plan to evaluate our models on standard CLIR test sets such as TREC (Schäuble and Sheridan, 1997), NTCIR (2007), FIRE (2013) and CLEF (2016). This will help answer the question of whether knowledge learned from automatically-generated datasets can be transferred to a wide range of CLIR problems.