CLIRMatrix: A Massively Large Collection of Bilingual and Multilingual Datasets for Cross-Lingual Information Retrieval

We present CLIRMatrix , a massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval extracted automatically from Wikipedia. CLIR-Matrix comprises (1) BI-139 , a bilingual dataset of queries in one language matched with relevant documents in another language for 139 × 138 = 19,182 language pairs, and (2) MULTI-8 , a multilingual dataset of queries and documents jointly aligned in 8 different languages. In total, we mined 49 million unique queries and 34 billion (query, document, label) triplets, making it the largest and most comprehensive CLIR dataset to date. This collection is intended to support research in end-to-end neural information retrieval and is publicly available at https: //github.com/ssun32/CLIRMatrix . We provide baseline neural model results on BI-139, and evaluate MULTI-8 in both single-language retrieval and mix-language retrieval settings.


Introduction
Cross-Lingual Information Retrieval (CLIR) is a retrieval task in which search queries and candidate documents are written in different languages. CLIR can be very useful in some scenarios. For example, a reporter may want to search foreignlanguage news to obtain different perspectives for her story; an inventor may explore the patents in another country to understand prior art. Traditionally, translation-based approaches are commonly used to tackle the CLIR task (Zhou et al., 2012;Oard, 1998;McCarley, 1999): the query translation approach translates the query into the same language of the documents, whereas the document translation approach translates the document into the same language as the query. Both approaches rely on a machine translation (MT) system or bilingual dictionary to map queries and documents to the same language, then employ a monolingual information retrieval (IR) engine to find relevant documents.
Recently, the research community has been actively looking at end-to-end solutions that tackle the CLIR task without the need to build MT systems. This line of work builds upon recent advances in Neural Information Retrieval in the monolingual setting, c.f. (Mitra and Craswell, 2018;Craswell et al., 2020). There are proposals to directly train end-to-end neural retrieval models on CLIR datasets (Sasaki et al., 2018; or MT bitext (Zbib et al., 2019;Jiang et al., 2020). One can also exploit cross-lingual word embeddings to train a CLIR model on disjoint monolingual corpora (Litschko et al., 2018).
Despite the growing interest in end-to-end CLIR, the lack of a large-scale, easily-accessible CLIR dataset covering many language directions in high-, mid-and low-resource settings has detrimentally affected the CLIR community's capability to replicate and compare with previously published work. For example, among the widely-used datasets, the CLEF collection (Ferro and Silvello, 2015) covers many languages but is not large enough for training neural models. The more recent IARPA MATE-RIAL/OpenCLIR collection (Zavorin et al., 2020), is not yet publicly accessible. This motivates us to design and build CLIRMatrix, a massively large collection of bilingual and multilingual datasets for CLIR.
We construct CLIRMatrix from Wikipedia in an automated manner, exploiting its large variety of languages and massive number of documents. The core idea is to synthesize relevance labels via an existing monolingual IR system, then propagate the labels via Wikidata links that connect documents in different languages. In total, we were able to mine 49 million unique queries in 139 languages and 34 billion (query, document, label) Figure 1: Illustration of our CLIRMatrix collection. The BI-139 portion of CLIRMatrix supports research in bilingual retrieval and covers a matrix of 139 × 138 language pairs. The MULTI-8 portion of CLIRMatrix supports research in multilingual modeling and mixedlanguage (ML) retrieval, where queries and documents are jointly aligned over 8 languages.
triplets, creating a CLIR collection across a matrix of 139 × 138 = 19, 182 language pairs. From this raw collection, we introduce two datasets: • BI-139 is a massively large bilingual CLIR dataset that covers 139 × 138 = 19, 182 language pairs. To encourage reproducibility, we present standard train, validation, and test subsets for every language direction.
See Figure 1 for a comparison of BI-139 and MULTI-8. The former facilitates the evaluation of bilingual retrieval over a wide variety of languages, while the latter supports research in mixed-language retrieval (a.k.a multilingual retrieval (Savoy and Braschler, 2019)), which is an interesting yet relatively under-explored problem. For both, the train sets are large enough to enable the training of the neural IR models. We hope CLIRMatrix is useful and can empower further developments in this field of research. To summarize, our contributions are:

Methodology
Let q X be a query in language X, and d Y be a document in language Y. A bilingual CLIR dataset consists of I triples where d Y ij is the j-th document associated with query q X i , and r ij is a label saying how relevant is the document d Y ij to the query q X i . Conventionally, r ij is an integer with 0 representing "not relevant" and higher values indicating more relevant.
Suppose there are J documents in total. In the full collection search setup, the index j ranges from 1, . . . , J, meaning that each query q X i searches over the full set of documents {d Y ij } j=1,...,J . In the re-ranking setup, each query q X i searches over a subset of documents obtained by an initial fullcollection retrieval engine: {d Y ij } j=1,...,K i , where K i J. For practical reasons, machine learning approaches to IR focus on the re-ranking setup with K i set to 10∼1000 (Liu, 2009;Chapelle and Chang, 2011). We follow the re-ranking setup here.
We now describe the main intuition of our construction method and detail various components and design choices in our pipeline.

Intuition and Assumptions
To create a CLIR dataset, one needs to decide how to obtain q X i and d Y ij , and r ij . We set q X i to be Wikipedia titles, d Y ij to be Wikipedia articles, and synthesize r ij automatically using a simple yet reliable method. We argue that Wikipedia is the best available resource for building CLIR datasets due to two reasons: First, it is freely available and contains articles in more than 300 languages, covering a large variety of topics. Second, Wikipedia articles are mapped to entities in Wikidata 1 , which is a relatively reliable way to find the same articles written in other languages.
To synthesize relevance labels r ij , we propose first to generate labels using an existing monolingual IR system in language X, then propagate the labels via Wikidata links to language Y. In other words, we assume: 1. the availability of documents d X in the same language as the query, and 2. the feasibility of an existing monolingual IR system in language X to provide labelsr ij on Then for any d Y ij that links to d X ij , we assign the relevance labelr ij .
This intuition is illustrated in Figure 2. Suppose we wish to find Chinese documents that are relevant for the English query "Barack Obama". We first run monolingual IR to find English documents that answer the query. In this figure, 4 documents are returned, and we attempt to link to the corresponding Chinese versions using Wikidata information. When the link is available, we set the relevance label r ij for Chinese documents using the English-based IR system's predictionsr ij ; all other documents are deemed not relevant. This gives us the triplet (q X i , d Y ij , r ij ). Figure 3 is our mining pipeline that implements the intuition in Figure 2. First, we download the Wikipedia dump of language X and then extract the titles and document bodies of every article. We index the documents into an Elasticsearch 2 search engine, which serves as our monolingual IR system. Using the extracted titles as search queries, we retrieve the top 100 relevant documents and their corresponding BM25 scores from Elasticsearch for every query. We then convert the BM25 scores into discrete relevance judgment labels using Jenks natural break optimization. Finally, we propagate these labels to documents in language Y that are linked via Wikidata.

Mining Pipeline
We downloaded Wikidata and Wikipedia dumps released on January 1, 2020. Since Wikipedia dumps contain tremendous amounts of metainformation such as URLs and scripts, it can be expensive to extract actual text directly from those dumps. Inspired by Schwenk et al. (2019), we extracted document ids, titles, and bodies from Wikipedia's search indices 3 instead, which contain raw text data without meta-information.
Wikipedia dumps We discarded dumps with less than ten thousand documents, which are usually the dumps of Wikipedia of certain dialects and less commonly used languages. We are left with Wikipedia dumps in 139 languages, containing a good mix of high-, mid-and low-resource languages. For writing systems that do not use whitespaces such as Chinese, Japanese, and Thai, we truncated documents to approximately the first 600 characters. For other languages, we kept roughly the first 200 tokens of every document. Truncating the documents is necessary for several reasons: First, shorter documents are more friendly to neural models that are bounded by GPU memories. Second, the first few hundred tokens of Wikipedia articles are usually the main points of the full text, thus are more likely to be topically similar across languages. Last but not least, BM25 tends to overpenalize long documents, which can lead to suboptimal IR performances (Lv and Zhai, 2011). We hypothesize we can get better relevant judgment labels if we use shorter documents.
Wikidata dump We downloaded the JSON dump 4 of Wikidata, a structured knowledge base that links to Wikipedia. We designed a regex rule that efficiently obtains a list of entities IDs from the Wikidata dump. For every entity ID, we also extracted a list of related (language code, document title) pairs. Using our extracted Wikipedia data, we matched the document titles to Wikipedia document IDs 5 . The extracted data allows us to construct two dictionaries: 1) A dictionary that maps the document ID in some language to its Wikidata entity ID. 2) A reverse dictionary that maps a Wikidata entity ID to document IDs in different languages. This enables us to locate a document's counterpart in another language quickly; we use this information to find link relevant documents across languages. 6

Design Choices
Document titles as search queries We considered several methods used to generate search queries. One quick way is to acquire humangenerated search queries directly from search logs. However, this is not a viable option because search logs are not publicly available for most languages. Alternatively, we can engage human annotators to manually generate search queries, but this can be time-consuming and expensive, and it is not possible to scale the process quickly to 139 languages.
We use document titles as search queries for two reasons: (1) They are readily available in large amounts for each of the 139 languages, which enables us to build large datasets (i.e., I is large). (2) In certain real-world search settings, queries are typically short, spanning only two to three tokens (Belkin et al., 2003) and informational, covering a wide variety of topics (Jansen et al., 2008). We leave the investigation of complex queries to future work. We want to emphasize that our mining pipeline is compatible with all query types; for example, we can use the first sentences of documents as queries (Schamoni et al., 2014;Sasaki et al., 2018) if desired. 5 Note that documents in different languages do not share document IDs. This means that document N in language X does not refer to the same entity as document N in language Y. 6 We acknowledge that there are potentially missing interlanguage links in Wikidata. This implies that our method may miss the labeling of some relevant documents. Wikidata has several policies to improve its data quality, such as requests for editors to link new Wikipedia articles to entities in Wikidata. There are also automated auditing tools that periodically identify articles with missing or inconsistent Wikidata labels and ask human editors for verification. An interesting research problem for future work is to find ways to quantify the coverage of these inter-language links.

BM25 and Elasticsearch
The main step of our mining pipeline is to index documents into a monolingual IR system, and then retrieve a list of relevant documents and similarity scores for every query. We assume the similarity score between a query and document accurately reflects the degree of relevance for that document. Since many Wikipedia dumps contain millions of documents, the computations needed to retrieve relevant documents for all 139 languages is non-trivial. We need an efficient retrieval system that can handle the retrieval task efficiently and accurately. For this reason, we chose Elasticsearch 7 as our monolingual IR system.
Elasticsearch is an open-source, highly optimized search engine software based on Apache Lucene 8 . It has built-in analyzers that handle language-specific preprocessing such as tokenization and stemming. By default, Elasticsearch implements the BM25 weighting scheme (Robertson et al., 2009), a bag-of-word retrieval function that calculates similarity scores between queries and documents based on term frequencies and inverse document frequencies. BM25 is a strong baseline that frequently outperforms existing neural IR models on multiple benchmark IR datasets (Chapelle and Chang, 2011;Guo et al., 2016;McDonald et al., 2018).
We used Elasticsearch 6.5.4 and imported the same settings as the official search indices from Wikipedia 9 . For every query, we configured Elasticsearch to search both document titles and document bodies, with twice the weight given to document titles. We limit Elasticsearch to return only the top 100 documents for each query and assume documents not returned by the search engine are irrelevant. We parallelized the retrieval processes by running multiple Elasticsearch instances on numerous servers and dedicated one Elasticsearch instance to every language.
Discrete relevance judgment labels A potential pitfall of using document titles as queries is that some short queries can be ambiguous (Allan and Raghavan, 2002). For example, it is impossible to figure out whether the search query "Java" refers to the Java programming language or the island in Indonesia without other context words. Fortunately, Wikipedia disambiguates different document titles by appending category information to the titles, e.g., Java (Programming Language) and Java (Island), etc. Nevertheless, we do not want to rank retrieved documents solely based on their BM25 scores. To prevent potential ambiguity issues, we smooth out the BM25 scores into discrete relevance judgment labels. We achieve this by using the Jenks natural break optimization (McMaster and McMaster, 2002), an algorithm that finds optimal BM25 score intervals for different labels by iteratively reducing the variance within labels and maximizing the variance between labels.
More specifically, for each query q X i , we normalized the BM25 scoresr ij of d X ij to the unit range and then used Jenks optimization to distribute the normalized scores into 5 different relevance judgment labels {1, 2, 3, 4, 5}. We want to emphasize that we did not run Jenks optimization globally across all BM25 scores because the scales of BM25 scores are not consistent across different queries. Additionally, documents that are not returned by Elasticsearch or not linked by any Wikidata are deemed irrelevant and given a label 0. We also assigned the label 6 to the document associated with the title query. So final r ij is of a scale of 0 to 6, with 0 being irrelevant and 6 being most relevant.

Bilingual and Multilingual datasets
BI-139 Using the aforementioned pipeline, we build a bilingual dataset {(q X i , d Y ij , r ij )} i=1,2,...,I for every X→Y language direction. In the "raw" version, there are 49.28 million unique queries and 34.06 billion (query, document, label) triplets across 139 × 138 = 19, 182 language directions. We also generated a "base" version, which contains standard train, validation, test1, and test2 subsets for each language direction. Train sets contain up to I=10,000 queries, while validation, test1, and test2 sets each contain up to 1,000 queries. We ensured that queries in the train and validation/test sets of one language direction do not overlap with the queries in the test sets from other language directions. For every query, we ensure there are precisely K =100 candidate documents by filling the shortfall with random irrelevant documents.
MULTI-8 This is a multilingual CLIR dataset covering 8 languages from various regions of the world (Arabic, German, English, Spanish, French, Japanese, Russian, and Chinese). First, we re-stricted queries to those with a relevant document (r ij = 6) in all 8 languages. Then, for each query q X i , we use the monolingual IR systems to collect 100 documents in the same language d X ij . 10 Similar to BI-139 base, if ElasticSearch returns less than 100 documents labels (r ij ≥ 1), then we fill-up the short-fall with random irrelevant documents with label r ij = 0. Finally, we merge these document lists such that for any query in language X, we have 7 × 100 documents in the other 7 languages.
Similar to the base version of BI-139, the train sets contain 10,000 queries, while validation, test1, and test2 sets contain 1,000 queries; but note the query sets are different. This dataset supports two kinds of research: First, one can still evaluate bilingual CLIR (single-language retrieval) like BI-139, but exploit training multilingual models using more than two languages. Second, one can evaluate on multilingual CLIR (mixed-language retrieval), where the document list to be re-ranked contains two or more languages. This research direction is relatively unexplored, with the exception of early work in the 2000s in the CLEF campaign (Savoy and Braschler, 2019).

File Formats
{"src id": "6267", "src query": "Cultural imperialism", "tgt results": [["3383724", 6], ["19028", 5]  For every language direction, we store queries and their relevant document IDs and labels in the JSON Lines format (Figure 4). For each unique language, we store the IDs and texts of documents in TSV files ( Figure 5). Note that we will release both the truncated and the original documents.

Experimental Setup
10 Recall that our Wikidata entities dictionary can map a language-independent entity to query strings (Wikipedia article titles) in any language.  Baseline neural CLIR model We follow the implementation of the vanilla BERT ranker model (MacAvaney et al., 2019), which obtained strong results in monolingual IR. As shown in Figure 6, the model encodes a query-document pair with BERT (Devlin et al., 2019) and stacks a linear combination layer on top of the [CLS] token. We extended the ranker model to use multilingual BERT 11 . At training time, we sample documents pairs in which the positive documents have higher relevance judgment labels than the negative documents. For each document pair, we obtain scores for both docu- 11 We used BERT-Base, Multilingual Cased ments using the same BERT ranker model. We then optimize the parameters with pairwise hinge loss and Adam optimizer. We trained all models for 20 epochs and sampled around 1,000 training pairs for each epoch. At inference time, we rerank documents based on the output scores from the BERT ranker model.

Evaluation metric
We report all results in NDCG (normalized discounted cumulative gain), an IR metric that measures the usefulness of documents based on their ranks in the search results (Järvelin and Kekäläinen, 2002). Following a common practice from the IR community, we calculate NDCG@10, which only evaluates the top 10 returned documents. For a given query, let ρ i be the relevance judgment label of the i-th document in the predicted document ranking and φ i be the relevance judgment label of the i-th document in the optimal document ranking. We define DCG@10 and ideal DCG@10 as: We can calculate NDCG@10 for that query as: The NDCG@10 of a test set is the arithmetic mean of NDCG@10 values for all queries. The range of the metric is [0, 1] and a higher NDCG@10 score means predicted rankings are closer to the ideal rankings.

Results on BI-139
We present results on the 138 target languages for English queries. For each language direction, we trained a baseline CLIR model on the base train set and kept the checkpoint with the best NDCG@10 performance on the base validation set. We reranked the documents in the base test1 set and calculated NDCG@10. that low resource languages such as Yiddish, a high German-derived language, and Walloon, a Romance language, benefit from their similarities to other languages within the same language families. For queries such as named entities, it is also possible that some relevant cross-language Wikipedia document may be multilingual and contain some overlap with the query term untranslated. The details will depend on the query in question.

Results on MULTI-8
Multilingual IR is a field that has been largely unexplored in recent years. MULTI-8 enables evaluation in two kinds of scenarios (see Table 2): Single-language retrieval This scenario is similar to BI-139 in terms of evaluation, i.e. during test we only have queries in source language q X = S test and documents in one target language d Y = T test . We divide MULTI-8 test set into 8 × 7 = 56 pairs.
For training, we compare bilingual model (BM S→ T ) trained in every language pair, against a multilingual model (MM) trained on data concatenated from all 56 language directions. As we can see in Table 3, the MM model performs better than the respective BM models in most language directions. This suggests that multilingual training is a promising research direction even for singlelanguage retrieval.
Mix-language retrieval In this scenario, at test time we have a single source query q X = S test and wish to retrieve documents d Y = A test which can be in any of the 8 MULTI-8 languages. The multilingual model (MM) can be applied directly, but the bilingual model (BM) requires some modifications. One can run multiple BM one for each target language, then merge the resulting document lists (Savoy, 2003;Tsai et al., 2008). A common strategy, which we adopt here, is to z-normalize the output scores and rank all the test documents based on z-scores.
As seen in Table 4, the multilingual model performs significantly better than the ensembled/merged bilingual models. The average NDCG@10 of the multilingual model is 0.684, which is 17.1% than bilingual models with z-score merging strategy.

Scenario
Models Train Evaluation

Related Work
Information retrieval (IR) has made a tremendous amount of progress, shifting focus from traditional bag-of-world retrieval functions such as tfidf (Salton and McGill, 1986) and BM25 (Robertson et al., 2009), to neural IR models (Guo et al., 2016;Hui et al., 2018;McDonald et al., 2018) which have shown promising results on multiple monolingual IR datasets. Recent advances in pretrained language models such as BERT (Devlin et al., 2019) have also led to significant improve-ments in IR tasks. For example, MacAvaney et al.
(2019) achieves state-of-the-art performances on benchmark datasets by incorporating BERT's context vectors into existing baseline neural IR models (McDonald et al., 2018). Training on synthetic is also a common practice, e.g., Dehghani et al. (2017) show that supervised neural ranking models can greatly benefit from pre-training on BM25 labels.
Cross-lingual Information Retrieval (CLIR) is a sub-field of IR that is becoming increasingly important as new documents in different languages are being generated every day. The field has progressed from translation-based methods (Zhou et al., 2012;Oard, 1998;McCarley, 1999;Yarmohammadi et al., 2019) to recent neural CLIR models (Vulić and Moens, 2015;Litschko et al., 2018;) that rely on cross-lingual word embeddings. In contrast to the wide availability of monolingual IR datasets (Voorhees, 2005;Craswell et al., 2020), cross-lingual and multilingual IR   (Ferro and Silvello, 2015), which focus primarily on European languages, and IARPA MATE-RIAL/OpenCLIR collection (Zavorin et al., 2020), which focus on a few low-resource language directions. Creating a CLIR dataset for more language directions remains an open challenge.
Extracting CLIR datasets from Wikipedia has been explored in previous work. Schamoni et al. (2014) build a German-English bilingual CLIR dataset from Wikipedia, which contains 245,294 German queries and 1,226,741 English documents. They convert the first sentences from German Wikipedia documents into queries and follow Wikipedia's interlanguage links to find relevant documents in English. Sasaki et al. (2018) apply the same techniques and release a larger CLIR dataset which contains English queries and relevant documents in 25 languages. Both datasets truncate the documents to the first 200 tokens and rely on bidirectional inter-article links to find partially relevant documents. Our contribution differs in three important aspects: (i) BI-139 is a significantly larger dataset, covering more languages and more documents. (ii) MULTI-8 provides a new multilingual retrieval setup, not previously available. (iii) We argue that our method can reliably find more relevant documents by propagating search results from monolingual IR systems to other languages via Wikidata. This is in contrast to directly using bidirectional links extracted from Wikipedia documents to determine relevance, which are much sparser. Further, our method allows for more finergrained levels of relevance (e.g. as opposed to binary relevance), making the dataset more challenging.
A comparison of various existing CLIR datasets is presented in Table 5.

Conclusion and future work
We present CLIRMatrix, the largest and the most comprehensive collection of bilingual and multilingual CLIR datasets to date. The BI-139 dataset supports CLIR in 139×138 language pairs, whereas the MULTI-8 dataset enables mix-language retrieval in 8 languages. The large number of supported language directions allows the research community to explore and build new models for many more languages, especially the low-resource ones. We document baseline NDCG results using a neural ranker based on multilingual BERT. Our mixlanguage retrieval experiments on MULTI-8 show that a single multilingual model can significantly outperform the combination of multiple bilingual models.
For future work, we think it will be interesting to look at: 1. zero-shot CLIR models for low-resource languages, 2. comparison of end-to-end neural rankers with traditional translation+IR pipelines in terms of both scalability, cost, and retrieval accuracy, 3. advanced neural architectures and training algorithms that can exploit our large training data, 4. building universal models for multilingual IR.