XQA: A Cross-lingual Open-domain Question Answering Dataset

Open-domain question answering (OpenQA) aims to answer questions through text retrieval and reading comprehension. Recently, lots of neural network-based models have been proposed and achieved promising results in OpenQA. However, the success of these models relies on a massive volume of training data (usually in English), which is not available in many other languages, especially for those low-resource languages. Therefore, it is essential to investigate cross-lingual OpenQA. In this paper, we construct a novel dataset XQA for cross-lingual OpenQA research. It consists of a training set in English as well as development and test sets in eight other languages. Besides, we provide several baseline systems for cross-lingual OpenQA, including two machine translation-based methods and one zero-shot cross-lingual method (multilingual BERT). Experimental results show that the multilingual BERT model achieves the best results in almost all target languages, while the performance of cross-lingual OpenQA is still much lower than that of English. Our analysis indicates that the performance of cross-lingual OpenQA is related to not only how similar the target language and English are, but also how difficult the question set of the target language is. The XQA dataset is publicly available at http://github.com/thunlp/XQA.


Introduction
In recent years, open-domain question answering (OpenQA), which aims to answer open-domain questions with a large-scale text corpus, has attracted lots of attention from natural language processing researchers.  proposed DrQA model, which used a text retriever to obtain relevant documents from Wikipedia, and further applied a trained reading comprehension model * * Corresponding author: Maosong Sun to extract the answer from the retrieved documents. Moreover, researchers have introduced more sophisticated models, which either aggregate all informative evidence (Lin et al., 2018;Wang et al., 2018b) or filter out those noisy retrieved text (Clark and Gardner, 2018;Choi et al., 2017;Wang et al., 2018a) to better predict the answers for open-domain questions. Benefiting from the power of neural networks, these models have achieved remarkable results in OpenQA. However, these neural-based models must be trained with a huge volume of labeled data. Collecting and labeling large-size training data for each language is often intractable and unrealistic, especially for those low-resource languages. In this case, it is impossible to directly apply existing OpenQA models to many different languages.
To address this problem, an alternative approach is to build a cross-lingual OpenQA system. It is trained on data in one high-resource source language such as English, and predicts answers for open-domain questions in other target languages. In fact, cross-lingual OpenQA can be viewed as a particular task of cross-lingual language understanding (XLU). Recently, XLU has been applied to many natural language processing tasks such as cross-lingual document classification (Schwenk and Li, 2018), cross-lingual natural language inference (Conneau et al., 2018b), and machine translation (Lample et al., 2018). Most cross-lingual models focus on word or sentence level understanding, while the interaction between questions and documents as well as the overall understanding of the documents are essential to OpenQA. To the best of our knowledge, there is still no dataset for cross-lingual OpenQA.
In this paper, we introduce a cross-lingual OpenQA dataset called XQA. It consists of a training set in English, and development and test sets in English, French, German, Portuguese, Polish,

Language Question Answer
English Do you know that the <Query> is the largest stingray in the Atlantic Ocean, at up to across and weighing? Moreover, we build several baseline systems that use the information of multilingual data from publicly available corpora for cross-lingual OpenQA, including two translation-based methods that translate training data and test data respectively and one zero-shot cross-lingual method (multilingual BERT (Devlin et al., 2019)). We evaluate the performance of the proposed baselines in terms of text retrieval and reading comprehension for different target languages on the XQA dataset.
The experimental results demonstrate that there is a gap between the performance in English and that in cross-lingual setting. The multilingual BERT model achieves the best performance in al-most all target languages, while translation-based methods suffer from the problem of translating name entities. We show that the performance on the XQA dataset depends on not only how similar the target language and English are, but also how difficult the question set of the target language is. Based on the results, we further discuss potential improvement for cross-lingual OpenQA systems.
We will release the dataset and baseline systems online with the hope that this could contribute to the research of cross-lingual OpenQA and overall cross-lingual language understanding.

Open-domain Question Answering
OpenQA, first proposed by Green et al. (1986), aims to answer an open-domain question by utilizing external resources. In the past years, most work in this area has focused on using documents (Voorhees et al., 1999), online webpages (Kwok et al., 2001), and structured knowledge graphs (Bordes et al., 2015). Recently, with the advancement of reading comprehension technique (Chen et al., 2016;Dhingra et al., 2017;Cui et al., 2017),  utilized both the information retrieval and reading comprehension techniques to answer open-domain questions. However, it usually suffers from the noise problem since the data is constructed under the distant supervision assumption. Hence researchers have made various attempts to alleviate the noise problem in OpenQA. Wang et al. (2018a) and Choi et al. (2017) performed paragraph selection before extracting answer of the question. Min et al. (2018) proposed to select a minimal set of sentences with sufficient information to answer the questions, while Lin et al. (2018) and Wang et al. (2018b) took all informative paragraphs into consideration by aggregating evidence in multiple paragraphs. Moreover, Clark and Gardner (2018) applied a shared-normalization learning objective on sampling paragraphs. All the models mentioned above were only verified in a single language (usually in English) with vast volumes of labeled data, and cannot be easily extended to the cross-lingual scenario.

Cross-lingual Language Understanding
Recent years, plenty of work has focused on multilingual word representation learning, including learning from parallel corpus (Gouws et al., 2015;Luong et al., 2015), with a bilingual dictionary (Zhang et al., 2016;, and even in a fully unsupervised manner (Conneau et al., 2018a). These multilingual word representation models could be easily extended to multilingual sentence representation by averaging the representations of all words (Klementiev et al., 2012). Nevertheless, this method does not take into account the structure information of sentences. To address this issue, much effort has been devoted to using the context vector of NMT system as multilingual sentence representation (Schwenk and Douze, 2017;Espana-Bonet et al., 2017). Recently, Artetxe and Schwenk (2018) proposed to utilize a single encoder to learn joint multilingual sentence representations for 93 languages. Besides, Devlin et al. (2019) also released a multilingual version of BERT which encoded over 100 languages with a unified encoder. These models have shown their effectiveness in several cross-lingual NLP tasks such as document classification (Klementiev et al., 2012), textual similarity (Cer et al., 2017), natural language in-ference (Conneau et al., 2018b), and dialog system (Schuster et al., 2019). However, there is still no existing benchmark for cross-lingual OpenQA.
In addition, another line of research attempts to answer questions in one language using documents in other languages (Magnini et al., 2004;Vallin et al., 2005;Magnini et al., 2006). Different from their setting, we emphasize on building question answering systems for other languages using labeled data from a rich source language such as English, while the documents are in the same language as the questions.

Cross-lingual Open-domain Question Answering
Existing OpenQA models usually first retrieve documents related to the question from the largescale text corpus using information retrieval module, and then predict the answer from these retrieved documents through reading comprehension module. Formally, given a question Q, the OpenQA system first retrieves m documents (paragraphs) P = {p 1 , p 2 , · · · , p m } corresponding to the question Q through information retrieval system, and then models the probability distribution of the answer given the question and the documents Pr(A|Q, P ).
In cross-lingual OpenQA task, we are given a source language The cross-lingual OpenQA system aims to learn language independent features, and then build an answer predictor that is able to model the answer prediction probability Pr t (A t |Q t i , P t i ) for target language under the supervision from source language.
In the following part of this section, we will introduce our baseline systems for cross-lingual OpenQA, including two translation-based methods and one zero-shot cross-lingual method.

Translation-Based Methods
The most straightforward solution for crosslingual OpenQA is to combine the machine translation system and the monolingual OpenQA system. In this paper, we consider two ways to use the machine translation system: first, Translate-Train which translates the training dataset from the source language into target languages, and then trains standard OpenQA system on the trans-   lated data; second, Translate-Test in which an OpenQA system is built with the training data in the source language, and questions and retrieved articles are translated from target languages into the source language. For the OpenQA model, we select two state-ofthe-art models, including: Document-QA model, proposed by (Clark and Gardner, 2018), is a multi-layer neural network which consists of a shared bi-directional GRU layer, a bi-directional attention layer, and a selfattention layer to obtain the question and paragraph representations. To produce well-calibrated answer scores on each paragraph, Document-QA samples multiple paragraphs and applies a sharednormalization learning objective to them.
BERT model (short for Bidirectional Encoder Representations from Transformers), proposed by (Devlin et al., 2019), aims to pre-train deep bidirectional representations by jointly conditioning on the context information in all layers. We use BERT to encode questions and paragraphs, and also adopt the shared-normalization learning objective on top to generate well-calibrated answer scores for it.
These two translation-based methods are simple and effective, but still have some drawbacks. Both translate-train and translate-test methods rely heavily on the quality of the machine translation system. However, the quality of the machine translation system varies in different language pairs, depending on the size of parallel data and the similarity of the language pair.

Zero-shot Cross-lingual Method
Zero-shot cross-lingual method uses a unified model for both source and target languages, which is trained with labeled data in the source language and then applied directly to the target language. In this paper, we select the widely-used multilingual BERT model since it has already been proved successful on reading comprehension benchmarks such as SQuAD (Devlin et al., 2019).
Multilingual BERT is a multilingual version of BERT, which is trained with the Wikipedia dumps of the top 100 languages in Wikipedia. Similar to the monolingual OpenQA model, we also fine-tune the multilingual BERT model with the shared-normalization learning objective.

The XQA Dataset
In this paper, we collect a novel dataset called XQA to support the cross-lingual OpenQA task.

Data Collection
Wikipedia provides a daily "Did you know" box on the main page of various languages 1 , which contains several factual questions from Wikipedia editors, with links to the corresponding answers. This serves as a good source for cross-lingual OpenQA.
We collect questions from this session, and use the entity name as well as its aliases from Wiki-Data 2 knowledge base as golden answers. For each question, we retrieve top-10 Wikipedia articles ranked by BM25 as relevant documents. Examples in various languages are shown in Table 1.
In Wikipedia articles, the entity name almost always appears at the very beginning of the document. The model may trivially predict the first few words, ignoring the true evidence in relevant documents. In order to avoid this, we remove the first paragraph from each document.
In total, we collect 90, 610 questions in nine languages. For English, We keep around 3000 ques-Language English French German Russian Tamil  1  human  human  human  human  human  2  taxon  taxon  taxon  taxon  literary work  3  film  commune of France  film  film  city  4  church  film  book  book  film  5  book  book  song  archaeological site  book  6  business enterprise  song  archaeological site  battle  chemical compound  7  song  album  business enterprise  painting  disease  8  album  sovereign state  painting  song  ethnic group  9 video game fossil taxon album literary work archaeological site 10 single single fossil taxon single chemical element  tions for development and test set respectively, and use the other questions as the training set. For other languages, we evenly split the questions into development and test set. The detailed statistics in each language are shown in Table 3.

Dataset Analysis
We calculate the average length of questions and documents in different languages, and the results are shown in Table 2. The average question length for most languages falls in the range of 10 to 20. The average question length in all languages is 18.97.
The documents on the XQA dataset are considerable long, containing 703.62 tokens and 11.02 paragraphs on average. Documents in Tamil and Polish are among the shortest, with an average length of 200.45 and 256.87 respectively. Documents in French and Ukrainian contain much more paragraphs than documents in other languages.
To understand whether questions in different languages have different topic distributions, we match the answers in WikiData, and obtain their types accordingly (Note that many answers either cannot be matched to WikiData entity or do not have a type label in WikiData). The top answer types in some of the languages from WikiData are displayed in Table 4. As we can see, there are some common topics across all languages, with human ranking first, and film and book ranking high. Besides, many questions in French are related to commune of France, while the topic battle ranks high in Russian. This indicates that XQA captures different data distributions for different languages, which may be influenced by cultural differences to some extent.

Implementation Details
In translate-test setting, we use our own translation system THUMT 3 (Zhang et al., 2017) to translate German, French, Portuguese, Russian, and Chinese data into English. Google Translate is used for Polish, Ukrainian, and Tamil as they are not supported by our translation system. Since it is very time-consuming to translate the large training data, we only perform the translate-train experiment for two selected languages, i.e., German and Chinese, using our translation system. To give an idea of the performance of the translation models, we report the BLEU scores in some public benchmarks in Table 5.
To handle multiple paragraphs for a single question, following Clark and Gardner (2018), we adopt shared-normalization as the training objective on sampling paragraphs as training object for all models. Documents are restructured by merging consecutive paragraphs up to 400 tokens. During testing, the model is run on top-5 restructured paragraphs separately, and the answer span with the highest score is chosen as the prediction.
For DocumentQA model, we use the official implementation 4 and follow the setting for TriviaQA-Wiki in (Clark and Gardner, 2018). We use GloVe 300-dimensional word vector in Translate-Test setting, and 300-dimensional Skipgram word vector trained on Chinese/German Wikipedia dumps in Translate-Train setting.
Our BERT model is similar to the BERT model for SQuAD in (Devlin et al., 2019), but we use shared-normalization on sampling paragraphs during training. We use the BASE setting   with a maximum sequence length of 512. The translate-test model is initialized with the public released "BERT-Base, Cased" pretrained model, while translate-train and multilingual BERT models are initialized with the "BERT-Base, Multilingual Cased" model.
The widely accepted exact match (EM) and F1 over tokens in the answer(s) are used as the evaluation metrics. In translate-test setting, we translate the golden answers from the target languages into English, and report the results based on the translated answers.

Retrieval Results
First, we show the retrieval performance for different languages in Table 7. As we can see, the retrieval performance varies for questions from different language sets. The retrieval results for questions from English, French and Chinese set are among the best, while answers to questions from Portuguese, Polish and Russian set are much harder to retrieve. Figure 1 suggests that as the question length increases, the retrieval performance in all languages grows. This is not difficult to understand, because longer questions will provide more information and make the retrieval problem easier.  Table 6 shows the overall results for different methods in different languages. There is a large gap between the performance of English and that of other target languages, which implies that the task of cross-lingual OpenQA is difficult.

Overall Results
In the English test set, the performance of the multilingual BERT model is worse than that of the monolingual BERT model. In almost all target languages, however, the multilingual model achieves the best result, manifesting its ability in capturing answers for questions across various languages.
When we compare DocumentQA to BERT, although they have similar performance in English, BERT consistently outperforms DocumentQA by a large margin in all target languages in both translate-test and translate-train settings. We conjecture that it is because the BERT model, which has been pretrained on large-scale unlabeled text data, has better generalization power, and could better handle the different distributions between   Table 9: Performance with respect to language distance and percentage of "easy" questions.
the original English training data and the machine translated test data. Translate-train methods outperform translatetest methods in all cases except for Documen-tQA in German. This may be due to the fact that DocumentQA uses space-tokenized words as basic units. In German, there is no space between compound words, resulting in countless possible combinations. Therefore, many of the words in translate-train German data do not have pretrained word vectors. On the contrary, using WordPiece tokenizer, BERT is not influenced by this.

Reading Comprehension Results across Different Languages
To remove the influence of retrieval, and compare the reading comprehension performance across different target languages, we conduct experiments on a subset of questions whose answers can be found in the top-10 retrieved documents. As BERT consistently outperforms DocumentQA in translation-based methods, we only report the result of BERT model in Table 8. We assume that the reading comprehension performance in the target language depends on two factors, the degree of similarity between the target language and the source language (i.e. English), and the intrinsic difficulty of the question set in the target language. To quantify the intrinsic difficulty of the question sets in different languages, we calculate the percentage of questions whose answers can be found in the sentence that shares the most words with the question. We refer those questions as "easy" questions, and use the percentage of those questions as a rough indicator of how hard the subset is.
To measure the degree of similarity between the target language and English, we use the genetic distance of the language pair given by eLinguistics.net 5 . In their model, the score calculation for two languages is based on the comparison of the consonants in certain well-chosen words. The quantification of the consonant relationship is established partially with data from (Brown et al., 2013). The larger the distance is, the less similar English and the target language are.
The results in Table 9 verify our assumption. The performance of different languages generally decreases as the genetic distance grows. The exceptions are Chinese and Portuguese since the percentages of "easy" questions in them are significantly higher than those in other languages. For languages that have similar genetic distances with English (i.e. Russian, Ukrainian, and Portuguese), the performance increases as the percentage of "easy" questions grows.

Limitation of Translation-based Method
Our experiments demonstrate that translationbased methods do not perform well in crosslingual OpenQA task. Particularly, we observe a large gap between the results of multilingual BERT and translate-test BERT for Chinese and Tamil. Through error analysis, we find that for a large portion of questions in Chinese and Tamil, the answers are translated into different forms under different conditions (i.e. with context and without context). This significantly decreases the metric numbers of translation-based systems in these languages. In Figure 2, we show the difference of reading comprehension performance (EM) between translate-test BERT and multilingual BERT, along with the percentage of questions whose answers are translated into different forms in the documents. As we can see, there is a correlation between the two variables.
In fact, the performance of translation-based method depends heavily on the translation quality of name entities. As we know, name entities are critical for question answering. For many factual questions, the answers are either name entities themselves, or highly related to name entities (i.e. the property of a name entity). Translation error or inconsistency of name entities would significantly hurt the performance of translation-based cross-lingual OpenQA system. As shown in Figure 3, the name entity "未央宫(Weiyang Palace)" is incorrectly translated as "Fuyang Palace" in the question, while correctly translated in the retrieved document. In addition, as we can see from the underlined parts, highly similar expressions in the question and the retrieved document are translated into largely different ones.
Compared to other words or phrases which occur more frequently in the training corpus, name entities are more flexible and various, and thus have worse translation results from prevailing Neural Machine Translation systems . While some work has focused on solving this problem (Hassan et al., 2007;Jiang et al., 2007;Grundkiewicz and Heafield, 2018;, it remains largely underresearched. With a translation system that handles name entities better, we can potentially obtain better results from translation-based methods.

Zero-shot Cross-lingual Method
Trained on pure English data without the involvement of machine translation systems, much effort has been saved using zero-shot cross-lingual methods. Moreover, a single model could be applied directly to various languages. Thus, compared to

Translation Result
Question: <Query> is located on the southwest side of Han Chang'an City. It is connected with the Fuyang Palace.
Retrieved Text: ... and built a Jianzhang Palace outside Chang'an City ... and there is a cross between the Weiyang Palace and the city wall ... Answer: Jianzhang Palace   translation-based methods, zero-shot cross-lingual method seems to be a more practical way to build a cross-lingual OpenQA system. Although trained and tested in different languages, the multilingual BERT model achieves relatively good results on the XQA dataset. This may indicate that multilingual BERT could transfer the ability of capturing some common interaction patterns between different text across different languages via pretraining a unified text encoder. To further investigate the cross-lingual transfer power of multilingual BERT, we examine the difference of reading comprehension performance between English and Chinese test sets, for "easy" questions and other questions respectively. Results in Table 10 show the performance gap between the source language and the target language for "easy" questions is much smaller than that for other questions. This may indicate that multilingual BERT better captures shallow matching information across different languages.
Despite multilingual BERT has been proved to have certain power in cross-lingual understanding, no parallel data is used in it. Another line of research extracts multilingual representation from the context vector of NMT models that are trained on parallel data (Schwenk and Douze, 2017;Artetxe and Schwenk, 2018), which may be complementary to multilingual BERT. Very recently, Lample and Conneau (2019) proposed a multilin-gual language model that leveraged both monolingual and parallel data. Incorporating monolingual and parallel data may help to improve the performance in cross-lingual OpenQA.

Conclusion
In this paper, we discuss the problem of crosslingual open-domain question answering, and present a novel dataset XQA, which consists of a total amount of 90k question-answer pairs in nine languages.
We further examine the performance of two translation-based methods and one zero-shot cross-lingual method on the XQA dataset. The experimental results show that multilingual BERT achieves the best result in almost all target languages. The performance of translation-based methods can be increased by applying machine translation system that better translates name entities, while the multilingual BERT model may be improved by incorporating parallel data with monolingual data.
We hope our work could contribute to the development of cross-lingual OpenQA systems and further promote the research of overall cross-lingual language understanding.