MLQA: Evaluating Cross-lingual Extractive Question Answering

Question answering (QA) models have shown rapid progress enabled by the availability of large, high-quality benchmark datasets. Such annotated datasets are difficult and costly to collect, and rarely exist in languages other than English, making building QA systems that work well in other languages challenging. In order to develop such systems, it is crucial to invest in high quality multilingual evaluation benchmarks to measure progress. We present MLQA, a multi-way aligned extractive QA evaluation benchmark intended to spur research in this area. MLQA contains QA instances in 7 languages, English, Arabic, German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA has over 12K instances in English and 5K in each other language, with each instance parallel between 4 languages on average. We evaluate state-of-the-art cross-lingual models and machine-translation-based baselines on MLQA. In all cases, transfer results are shown to be significantly behind training-language performance.


Introduction
Question answering (QA) is a central and highly popular area in NLP, with an abundance of datasets available to tackle the problem from various angles, including extractive QA, cloze-completion, and open-domain QA (Richardson, 2013;Rajpurkar et al., 2016;Chen et al., 2017;Kwiatkowski et al., 2019). The field has made rapid advances in recent years, even exceeding human performance in some settings .
Despite such popularity, QA datasets in languages other than English remain scarce, even for relatively high-resource languages (Asai et al., 2018), as collecting such datasets at sufficient scale and quality is difficult and costly. There are two reasons why this lack of data prevents internationalization of QA systems. First, we cannot measure progress on multilingual QA without relevant benchmark data. Second, we cannot easily train end-to-end QA models on the task, and arguably most recent successes in QA have been in fully supervised settings. Given recent progress in cross-lingual tasks such as document classification (Lewis et al., 2004;Klementiev et al., 2012;Schwenk and Li, 2018), semantic role labelling (Akbik et al., 2015) and NLI (Conneau et al., 2018), we argue that while multilingual QA training data might be useful but not strictly necessary, multilingual evaluation data is a must-have.
Recognising this need, several cross-lingual datasets have recently been assembled (Asai et al., 2018;. However, these generally cover only a small number of languages, combine data from different authors and annotation protocols, lack parallel instances, or explore less practically-useful QA domains or tasks (see Section 3). Highly parallel data is particularly attractive, as it enables fairer comparison across languages, requires fewer source language annotations, and allows for additional evaluation setups at no extra annotation cost. A purpose-built evaluation benchmark dataset covering a range of diverse languages, and following the popular extractive QA paradigm on a practically-useful domain would be a powerful testbed for cross-lingual QA models.
With this work, we present such a benchmark, MLQA, and hope that it serves as an accelerator for multilingual QA in the way datasets such as SQuAD (Rajpurkar et al., 2016) have done for its monolingual counterpart. MLQA is a multi-way parallel extractive QA evaluation benchmark in seven languages: English, Arabic, German, Vietnamese, Spanish, Simplified Chinese and Hindi. To construct MLQA, we first automatically identify sentences from Wikipedia articles which have the same or similar meaning in multiple languages. We extract the paragraphs that contain such sentences, then crowd-source questions on the English paragraphs, making sure the answer is in the aligned sentence. This makes it possible to answer the question in all languages in the vast majority of cases. 2 The generated questions are then translated to all target languages by professional translators, and answer spans are annotated in the aligned contexts for the target languages.
The resulting corpus has between 5,000 and 6,000 instances in each language, and more than 12,000 in English. Each instance has an aligned equivalent in multiple other languages (always including English), the majority being 4-way aligned. Combined, there are over 46,000 QA annotations.
We define two tasks to assess performance on MLQA. The first, cross-lingual transfer (XLT), requires models trained in one language (in our case English) to transfer to test data in a different language. The second, generalised cross-lingual transfer (G-XLT) requires models to answer questions where the question and context language is different, e.g. questions in Hindi and contexts in Arabic, a setting possible because MLQA is highly parallel.
We provide baselines using state-of-the-art crosslingual techniques. We develop machine translation baselines which map answer spans based on the attention matrices from a translation model, and use multilingual BERT  and XLM (Lample and Conneau, 2019) as zero-shot approaches. We use English for our training language and adopt SQuAD as a training dataset. We find that zero-shot XLM transfers best, but all models lag well behind training-language performance.
In summary, we make the following contributions: i) We develop a novel annotation pipeline to construct large multilingual, highly-parallel extractive QA datasets ii) We release MLQA, a 7language evaluation dataset for cross-lingual QA iii) We define two cross-lingual QA tasks, including a novel generalised cross-lingual QA task iv) We provide baselines using state-of-the-art techniques, and demonstrate significant room for improvement.

The MLQA corpus
First, we state our desired properties for a crosslingual QA evaluation dataset. We note that whilst some existing datasets exhibit these properties, 2 The automatically aligned sentences occasionally differ in a named entity or information content, or some questions may not make sense without the surrounding context. In these rare cases, there may be no answer for some languages. none exhibit them all in combination (see section 3). We then describe our annotation protocol, which seeks to fulfil these desiderata.
Parallel The dataset should consist of instances that are parallel across many languages. First, this makes comparison of QA performance as a function of transfer language fairer. Second, additional evaluation setups become possible, as questions in one language can be applied to documents in another. Finally, annotation cost is also reduced as more instances can be shared between languages.
Natural Documents Building a parallel QA dataset in many languages requires access to parallel documents in those languages. Manually translating documents at sufficient scale entails huge translator workloads, and could result in unnatural documents. Exploiting existing naturally-parallel documents is advantageous, providing high-quality documents without requiring manual translation.
Diverse Languages A primary goal of crosslingual research is to develop systems that work well in many languages. The dataset should enable quantitative performance comparison across languages with different linguistic resources, language families and scripts.
Extractive QA Cross-lingual understanding benchmarks are typically based on classification (Conneau et al., 2018). Extracting spans in different languages represents a different language understanding challenge. Whilst there are extractive QA datasets in a number of languages (see Section 3), most were created at different times by different authors with different annotation setups, making cross-language analysis challenging.
Textual Domain We require a naturally highly language-parallel textual domain. Also, it is desirable to select a textual domain that matches existing extractive QA training resources, in order to isolate the change in performance due to language transfer.
To satisfy these desiderata, we identified the method described below and illustrated in Figure 1. Wikipedia represents a convenient textual domain, as its size and multi-linguality enables collection of data in many diverse languages at scale. It has been used to build many existing QA training resources, allowing us to leverage these to train QA models, without needing to build our own training dataset. We choose English as our source language as it has the largest Wikipedia, and to easily source crowd Earth's Moon is an astronomical body that orbits the planet and acts as its only permanent natural satellite. The Moon is, after Jupiter's satellite Io, the seconddensest satellite in the Solar System among those whose densities are known.
Eclipses only occur when the Sun, Earth, and Moon are all in a straight line (termed "syzygy"). Solar eclipses occur at new moon, when the Moon is between the Sun and Earth. In contrast, lunar eclipses occur at full moon, when Earth is between the Sun and Moon. The Sun is much larger than the Moon but it is the vastly greater distance that gives it the same apparent size as the much closer and much smaller Moon from the perspective of Earth.
Because the Moon's orbit around Earth is inclined by about 5.145°(5°9') to the orbit of Earth around the Sun, eclipses do not occur at every full and new moon. For an eclipse to occur, the Moon must be near the intersection of the two orbital planes.
Because the Moon is continuously blocking our view of a half-degree-wide circular area of the sky, the related phenomenon of occultation occurs when a bright star or planet passes behind the Moon and is hidden from view. In this way, a solar eclipse is an occultation of the Sun.  workers. We choose six other languages which represent a broad range of linguistic phenomena and have sufficiently large Wikipedia. Our annotation pipeline consists of three main steps: Step 1) We automatically extract paragraphs which contain a parallel sentence from articles on the same topic in each language (left of Figure 1).
Step 2) We employ crowd-workers to annotate questions and answer spans on the English paragraphs (centre of Figure 1). Annotators must choose answer spans within the parallel source sentence. This allows annotation of questions in the source language with high probability of being answerable in the target languages, even if the rest of the context paragraphs are different.
Step 3) We employ professional translators to translate the questions and to annotate answer spans in the target language (right of Figure 1).
The following sections describe each step in the data collection pipeline in more detail.

Parallel Sentence Mining
Parallel Sentence mining allows us to leverage naturally-written documents and avoid translation, which would be expensive and result in potentially unnatural documents. In order for questions to be answerable in every target language, we use contexts containing an N -way parallel sentence. Our approach is similar to WikiMatrix  which extracts parallel sentences for many language pairs in Wikipedia, but we limit the search for parallel sentences to documents on the same topic only, and aim for N -way parallel sentences.
To detect parallel sentences we use the LASER toolkit, 3 which achieves state-of-the-art performance in mining parallel sentences (Artetxe and Schwenk, 2019). LASER uses multilingual sentence embeddings and a distance or margin criterion in the embeddings space to detect parallel sentences. The reader is referred to Artetxe and Schwenk (2018) and Artetxe and Schwenk (2019) for a detailed description. See Appendix A.6 for further details and statistics on the number of parallel sentences mined for all language pairs. We first independently align all languages with English, then intersect these sets of parallel sentences, forming sets of N-way parallel sentences. As shown in Table 1, starting with 5.4M parallel English/German sentences, the number of N-way parallel sentences quickly decreases as more languages are added. We also found that 7-way parallel sentences lack linguistic diversity, and often appear in the first sentence or paragraph of articles.
As a compromise between language-parallelism and both the number and diversity of parallel sentences, we use sentences that are 4-way parallel. This yields 385,396 parallel sentences (see Appendix A.6) which were sub-sampled to ensure parallel sentences were evenly distributed in paragraphs. We ensure that each language combination is equally represented, so that each language has many QA instances in common with every other language. Except for any rejected instances later in the pipeline, each QA instance will be parallel between English and three target languages.

English QA Annotation
We use Amazon Mechanical Turk to annotate English QA instances, broadly following the methodology of Rajpurkar et al. (2016). We present workers with an English aligned sentence, b en along with the paragraph that contains it c en . Workers formulate a question q en and highlight the shortest answer span a en that answers it. a en must be be a subspan of b en to ensure q en will be answerable in the target languages. We include a "No Question Possible" button when no sensible question could be asked. Screenshots of the annotation interface can be found in Appendix A.1. The first 15 questions from each worker are manually checked, after which the worker is contacted with feedback, or their work is auto-approved. Once the questions and answers have been annotated, we run another task to re-annotate English answers. Here, workers are presented with q en and c en , and requested to generate an a en or to indicate that q en is not answerable. Two additional answer span annotations are collected for each question. The additional answer annotations enable us to calculate an inter-annotator agreement (IAA) score. We calculate the mean token F1 score between the three answer annotations, giving an IAA score of 82%, comparable to the SQuAD v1.1 development set, where this IAA measure is 84%.
Rather than provide all three answer annotations as gold answers, we select a single representative reference answer. In 88% of cases, either two or three of the answers exactly matched, so the majority answer is selected. In the remaining cases, the answer with highest F1 overlap with the other two is chosen. This results both in an accurate answer span, and ensures the English results are comparable to those in the target languages, where only one answer is annotated per question.
We discard instances where annotators marked the question as unanswerable as well as instances where over 50% of the question appeared as a subsequence of the aligned sentence, as these are too easy or of low quality. Finally, we reject questions where the IAA score was very low (< 0.3) removing a small number of low quality instances. To verify we were not discarding challenging but high quality examples in this step, a manual analysis of discarded questions was performed. Of these discarded questions, 38% were poorly specified, 24% did not make sense/had no answer, 30% had poor answers, and only 8% were high quality challenging questions.

Target Language QA Annotation
We use the One Hour Translation platform to source professional translators to translate the questions from English to the six target languages, and to find answers in the target contexts. We present each translator with the English question q en , English answer a en , and the context c x (containing aligned sentence b x ) in target language x. The translators are only shown the aligned sentence and the sentence on each side (where these exist). This increases the chance of the question being answerable, as in some cases the aligned sentences are not perfectly parallel, without requiring workers to read the entire context c x . By providing the English answer we try to minimize cultural and personal differences in the amount of detail in the answer. We sample 2% of the translated questions for additional review by language experts. Translators that did not meet the quality standards were removed from the translator pool, and their translations were reallocated. By comparing the distribution of answer lengths relative to the context to the English distribution, some cases were found where some annotators selected very long answers, especially for Chinese. We clarified the instructions with these specific annotators, and send such cases for re-annotation. We discard instances in target languages where annotators indicate there is no answer in that language. This means some instances are not 4-way parallel. "No Answer" annotations occurred for 6.6%-21.9% of instances (Vietnamese and German, respectively). We release the "No Answer" data separately as an additional resource, but do not consider it in our experiments or analysis.

The Resulting MLQA corpus
Contexts, questions and answer spans for all the languages are then brought together to create the
Cross-lingual QA Modelling Cross-lingual QA as a discipline has been explored in QA for RDF data for a number of years, such as the QALD-3 and 5 tracks (Cimiano et al., 2013;Unger et al., 2015), with more recent work from Zimina et al. (2018). Lee et al. (2018) explore an approach to use English QA data from SQuAD to improve QA performance in Korean using an in-language seed dataset. Kumar et al. (2019) study question generation by leveraging English questions to generate better Hindi questions, and Lee and Lee (2019)    1190 SQuAD instances from 240 paragraphs manually translated into 10 languages. As shown in Table 4, MLQA covers 7 languages, but contains more data per language -over 5k QA pairs from 5k paragraphs per language. MLQA also uses real Wikipedia contexts rather than manual translation.
Aggregated Cross-lingual Benchmarks Recently, following the widespread adoption of projects such as GLUE (Wang et al., 2019), there have been efforts to compile a suite of high quality multilingual tasks as a unified benchmark system. Two such projects, XGLUE (Liang et al., 2020) and XTREME (Hu et al., 2020) incorporate MLQA as part of their aggregated benchmark.

Cross-lingual QA Experiments
We introduce two tasks to assess cross-lingual QA performance with MLQA. The first, cross-lingual transfer (XLT), requires training a model with (c x , q x , a x ) training data in language x, in our case English. Development data in language x is used for tuning. At test time, the model must extract answer a y in language y given context c y and question q y . The second task, generalized cross-lingual transfer (G-XLT), is trained in the same way, but at test time the model must extract a z from c z in language z given q y in language y. This evaluation setup is possible because MLQA is highly parallel, allowing us to swap q z for q y for parallel instances without changing the question's meaning.
As MLQA only has development and test data, we adopt SQuAD v1.1 as training data. We use MLQA-en as development data, and focus on zeroshot evaluation, where no training or development data is available in target languages. Models were trained with the SQuAD-v1 training method from  and implemented in Pytext (Aly et al., 2018). We establish a number of baselines to assess current cross-lingual QA capabilities: Translate-Train We translate instances from the SQuAD training set into the target language using machine-translation. 4 Before translating, we enclose answers in quotes, as in Lee et al. (2018). This makes it easy to extract answers from translated contexts, and encourages the translation model to map answers into single spans. We discard instances where this fails (∼5%). This corpus is then used to train a model in the target language.
Translate-Test The context and question in the target language is translated into English at test time. We use our best English model to produce an answer span in the translated paragraph. For all languages other than Hindi, 5 we use attention scores, a ij , from the translation model to map the answer back to the original language. Rather than aligning spans by attention argmax, as by Asai et al. (2018), we identify the span in the original context which maximizes F1 score with the English span: where S e and S o are the English and original spans respectively, a i * = j a ij and a * j = i a * j .

Cross-lingual Representation Models
We produce zero-shot transfer results from multilingual BERT (cased, 104 languages)  and XLM (MLM + TLM, 15 languages) (Lample and Conneau, 2019). Models are trained with the SQuAD training set and evaluated directly on the MLQA test set in the target language. Model selection is also constrained to be strictly zero-shot, using only English development data to pick hyperparameters. As a result, we end up with a single model that we test for all 7 languages.

Evaluation Metrics for Multilingual QA
Most extractive QA tasks use Exact Match (EM) and mean token F1 score as performance metrics. The widely-used SQuAD evaluation also performs the following answer-preprocessing operations: i) lowercasing, ii) stripping (ASCII) punctuation iii) stripping (English) articles and iv) whitespace tokenisation. We introduce the following modifications for fairer multilingual evaluation: Instead of stripping ASCII punctuation, we strip all unicode characters with a punctuation General Category. 6 When a language has stand-alone articles (English, Spanish, German and Vietnamese) we strip them. We use whitespace tokenization for all MLQA languages other than Chinese, where we use the mixed segmentation method from Cui et al. (2019b). Table 5 shows the results on the XLT task. XLM performs best overall, transferring best in Spananswers using another round of translation. Back-translated answers may not map back to spans in the original context, so this Translate-Test performs poorly. 6 http://www.unicode.org/reports/tr44/ tr44-4.html#General_Category_Values Figure 3: F1 score stratified by English wh* word, relative to overall F1 score for XLM ish, German and Arabic, and competitively with translate-train+M-BERT for Vietnamese and Chinese. XLM is however, weaker in English. Even for XLM, there is a 39.8% drop in mean EM score (20.9% F1) over the English BERT-large baseline, showing significant room for improvement. All models generally struggle on Arabic and Hindi.

XLT Results
A manual analysis of cases where XLM failed to exactly match the gold answer was carried out for all languages. 39% of these errors were completely wrong answers, 5% were annotation errors and 7% were acceptable answers with no overlap with the gold answer. The remaining 49% come from answers that partially overlap with the gold span. The variation of errors across languages was small.
To see how performance varies by question type, we compute XLM F1 scores stratified by common English wh-words. Figure 3 shows that "When" questions are the easiest for all languages, and "Where" questions seem challenging in most target languages. Further details are in Appendix A.3.
To explore whether questions that were difficult for the model in English were also challenging in the target languages, we split MLQA into two subsets on whether the XLM model got an English F1 score of zero. Figure 4 shows that transfer performance is better when the model answers well in English, but is far from zero when the English answer is wrong, suggesting some questions may be easier to answer in some languages than others. Table 6 shows results for XLM on the G-XLT task. 7 For questions in a given language, the model performs best when the context language matches the question, except for Hindi and Arabic. For con-

English Results on SQuAD 1 and MLQA
The MLQA-en results in Table 5 are lower than reported results on SQuAD v1.1 in the literature for equivalent models. However, once SQuAD scores are adjusted to reflect only having one answer annotation (picked using the same method used to pick MLQA answers), the discrepancy drops to 5.8% on average (see Table 7). MLQA-en contexts are on average 28% longer than SQuAD's, and MLQA covers a much wider set of articles than SQuAD. Minor differences in preprocessing and answer lengths may also contribute (MLQAen answers are slightly longer, 3.1 tokens vs 2.9 on average). Question type distributions are very similar in both datasets (Figure 7 in Appendix A)  Table 7: English performance comparisons to SQuAD using our models. * uses a single answer annotation.

Discussion
It is worth discussing the quality of context paragraphs in MLQA. Our parallel sentence mining approach can source independently-written documents in different languages, but, in practice, articles are often translated from English to the target languages by volunteers. Thus our method sometimes acts as an efficient mechanism of sourcing existing human translations, rather than sourcing independently-written content on the same topic. The use of machine translation is strongly discouraged by the Wikipedia community, 8 but from examining edit histories of articles in MLQA, machine translation is occasionally used as an article seed, before being edited and added to by human authors. Our annotation method restricts answers to come from specified sentences. Despite being provided several sentences of context, some annotators may be tempted to only read the parallel sentence and write questions which only require a single sentence of context to answer. However, single sentence context questions are a known issue in SQuAD annotation in general (Sugawara et al., 2018) suggesting our method would not result in less challenging questions, supported by scores on MLQA-en being similar to SQuAD (section 5.3).
MLQA is partitioned into development and test splits. As MLQA is parallel, this means there is development data for every language. Since MLQA will be freely available, this was done to reduce the risk of test data over-fitting in future, and to estab-lish standard splits. However, in our experiments, we only make use of the English development data and study strict zero-shot settings. Other evaluation setups could be envisioned, e.g. by exploiting the target language development sets for hyperparameter optimisation or fine-tuning, which could be fruitful for higher transfer performance, but we leave such "few-shot" experiments as future work. Other potential areas to explore involve training datasets other than English, such as CMRC (Cui et al., 2018), or using unsupervised QA techniques to assist transfer (Lewis et al., 2019).
Finally, a large body of work suggests QA models are over-reliant on word-matching between question and context (Jia and Liang, 2017;Gan and Ng, 2019). G-XLT represents an interesting testbed, as simple symbolic matching is less straightforward when questions and contexts use different languages. However, the performance drop from XLT is relatively small (8.2 mean F1), suggesting word-matching in cross-lingual models is more nuanced and robust than it may initially appear.

Conclusion
We have introduced MLQA, a highly-parallel multilingual QA benchmark in seven languages. We developed several baselines on two cross-lingual understanding tasks on MLQA with state-of-the-art methods, and demonstrate significant room for improvement. We hope that MLQA will help to catalyse work in cross-lingual QA to close the gap between training and testing language performance.

A Appendices
A.1 Annotation Interface Figure 5 shows a screenshot of the annotation interface. Workers are asked to write a question in the box, and highlight an answer using the mouse in the sentence that is in bold. There are a number of data input validation features to assist workers, as well as detailed instructions in a drop-down window, which are shown in Figure 6 A.2 Additional MLQA Statistics Figure 7 shows the distribution of wh words in questions in both MLQA-en and SQuAD v.1.1. The distributions are very similar, suggesting training on SQuAD data is an appropriate training dataset choice. Table 4 shows the number of Wikipedia articles that feature at least one of their paragraphs as a context paragraph in MLQA, along with the number of unique context paragraphs in MLQA. There are 1.9 context paragraphs from each article on average. This is in contrast to SQuAD, which instead features a small number of curated articles, but more densely annotated, with 43 context paragraphs per article on average. Thus, MLQA covers a much broader range of topics than SQuAD. Table 8 shows statistics about the lengths of con- Figure 6: English annotation instructions screenshot texts, questions and answers in MLQA. Vietnamese has the longest contexts on average and German are shortest, but all languages have a substantial tail of long contexts. Other than Chinese, answers are on average 3 to 4 tokens.

A.3 QA Performance stratified by question and answer types
To examine how performance varies across languages for different types of questions, we stratify MLQA with three criteria -By English Wh-word, by answer Named-Entity type and by English Question Difficulty By wh-word: First, we split by the English Wh* word in the question. This resulting change in F1 score compared to the overall F1 score is shown in Figure 3, and discussed briefly in the main text. The English wh* word provides a clue as to the type of answer the questioner is expecting, and thus acts as a way of classifying QA instances into types. We chose the 5 most common wh* words in the dataset for this analysis. We see that "when" questions are consistently easier than average across the languages, but the pattern is less clear for other question types. "Who" questions also seem easier than average, except for Hindi, where the performance is quite low for these questions. "How"-type questions (such as "how much", "how many" or "how long" ) are also more challenging to answer than average in English compared to the other languages. "Where" questions also seem challenging for Spanish, German, Chinese and Hindi, but this is not true for Arabic or Vietnamese.
By Named-Entity type We create subsets of MLQA by detecting which English named entities are contained in the answer span. To achieve this, we run Named Entity Recognition using SPaCy (Honnibal and Montani, 2017), and detect where named entity spans overlap with answer spans. The F1 scores for different answer types relative to overall F1 score are shown for various Named Entity types in Figure 8. There are some clear trends: Answer spans that contain named entities are easier to answer than those that do not (the first two rows) for all the languages, but the difference is most pronounced for German. Secondly,"Temporal" answer types (DATE and TIME entity labels) are consistently easier than average for all languages, consistent with the high scores for "when" questions in the previous section. Again, this result is most pronounced in German, but is also very strong for Spanish, Hindi, and Vietnamese. Arabic also performs well for ORG, GPE and LOC answer types, unlike most of the other languages. Numeric questions (CARDINAL, ORDINAL, PERCENT, QUANTITY and MONEY entity labels) also seem relatively easy for the model in most languages.
By English Question Difficulty Here, we split MLQA into two subsets, according to whether the XLM model got the question completely wrong (no word overlap with the correct answer). We then evaluated the mean F1 score for each language on the two subsets, with the results shown in Figure  4. We see that questions that are "easy" in English also seem to be easier in the target languages, but the drop in performance for the "hard" subset is not as dramatic as one might expect. This suggests that not all questions that are hard in English in MLQA are hard in the target languages. This could be due to the grammar and morphology of different languages leading to questions being easier or more difficult to answer, but an another factor is that context documents can be shorter in target languages for questions the model struggled to answer correctly in English, effectively making them easier. Manual inspection suggests that whilst context documents are often shorter for when the model is correct in the target language, this effect is not sufficient to explain the difference in performance.
A.4 Additional G-XLT results Table 6 in the main text shows for XLM on the G-XLT task, and Table 9 for Multilingual-BERT respectively. XLM outperforms M-BERT for most language pairs, with a mean G-XLT performance of 53.4 F1 compared to 47.2 F1 (mean of off-diagonal elements of Tables 6 and 9). Multilingual BERT exhibits more of a preference for English than XLM for G-XLT, and exhibits a bigger performance drop going from XLT to G-XLT (10.5 mean drop in F1 compared to 8.2).

A.5 Additional preprocessing Details
OpenCC (https://github.com/BYVoid/OpenCC) is used to convert all Chinese contexts to Simplified Chinese, as wikipedia dumps generally consist of a mixture of simplified and traditional Chinese text.
A.6 Further details on Parallel Sentence mining Table 10 shows the number of mined parallel sentences found in each language, as function of how many languages the sentences are parallel between. As the number of languages that a parallel sentence is shared between increases, the number of such sentences decreases. When we look for 7-way aligned examples, we only find 1340 sentences from the entirety of the 7 Wikipedia. Additionally, most of these sentences are the first sentence of the article, or are uninteresting. However, if we choose 4-way parallel sentences, there are plenty of sentences to choose from. We sample evenly from each combination of English and 3 of the 6 target languages. This ensures that we have an even distribution over all the target languages, as well as ensuring we have even numbers of instances that will be parallel between target language combinations.