Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English

The vast majority of language pairs in the world are low-resource because they have little, if any, parallel data available. Unfortunately, machine translation (MT) systems do not currently work well in this setting. Besides the technical challenges of learning with limited supervision, there is also another challenge: it is very difficult to evaluate methods trained on low resource language pairs because there are very few freely and publicly available benchmarks. In this work, we take sentences from Wikipedia pages and introduce new evaluation datasets in two very low resource language pairs, Nepali-English and Sinhala-English. These are languages with very different morphology and syntax, for which little out-of-domain parallel data is available and for which relatively large amounts of monolingual data are freely available. We describe our process to collect and cross-check the quality of translations, and we report baseline performance using several learning settings: fully supervised, weakly supervised, semi-supervised, and fully unsupervised. Our experiments demonstrate that current state-of-the-art methods perform rather poorly on this benchmark, posing a challenge to the research community working on low resource MT. Data and code to reproduce our experiments are available at https://github.com/facebookresearch/flores.

Unfortunately, MT systems, and in particular neural models, perform rather poorly on lowresource language pairs, for which parallel training data is scarce (Koehn and Knowles, 2017). Improving performance on low resource language pairs could be very impactful if we consider that, altogether, these languages are spoken by a rather large fraction of the world population.
Technically, there are several challenges to solve in order to improve translation for lowresource languages. First, in face of the scarcity of clean parallel data, MT systems should be able to use any source of data available, namely monolingual resources and noisy comparable data. Second, we need reliable public evaluation benchmarks to track progress in translation quality. Building evaluation sets on low resource languages is very expensive because there are often very few fluent bilingual speakers in these languages. Moreover, it is difficult to check the quality of the human translations, because collecting multiple references is often not feasible and the topics of the documents in these low resource languages may require knowledge of the local culture.
In this work, we introduce new evaluation benchmarks on two very low-resource language pairs: Nepali-English and Sinhala-English . Sentences were randomly extracted from Wikipedia pages in each language and translated by professional translators. The data-sets we release to the community are composed of a tune set of 2559 and 2898 sentences, a development set of 2835 and 2766 sentences, and a test set of 2924 and 2905 sentences for Nepali-English and Sinhala-English respectively. The test set will be released after the WMT 2019 shared task on parallel corpus filtering 1 .
We describe in §3 the methodology we used to collect the data as well as to check the quality of translations. The experiments reported in §4 demonstrate that these benchmarks are very challenging for current state-of-the-art methods, yielding very low BLEU scores (Papineni et al., 2002) even using all available parallel data as well as monolingual data or Paracrawl 2 filtered data. This suggests that these languages and evaluation benchmarks can constitute a useful test-bed for developing and comparing MT systems for low resource language pairs.

Related Work
There is ample literature on low-resource MT. From the modeling side, one possibility is to design methods that make more effective use of monolingual data. This is a research avenue that has seen a recent surge of interest, starting with semisupervised methods relying on backtranslation (Sennrich et al., 2015), integration of a language model into the decoder (Gulcehre et al., 2017;Stahlberg et al., 2018) all the way to fully unsupervised approaches (Lample et al., 2018b;Artetxe et al., 2018), which use monolingual data both for learning good language models and for fantasizing parallel data. Another avenue of research has been to extend the traditional supervised learning setting to a weakly supervised one, whereby the original training set is augmented with parallel sentences mined from noisy comparable corpora like Paracrawl. In addition to the challenge of learning with limited supervision, low-resource language pairs often involve distant languages, that do not share the same alphabet, or have very different morphology and syntax, which makes the learning problem more difficult on its own. In terms of low-resource data-sets, DARPA programs like LORELEI (Strassel and Tracey, 2016) have collected translations on several low resource languages like English-Tagalog. Unfortunately, the data is only made available to the program's participants. More recently, the Asian Language Treebank project (Riza et al., 2016) has introduced parallel data-sets for several low-resource language pairs, but these are sampled from text originating in English and thus may not generalize to text sampled from low-resource languages.
In the past, there has been work on extracting high quality translations from crowd-sourcing using automatic methods (Zaidan and Callison-Burch, 2011), and cleaning them through voting mechanisms (Post et al., 2012). However, crowdsourced translations are expected to be of lower quality than professional translations and useful for training systems. In contrast, here we explore the quality checks that had to be put in place to filter professional translations for low resource languages in order to build a high quality benchmark set. Moreover, here we explicitly aim to have larger test sets that are closer in volume to the sets used for WMT shared tasks.
In practice, there are very few publicly available data-sets for low resource language pairs, and often times, researchers simulate learning on low resource languages by using a high resource language pair like English-French, and merely limiting how much labeled data they use for training (Johnson et al., 2016;Lample et al., 2018a). While this practice enables a framework for easy comparison of different approaches, the real practical implications deriving from these methods can be unclear. For instance, low resource languages are often distant and often times corresponding corpora are not comparable, conditions which are far from the simulation with high resource European languages, as has been recently pointed out (Neubig and Hu, 2018).

Methodology & Resulting Data-Sets
In this section, we report the methodology used to select Wikipedia documents and to check the quality of human translations. We conclude with a detailed description of the resulting evaluation data-sets.

Low Resource Languages
For the construction of our benchmark sets we chose to translate between Nepali and Sinhala into and out of English. Both Nepali and Sinhala are Indo-Aryan languages, the former spoken by about 15 million people if we consider only Nepal, while the latter by about 20 million people only in Sri Lanka 3 . Both languages are SOV (subjectobject-verb), Nepali being similar to Hindi in its structure, while Sinhala is characterized by extensive omissions of arguments in a sentence. Sinhala and Nepali have very little parallel data publicly available. For instance, most of the parallel corpora for Nepali-English originate from GNOME and Ubuntu handbooks, and account for about 500K sentence pairs. 4 For Sinhala-English , there are an additional 600K sentence pairs automatically aligned from from OpenSubtitles (Lison et al., 2018). Overall, the domains and quantity of the existing parallel data are very limited. However, both languages have a rather large amount of monolingual data publicly available (Buck et al., 2014), making them perfect candidates to track performance on unsupervised and semi-supervised tasks for Machine Translation.

Document selection
To build the evaluation sets, we selected and professionally translated sentences originating from Wikipedia pages in English, Nepali and Sinhala ({en,ne,si}.wikipedia.org) from a Wikipedia crawl of early May 2018.
To select sentences for translation, we first filtered Wikipedia by only retaining the top 25 documents that contain the largest number of candidate sentences in each source language, then we manually filtered the documents to ensure their quality. To this end, we defined candidate sentences as: (i) being in the intended source language according to a language-id classifier (Bojanowski et al., 2017)  adequate for translation when they contained large portions of untranslatable content such as lists of entities 6 . For English, sentences have to start with an uppercase letter and end with a period. For Nepali and Sinhala , we ran a regular expression to avoid symbols such as bullet points, repeated dashes or periods and avoid ASCII characters. The document set, along with the categories of documents is presented in the appendix (Table 7).
After the document selection process, we sampled 2,500 random sentences for each language. From English we translated into Nepali and Sinhala , while from Sinhala and Nepali we only translated into English. We requested each string to be translated twice by different translators.

Quality checks
Translating domain-specialized content such as Wikipedia pages, from and to low-resource languages is challenging: the pool of available translators is limited, there is limited context available to each translator when translating one string at a time, and some of the sentences can contain codeswitching (e.g. text about Buddhism in Nepali or Sinhala can contain Sanskrit or Pali words). As a result, we observed large variations in the level of translation quality coming from professional translators. To ensure good quality, we relied on both automatic and manual methods to detect high and low quality translations.
On a first round, we used automatic methods to filter out bad translations and send them for rework. Once the reworked translations were received, we sent all translations (original or reworked) that passed the automatic checks to human quality checks. Only the high quality translations that passed all checks were added to the final pool of translations used to build the test sets. Note that it is possible that some source sentences have less than two (but at least one) translations after two rounds of rework. Below, we describe the automatic and manual quality checks that we applied to the data-sets.
Automatic Filtering. To maximize the quality of the translation sets while minimizing the amount of manual checks required, we relied on automatic methods that ensured the translations followed these principles: (i) translations should be fluent (Zaidan and Callison-Burch, 2011), (ii) they should be sufficiently different from the source text, (iii) translations should be similar to each other, yet not equal; and (iv) translations should not be transliterations. In order to identify the vast majority of translation issues we filtered by: (i) applying a count-based n-gram language model trained on Wikipedia monolingual data and removing translations that have perplexity above 3000.0 (English translations only), (ii) removing translations that have sentence-level char-BLEU score between the two generated translations below 15 (indicating disparate translations) or above 90 (indicating suspiciously similar translations), (iii) removing sentences that contain at least 33% transliterated words, (iv) removing translations with at least 50% of words are copied from the source sentence, and (v) removing translations that contain more than 50% out-of-vocabulary ratio or 5 total out-of-vocabulary words in the sentences (English translations only). For this, the vocabulary was calculated on the monolingual English Wikipedia described in Table 3.
Manual Filtering. We performed manual checks on every translation to ensure end-to-end translation quality, and target-language fluency. For translation quality assessment, we followed a setup similar to direct assessment (Graham et al., 2013). We asked three different raters to rate the sentences from 0-100 according to the perceived translation quality. In our guidelines, the 0-10 range represents a translation that is completely incorrect and inaccurate, the 70-90 range represents a translation that closely preserves the semantics of the source sentence, while the 90-100 range represents a perfect translation. To ensure rating consistency, we rejected any evaluation set in which the range of scores among the three reviewers was above 30 points, and requested a fourth rater to break ties, by replacing the most diverging translation rating with the new one. This is repeated until convergence is reached. For each translation, we took the average score over all raters and rejected translations whose scores were below 70. This filtering was done for all language pairs.
To ensure that the translations were as fluent as possible, we also designed an Amazon Mechanical Turk (AMT) monolingual task to judge the fluency of English translations. Regardless of content preservation, translations that are not fluent in the target language should be disregarded. For this task, we then asked five independent human annotators to rate the fluency of each English translation from 1 (bad) to 5 (excellent).We rejected translations into English with fluency scores less than 3.

Resulting data-sets
We built three evaluation sets for each language pair using the all of the data that passed our automatic and manual quality checks: dev (tune), devtest (validation) and test (test). The tune set is used for hyper-parameter tuning and model selection, the validation set is used to measure generalization during development, while the test set will be used as a blind set for a WMT task on data filtering 7 ; the test set will be made available after this competition is over.
To measure performance in both directions (e.g. Sinhala-English and English-Sinhala ), we built test sets with mixed original-translationese (Baroni and Bernardini, 2005) on the source side. To reduce the effect of the source language on the quality of the resulting translations on the sets used for tracking progress (devtest, test), direct and reverse translations were mixed at an approximate 50-50 ratio for the devtest and test sets. On the other hand, the dev set was composed of the remainder of the available translations, which were not guaranteed to be balanced. Before selection, the sentences were grouped by document, so to minimize the number of documents per evaluation set. For sentences with two satisfactory translations, the second translation was merged as additional evaluation instances. This yielded on average 1.7 translations per source sentence.
In Table 1 we present the statistics of the resulting sets. For Sinhala-English, the test set is composed of 850 sentences originally in English, and 850 originally in Sinhala. We have approximately 1.7 translations per sentence. This yielded 1,465 sentence pairs originally in English, and 1,440 originally in Sinhalese, for a total of 2,905 sentences. Similarly, for Nepali-English, the test set is composed of 850 sentences originally in English, and 850 originally in Nepali. This yielded 1,462 sentence pairs originally in English and 1,462 originally in Nepali, for a total of 2,924 sentence pairs. The composition of the rest   of the sets can be found in Table 1.
In Table 2 we present the aggregate distribution of topics per sentence for the data-sets in Nepali-English and Sinhala-English . We can observe there is a diverse representation of topics ranging from General (e.g. documents about tires, shoes and insurance), History (e.g. documents about history of the radar, the Titanic, etc.) to Law and Sports. This richness of topics increases the difficulty of the set, as it requires domain-independent, generalizable approaches to improve quality. The full list of documents and topics is in the appendix (Table 7).

Experiments
In this section, we first describe the data used for training the models, we then discuss the learning settings and models considered, and finally we re-port the results of these baseline models on the new evaluation benchmarks.

Training Data
Small amounts of parallel data are available for Sinhala-English and Nepali-English. Statistics can be found in Table 3.
This data comes from different sources. Open Subtitles and GNOME/KDE/Ubuntu come from the OPUS repository 8 . Global Voices is an updated version (2018q4) of a data set originally created for the CASMACAT project 9 . Bible translations were from the bible-corpus 10 . The Paracrawl corpus comes from the Paracrawl project 11 . The filtered version (Clean Paracrawl) was filtered with Zipporah (Xu and Koehn, 2017). We also contrast this filtered version with a randomly filtered version (Random Paracrawl) with the same number of English tokens.

Training Settings
We evaluate models in four training settings. First, we consider a fully supervised training setting using the parallel data listed in Table 3.
Second, we consider a semi-supervised setting whereby in addition to parallel data, we also leverage monolingual data on the target side. For this setting, we considered the standard backtranslation training protocol as introduced in Sennrich et al. (2015)'s seminal work: we train a backward MT system, which we use to translate monolingual target sentences to the source language. Then, we merge the resulting pairs of noisy (back-translated) source sentences with the original target sentences and add them as additional parallel data for training the original sourceto-target MT system. When monolingual data is available for both languages, we can train backward MT systems in both directions and repeat the back-translation process iteratively (He et al., 2016;Lample et al., 2018a). We consider up to two back-translation iterations. At each iteration we generate back-translations using beam search, which has been shown to perform well in lowresource settings (Edunov et al., 2018); we use a beam width of 5 and individually tune the length-  penalty on the dev set.

Sentences Tokens Nepali-English
Third, we consider a weakly supervised setting by using a baseline system to filter out Paracrawl data, in our case the Zipporah method (Xu and Koehn, 2017), in order to augment the original training set with a possibly larger but noisier set of parallel sentences.
Finally, we consider a fully unsupervised setting, whereby only monolingual data on both the source and target side are used to train the model (Lample et al., 2018b).

Models & Architectures
We consider both phrase-based statistical machine translation (PBSMT) and neural machine translation (NMT) systems in our experiments. The PB-SMT systems use Moses (Koehn et al., 2007), with state-of-the-art settings (5-gram language model, hierarchical lexicalized reordering model, operation sequence model) but no additional monolingual data to train the language model. The NMT systems use the Transformer (Vaswani et al., 2017) implementation in the Fairseq toolkit; 12 preliminary experiments showed these to perform better than LSTM-based NMT models. More specifically, in the supervised setting we use a Transformer architecture with 5 encoder and 5 decoder layers, where the number of attention heads, embedding dimension and inner-layer dimension are 2, 512 and 2048, respectively. In the semi-supervised setting, where we augment our small parallel training data with millions of back-translated sentence pairs, we use a larger Transformer architecture with 6 encoder and 6 decoder layers, where the number of attention heads, embedding dimension and inner-layer dimension are 8, 512 and 4096, respectively. We regularize our models with dropout, label smoothing and weight decay, with the corresponding hyper-parameters tuned independently for each language pair. Models are optimized with Adam (Kingma and Ba, 2015) using β 1 = 0.9, β 2 = 0.98, and = 1e − 8. We use the same learning rate schedule as . We run experiments on between 4 and 8 Nvidia V100 GPUs with mini-batches of between 10K and 100K target tokens following . Code to reproduce our results can be found at https://github.com/facebookresearch/flores.

Preprocessing and Evaluation
We tokenize Nepali and Sinhala using the Indic NLP Library. 13 For the PBSMT system, we tokenize English sentences using the Moses tokenization scripts. For NMT systems, we instead use a vocabulary of 5K symbols based on a joint source and target Byte-Pair Encoding (BPE; Sennrich et al., 2015) learned using the sentencepiece library 14 over the parallel training data. We learn the joint BPE for each language pair over the raw  English sentences and tokenized Nepali or Sinhala sentences. We then remove training sentence pairs with more than 250 source or target BPE tokens. We report detokenized SacreBLEU (Post, 2018) when translating into English, and tokenized BLEU (Papineni et al., 2002) when translating from English into Nepali or Sinhala.

Results
We run both PBSMT and NMT in the various learning configurations described in §4.2. There are several observations we can make from the results reported in Table 4.
First, these language pairs are very difficult. Both NMT and PBSMT supervised baselines achieve BLEU scores less than 8. Second, not surprisingly BLEU scores are higher when translating into English than into the more morphologically rich Nepali and Sinhala languages. Third, the biggest improvements are brought by the semi-supervised approach using back-translation, which nearly doubles BLEU for Nepali-English from 7.6 to 15.1 (+7.5 BLEU) and Sinhala-English from 7.2 to 15.1 (+7.9 BLEU), and increases BLEU for English-Nepali from 4.3 to 6.8 (+2.5 BLEU) and Sinhala-English from 1.2 to 6.5 (+5.3 BLEU). Notably, repeating backtranslation for a second iteration brings further gains compared to the first iteration, suggesting that more iterations of back-translation or "online" back-translation (Lample et al., 2018b) may be helpful. The weakly supervised baseline does not work as well as the semi-supervised one, yet it achieves almost 10 BLEU points in Sinhala-English . Finally, unsupervised NMT approaches seem to be ineffective on these language pairs, achieving BLEU scores close to 0. The reason for this failure is due to poor initialization: unsupervised lexicon induction techniques (Conneau et al., 2018;Artetxe et al., 2017)   allel data from filtered Paracrawl improves translation quality in some conditions. "Parallel" refers to the data described in Table 3. on these morphologically rich languages, partly because the monolingual corpora used to train word embeddings are not comparable (Neubig and Hu, 2018) and do not have sufficient number of overlapping strings.
We also investigate in more detail the effect of weak supervision in Table 5 for Nepali-English and Sinhala-English . The baseline corresponds to the Supervised NMT setting in Table 4. The data has been described in §4.1. For both Nepali-English and Sinhala-English , we can see that the filtering method applied to the Paracrawl corpus is critical. Filtering at random gives a BLEU score close to 0, while applying the Zipporah filtering method provides an improvement over using the unfiltered Paracrawl directly, +0.5 BLEU for Nepali-English and +6.0 BLEU for Sinhala-English . In the case of Sinhala-English , adding Paracrawl Clean to the initial parallel data improves performance by 2.7 BLEU. However, no improvement over the baseline was observed for Nepali-English with Paracrawl Clean. We surmise that the low availability of parallel data to train the Nepali-English Zipporah has an impact on the effectiveness of the cleaning technique. Figure 1: Analysis of the Ne→En devtest set using the semi-supervised machine translation system. Left: sentence level BLEU versus AMT fluency score of the reference sentences in English; source sentences that have received more fluent human translations are not easier to translate by machines. Right: average sentence level BLEU against Wikipedia document id from which the source sentence was extracted; sentences have roughly the same degree of difficulty across documents since there is no extreme difference between shortest and tallest bar. However, source sentences originating from Nepali Wikipedia (blue) are translated more poorly than those originating from English Wikipedia (red). Documents are sorted by BLEU for ease of reading.
11.8 (+7%) 6.5 (+542%) Table 6: In-domain vs. out-of-domain translation performance (BLEU) for supervised and semi-supervised NMT models. In-domain performance is measured on a held-out subset of 1,000 sentences from the Open Subtitles training data (see Table 3). Out-of-domain performance is measured on devtest (see §3).

Analysis
In this section, we provide an analysis of the Nepali to English devtest set using the semisupervised machine translation system, see First, we observe no correlation between fluency rating of human references and closeness of system hypotheses to such references, suggesting lack of such bias in the benchmark. Fluency of references does not correlate with ease of the translation task, at least at the current level of accuracy.
Second, we observe that source sentences receive rather similar translation quality across all document ids, with a difference of 10 BLEU points between the document that is the easiest and the hardest to translate. This suggests that the random sampling procedure used to construct the data-set was adequate and that no single Wikipedia document produces much harder sentences than others.
However, if we split documents by their originating source (actual Nepali versus English translated into Nepali, a.k.a. translationese), we notice that genuine Nepali documents are harder to translate. The same is true also when performing the evaluation with the supervised MT system: translations of Nepali originating source sentences obtain 4.9 BLEU while translations of English originating sentences obtain 9.1 BLEU. This suggests that the existing parallel corpus is closer to English Wikipedia than Nepali Wikipedia, and that this bias is further reinforced when using English Wikipedia monolingual data during the back-translation process that generates data for the model trained in semi-supervised mode.
In order to better understand the effect of domain mismatch between the parallel data-set and the Wikipedia evaluation set, we restricted the Si-En training set to only the Open Subtitles portion of the parallel data-set, and we held out one thousand sentences for "in-domain" evaluation of generalization performance. Table 6 shows that translation quality on in-domain data is between 10 and 16 BLEU points higher. This may be due to both domain mismatch as well as sensitivity of the BLEU metric to sentence length. Indeed, there are on average 6 words per sentences in the Open Sub-titles test set compared to 16 words per sentence in the Wikipedia devtest set. However, when we train semi-supervised models on back-translated Wikipedia data whose domain better matches the "Out-of-domain" devtest set, we see much larger gains in BLEU for the "Out-of-domain" set than we see on the "In-domain" set, suggesting that domain mismatch is indeed a major problem.

Conclusions
One of the biggest challenges in MT today is learning to translate low resource language pairs. Research in this area not only faces formidable technical challenges, from learning with limited supervision to dealing with very distant languages, but it is also hindered by the lack of freely and publicly available evaluation benchmarks.
In this work, we introduce and freely release to the community two new benchmarks: Nepali-English and Sinhala-English . Nepali and Sinhala are languages with very different syntax and morphology than English; also, very little parallel data in these language pairs is publicly available. However, a good amount of monolingual data and Paracrawl data exist in both languages, making these two language pairs a perfect candidate for research on low-resource MT.
Our experiments show that current state-of-theart approaches perform rather poorly on these new evaluation benchmarks, with semi-supervised neural methods outperforming all the other model variants and training settings we considered. We believe that these benchmarks will help the research community on low-resource MT make faster progress by enabling free access to evaluation data on actual low resource languages and promoting fair comparison of methods.

Source
The academic research tended toward the improvement of basic technologies, rather than their specific applications.

References A
In the past, the assembly that advised the king were called 'parliament'.

B
In old times the counsil that gave advice to the king was called 'parliament'.

System
In old times the council of counsel to the king was 'Senate'.

References A
As a worker African Mandela joined the Congress party.

B
He joined the African National Congress as a activist.

System
As a worker, he joined the African National Congress.

Source
Iphone users can and do access the internet frequently, and in a variety of places.

Source
In Serious meets, the absolute score is somewhat meaningless.

References A
Threatening, physical violence, property damage, assault and execution are these punishments.

B
Threats,bodily violence,property damages,assaults and killing are these punishments.

System
Threats, physical harassment, property damage, strike and killing this punishment.

References A
After education priests leave ordination in order to fulfill duties to the family or due to sickness.

B
Sangha is often abandoned because of education or after fulfilling family responsibilities or because of illness.

System
After education or to fulfill the family's disease or disease conditions, the companion is often removed from substance. Table 8: Examples of sentences from the En-Ne, Ne-En, En-Si and Si-En devtest set. System hypotheses (System) are generated using the semi-supervised model described in the main paper using beam search decoding.