The FLORES Evaluation Datasets for Low-Resource Machine Translation: Nepali–English and Sinhala–English

For machine translation, a vast majority of language pairs in the world are considered low-resource because they have little parallel data available. Besides the technical challenges of learning with limited supervision, it is difficult to evaluate methods trained on low-resource language pairs because of the lack of freely and publicly available benchmarks. In this work, we introduce the FLORES evaluation datasets for Nepali–English and Sinhala– English, based on sentences translated from Wikipedia. Compared to English, these are languages with very different morphology and syntax, for which little out-of-domain parallel data is available and for which relatively large amounts of monolingual data are freely available. We describe our process to collect and cross-check the quality of translations, and we report baseline performance using several learning settings: fully supervised, weakly supervised, semi-supervised, and fully unsupervised. Our experiments demonstrate that current state-of-the-art methods perform rather poorly on this benchmark, posing a challenge to the research community working on low-resource MT. Data and code to reproduce our experiments are available at https://github.com/facebookresearch/flores.

For machine translation, a vast majority of language pairs in the world are considered low-resource because they have little parallel data available. Besides the technical challenges of learning with limited supervision, it is difficult to evaluate methods trained on lowresource language pairs because of the lack of freely and publicly available benchmarks. In this work, we introduce the FLORES evaluation datasets for Nepali-English and Sinhala-English, based on sentences translated from Wikipedia. Compared to English, these are languages with very different morphology and syntax, for which little out-of-domain parallel data is available and for which relatively large amounts of monolingual data are freely available. We describe our process to collect and cross-check the quality of translations, and we report baseline performance using several learning settings: fully supervised, weakly supervised, semi-supervised, and fully unsupervised. Our experiments demonstrate that current state-of-the-art methods perform rather poorly on this benchmark, posing a challenge to the research community working on lowresource MT. Data and code to reproduce our experiments are available at https://github. com/facebookresearch/flores.

Introduction
Research in Machine Translation (MT) has seen significant advances in recent years thanks to improvements in modeling, and in particular neural models (Sutskever et al., 2014;Bahdanau et al., 2015;Gehring et al., 2016;Vaswani et al., 2017), as well as the availability of large parallel corpora for training (Tiedemann, 2012;Smith et al., ♥ Equal contribution. 2013; Bojar et al., 2017). Indeed, modern neural MT systems can achieve near human-level translation performance on language pairs for which sufficient parallel training resources exist (e.g., Chinese-English translation (Hassan et al., 2018) and English-French translation (Gehring et al., 2016;Ott et al., 2018a).
Unfortunately, MT systems, and in particular neural models, perform poorly on low-resource language pairs, for which parallel training data is scarce (Koehn and Knowles, 2017). Improving translation performance on low-resource language pairs could be very impactful considering that these languages are spoken by a large fraction of the world population.
Technically, there are several challenges to solve in order to improve translation for lowresource languages. First, in face of the scarcity of clean parallel data, MT systems should be able to use any source of data available, namely monolingual resources, noisy comparable data, as well as parallel data in related languages. Second, we need reliable public evaluation benchmarks to track progress in translation quality.
Building evaluation sets on low-resource languages is both expensive and time-consuming because the pool of professional translators is limited, as there are few fluent bilingual speakers for these languages. Moreover, the quality of professional translations for low-resource languages is not on par with that of high-resource languages, given that the quality assurance processes for the low-resource languages are often lacking or under development. Also, it is difficult to verify the quality of the human translations as an non-native speaker, because the topics of the documents in these low-resource languages may require knowl-edge and context coming from the local culture.
In this work, we introduce new evaluation benchmarks on two very low-resource language pairs: Nepali-English and Sinhala-English. Sentences were extracted from Wikipedia articles in each language and translated by professional translators. The datasets we release to the community are composed of a tune set of 2559 and 2898 sentences, a development set of 2835 and 2766 sentences, and a test set of 2924 and 2905 sentences for Nepali-English and Sinhala-English respectively.
In §3, we describe the methodology we used to collect the data as well as to check the quality of translations. The experiments reported in §4 demonstrate that these benchmarks are very challenging for current state-of-the-art methods, yielding very low BLEU scores (Papineni et al., 2002) even using all available parallel data as well as monolingual data or Paracrawl 1 filtered data. This suggests that these languages and evaluation benchmarks can constitute a useful test-bed for developing and comparing MT systems for lowresource language pairs.

Related Work
There is ample literature on low-resource MT. From the modeling side, one possibility is to design methods that make more effective use of monolingual data. This is a research avenue that has seen a recent surge of interest, starting with semisupervised methods relying on backtranslation (Sennrich et al., 2015), integration of a language model into the decoder (Gulcehre et al., 2017;Stahlberg et al., 2018) all the way to fully unsupervised approaches (Lample et al., 2018b;, which use monolingual data both for learning good language models and for fantasizing parallel data. Another avenue of research has been to extend the traditional supervised learning setting to a weakly supervised one, whereby the original training set is augmented with parallel sentences mined from noisy comparable corpora like Paracrawl. In addition to the challenge of learning with limited supervision, low-resource language pairs often involve distant languages that do not share the same alphabet, or have very different morphology and syntax; accordingly, recent work has begun to explore language-independent lexical representations to improve transfer learning (Gu et al., 2018). In terms of low-resource datasets, DARPA programs like LORELEI (Strassel and Tracey, 2016) have collected translations on several lowresource languages like English-Tagalog. Unfortunately, the data is only made available to the program's participants. More recently, the Asian Language Treebank project (Riza et al., 2016) has introduced parallel datasets for several low-resource language pairs, but these are sampled from text originating in English and thus may not generalize to text sampled from low-resource languages.
In the past, there has been work on extracting high quality translations from crowd-sourced workers using automatic methods (Zaidan and Callison-Burch, 2011;Post et al., 2012). However, crowd-sourced translations have generally lower quality than professional translations. In contrast, in this work we explore the quality checks that are required to filter professional translations of lowresource languages in order to build a high quality benchmark set.
In practice, there are very few publicly available datasets for low-resource language pairs, and often times, researchers simulate learning on lowresource languages by using a high-resource language pair like English-French, and merely limiting how much labeled data they use for training (Johnson et al., 2016;Lample et al., 2018a). While this practice enables a framework for easy comparison of different approaches, the real practical implications deriving from these methods can be unclear. For instance, low-resource languages are often distant and often times corresponding corpora are not comparable, conditions which are far from the simulation with high-resource European languages, as has been recently pointed out by Neubig and Hu (2018).

Methodology & Resulting Datasets
For the construction of our benchmark sets we chose to translate between Nepali and Sinhala into and out of English. Both Nepali and Sinhala are Indo-Aryan languages with a subject-object-verb (SOV) structure. Nepali is similar to Hindi in its structure, while Sinhala is characterized by extensive omissions of arguments in a sentence.
Nepali is spoken by about 20 million people if we consider only Nepal, while Sinhala is spo-ken by about 17 million people just in Sri Lanka 2 . Sinhala and Nepali have very little publicly available parallel data . For instance, most of the parallel corpora for Nepali-English originate from GNOME and Ubuntu handbooks, and account for about 500K sentence pairs. 3 For Sinhala-English, there are an additional 600K sentence pairs automatically aligned from from OpenSubtitles (Lison et al., 2018). Overall, the domains and quantity of the existing parallel data are very limited. However, both languages have a rather large amount of monolingual data publicly available (Buck et al., 2014), making them perfect candidates to track performance on unsupervised and semi-supervised tasks for Machine Translation.

Document selection
To build the evaluation sets, we selected and professionally translated sentences originating from Wikipedia articles in English, Nepali and Sinhala from a Wikipedia snapshot of early May 2018. To select sentences for translation, we first selected the top 25 documents that contain the largest number of candidate sentences in each source language. To this end, we defined candidate sentences 4 as: (i) being in the intended source language according to a language-id classifier (Bojanowski et al., 2017) 5 , and (ii) having sentences between 50 and 150 characters. Moreover, we considered sentences and documents to be inadequate for translation when they contained large portions of untranslatable content such as lists of entities 6 . To avoid such lists we used the following rules: (i) for English, sentences have to start with an uppercase letter and end with a period; (ii) for Nepali and Sinhala, sentences should not contain symbols such as bullet points, repeated dashes, repeated periods or ASCII characters. The document set, along with the categories of documents 2 See https://www.ethnologue.com/language/npi and https://www.ethnologue.com/language/sin. 3 Nepali has also 4K sentences translated from English Penn Tree Bank at http:// www.cle.org.pk/software/ling_resources/ UrduNepaliEnglishParallelCorpus.htm, which is valuable parallel data. 4 We first used HTML markup to split document text into paragraphs. We then used regular expressions to split on punctuation, e.g. full-stop, poorna virama (\u0964) and exclamation marks. 5 This is a necessary step as many sentences in foreign language Wikipedias may be in English or other languages. 6 For example, the Academy Awards page: https://en.wikipedia.org/wiki/Academy_Award_ for_Best_Supporting_Actor.
is presented in the Appendix, Table 8.
After the document selection process, we randomly sampled 2,500 sentences for each language. From English, we translated into Nepali and Sinhala, while from Sinhala and Nepali, we only translated into English. We requested each string to be translated twice by different translators.

Quality checks
Translating domain-specialized content such as Wikipedia articles from and to low-resource languages is challenging: the pool of available translators is limited, there is limited context available to each translator when translating one string at a time, and some of the sentences can contain codeswitching (e.g. text about Buddhism in Nepali or Sinhala can contain Sanskrit or Pali words). As a result, we observed large variations in the level of translation quality, which motivated us to enact a series of automatic and manual checks to filter out poor translations.
We first used automatic methods to filter out poor translations and sent them for rework. Once the reworked translations were received, we sent all translations (original or reworked) that passed the automatic checks to human quality checks. Translations which failed human checks, were disregarded. Only the translations that passed all checks were added to the evaluation benchmark, although some source sentences may have less than two translations. Below, we describe the automatic and manual quality checks that we applied to the datasets.
Automatic Filtering. The guiding principles underlying our choice of automatic filters are: (i) translations should be fluent (Zaidan and Callison-Burch, 2011), (ii) they should be sufficiently different from the source text, (iii) translations should be similar to each other, yet not equal; and (iv) translations should not be transliterations. In order to identify the vast majority of translation issues we filtered by: (i) applying a count-based n-gram language model trained on Wikipedia monolingual data and removing translations that have perplexity above 3000.0 (English translations only), (ii) removing translations that have sentence-level char-BLEU score between the two generated translations below 15 (indicating disparate translations) or above 90 (indicating suspiciously similar translations), (iii) removing sen-tences that contain at least 33% transliterated words, (iv) removing translations where at least 50% of words are copied from the source sentence, and (v) removing translations that contain more than 50% out-of-vocabulary words or more than 5 total out-of-vocabulary words in the sentences (English translations only). For this, the vocabulary was calculated on the monolingual English Wikipedia described in Table 2.
Manual Filtering. We followed a setup similar to direct assessment (Graham et al., 2013). We asked three different raters to rate sentences from 0-100 according to the perceived translation quality. In our guidelines, the 0-10 range represents a translation that is completely incorrect and inaccurate, the 70-90 range represents a translation that closely preserves the semantics of the source sentence, while the 90-100 range represents a perfect translation. To ensure rating consistency, we rejected any evaluation set in which the range of scores among the three reviewers was above 30 points, and requested a fourth rater to break ties, by replacing the most diverging translation rating with the new one. For each translation, we took the average score over all raters and rejected translations whose scores were below 70.
To ensure that the translations were as fluent as possible, we also designed an Amazon Mechanical Turk (AMT) monolingual task to judge the fluency of English translations. Regardless of content preservation, translations that are not fluent in the target language should be disregarded. For this task, we then asked five independent human annotators to rate the fluency of each English translation from 1 (bad) to 5 (excellent), and retained only those above 3.Additional statistics of automatic and manual filtering stages can be found in Appendix.

Resulting Datasets
We built three evaluation sets for each language pair using the data that passed our automatic and manual quality checks: dev (tune), devtest (validation) and test (test). The tune set is used for hyperparameter tuning and model selection, the validation set is used to measure generalization during development, while the test set is used for the final blind evaluation.
To measure performance in both directions (e.g. Sinhala-English and English-Sinhala), we built test sets with mixed original-translationese (Ba-  roni and Bernardini, 2005) on the source side. To reduce the effect of the source language on the quality of the resulting evaluation benchmark, direct and reverse translations were mixed at an approximate 50-50 ratio for the devtest and test sets. On the other hand, the dev set was composed of the remainder of the available translations, which were not guaranteed to be balanced. Before selection, the sentences were grouped by document, to minimize the number of documents per evaluation set.
In Table 1 we present the statistics of the resulting sets. For Sinhala-English, the test set is composed of 850 sentences originally in English, and 850 originally in Sinhala. We have approximately 1.7 translations per sentence. This yielded 1,465 sentence pairs originally in English, and 1,440 originally in Sinhalese, for a total of 2,905 sentences. Similarly, for Nepali-English, the test set is composed of 850 sentences originally in English, and 850 originally in Nepali. This yielded 1,462 sentence pairs originally in English and 1,462 originally in Nepali, for a total of 2,924 sentence pairs. The composition of the rest of the sets can be found in Table 1.
In Appendix Table 6, we present the aggregate distribution of topics per sentence for the datasets in Nepali-English and Sinhala-English, which shows a diverse representation of topics ranging from General (e.g. documents about tires, shoes and insurance), History (e.g. documents about history of the radar, the Titanic, etc.) to Law and Sports. This richness of topics increases the difficulty of the set, as it requires models that are rather domain-independent. The full list of documents and topics is also in Appendix, Table 8.

Experiments
In this section, we first describe the data used for training the models, we then discuss the learning settings and models considered, and finally we report the results of these baseline models on the new evaluation benchmarks.

Training Data
Small amounts of parallel data are available for Sinhala-English and Nepali-English. Statistics can be found in Table 2.
This data comes from different sources. Open Subtitles and GNOME/KDE/Ubuntu come from the OPUS repository 7 . Global Voices is an updated version (2018q4) of a data set originally created for the CASMACAT project 8 . Bible translations come from the bible-corpus 9 . The Paracrawl corpus comes from the Paracrawl project 10 . The filtered version (Clean Paracrawl) was generated using the LASER model (Artetxe and Schwenk, 2018) to get the best sentence pairs having 1 million English tokens as specified in Chaudhary et al. (2019). We also contrast this filtered version with a randomly filtered version (Random Paracrawl) with the same number of English tokens. Finally, our multilingual experiments in Nepali use Hindi monolingual (about 5 million sentences) and English-Hindi parallel data (about 1.5 million parallel sentences) from the IIT Bombay corpus 11 .

Training Settings
We evaluate models in four training settings. First, we consider a fully supervised training setting using the parallel data listed in Table 2.
Second, we consider a fully unsupervised setting, whereby only monolingual data on both the source and target side are used to train the model (Lample et al., 2018b).
Third, we consider a semi-supervised setting where we also leverage monolingual data on the target side using the standard back-translation training protocol (Sennrich et al., 2015): we train a backward MT system, which we use to translate monolingual target sentences to the source language. Then, we merge the resulting pairs of noisy  (back-translated) source sentences with the original target sentences and add them as additional parallel data for training source-to-target MT system. Since monolingual data is available for both languages, we train backward MT systems in both directions and repeat the back-translation process iteratively (He et al., 2016;Lample et al., 2018a). We consider up to two back-translation iterations. At each iteration we generate back-translations using beam search, which has been shown to perform well in low-resource settings (Edunov et al., 2018); we use a beam width of 5 and individually tune the length-penalty on the dev set. Finally, we consider a weakly supervised setting by using a baseline system to filter out Paracrawl data using LASER (Artetxe and Schwenk, 2018) by following the approach similar to Chaudhary et al. (2019), in order to augment the original training set with a possibly larger but noisier set of parallel sentences.
For Nepali only, we also consider training using Hindi data, both in a joint supervised and semi-supervised setting. For instance, at each iteration of the joint semi-supervised setting, we use models from the previous iteration to backtranslate English monolingual data into both Hindi and Nepali, and from Hindi and Nepali monolingual data into English. We then concatenate actual parallel data and back-translated data of the same language pair together, and train a new model. We also consider using English-Hindi data in the unsupervised scenario. In that setting, a model is pretrained in an unsupervised way with English, Hindi and Nepali monolingual data using the unsupervised approach by Lample and Conneau (2019), and it is then jointly trained on both the Nepali-English unsupervised learning task and the Hindi-English supervised task (in both directions).

Models & Architectures
We consider both phrase-based statistical machine translation (PBSMT) and neural machine translation (NMT) systems in our experiments. All hyper-parameters have been cross-validated using the dev set. The PBSMT systems use Moses (Koehn et al., 2007), with state-of-theart settings (5-gram language model, hierarchical lexicalized reordering model, operation sequence model) but no additional monolingual data to train the language model. The NMT systems use the Transformer (Vaswani et al., 2017) implementation in the Fairseq toolkit (Ott et al., 2019); preliminary experiments showed these to perform better than LSTM-based NMT models. More specifically, in the supervised setting, we use a Transformer architecture with 5 encoder and 5 decoder layers, where the number of attention heads, embedding dimension and inner-layer dimension are 2, 512 and 2048, respectively. In the semi-supervised setting, where we augment our small parallel training data with millions of back-translated sentence pairs, we use a larger Transformer architecture with 6 encoder and 6 decoder layers, where the number of attention heads, embedding dimension and inner-layer dimension are 8, 512 and 4096, respectively. When we use multilingual data, the encoder is shared in the {Hindi, Nepali}-English direction, and the decoder is shared in the English-{Hindi, Nepali}direction.
We regularize our models with dropout, label smoothing and weight decay, with the corresponding hyper-parameters tuned independently for each language pair. Models are optimized with Adam (Kingma and Ba, 2015) using β 1 = 0.9, β 2 = 0.98, and = 1e − 8. We use the same learning rate schedule as . We run experiments on between 4 and 8 Nvidia V100 GPUs with mini-batches of between 10K and 100K target tokens following . Code to reproduce our results can be found at https://github.com/facebookresearch/flores.

Preprocessing and Evaluation
We tokenize Nepali and Sinhala using the Indic NLP Library. 12 For the PBSMT system, we tokenize English sentences using the Moses tokenization scripts. For NMT systems, we instead use a vocabulary of 5K symbols based on a joint source and target Byte-Pair Encoding (BPE; Sennrich et al., 2015) learned using the sentencepiece library 13 over the parallel training data. We learn the joint BPE for each language pair over the raw English sentences and tokenized Nepali or Sinhala sentences. We then remove training sentence pairs with more than 250 source or target BPE tokens.
We report detokenized SacreBLEU (Post, 2018) when translating into English, and tokenized BLEU (Papineni et al., 2002) when translating from English into Nepali or Sinhala.

Results
In the supervised setting, PBSMT performed quite worse than NMT, achieving BLEU scores of 2.5, 4.4, 1.6 and 5.0 on English-Nepali, Nepali-English, English-Sinhala and Sinhala-English, respectively. Table 3 reports results using NMT in all the other learning configurations described in §4.2. There are several observations we can make.
First, these language pairs are very difficult, as even supervised NMT baselines achieve BLEU scores less than 8. Second and not surprisingly, the BLEU score is particularly low when translating into the more morphologically rich Nepali and Sinhala languages. Third, unsupervised NMT approaches seem to be ineffective on these distant language pairs, achieving BLEU scores close to 0. The reason for this failure is due to poor initialization of the word embeddings.  Table 3: BLEU scores of NMT using various learning settings on devtest (see §3). We report detokenized Sacre-BLEU (Post, 2018) for {Ne,Si}→En and tokenized BLEU for En→{Ne,Si}.
Poor initialization can be attributed to to the monolingual corpora used to train word embeddings which do not have sufficient number of overlapping strings, and are not comparable (Neubig and Hu, 2018;Søgaard et al., 2018).
Fourth, the biggest improvements are brought by the semi-supervised approach using backtranslation, which nearly doubles BLEU for Nepali-English from 7.6 to 15.1 (+7.5 BLEU points) and Sinhala-English from 7.2 to 15.1 (+7.9 BLEU points), and increases +2.5 BLEU points for English-Nepali and +5.3 BLEU points for English-Sinhala.
Fifth, additional parallel data in English-Hindi further improves translation quality in Nepali across all settings. For instance, in the Nepali-English supervised setting, we observe a gain of 6.5 BLEU points, while in the semi-supervised setting (where we back-translate also to and from Hindi) the gain is 6.4 BLEU points. Similarly, in the unsupervised setting, multilingual training with Hindi brings Nepali-English to 3.9 BLEU and English-Nepali to 2.5 BLEU; if however, the architecture is pretrained as prescribed by Lample and Conneau (2019), BLEU score improves to 18.8 BLEU for Nepali-English and 8.3 BLEU for English-Nepali.
Finally, the weakly supervised baseline using the additional noisy parallel data described in §4.1 improves upon the supervised baseline in all four directions. This is studied in more depth in Table 4 for Sinhala-English and Nepali-English. Without any filtering or with random filtering, BLEU score is close to 0 BLEU. Applying the a filtering method based on LASER scores (Artetxe and Schwenk, 2018) provides an improvement over using the unfiltered Paracrawl, of +5.5 BLEU points for Nepali-English and +7.3 BLEU points for Sinhala-English. Adding Paracrawl Clean to the initial parallel data improves performance by +2.0 and +3.7 BLEU points, for Nepali-English and Sinhala-English, respectively.  Table 4: Weakly supervised experiments: Adding noisy parallel data from filtered Paracrawl improves translation quality in some conditions. "Parallel" refers to the data described in Table 2.

Discussion
In this section, we provide an analysis of the performance on the Nepali to English devtest set using the semi-supervised machine translation system, see Figure 1. Findings on other language directions are similar. Fluency of references: we observe no correlation between the fluency rating of human references and the quality of translations as measured by BLEU. This suggests that the difficulty of the translation task is not related to the fluency of the references, at least at the current level of accuracy. Document difficulty: we observe that translation quality is similar across all document ids, with a difference of 10 BLEU points between the document that is the easiest and the hardest to translate. This suggests that the random sampling procedure used to construct the dataset was adequate and that no single Wikipedia document produces much harder sentences than others.
Original vs translationese: we noticed that documents originating from Nepali are harder to translate than documents originating in English. This holds when performing the evaluation with the supervised MT system: translations of original Nepali sentences obtain 4.9 BLEU while Nepali translationese obtain 9.1 BLEU. This suggests that the existing parallel corpus is closer to English Wikipedia than Nepali Wikipedia. Figure 1: Analysis of the Ne→En devtest set using the semi-supervised machine translation system. Left: sentence level BLEU versus AMT fluency score of the reference sentences in English; source sentences that have received more fluent human translations are not easier to translate by machines. Right: average sentence level BLEU against Wikipedia document id from which the source sentence was extracted; sentences have roughly the same degree of difficulty across documents since there is no extreme difference between shortest and tallest bar. However, source sentences originating from Nepali Wikipedia (blue) are translated more poorly than those originating from English Wikipedia (red). Documents are sorted by BLEU for ease of reading.

Domain drift
To better understand the effect of domain mismatch between the parallel dataset and the Wikipedia evaluation set, we restricted the Sinhala-English training set to only the Open Subtitles portion of the parallel dataset, and we held out 1000 sentences for "in-domain" evaluation of generalization performance. Table 5 shows that translation quality on in-domain data is between 10 and 16 BLEU points higher. This may be due to both domain mismatch as well as sensitivity of the BLEU metric to sentence length. Indeed, there are on average 6 words per sentences in the Open Subtitles test set compared to 16 words per sentence in the FLORES devtest set. However, when we train semi-supervised models on back-translated Wikipedia data whose domain better matches the "Out-of-domain" devtest set, we see much larger gains in BLEU for the "Out-of-domain" set than we see on the "In-domain" set, suggesting that domain mismatch is indeed a major problem.

Conclusions
One of the biggest challenges in MT today is learning to translate low-resource language pairs. Research in this area not only faces formidable technical challenges, from learning with limited supervision to dealing with very distant languages, but it is also hindered by the lack of freely and publicly available evaluation benchmarks.
In this work, we introduce and freely release to the community FLORES benchmarks for Nepali-English and Sinhala-English . Nepali and Sinhala are languages with very different syntax and morphology than English; also, very little parallel data in these language pairs is publicly available. However, a good amount of monolingual data, parallel data in related languages, and Paracrawl data exist in both languages, making these two language pairs a perfect candidate for research on low-resource MT.
Our experiments show that current state-of-theart approaches perform rather poorly on these new evaluation benchmarks, with semi-supervised and in particular multi-lingual neural methods outperforming all the other model variants and training settings we considered. We perform additional analysis to probe the quality of the datasets. We find no evidence of poor construction quality, yet observe that the low BLEU scores are partly due to the domain mismatch between the training and test datasets. We believe that these benchmarks will help the research community on low-resource MT make faster progress by enabling free access to evaluation data on actual low-resource languages and promoting fair comparison of methods.  B Statistics of automatic filtering and manual filtering Figure 2: Histogram of averaged translation quality score. We ask three different raters to rate each sentence from 0-100 according to the perceived translation quality. In our guidelines, the 0-10 range represents a translation that is completely incorrect and inaccurate; the 11-29 range represents a translation with few correct keywords, but the overall meaning is different from the source; the 30-50 range represents a translation that contains translated fragments of the source string, with major mistakes; the 51-69 range represents a translation which is understandable and conveys the overall meaning of source string but contains typos or grammatical errors; the 70-90 range represents a translation that closely preserves the semantics of the source sentence; and the 90-100 range represents a perfect translation. Translations with averaged translation score less than 70 (red line) are removed from the dataset. Figure 3: Histogram of averaged AMT fluency score of English translations. We ask five different raters to rate each sentence from 1-5 according to its fluency. In our guidelines, the 1-2 range represents a sentence that is not fluent, 3 is neutral, while the 4-5 range is for fluent sentences that raters can easily understand. Translations with averaged fluency score less than 3 (red line) are removed from the dataset.  Table 7: Percentage of translations that did not pass the automatic and manual filtering checks. We first use automatic methods to filter out poor translations and send those translations back for rework. We then collect translations that pass the automatic filtering and send them to two human quality checks, one for adequacy and the other for fluency. Note that the percentage of sentences that did not pass manual filtering is among those sentences that passed the automatic filtering. In the past, the assembly that advised the king were called 'parliament'.

B
In old times the counsil that gave advice to the king was called 'parliament'.

System
In old times the council of counsel to the king was 'Senate'.

References A
As a worker African Mandela joined the Congress party.

B
He joined the African National Congress as a activist.

System
As a worker, he joined the African National Congress.

Source
Iphone users can and do access the internet frequently, and in a variety of places.

References A .
B

Source
In Serious meets, the absolute score is somewhat meaningless.

References A
Threatening, physical violence, property damage, assault and execution are these punishments.

B
Threats, bodily violence, property damages, assaults and killing are these punishments.

System
Threats, physical harassment, property damage, strike and killing this punishment.

References A
After education priests leave ordination in order to fulfill duties to the family or due to sickness.

B
Sangha is often abandoned because of education or after fulfilling family responsibilities or because of illness.

System
After education or to fulfill the family's disease or disease conditions, the companion is often removed from substance. Table 9: Examples of sentences from the En-Ne, Ne-En, En-Si and Si-En devtest set. System hypotheses (System) are generated using the semi-supervised model described in the main paper using beam search decoding.