Findings of the 2019 Conference on Machine Translation (WMT19)

This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2019. Participants were asked to build machine translation systems for any of 18 language pairs, to be evaluated on a test set of news stories. The main metric for this task is human judgment of translation quality. The task was also opened up to additional test suites to probe specific aspects of translation.

This year we conducted several official tasks. We report in this paper on the news and similar translation tasks. Additional shared tasks are described in separate papers in these proceedings: • biomedical translation (Bawden et al., 2019b) • automatic post-editing (Chatterjee et al., 2019) • metrics  • quality estimation (Fonseca et al., 2019) • parallel corpus filtering (Koehn et al., 2019) • robustness (Li et al., 2019b) In the news translation task (Section 2), participants were asked to translate a shared test set, optionally restricting themselves to the provided training data ("constrained" condition). We held 18 translation tasks this year, between English and each of Chinese, Czech (into Czech only), German, Finnish, Lithuanian, and Russian. New this year were Gujarati↔English and Kazakh↔English. Both pose a lesser resourced data condition on challenging language pairs. System outputs for each task were evaluated both automatically and manually.
This year the news translation task had two additional sub-tracks: an unsupervised language pair (German→Czech) and a language pair not involving English (German↔French). Both sub-tracks were included into the general list of news translation submissions and are described in more detail in the corresponding subsections of Section 2.
The human evaluation (Section 3) involves asking human judges to score sentences output by anonymized systems. We obtained large numbers of assessments from researchers who contributed evaluations proportional to the number of tasks they entered. In addition, we used Mechanical Turk to collect further evaluations. This year, the official manual evaluation metric is again based on judgments of adequacy on a 100-point scale, a method we explored in the previous years with convincing results in terms of the trade-off between annotation effort and reliable distinctions between systems.
The primary objectives of WMT are to evaluate the state of the art in machine translation, to disseminate common test sets and public training data with published performance numbers, and to refine evaluation and estimation methodologies for machine translation. As before, all of the data, translations, and collected human judgments are publicly available. 2 We hope these datasets serve as a valuable resource for research into data-driven machine translation, automatic evaluation, or prediction of translation quality. News translations are also available for interactive visualization and comparison of differences between systems at http://wmt.ufal.cz/ using MT-ComparEval (Sudarikov et al., 2016).
In order to gain further insight into the performance of individual MT systems, we organized a call for dedicated "test suites", each focussing on some particular aspect of translation quality. A brief overview of the test suites is provided in Section 4.

News Translation Task
The recurring WMT task examines translation between English and other languages in the news domain. As in the previous year, we include Chinese, Czech, German, Finnish and Russian (into and out of English, except for Czech were only out of English was included). New language pairs for this year were Gujarati, Lithuanian and Kazakh (to and from English), and French-German. We also used German-Czech (joining the corresponding parts of the English-X test sets) for the unsupervised subtask.

Test Data
The test data for this year's task (except for the French-German set) was selected from online news sources, as in previous years, with translation produced specifically for the task. For language pairs that had appeared before at WMT (and so had previous years' data for development testing) we selected approximately 2000 sentences in each of the languages in the pair and translated them into the other language. The source English sentences were common across all test sets. For the new language pairs (i.e. English-Gujarati, English-Kazakh and English-Lithuanian) we released development sets at the start of the campaign, consisting of approximately 1000 sentences in each language in the pair, translated into the other language. For Gujarati-English the development set was selected from online news in the same way as the test set, whereas for Kazakh-English the development set was selected (and removed) from the news-commentary training set. The test sets for these new language pairs was half the size of the test sets of the existing language pairs. Different to previous years, all test sets (ex-cept for French-German and German-Czech) only included naturally occurring text on the source side. In previous years, the way we produced an English-X test set was to take 1500 sentences of English text, translate these into language X, then take 1500 sentences in language X, and translated them into English. These 3000 translation pairs were then used for the English-X task, and for the X-English task, meaning that 50% of the sentences in each test has "translationese" on the source side, potentially leading to distortions in automatic and human evaluation (Graham et al., 2019a). This year, we did not include such "flipped" test data in the test sets, meaning that the English-X and X-English sets were non-overlapping.
The composition of the test documents is shown in Table 1, the size of the test sets in terms of sentence pairs and words is given in Figure 2.
The translation of the test sets was sponsored by the EU H2020 projects Bergamot and GoURMET (English-Czech and Gujarati-English respectively), by Yandex (Kazakh-English and Russian-English), Microsoft (Chinese-English and German-English), Tilde (Lithuanian-English), the University of Helsinki (Finnish-English) and Lingua Custodia 3 (a part of French-German test set).
The translations into Czech were carried out by the agency Překlady textu, s.r.o. 4 with the instructions for translators as given to all agencies: • preserve line and document boundaries, • translate from scratch, without post-editing, • translate as literally as possible, but ensure that the translation is still a fluent sentence in the target language, • do not add or remove information from the translations, and do not add translator's comments.
• The point is to have a linguistically nice document, but to be matching the original text as closely as possible in terms of segmentation into sentences (e.g. we don't want 3 English sentences combined into 1 long Czech complex sentence). 3, 074,921,453 2,872,785,485 333,498,145 1,168,529,851 157,264,161 Words 65,128,419,540 65,154,042,103 6,694,811,063 23,313,060,950 2,935,402,545 Dist. 342,760,462 339,983,035 50,162,437 101,436,673 47,083,545 Chinese Lithuanian Kazakh Gujarati French Sent. 1,672,324,647 103,103,449 10,862,371 3,729,907,519,260 261,518,626 80,120,343,195 4,381,617 2,068,064 Test Sets

Training Data
As in past years we provided parallel corpora to train translation models, monolingual corpora to train language models, and development sets to tune system parameters.
This year, we proposed document-level evaluation for the English-German and English-Czech tasks. We therefore attempted to provide training corpora with document boundaries intact wherever possible. We produced new versions of the Europarl corpora with document boundaries, an updated version of news-commentary with document boundaries, and a release of the Rapid corpus for German-English with document boundaries intact. The CzEng 5 already included context for each sentence, so we did not update it. We also produced a WikiTitle corpus this year for all language pairs, and allowed the use of a new ParaCrawl corpus (v3). The UN, Common-Crawl and Yandex corpora were unchanged since last year.
For Gujarati-English, we allowed several extra parallel corpora (the Bible, a localisation corpus from Opus, the Emille corpus, a Wikipedia corpus and a crawled corpus specifically for this task), 5 http://ufal.mff.cuni.cz/czeng/czeng17 as well as encouraging participants to experiment with the HindEnCorp 6 for transfer learning.
For Kazakh-English, we released a crawled corpus (from KazakhTV) prepared by Bagdat Myrzakhmetov of Nazarbayev University as well as a much larger Kazakh-Russian corpus for transfer learning or pivoting.
We released new monolingual news crawls for each of the languages used in the task. For German and Czech, we released versions of these with the document boundaries intact, for participants wishing to experiment with document-level models.
Some statistics about the training materials are given in Figures 1 and 2.

Unsupervised Sub-Task
Following up on the unsupervised learning challenge from last year, we again invited participants to build unsupervised machine translation systems without the use of any parallel training corpora.
While WMT has been (and is) providing considerable amounts of bitext for most of the language pairs covered in its shared tasks on machine translation of news, there is however still a shortage of available parallel resources between 6 lots of combinations of two human languages. Bridging through a global hub language-such as English-can be a solution in scenarios where no bitext exists between two languages but parallel corpora with the hub language are at hand for each of the two. This "pivot translation" approach of cascading source-English and English-target MT is well-established. More recent research on unsupervised translation, on the other hand, seeks to altogether eliminate the need for parallel training data. Unsupervised translation techniques should be capable of learning translation correspondences from only monolingual data in two different languages, thus potentially offering a solution to machine translation between each and every possible pair of written human languages.
Previous year's evaluation had indicated that, unsurprisingly, unsupervised translation clearly lags behind supervised translation. But we had also seen promising early-stage research results which seemed to suggest that the difficult task of unsupervised learning in machine translation may not be impossible to solve in the long run. When acceptable quality can be reached with unsupervised methods, these methods will likely not directly compete with supervised translation, but rather be deployed to cover language pairs where supervised translation is inapplicable due to a lack of parallel data.
The language pair for the WMT19 unsupervised sub-task was German-Czech. Only the German→Czech translation direction was evaluated, not the Czech→German direction. German is a compounding language, and German and Czech are both morphologically rich. Linguistic peculiarities on both the source and the target side impose difficulties other than for last year's languages, where we paired Turkish, Estonian, and German each with English for the unsupervised sub-task. By choosing German-Czech, we hope to simulate practical application scenarios for fully unsupervised translation. However, note that there actually is German-Czech parallel data, e.g. from European parliamentary proceedings. German-English and English-Czech bitexts likewise exist in large amounts. We asked the participants to avoid any of these corpora, as well as any monolingual or parallel data for other languages and language pairs. Permissible training data for the unsupervised sub-task were only the monolingual corpora from the constrained monolingual WMT News Crawls of German and Czech. Last years' parallel dev and test sets (from the development tarball 7 ) were allowed for bootstrapping purposes. Since they contain a few thousand sentences of high-quality German-Czech parallel text, we advised participants to make only very moderate use of this data. Using it directly as a training corpus was strongly discouraged, but we wanted to provide system builders with a means to evaluate and track progress internally during system development. We also did not prohibit its use for lightweight (hyper-)parameter optimization.
Seven German→Czech unsupervised machine translation systems were submitted and marked as primary submissions by the participating teams. The unsupervised system submissions were evaluated along with four online systems for the German→Czech language pairs, which we assume are all supervised MT engines. The official results of the human evaluation are presented in Table 12 (Section 3).

EUElections German→French and French→German Sub-Tasks
The second new sub-task this year included translating news data between French and German (both directions) on the topic of the European Elections. We collected a development and test set from online news websites. Articles were originally in French or in German. Statistics of the corpora a presented in the following  der to analyse the impact of the original source language of document on systems' performance, we computed the METEOR scores on the full corpus (FULL), on the sentences from articles initially written in French (second column) or in German (third column). Results are shown in the Tables 3 and 4. One can notice some differences depending on the language direction. While the performance of the systems when translating from French to German seems to heavily depend on the   original language of the document, this is less the case for the German to French direction. These results suggest that the German text produced by translating French documents is somewhat different from the German text originally produced even though native German translators were involved in the process. This is of course not new and is related to translationese (Koppel and Ordan, 2011). As shown in Table 2, only one fifth of the test corpus originates from French documents. With this in mind, Table 4 suggests that the translationese is less obvious for French text.
For next year, we plan to produce additional data with documents created during and after the elections.

Submitted Systems
In 2019, we received a total of 153 submissions. The participating institutions are listed in Table 5 and detailed in the rest of this section. Each system did not necessarily appear in all translation tasks. We also included online MT systems (originating from 5 services), which we anonymized as ONLINE-A,B,G,X,Y. For presentation of the results, systems are treated as either constrained or unconstrained, depending on whether their models were trained only on the provided data. Since we do not know how they were built, the online systems are treated as unconstrained during the automatic and human evaluations.
In the rest of this sub-section, we provide brief details of the submitted systems, for those in cases where the authors provided such details.
2.5.1 AFRL AFRL-SYSCOMB19 (Gwinnup et al., 2019) is a system combination of a Marian ensemble system, two distinct OpenNMT systems, a Sockeyebased Elastic Weight Consolidation system, and one Moses phrase-based system. AFRL-EWC (Gwinnup et al., 2019) is a Sockeye Transformer system trained with the default network configuration as described in Vaswani et al. (2017). The model is trained using the prepared parallel corpus used in other AFRL systems. A fine-tuning corpus is created from the 2014-2017 WMT Russian-English test sets. EWC is applied as described in Thompson et al. (2019). The final submission is an ensemble decode of the four best-performing checkpoints from a single training run when scoring newstest2018.

APPRENTICE-C (Li and Specia, 2019)
APPRENTICE-C is a RNN-based encoder-decoder with pre-trained embedding enhanced by character information. The system is trained on 10.38M Chinese-English sentence pairs after tokenization, filtering by alignment and BPE . Pre-trained embedding is trained on monolingual data for 5 iterations and used as an initialization for the RNN model. XZL-NMT (no associated paper) Table 5: Participants in the shared translation task. Not all teams participated in all language pairs. The translations from the online systems were not submitted by their respective companies but were obtained by us, and are therefore anonymized in a fashion consistent with previous years of the workshop.

BTRANS
Unfortunately, no details are available for this system.
2.5.7 BASELINE-RE-RERANK (no associated paper) BASELINE-RE-RERANK is a standard Transformer, with corpus filtering, pre-processing, postprocessing, averaging and ensembling as well as n-best list reranking.
2.5.8 CAIRE  CAIRE is a hybrid system that took part only in the unsupervised track. The system builds upon phrase-based MT and a pre-trained language model, combining word-level and subwordlevel NMT. A series of pre-processing and postprocessing steps improves the performance, e.g. placeholders for numbers and dates, recasing and quotes normalization.

Charles University (CUNI) Systems
CUNI-T2T-TRANSFER (Kocmi and  are Transformer neural machine translation systems (as implemented in Tensor2tensor) for Kazakh↔English, Gujarati↔English. CUNI-T2T-TRANSFER focused on transfer learning from a high-resource language pair (Russian-English and Czech-English, respectively) followed by iterative back-translation.

CUNI-DOCTRANSFORMER-T2T2019
and CUNI-TRANSFORMER-T2T2019 (Popel et al., 2019) are trained in the T2T framework following the last year submission (Popel, 2018), but training on WMT19 document-level parallel and monoliongual data. During decoding, each document is split into overlapping multi-sentence segments, where only the "middle" sentences in each segment are used for the final translation. CUNI-TRANSFORMER-T2T2019 is the same system as CUNI-DOCTRANSFORMER-T2T2019, just applied on separate sentences during decoding.
CUNI-DOCTRANSFORMER-MARIAN (Popel et al., 2019) is a Transformer model as implemented in Marian and trained in a context-aware ("document-level") fashion. The training started with the same technique as the last year's submission but it was finetuned on document-level parallel and monolingual data by translating triples of adjacent sentences at once. If possible, only the middle sentence was considered for the final translation hypothesis, otherwise shorter context of two sentences or just a single sentence was used.
CUNI-TRANSFORMER-T2T2018 (Popel, 2018) is the exact same system as used last year.

CUNI-UNSUPERVISED-NER-POST
(Kvapilíková et al., 2019) follows the strategy of Artetxe et al. (2018), creating a seed phrase-based system where the phrase table is initialized from cross-lingual embedding mappings trained on monolingual data, followed by a neural machine translation system trained on synthetic parallel corpus. The synthetic corpus is produced by the seed phrase-based MT system or by a such a model refined through iterative back-translation. CUNI-UNSUPERVISED-NER-POST further focuses on the handling of named entities, i.e. the part of vocabulary where the cross-lingual embedding mapping suffer most. The system DBMS-KU INTERPOLATION uses Linear Interpolation and Fillup Interpolation method with different language models, i.e., 3gram and 5-gram. It combines a direct phrase table with pivot phrase table, pivoting through the Russian language.
2.5.11 DFKI-NMT (Zhang and van Genabith, 2019) from monolingual news. Each Transformer model is fine tuned on previous years' test sets.
ETRANSLATION Fr-De The Fr-De system is an ensemble of 2 big Transformers (with size 8192 FFN layers). Back-translation data was selected using topic modelling techniques to tune the model towards the domain defined in the task.
ETRANSLATION En-Lt The En-Lt system is an ensemble of 2 big Transformers (as for Fr-De) and a Transformer type language model. The training data contains the Rapid corpus and the news domain back-translated data sets 2 times oversampled.
ETRANSLATION Ru-En The Ru-En system is a single base Transformer trained only on true parallel data (including ParaCrawl but excluding the UN corpus) filtered in the same way as in the other submissions and fine tuned on previous test sets.
2.5.14 FACEBOOK FAIR (Ng et al., 2019) Facebook FAIR system is a pure sentence level system, it is an ensemble of 3 Big Transformer models with FFN layers of size 8192. Trained on the mix of bitext and back-translated newscrawl data, oversampling was used to keep the effective ratio of bitext and back-translated data the same. Sampling from an ensemble of 3 models trained on bitext only was used to generate back-translations. The models were fine-tuned on in-domain data and a final noisy channel reranking was applied. All the training data (bitext and monolingual) was cleaned using langid filtering.

FRANK-S-MT
Unfortunately, no details are available for this system.

GTCOM (Bei et al., 2019)
GTCOM's systems (sysNameGTCOM-Primary) mainly focus on backtranslation, knowledge distillation and reranking to build a competitive model with transformer architecture. Also, the language model is applied to filter monolingual data, backtranslated data and parallel data. The techniques for data filtering include filtering by rules, language models. Furthermore, they apply knowledge distillation techniques and right-to-left (R2L) reranking. IIITH-MT for Gujarati-English first experimented with attention-based LSTM encoderdecoder architecture, but later found the results to be more promising by using Transformer architecture. The paper documents that with Hindi-English as an assisting language pair in a joint training, the multilingual system obtains significant BLEU improvements for a low resource language pair like Gujarati-English.

IITP (Sen et al., 2019)
IITP-MT is a Transformer based NMT system trained using original parallel corpus and synthetic parallel corpus obtained through backtranslation of monolingual data. All the experiments are performed at subword-level using BPE with 10K merge operations. JHU's English-German system is an ensemble of 2 Transformer base models, improved by filtered backtranslation with restricted sampling (like Edunov+ 2018), filtered ParaCrawl and Com-monCrawl (Junczys-Dowmunt, 2018a), continued training on newstest15-18 (like JHU's submission to WMT18, Koehn et al., 2018), reranking with R2L models (like Sennrich et al., 2017or Junczys-Dowmunt, 2018b) and fixing quotation marks to match the German style (as many other teams did). English-German was the same, with a 3 Transformer base ensemble, no fixed quotation marks, and reranking additionally included a language model (inspired by Junczys-Dowmunt, 2018a).

JUMT (no associated paper)
For the training purpose, the preprocessed Lithuanian-English sentence pairs were fed to Moses toolkit (Koehn et al., 2007). This created an SMT translation model with Lithuanian as the source language and English as the target language. After that, the Lithuanian side of a parallel corpus of 2,00,000 Lithuanian-English sentence pairs was re-translated into English with the SMT model. These 2,00,000 machine translated English sentences and the respective 2,00,000 gold standard Lithuanian sentences (from the Lithuanian-English sentence pairs) were given as input to a word embedding based NMT model. This resulted in the hybrid model submitted for manual evaluation. The systems JU_SAARLAND and JU_SAARLAND_CLEAN_NUM_135_BPE used additional backtranslated data and were trained using phrase-based and BPE-based attention models.
2.5.23 KSAI (Guo et al., 2019) Kingsoft's submissions were based on various NMT architectures with Transformer as the baseline system.
Several data filters and backtranslation were used for data cleaning and data augmentation, respectively. Several advanced techniques were added to the baseline system such as Linear Combination and Layer Aggregation. Fine-tuning methods were applied to improve the in-domain translation quality. The final model was a system combination through multi-model ensembling and reranking, post-processed.

KYOTO UNIVERSITY (Cromieres and
Kurohashi, 2019) KYOTO UNIVERSITY used the now standard Transformer model (with 6 layers for each of encoder/decoder, hidden size of 1024, 16 attention heads, dropout of 0.3). Training data was carefully cleaned and the 2018 monolingual data was used through back-translation, as it turned out to be necessary for correctly translating recent news items. No ensemble translation was performed but a small BLEU improvement was obtained by taking a "majority vote" on the final translations for different checkpoints.

LINGUA CUSTODIA (Burlot, 2019)
The German-to-French system LINGUA-CUSTODIA-PRIMARY is an ensemble of eight Transformer base models, fine-tuned on monolingual news data back-translated with constrained decoding for specific terminology control. German. The training data was created by crossmatching the training data from previous WMT shared tasks. Development and test sets have been manually created from news articles Focusing on EU elections topics. LIUM participated in both directions for German-French language pairs. LIUM systems are based on the self-attentional Transformer networks using "small" and "big" architectures. We also used monolingual data selection and synthetic data through backtranslation. settings. We build 12-layer Transformer-Big systems: a) on the sentence-level, b) with large document-level context (training on full documents with up to 1024 subwords) and c) hybrid models via 2nd-pass decoding and ensembling. The models are trained on filtered parallel data, large amounts of back-translated documents and augmented fake and true parallel documents. MSRA.MADL is based on Transformer (i.e., the standard transformer_big setting with 6 layers, embedding dimension 1024 and hidden state dimension 4096) and trained with multi-agent dual learning  scheme (briefly, MADL). The core idea of dual learning is to leverage the duality between the primal task (mapping from domain X to domain Y) and dual task (mapping from domain Y to X ) to boost the performances of both tasks. MADL extends the dual learning framework by introducing multiple primal and dual models. It was integrated into the submitted system MSRA.MADL for German↔English and German↔French translations.
MSRA.SCA is a combination of Transformer network, back translation, knowledge distillation, soft contextual data augmentation , and model ensembling. The Transformer big architecture is trained using soft contextual data augmentation to further enhance the performance. Following the above procedures, 5 different models are trained and ensembled for final submission.
MSRA.MASS is based on Transformer (i.e., the standard transformer_big setting with 6 layers, embedding dimension 1024 and hidden state dimension 4096) and pre-trained with MASS: masked sequence to sequence pre-training for language generation (Song et al., 2019). MASS leverages both monolingual and bilingual sentences for pre-training, where a segment of the source sentence is masked in the encoder side, and the decoder predicts this masked segment in the monolingual setting and predicts the whole target sentence in the bilingual setting. After pre-training, back-translation and ensemble/reranking are further leveraged to improve the accuracy of the system. MSRA.MASS handles Chinese→English and English↔Lithuanian translations in the submission MSRA.NAO is a system whose architecture is obtained by neural architecture optimization (briefly, NAO; . NAO leverages the power of a gradient-based method to conduct optimization and guide the creation of better neural architecture in a continuous and more compact space given the historically observed architectures and their performances. The search space includes self attention, convolutional networks, LSTMs, etc. It was applied in English↔Finnish translations in the submitted systems. 2.5.31 NIUTRANS providing the system NEU (Li et al., 2019a) The NIUTRANS submissions are based on Deep-Transformer-DLCL and its variants, we used back-translation with beam search and sampling methods for data augmentation. Iterative ensemble knowledge distillation was employed to enhance single systems by various teachers. Ensembling and reranking facilitated further system combination.

NICT
NICT (Dabre et al., 2019) submitted supervised neural machine translation (NMT) systems developed for the news translation task for Kazakh↔English, Gujarati↔English, Chinese↔English, and English→Finnish translation directions. NICT focused on leveraging multilingual transfer learning and back-translation for the extremely low-resource language pairs: Kazakh↔English and Gujarati↔English translation.
For the Chinese↔English translation, back-translation, fine-tuning, and model ensembling were found to work the best. For English→Finnish, NICT submission from WMT18 remains a strong baseline despite the increase in parallel corpora for this year's task.
NICT (Marie et al., 2019b) submitted also an unsupervised neural machine translation system developed for the news translation task for German→Czech translation direction, focussing on language model pre-training, n-best list reranking, fine-tuning, and model ensembling technolo-gies. The final primary submission to this task is the result of a simple combination of the unsupervised neural and statistical machine translation systems. (Littell et al., 2019) The National Research Council Canada (NRC-CNRC) Kazakh-English news translation system is a multi-source, multi-encoder NMT system that takes Russian as the additional source. The constrained Kazakh-Russian parallel corpora is used to train NMT systems for "cross-translation" of resources between the languages, and the final Kazakh/Russian-to-English system is trained on a combination of genuine, back-translated, and cross-translated synthetic data. The submitted model is a partially trained single run system.

PARFDA (Biçici, 2019)
Biçici (2019) reports on the use of parfda system, Moses, KenLM, NPLM, and PRO, including the coverage of the test sets and the upper bounds on the translation results using the constrained resources.

PROMT-NMT (Molchanov, 2019)
This is an unconstrained, transformer-based single system, built using Marian and using BPE.
2.5.36 RUG RUG_KKEN_MORFESSOR (Toral et al., 2019) uses (i) unsupervised morphological segmentation given the agglutinative nature of Kazakh, (ii) data from an additional language (Russian), given the scarcity of English-Kazakh data and (iii) synthetic data for the source language filtered using language-independent sentence similarity. RUG_ENKK_BPE (Toral et al., 2019) uses data from an additional language (Russian), given the scarcity of English-Kazakh data and synthetic data (for both source and target languages) filtered using language-independent sentence similarity.

RWTH AACHEN (Rosendahl et al., 2019)
The systems by RWTH AACHEN are all based on Transformer architecture and aside from careful corpus filtering and fine tuning, they experiment with different types of subword units. For English-German, no gains over the last year setup are observed. Small improvements are reached in Chinese-English. The highest gain of 11.1 BLEU is obtained for Kazakh-English, also thanks to transfer learning techniques.

TALP_UPC_2019_KKEN and
TALP_UPC_2019_ENKK (Casas et al., 2019) The TALP-UPC system was trained on a combination of the original Kazakh-English data (oversampled 3x) together with synthetic corpora obtained by translating with a BPE-based Moses the Russian side of the Kazakh-Russian data to English for the en-kk direction, and the Russian side of the English-Russian data to Kazakh for the kken direction. For the final systems, a custom model consisting in a self-attention Transformer decoder that learns joint source-target representations (with BPE tokenization) was used, implemented on the fairseq library.

TARTUNLP-C (Tättar et al., 2019)
TARTUNLP-C is a multilingual multi-domain neural machine translation, achieved by specifying the output language and domain via input word features (factors). The system was trained using all the parallel data for latin alphabet languages and used self-attention (Transformer) as the base architecture.

TILDE-NC-NMT and TILDE-NC-NMT (Pinnis et al., 2019)
Tilde developed both constrained and unconstrained NMT systems for English-Lithuanian and Lithuanian-English using the Marian toolkit. All systems feature ensembles of four to five transformer models that were trained using the quasi-hyperbolic Adam optimiser (Ma and Yarats, 2018). Data for the systems were prepared using TildeMT filtering (Pinnis, 2018) and preprocessing  methods. For unconstrained systems, data were additionally filtered using dual conditional cross-entropy filtering (Junczys-Dowmunt, 2018a). All systems were trained using iterative back-translation (Rikters, 2018) and feature synthetic data that allows training NMT systems to support handling of unknown phenomena (Pinnis et al., 2017). During translation, automatic named entity and nontranslatable phrase post-editing were performed. For constrained systems, named entities and nontranslatable phrase lists were extracted from the parallel training data. For unconstrained systems, WikiData 8 was used to acquire bilingual lists of named entities.

Universitat d'Alacant
UALACANT-NMT (Sánchez-Cartagena et al., 2019) is an ensemble of two RNN and two transformer models. They were trained on a combination of genuine parallel data, synthetic data generated by means of pivot backtranslation (from the available English-Russian and Kazakh-Russian parallel data) and backtranslated monolingual data. The Kazakh text was morphologically segmented with Apertium.
UALACANT-NMT+RBMT (Sánchez-Cartagena et al., 2019) is an ensemble of two RNN and two Transformer models. They were trained on a combination of genuine parallel data, synthetic data generated by means of pivot backtranslation (from the available English-Russian and Kazakh-Russian parallel data) and backtranslated monolingual data. The Kazakh text was morphologically segmented with Apertium. The RNN models were multi-source models with two inputs: the original SL text and its translation with the Apertium RBMT English-Kazakh system.

UCAM (Stahlberg et al., 2019)
The Cambridge University Engineering Department's entry to the WMT19 evaluation campaign focuses on fine-tuning and language modelling. Fine-tuning on former WMT test sets is regularized with elastic weight consolidation (Kirkpatrick et al., 2017). Language models are used on both the sentence-level and the document-level, with a modified Transformer architecture for documentlevel language modelling. An SMT system is integrated via a minimum Bayes-risk formulation (Stahlberg et al., 2017).

UDS-DFKI (España-Bonet and
Ruiter, 2019) The UdS-DFKI English→German system uses a standard Transformer architecture where data is enriched with coreference information gathered at document level. The training is still done at the sentence level. The English↔Gujarati systems are phrasebased SMT systems enriched with parallel sentences extracted from comparable corpora with a 8 www.wikidata.org self-supervised NMT system. In this case, also back-translations are used.

UEDIN (Bawden et al., 2019a)
The UEDIN systems are supervised NMT systems based on the transformer architecture and trained using Marian (Junczys-Dowmunt et al., 2018). For English↔Gujarati, synthetic parallel data from two sources, backtranslation and pivoting through Hindi, is produced using unsupervised and semi-supervised NMT models, pre-trained using a cross-lingual language objective (Lample and Conneau, 2019) For German→English, the impact of vast amounts of back-translated training data on translation quality is studied, and some additional insights are gained over (Edunov et al., 2018). Towards the end of training, for German→English and Chinese↔English, the mini-batch size was increased up to fifty-fold by delaying gradient updates (Bogoychev et al., 2018) as an alternative to learning rate cooldown (Smith, 2018). For Chinese↔English, a comparison of different segmentation strategies showed that character-based decoding was superior to the translation of subwords when translating into Chinese. Pre-processing strategies were also investigated for English→Czech, showing that preprocessing can be simplified without loss to MT quality.
UEDIN's main findings on the Chinese↔English translation task are that character-level model on the Chinese side can be used when translating into Chinese to improve the BLEU score. The same does not hold when translating from Chinese.

UMD (Briakou and Carpuat, 2019)
UMD NMT models are Sequence-2-Sequence attentional with Long-Short Term Memory units; words are segmented using BPEs jointly learned on the concatenation of Turkish and Kazakh data. The submitted model is an ensemble obtained by averaging the output distributions of 4 models trained on Kazakh, Turkish and back-translated data using different random seeds.

UNSUPERVISED-6929 and
UNSUPERVISED-6935 Unfortunately, no details are available for these systems.

USTC-MCC (no associated paper)
USTC-MCC is a Transformer model implemented in Fairseq-py. Tokenization and BPE were used and the training data were augmented with back-translation.

USYD (Ding and Tao, 2019)
The University of Sydney's system is based on the self attentional Transformer networks, into which they integrated the most recent effective strategies from academic research (e.g., BPE, back translation, multi-features data selection, data augmentation, greedy model ensemble, reranking, Con-MBR system combination, and post-processing). Furthermore, they proposed a novel augmentation method Cycle Translation and a data mixture strategy Big/Small parallel construction to entirely exploit the synthetic corpus.

XZL-NMT (no associated paper)
XZL-NMT is an ensembled Transformer model as implemented in Marian, using Moses tokenizer and subword units.

Submission Summary
An overview of techniques used in the submitted systems was obtained in a poll. The full details are available on-line. 9 Including manually entered data rows, we had more than 60 responses, some of which describe more MT systems at once. Overall, most of the submitted systems were standard bilingual MT systems, optimized to translate one language pair, even in the case when data from other languages are used to support this pair. Truly multilingual systems were TARTUNLP-C covering 7 of the tested language pairs, DBMS-KU INTER-POLATION (bidirectional Kazakh-English) and AYLIEN_MT_MULTILINGUAL which was unfortunately tested only on the very low-resource Gujarati-English and not all the language pairs it covers. In the highly competitive task of news translation, these systems ended up on lower ranks, so aiming at multi-linguality seems rather as a distraction, except for supporting lowresource languages.
As already in the previous year, the Transformer architecture (Vaswani et al., 2017) domi-9 https://tinyurl.com/ wmt19-systems-descr-summary  nates with more than 80% of submissions 10 reporting to include it. Some diversity is seen at least in the actual implementation of the model, with Marian (Junczys-Dowmunt et al., 2018) being by far the most popular (more than 30%), followed by fairseq (18%), OpenNMT-py (16%) and Ten-sor2tensor and Sockeye (14% each). Phrase-based MT (primarily Moses, Koehn et al., 2007) is still often in use, with 15-25% submissions using it in some way. Subword processing is very frequent: BPE (Sennrich et al., 2016) taking the lead (two thirds) and SentencePiece (Kudo and Richardson, 2018) following (a quarter of submissions). More than 90% of submissions use tokenization (Moses tokenizer being used in 40% of cases) before subword splitting while more language-specific tools such as morphological segmenters are rare. Unicode characters were used only exceptionally (4 mentions) and with rather experimental systems, except for UEDIN, see Section 2.5.44.
More than 40% of submissions used language identification to clean the provided training data. Truecasing or recasing was also quite popular.
Common NMT model and training features are listed in Table 6, documenting that backtranslation, ensembling and corpus filtering are a must.

Human Evaluation
A human evaluation campaign is run each year to assess translation quality and to determine the final ranking of systems taking part in the competition. English to German assessment from the human evaluation campaign. The annotator is presented with the machine translation output segment randomly selected from competing systems (anonymized) and is asked to rate the translation on a sliding scale.
This section describes how preparation of evaluation data, collection of human assessments, and computation of the official results of the shared task was carried out this year.

Direct Assessment
Work on evaluation over the past few years has provided fresh insight into ways to collect direct assessments (DA) of machine translation quality (Graham et al., 2013(Graham et al., , 2014(Graham et al., , 2016, and three years ago the evaluation campaign included parallel assessment of a subset of News task language pairs evaluated with relative ranking (RR) and DA. DA has some clear advantages over RR, namely the evaluation of absolute translation quality and the ability to carry out evaluations through quality controlled crowd-sourcing. As established in 2016 (Bojar et al., 2016), DA results (via crowd-sourcing) and RR results (produced by researchers) correlate strongly, with Pearson correlation ranging from 0.920 to 0.997 across several source languages into English and at 0.975 for English-to-Russian (the only pair evaluated outof-English). Since 2017, we have thus employed DA for evaluation of systems taking part in the news task and do so again this year.
Human assessors are asked to rate a given translation by how adequately it expresses the meaning of the corresponding reference translation or source language input on an analogue scale, which corresponds to an underlying absolute 0-100 rating scale. No sentence or document length restric-tion is applied during manual evaluation.

Styles of Direct Assessment Tested in WMT19
In previous year's evaluation translated segments for all language pairs were evaluated independent of the wider document context. However, since recent MT evaluations address the question of comparison of system and human performance, evaluation within document context has become more relevant (Läubli et al., 2018;Toral et al., 2018). Therefore, for a selection of language pairs, human evaluation was carried out within the document context. We denote the two options "+DC" (with document context) and "−DC" (without document context) in the following. Additionally in past years, test data included text that was created in the opposite direction to testing, in order to achieve a larger test set with limited resources. Inclusion of test data has been shown to introduce inaccuracies in evaluations particularly in terms of BLEU scores however  and for this reason, this year we only test systems on data that was originally written in the source language.
In previous years we have employed only monolingual human evaluation (denoted "M" in the following for official results. Last year we trialled source-based evaluation for English to Czech translation, i.e. a bilingual configuration ("B") in which the human assessor is shown the source input and system output only (with no reference English to German assessment from the human evaluation campaign. The annotator is presented with the machine translation output document randomly selected from competing systems (anonymized) and is asked to rate the translation on a sliding scale. translation shown). This approach has the advantage of freeing up the human-generated reference translation so that it can be included in the evaluation as another system and provide an estimate of human performance. Since we would like to restrict human assessors to only evaluate translation into their native language, we restricted bilingual/source-based evaluation to evaluation of translation for out-of-English language pairs. This is especially relevant since we have a large group of volunteer human assessors with native language fluency in non-English languages and high fluency in English, while we generally lack the reverse, native English speakers with high fluency in non-English languages. A summary of the human evaluation configurations run this year in the news task is provided in Table 7, where configurations that correspond to official results are highlighted in bold.
The style of official evaluation used in the past recent years of WMT corresponds to M SR−DC (Segment Rating without Document Context) i.e. evaluating individual segments against the reference translation and independently of each other.
For language pairs for which our original style SR−DC evaluation was run this year, the SR−DC configuration was kept as the source of the official results with additional configurations provided for the purpose of comparison. For the remaining language pairs, official results are based on the SR+DC evaluation, i.e. the assessment of individual segments which are nevertheless provided in their natural order as they appear in the document. Fully document-level evaluation (DR+DC) as trialled this year where we asked for a single score given the whole document is problematic in terms of statistical power and inconclusive ties, as shown in .
In order to maximize the number of human annotations collected while minimizing the amount of reading required by a given human assessor, we combined two evaluation configurations, Document Rating + Document Context (DR+DC) and Segment Rating + Document Context (SR+DC), shown in Table 7 and ran them as a single task. In this configuration, human annotators were shown each segment of a given document (produced by a single MT system) in original sequential order and the human assessor rated each segment in turn. Figure 3 shows a screenshot of this part of the annotation process. This was followed by a screen where the human assessor rated the entire document as a whole comprising the most recently rated segments. Figure 4 shows this later part of the same evaluation set-up. Subsequently when sufficient data is collected, SR+DC results are arrived at by combining ratings attributed to segments, while DR+DC results are a combination of document ratings.
For some language pairs the standard configuration from past years in which segments are evalu-ated in isolation from the wider document context, which we call Segment Rating − Document Context (SR−DC) and a screenshot of this configuration is shown in Figure 5.
As in previous years, the standard SR−DC annotation is organized into "HITs" (following the Mechanical Turk's term "human intelligence task"), each containing 100 such screens and requiring about half an hour to finish. For the additional configuration that included both DR+DC and SR+DC, HITs were simply made up of a random sample of machine translated documents as opposed to segments.

Evaluation Campaign Overview
In terms of the News translation task manual evaluation, a total of 263 individual researcher accounts were involved, and 766 turker accounts. 11 Researchers in the manual evaluation contributed judgments of 242,424 translations, while 487,674 translation assessment scores were submitted in total by the crowd, of which 224,046 were provided by workers who passed quality control.
Under ordinary circumstances, each assessed translation would correspond to a single individual scored segment. However, since distinct systems can produce the same output for a particular input sentence, in previous years we were often able to take advantage of this and use a single assessment for multiple systems. For example, last year we combined human assessment of identical translations produced by multiple systems and were able to get up to 17% saving in terms of evaluation resources. However, since our evaluation now includes document context, deduplication of system outputs was not possible for most of the configurations run this year.

Data Collection
System rankings are produced from a large set of human assessments of translations, each of which indicates the absolute quality of the output of a system. Annotations are collected in an evaluation campaign that enlists the help of participants in the shared task. Each team is asked to contribute our data collection, in addition to Amazon Mechanical Turk. 13 Table 8 shows total numbers of human assessments collected in WMT19 contributing to final scores for systems. 14 The effort that goes into the manual evaluation campaign each year is impressive, and we are grateful to all participating individuals and teams. We believe that human annotation provides the best decision basis for evaluation of machine translation output and it is great to see continued contributions on this large scale.

Crowd Quality Control
In order to trial document-level evaluation, in addition to our standard segment-level human evaluation, we ran two additional evaluations combined into a single HIT structure. Firstly, we collected segment ratings with document context (SR+DC) and secondly document ratings with document context (DR+DC). We refer to our original segment-level evaluation where assessors are shown segments in isolation from the wider document context as segment rating − document context (SR−DC). We describe all three methods of ranking systems in detail below.

Standard DA HIT Structure (SR−DC)
In the standard DA HIT structure (Segment Rating − Document Context), three kinds of quality control translation pairs are employed as described in Table 9: we repeat pairs (expecting a similar judgment), damage MT outputs (expecting significantly worse scores) and use references instead of MT outputs (expecting high scores).
In total, 60 items in a 100-translation HIT serve in quality control checks but 40 of those are regular judgments of MT system outputs (we exclude assessments of bad references and ordinary reference translations when calculating final scores). The effort wasted for the sake of quality control is thus 20%.
Also in the standard DA HIT structure, within each 100-translation HIT, the same proportion of translations are included from each participating system for that language pair. This ensures the final dataset for a given language pair contains roughly equivalent numbers of assessments for each participating system. This serves three purposes for making the evaluation fair. Firstly, for the point estimates used to rank systems to be reliable, a sufficient sample size is needed and the most efficient way to reach a sufficient sample size for all systems is to keep total numbers of judgments roughly equal as more and more judgments are collected. Secondly, it helps to make the evaluation fair because each system will suffer or benefit equally from an overly lenient/harsh human judge. Thirdly, despite DA judgments being absolute, it is known that judges "calibrate" the way they use the scale depending on the general observed translation quality. With each HIT including all participating systems, this effect is averaged out. Furthermore apart from quality con-

Repeat Pairs:
Original System output (10) An exact repeat of it (10); Bad Reference Pairs: Original System output (10) A degraded version of it (10); Good Reference Pairs: Original System output (10) Its corresponding reference translation (10). trol items, HITs are constructed using translations sampled from the entire set of outputs for a given language pair.

Document-Level DA HIT Structure (SR+DC and DR+DC)
As mentioned previously, collection of segmentlevel ratings with document context (Segment Rating + Document Context) and document ratings with document context (Document Rating + Document Context) assessments were combined into a single evaluation set-up to save annotator time. This involved constructing HITs so that each sentence belonging to a given document (produced by a single MT system) were displayed to and rated by the human annotator before he/she was shown the same entire document again and asked to rate it. Quality control items for this set-up was carried out as follows with the aim of constructing a HIT with as close to 100 segments in total: 1. All documents produced by all systems are pooled; 15 2. Documents are then sampled at random (without replacement) and assigned to the current HIT until the current HIT comprises no more than 70 segments in total; 3. Once documents amounting to close to 70 segments have been assigned to the current HIT, we select a subset of these documents to be paired with quality control documents; this subset is selected by repeatedly checking if the addition of the number of the segments belonging to a given document (as quality control items) will keep the total number of segments in the HIT below 100; if this is the case it is included; otherwise it is skipped until the addition of all documents has been checked. In doing this, the HIT is structured to bring the total number of segments as close as possible to 100 segments in total within a HIT but without selecting documents in any systematic way such as selecting them based on fewest segments, for example.
4. Once we have selected a core set of original system output documents and a subset of them to be paired with quality control versions for each HIT, quality control documents are automatically constructed by altering the sentences of a given document into a mixture of three kinds of quality control items used in the original DA segment-level quality control: bad reference translations, reference translations and exact repeats, see Section 3.5.3 for details of bad reference generation; 5. Finally, the documents belonging to a HIT are shuffled.

Construction of Bad References
In all set-ups employed in the evaluation campaign, and as in previous years, bad reference pairs were created automatically by replacing a phrase within a given translation with a phrase of the same length, randomly selected from n-grams extracted from the full test set of reference translations belonging to that language pair. This means that the replacement phrase will itself comprise a fluent sequence of words (making it difficult to tell that the sentence is low quality without reading the entire sentence) while at the same time making its presence highly likely to sufficiently change the meaning of the MT output so that it causes a noticeable degradation. The length of the phrase to be replaced is determined by the number of words in the original translation, as follows:

Annotator Agreement
When an analogue scale (or 0-100 point scale, in practice) is employed, agreement cannot be measured using the conventional Kappa coefficient, ordinarily applied to human assessment when judgments are discrete categories or preferences. Instead, to measure consistency we filter crowd-sourced human assessors by how consistently they rate translations of known distinct quality using the bad reference pairs described previously. Quality filtering via bad reference pairs is especially important for the crowd-sourced portion of the manual evaluation. Due to the anonymous nature of crowd-sourcing, when collecting assessments of translations, it is likely to encounter workers who attempt to game the service, as well as submission of inconsistent evaluations and even robotic ones. We therefore employ DA's quality control mechanism to filter out low quality data, facilitated by the use of DA's analogue rating scale. 16 Assessments belonging to a given crowdsourced worker who has not demonstrated that he/she can reliably score bad reference translations significantly lower than corresponding genuine system output translations are filtered out. A paired significance test is applied to test if degraded translations are consistently scored lower than their original counterparts and the p-value produced by this test is used as an estimate of human assessor reliability. Assessments of workers whose p-value does not fall below the conventional 0.05 threshold are omitted from the evaluation of systems, since they do not reliably score degraded translations lower than corresponding MT output translations. Table 10 shows the number of workers who met our filtering requirement by showing a signif-icantly lower score for bad reference items compared to corresponding MT outputs, and the proportion of those who simultaneously showed no significant difference in scores they gave to pairs of identical translations.
Numbers in Table 10 of workers passing quality control criteria (A) varies across language pairs and this is in-line with passed DA evaluations. Language pairs were run in the following order on Mechanical Turk: fi-en, gu-en, kk-en, lt-en ruen, zh-en, de-en. We observe that the amount of low quality data we received (with one exception at the beginning) steadily decreases as data collection proceeded from (100−31=) 69% low quality data for fi-en to (100−71=) 29% for de-en, the last language pair to be evaluated. This is likely due to the active rejection of low quality HITs and word spreading among unreliable workers to avoid our HITs. The assessors were least reliable for gu-en, with only 60 out of 301 workers passing the quality control. We removed the data from the nonreliable workers in all language pairs.
In terms of numbers of workers who passed quality control who also showed no significant difference in exact repeats of the same translation, the two document-level runs, zh-en and de-en, showed lower reliability than the original DA standard sentence-level set-up. Overall the reliability is still relatively high however with the lowest language pair being de-en still reaching 88% of workers showing no significant difference in scores for repeat assessment of the same translation. In sum, we confirmed this year again that the check on bad references is sufficient and not many more workers would be ruled out if we also demanded similar judgements for repeated inputs.

Producing the Human Ranking
The data belong to each individual human evaluation run were compiled individually to produce either one of our official system rankings or a ranking that we would like to compare with official rankings.
In all set-ups, similar to previous years, system rankings were arrived at in the following way. Firstly, in order to iron out differences in scoring strategies of distinct human assessors, human assessment scores for translations were first standardized according to each individual human assessor's overall mean and standard deviation score. For rankings arrived at via segment ratings (SR−DC as well as SR+DC), average standardized scores for individual segments belonging to a given system were then computed, before the final overall DA score for a given system is computed as the average of its segment scores (Ave z in Table 11). For rankings arrived at via document ratings (DR+DC), average standardized scores for individual documents belonging to a given system were then computed, before the final overall DA score for a given system was computed as the average of its document scores (Ave z in Table 11). Results are also reported for average scores for systems, computed in the same way but without any score standardization applied (Ave % in Table  11).
Tables 11 , Tables 12 and 13 include the official results of the news task and Tables 14 and 15 include results for alternate human evaluation configurations. 17 Human performance estimates arrived at by evaluation of human-produced reference translations are denoted by "HUMAN" in all tables. Clusters are identified by grouping systems together according to which systems significantly outperform all others in lower ranking clusters, according to Wilcoxon rank-sum test. Appendix A shows the underlying head-to-head significance test official results for all pairs of systems.

Human Parity
In terms of human parity, as pointed out by , fully document-level evaluations incur the problem of low statistical power due to the reduced sample size of documents. The many ties in our DR+DC evaluation results cannot be used to draw conclusions of human parity with MT therefore. In addition, as highlighted by Toral et al. (2018), Läubli et al. (2018) and also us Bojar et al. (2018), a tie of human and machine in an evaluation of isolated segments cannot be used to draw conclusions of human parity. Given a wider context, human evaluators may draw different conclusions. 18 Our SR+DC human evaluation configuration is an attempt to draw the right balance between making it possible to assess a sufficient sample size of translations but importantly keeping the docu-  for bad reference items were significantly lower than corresponding MT outputs; (B) those of (A) whose scores also showed no significant difference for exact repeats of the same translation. The language pairs were submitted for evaluation one after another in the reported order. ment context available to human assessors, a configuration highlighted as suitable for human-parity investigations by   The results that can be relied upon for drawing conclusions of human parity therefore include the following from our SR+DC configurations: Even with all our precautions, the indications of human parity should not be overvalued. For instance, the super-human performance observed with Facebook-FAIR on English to German is based on standardized scores (Ave z.). Without the standardization (Ave.), Facebook-FAIR is on par with the reference and two systems by Microsoft score higher. The same mismatch of Ave. and Ave. z happens for English-Czech within the second performance cluster and also a couple of times in German-English and other language pairs. This has happened in the past already but the English-German case seems to be the first one where the Wilcoxon test claims a significant difference.   Table 12: Official results of WMT19 German to Czech Unsupervised News Translation Task. Systems ordered by DA score z-score; systems within a cluster are considered tied; lines indicate clusters according to Wilcoxon rank-sum test p < 0.05; grayed entry indicates resources that fall outside the constraints provided (in particular the use of parallel training data).    In contrast to the previous year, reference translations were scored significantly higher than MT systems in all these settings. It is thus not clear if the super-human quality observed last year was due to lower quality of last year's references, different set of documents or the segment-level style of evaluation as thoroughly discussed by Bojar et al. (2018).

Comparing the Different English-Czech Results
The good news is that all the different types of evaluation correlate very well, with Pearson correlation coefficient ranging from .978 (Ave. of DR+DC vs. SR−DC Microsoft) to .998 (Ave. vs. Ave. z of SR+DC). The document-level ranking (DR+DC) correlates with all variants of segmentlevel ranking with Pearson of .981 to .996.

Test Suites
Following our practice since last year, we issued a call for "test suites", i.e. test sets focussed on particular language phenomena, to complement the standard manual and automatic evaluations of WMT News Translation system.
Each team in the test suites track provides source texts (and optionally references) for any language pair that is being evaluated by WMT News Task. We shuffle these additional texts into the inputs of News Task and ship them jointly with the regular news texts. MT system developers may decide to skip these documents based on their ID but most of them process test suites along with the main news texts. After collecting the output translations from all WMT News Task Partic-ipants, we extract translated test suites, unshuffle them and send them back to the corresponding test-suite team. It was up to the test-suite team to evaluate MT outputs and some did this automatically, some manually and some both.
When shuffling, test suites this year closely observed document boundaries. If a test suite was marked as sentence-level only by their authors, we treated individual sentences as if they were onesentence documents. This lead to a very high number of input documents for some language pairs but all News Task participants managed to handle this additional burden.
As in the previous year, we have to note that test suites go beyond the news domain. If News Task systems are too heavily optimized for news, they may underperform on these domains.
The primary motivation in 2018 was to cut through the opacity of evaluations. We wanted to know more details than just which systems perform better or worse on average. This motivation remains also this year but one more reason for people providing test suites was to examine the human parity question from additional viewpoints beyond what Bojar et al. (2018)

Test Suite Details
The following paragraphs briefly describe each of the test suites. Please refer to the respective paper for all the details of the evaluation. (Vojtěchová et al., 2019) The test suite provided by the ELITR project (Vojtěchová et al., 2019) focuses on document-level qualities of two types of documents, audit reports and agreements (represented with only one document, in fact), for the top-performing English-to-Czech systems and some English↔German systems.

Audits and Agreements
The English-to-Czech systems were found as matching or perhaps even surpassing the quality of news reference translations in WMT18 (Bojar et al., 2018) and they also perform very well this year on news. The test suite wanted to validate if this quality transfers (without any specific domain adaptation) also to the domain of reports of supreme audit institutions, which is much more sensitive to terminological choices, and the domain of agreements, where term consistence is  critical.
The main findings are that also for precise texts (even if intended for the general public and written in a relatively simple language), current NMT systems are close to matching human translation quality. Terminological choices are a little worse but syntax and overall understandability was scored on par or better than the human reference (mixed among the system in an anonymous way). This can be seen as an indication of human parity even out of the original domain of the systems, although the official evaluation on news this year ranks the reference significantly higher.
A very important observation is that (single) reference translations are insufficient because they don't reflect the truly possible term translations. Manual non-expert evaluation would also not be sufficiently reliable because non-experts do not realize the subtle meaning differences among the terms.
On the other hand, the micro-study on agreements reveals that even these very good systems produce practically useless translations of agreements because none of them handles documentspecific terms and their consistent translations whatsoever.

Linguistic Evaluation of
German-to-English (Avramidis et al.,

2019)
The test suite by DFKI covers 107 grammatical phenomena organized into 14 categories. The test suite is very closely related to the one used last year (Macketanz et al., 2018), which allows an evaluation over time.
The test suite is evaluated semi-automatically on a large set of sentences (over 25k) illustrating each of the examined phenomenon and equipped with automatic checks for anticipated good and bad translations. The outputs of these checks are manually verified and refined.
The cross-year comparison is naturally affected by the different set of systems participating in each of the evaluations, but some trends are still observed, namely the improvement in function words, non-verbal agreement and punctuation. The least improvement is seen in terminology and named entities.
Overall, MT system still translate on average about 25% of the tested sentences wrongly. The worst performance is seen for idioms (88% wrong) and complex German verbal grammar (72-77% wrong). Specific terminology and some grammat-ical phenomena reach about 50%. The paper also indicates phenomena with error rate below 10%, e.g. negation or several cases of verb conjugation.

Document-Level Phenomena (Rysová et al., 2019)
The English-to-Czech test suite by Rysová et al.
(2019) builds upon discourse linguistics and manually evaluates three phenomena related to document-level coherence, namely topic-focus articulation (information structure), discourse connectives and alternative lexicalizations of connectives (essentially multi-word discourse connectives). Co-reference is deliberately not included. The 101 test suite documents (3.5k source sentences in total) come from Penn Discourse Treebank and are speficically the "essay" or "letter" type. The manual evaluation by trained linguists considered always the whole document: the source English text and one of the MT outputs. Targetted phenomena were highlighted in the source and the annotators marked whether they agree with the source annotation and (if yes) whether the respective source phenomenon is also refleted in the target. The reference translation comes from Prague Czech-English Dependency Treebank (Hajič et al., 2012) and it was included in the annotation in a blind way, as if it was one of the MT systems.
The results indicate that the examined phenomena are also handled by the MT systems exceptionally well, matching human quality or even negligibly outperforming humans, e.g. in the mutli-word discourse connectives. Interestingly, the English-Czech systems trained in some document-level way this year do not seem any better than the segment-level ones.

Producing German Conjuctions from
English and French (Popović, 2019) The test suite by Popović (2019) contains approximately 1000 English and 1000 individual French sentences that were included in the English→German and French→German tasks. The sentences focus on the translation of the English "but" and French "mais" which should be disambiguated into German "aber" or "sondern". Except for 1-2% of cases (when no conjunction or both possibilities are found in the target), the outputs can be evaluated automatically. The results indicate that the situation when "aber" is needed is recognized almost perfectly by all the system but the situation which requires "sondern" is sometimes mishandled and the (generally more frequent) "aber" is used. The error rate ranges from 3% (TARTUNLP-C) to 14% (ONLINE-X) or 22% (the unclear system called EN-DE-TASK)  The test suite covers from German, Finnish, Lithuanian and Russian into English and from English into these four langauges and Czech.
The ambiguous words were identified with the help of BabelNet (Navigli and Ponzetto, 2012) multilingual synsets and the granularity was reduced with the help of word embeddings to ensure that the meaning distinctions are reliably big. For the WMT use case, there are dozens or a few hundreds of ambiguous source words (except Lithuanian with only very few words) with slightly more than 2 distinct word senses per examined source word on average.
The results show that overall, WMT systems perform quite well word-sense disambiguation when evaluated in the "in-domain" setting (word senses not too common in subtitle corpora), with precision (examples with correct target words over examples with either correct or in-correct target words) in the ranges 64-80% (e.g. Finnish→English or English→German) up to 95-97% (English→Czech) depending on the language pair. The recalls (examples with correct target words over all examples) are similarly high, 65-91 across the board.
The "out-of-domain" evaluation was directed at word senses common in colloquial speech and in general, research WMT news system perform a little worse than online systems in these scores except for English-Czech.

Similar Language Translation
Within the MT and NLP communities, English is by far the most resource-rich language. MT systems are most often trained to translate texts from and to English or they use English as a pivot language to translate between resource-poorer languages. The interest in English is reflected, for example, in the WMT translation tasks (e.g. News, Biomedical) which have always included language pairs in which texts are translated to and/or from English.
With the widespread use of MT technology, there is more and more interest in training systems to translate between languages other than English. One evidence of this is the need of directly translating between pairs of similar languages, varieties, and dialects (Zhang, 1998;Marujo et al., 2011;Hassani, 2017;Costa-jussà et al., 2018). The main challenge is to take advantage of the similarity between languages to overcome the limitation given the low amount of available parallel data to produce an accurate output.
Given the interest of the community in this topic we organize, for the first time at WMT, a shared task on "Similar Language Translation" to evaluate the performance of state-of-the-art translation systems on translating between pairs of languages from the same language family. We provide participants with training and testing data from three language pairs: Spanish -Portuguese (Romance languages), Czech -Polish (Slavic languages), and Hindi -Nepali (Indo-Aryan languages). Evaluation will be carried out using automatic evaluation metrics and human evaluation.

Data
Training We have made available a number of data sources for the Similar Language Translation shared task. Some training datasets were used in the previous editions of the WMT News Translation shared task and were updated (Europarl v9, News Commentary v14), while some corpora were newly introduced (Wiki Titles v1, JRC Acquis). For the Hi-Ne language pair, parallel corpora have been collected from Opus (Tiedemann and Nygaard, 2004) 19 . We used the Ubuntu, KDE, and Gnome corpus available at OPUS for this shared task.

Development and Test Data
The creation of development and test sets for Czech and Polish involved random extraction of 30 TED talks for the development and 30 TED talks for the test set in each language. Then unique sentences were extracted and cleaning of lines containing meta-data information was performed which resulted in 4.7k sentences in the development sets and 4.8k sentences in the test sets. Further cleaning of the corpus to retain only sentences between 7 and 100 words limited the number of the sentences in the dev and test sets to 3050 and 3412 sentences respectively.
The development and test sets for Spanish and Portuguese were created from a corpus provided by AT Language Solutions 20 . First, the extraction of unique sentences and cleaning of lines containing meta-data information was performed which narrowed the number of sentences to 11.7k sentences. Then cleaning of the corpus to retain only sentences between 7 and 100 words limited the number of the sentences to 6.8k. Finally, 3k randomly selected sentences were used for the development set and other 3k random sentences were extracted to form the test set. For HI-NE, all data was initially combined and randomly shuffled. From the combined corpus, we randomly extracted 65,505 sentences for the training set, 3,000 sentences for development set and 3,567 for the test set. Finally, the test set was split into two different test sets: 2,000 sentences used for HI to NE and 1,557 sentences were used for NE to HI.

Participants
The first edition of the WMT Similar Language Translation task attracted more participants than we anticipated. There were 35 teams who signed up to participate in the competition and 14 of them submitted their system outputs to one of the three language pairs in any translation direction. In the        end of the competition, 10 teams submitted system description papers which are referred to in this report. Table 25 summarizes the participation across language pairs and translation directions and includes references to the 10 system description papers. We observed that the majority of teams contain only members which work in universities and research centers (12 teams) whereas only two teams contain members who work in the industry. The participants were distributed across different continents with a higher participation of European teams (7 European) with two teams based on the Americas, and five Asian teams.
As follows we provide summaries for each of the entries we received:

BSC: Team BSC (Barcelona SuperComputing
Center) participated with a Transfomer-based approach in the Spanish-Portuguese track. As preprocessing, SentencePiece 21 was applied after concatenating and shuffling the data. For the Portuguese to Spanish language direction, BSC made use of back-translation.

CFILT_IITB:
The CFILT_IITB submission (Khatri and Bhattacharyya, 2019) is based on unsupervised neural machine translation described in Artetxe et al. (2018) in the task Hindi ↔ Nepali, where encoder is shared and following bidirectional recurrent neural network architecture. They used 2 hidden layers for both encoder and decoder.

CMUMEAN:
The is system is based on standard 21 https://github.com/google/sentencepiece transformer based NMT model for the Hindi ↔ Nepali shared task. To compensate the insufficient released parallel data, they utilized 7M monolingual data for both Hindi and Nepali taken from CommonCrawl. They augmented the monolingual data by constructing pseudo-parallel datasets. The pseudo-parallel sentences were constructed by word substitutions, based on a mapping of the embedding spaces of the two languages. These mapping were learned from all data and a seed dictionary based on the alignment of the parallel data.
Incomslav: Team INCOMSLAV (Chen and Avgustinova, 2019) by Saarlad University participated in the Czech to Polish translation task only. The team's primary submission builds on a transformer-based NMT baseline with back translation which has been submitted one of their contrastive submission. Incomslav's primary system is a phoneme-based system re-scored using their NMT baseline. A second contrastive submission builds our phrase-based SMT system combined with a joint BPE model.
JUMT: This submittion used phrase based statistical machine translation model for Hindi → Nepali task. They used 3-gram language model and MGIZA++ for word alignment. However, their system achieved poor performance in the shared task.

MLLP-UPV:
Team MLLP-UPV (Baquero-Arnal et al., 2019) by Universitat Politècnica de València (UPV) participated with a Transformer (implemented with FairSeq (Ott et al., 2019)) and a finetuning strategy for domain adaptaion in the task of Spanish-Portuguese. Fine-tunning on the development data provide improvements of almost 12 BLEU points, which may explain their clear best performance in the task for this language pair. As a contrastive system authors provided only for the Portuguese-to-Spanish a novel 2D alternating RNN model which did not respond so well when fine-tunning.
KYOTOUNIVERSITY: Kyoto University's submission, listed simply as KYOTO in Table 25 for PT → ES task is based on transformer NMT system. They used difference word segmentation strategies during preprocessing. Additionally they used optional reverse feature in their prepro-cessing step. Their submission achieved average scores in the shared task.
NICT: The NICT team (Marie et al., 2019a) participated with the a system combination between the Transformer (implemented in Marian (Junczys-Dowmunt et al., 2018) and Phrase-based machine translation system (implemented with Moses) and for the Spanish-Portuguese task. The system combination included features formerly presented in (Marie and Fujita, 2018), including scores left-to-right and right-to-left, sentence level translation probabilities and language model scores. Also authors provide contrastive results with an unsupervised phrase-based MT system which achieves quite close results to their primary system. Authors associate high performance of the unsupervised system to the language similarity.  (Koehn et al., 2007) and KenLM. Their two NMT systems were built using OpenNMT. The first system was built with 2 layers using LSTM model while the second system was built with 6 layers using the Transformer model.

UBC-NLP:
Team UBC-NLP from the University of British Columbia in Canada (Przystupa and Abdul-Mageed, 2019) compared the performance of the LSTM plus attention (Bahdanau et al., 2015) and Transformer (Vaswani et al., 2017) (implemented in OpenNMT toolkit 22 ) perform for the three tasks at hand. Authors use backtranslation to introduce monolingual data in their systems. LSTM plus attention outperformed Transformer for Hindi-Nepali, and viceversa for the other two tasks. As reported by the authors, Hindi-Nepali task provides much more shorter sentences than the other two-tasks. Additionally, authors in their system description report interesting insights on how similar are languages in each of the 3 different tasks.

UDS-DFKI:
The UDS-DFKI team (Pal et al., 2019) is formed by researchers from Saarland University (UDS), the German Research Foundation of Artificial Intelligence (DFKI), and the University of Wolverhampton. They submitted a transference model that extends the original transformer model to multi-encoder based transformer architecture. The transference model contains two encodes, the first encoder encodes word form information of the source (CS), and a second encoder to encode sub-word (byte-pair-encoding) information of the source (CS). The results obtained by their system in translating from Czech→Polish and comment on the impact of out-of-domain test data in the performance of their system. UDS-DFKI ranked second among ten teams in Czech-Polish translation.
UHelsinki: The University of Helsinki team (Scherrer et al., 2019) participated with the Transformer (Vaswani et al., 2017) implemented in the OpenNMT toolkit. They focused on word segmentation methods and compared a cognate-aware segmentation method, Cognate Morfessor (Grönroos et al., 2018), with character segmentation and unsupervised segmentation methods. As primary submission they submitted this Cognate Morfessor that optimizes subword segmentations consistently for cognates. They participated for all translation directions in Spanish-Portuguese and Czech-Polish, and this Cognate Morfessor performed better for Czech-Polish, while characterbased segmentations (Costa-jussà and Fonollosa, 2016), while much more inefficient, were superior for Spanish-Portuguese.

UPC-TALP:
The UPC-TALP team (Biesialska et al., 2019) by the Universitat Politècnica de Catalunya submitted a Transformer (implemented with Fairseq (Ott et al., 2019)) for the Czechto-Polish task and a Phrase-based system (implemented with Moses (Koehn et al., 2007)) for Spanish-to-Portuguese. They tested adding monolingual data to the NMT system by copying the same data on the source and target sides, with negative results. Also, their system combination based on sentence-level BLEU in back-translation did not succeed. Authors provide interesting insights on language distance based on previous work by (Gamallo et al., 2017) and their results show that the Phrase-based compared to NMT achieves better results when the language distance between source and target language is lower.

Results
We present results for the three language pairs, each of them in the two possible directions. For this first edition of the Similar Translation Task and differently from News task, evaluation was only performed on automatic basis using BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) measures. Each language direction is reported in one different table which contain information of the team; type of system, either contrastive (C) or primary (P), and the BLEU and TER results. In general, primary systems tend to be better than contrastive systems, as expected, but there are some exceptions.
Even if we are presenting 3 pairs of languages each pair belonging to the same family, translation quality in terms of BLEU varies signficantly. While the best systems for Spanish-Portuguese are above 64 BLEU and below 21 TER (see Tables  26 and 27), best systems for Czech-Polish do not reach the 8 BLEU and the 79.6 TER for the direction with lowest TER (Polish-to-Czech). The case of Hindi-Nepali is in between, with BLEU of 53.7 and TER of 36.3 for the better direcion Hindi-to-Nepali. Also, we noticed that BLEU and TER do not always correlate and while some systems performed better in BLEU, the ranking is different if ordered by TER. In any case, we chose BLEU as the official metric for ranking.
The highest variance of system performance can be found in Hindi-Nepali (both directions), where the best performing system is around 50 BLEU (53 for Hindi-to-Nepali and 49.1 for Nepali-to-Hindi), and the lowest entry is 1.4 for Hindi-to-Nepali and 0 for Nepali-to-Hindi. The lowest variance is for Polish-to-Czech and it may be because only two teams participated.

Conclusion of Similar Language Translation
In this section we presented the results of the WMT Similar Language Translation shared task 2019. The competition featured data in three language pairs: Czech-Polish, and Hindi-Nepali, and Portuguese-Spanish.         Gamallo et al. (2017) and provide big language distances for Czech-Polish compared to Spanish-Portuguese.

Conclusion
We presented the results of the WMT18 News Translation Shared Task. Our main findings rank participating systems in their sentence-level translation quality, as assessed in a large-scale manual evaluation using the method of Direct Assessment (DA). The novelties this year include (1) avoiding effects of translationese by creating reference translations always in the same directions as the MT systems are run, (2) providing human assessors with the context of the whole document when assessing individual segments for a large portion of language pairs, (3) extending the set of languages which are evaluated given the source, not the reference translation, and (4) scoring also whole documents, not only individual segments.
Our results indicate which MT systems perform best across the 18 examined translation pairs, as well as what features are now commonly used in the field. The test suites complement this evaluation by focussing on particular language phenomena such as word-sense disambiguation, document-level coherence or terminological correctness.
As in the previous year, MT systems seem to reach the quality of human translation in the news domain for some language pairs. This result has to be regarded with a great caution and considering the technical details of the (document-aware) DA evaluation method as well as the outcomes of complementary evaluations, such as those included in the test suites. Importantly, the language pairs where the parity was reached last year were not confirmed by the evaluation this year and a similar situation can repeat. As one of the test suites (Vojtěchová et al., 2019) suggests, there are aspects of texts which are wrongly handled by even the best translation systems. The task on similar language translation indicated that the performance in this area is extremely varied across language pairs as well as across participating teams.

Acknowledgments
This work was supported in part by funding from the European Union's Horizon 2020 research and innovation programme under grant agreement Nos. 825299 (GoURMET) and 825303 (Bergamot), and from the Connecting Europe Facility under agreement No. NEA/CEF/ICT/A2016/1331648 (ParaCrawl).
The human evaluation campaign was very gratefully supported by Apple, Microsoft, and Science Foundation Ireland in the ADAPT Centre for Digital Content Technology (www.adaptcentre. ie) at Dublin City University funded under the SFI Research Centres Programme (Grant 13/RC/2106) co-funded under the European Regional Development Fund.
We are grateful to the large number of anonymous Mechanical Turk workers who contributed their human intelligence to the human evaluation.
Ondřej Bojar would like to acknowledge also the grant no. 19-26934X (NEUREM3) of the Czech Science Foundation.
The organizers of the similar languages task want to thank Magdalena Biesialska for her support in the compilation of the Czech-Polish data set as well as her valuable support as a Polish native speaker. This work is supported in part by the Spanish Ministerio de Economía y Competitividad, the European Regional Development Fund and the Agencia Estatal de Investigación, through the postdoctoral senior grant Ramón y Cajal, through the travel grant José Castillejo, CAS18/00223, the contract TEC2015-69266-P (MINECO/FEDER,EU) and the contract PCIN-2017-079 (AEI/MINECO). The authors also want to thank German research foundation (DFG) under grant number GE 2819/2-1 (project MMPE) and the German Federal Ministry of Education and Research (BMBF) under funding code 01IW17001 (project Deeplee).

A Differences in Human Scores
Tables 32-49 show differences in average standardized human scores for all pairs of competing systems for each language pair. The numbers in each of the tables' cells indicate the difference in average standardized human scores for the system in that column and the system in that row. Because there were so many systems and data conditions the significance of each pairwise comparison needs to be quantified. We applied Wilcoxon rank-sum test to measure the likelihood that such differences could occur simply by chance. In the following tables indicates statistical significance at p < 0.05, † indicates statistical significance at p < 0.01, and ‡ indicates statistical significance at p < 0.001, according to Wilcoxon rank-sum test.