Findings of the 2018 Conference on Machine Translation (WMT18)

This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2018. Participants were asked to build machine translation systems for any of 7 language pairs in both directions, to be evaluated on a test set of news stories. The main metric for this task is human judgment of translation quality. This year, we also opened up the task to additional test sets to probe specific aspects of translation.

This year we conducted several official tasks. We report in this paper on the news translation task. Additional shared tasks are described in separate papers in these proceedings: • biomedical translation (Neves et al., 2018), • multimodal machine translation (Barrault et al., 2018), • metrics (Ma et al., 2018), • quality estimation , • automatic post-editing (Chatterjee et al., 2018), and • parallel corpus filtering (Koehn et al., 2018b).
In the news translation task (Section 2), participants were asked to translate a shared test set, optionally restricting themselves to the provided training data ("constrained" condition). We held 14 translation tasks this year, between English and each of Chinese, Czech, Estonian, German, Finnish, Russian, and Turkish. The Estonian-English language pair was new this year. Similarly to Latvian, which we had covered in 2017, Estonian is a lesser resourced data condition on a challenging language pair. System outputs for each task were evaluated both automatically and manually.
This year the news translation task had two additional sub-tracks: multilingual and unsupervised MT. Both sub-tracks were included into the general list of news translation submissions and are described in more detail in corresponding subsections of Section 2.
The human evaluation (Section 3) involves asking human judges to score sentences output by anonymized systems. We obtained large numbers of assessments from researchers who contributed evaluations proportional to the number of tasks they entered. In addition, we used Mechanical Turk to collect further evaluations. This year, the official manual evaluation metric is again based on judgments of adequacy on a 100-point scale, a method we explored in the previous years with convincing results in terms of the trade-off between annotation effort and reliable distinctions between systems.
The primary objectives of WMT are to evaluate the state of the art in machine translation, to disseminate common test sets and public training data with published performance numbers, and to refine evaluation and estimation methodologies for machine translation. As before, all of the data, translations, and collected human judgments are publicly available. 2 We hope these datasets serve as a valuable resource for research into datadriven machine translation, automatic evaluation, or prediction of translation quality. News transla-tions are also available for interactive visualization and comparison of differences between systems at http://wmt.ufal.cz/ using MT-ComparEval .
In order to gain further insight into the performance of individual MT systems, we organized a call for dedicated "test suites", each focussing on some particular aspect of translation quality. A brief overview of the test suites is provided in Section 4.

News Translation Task
The recurring WMT task examines translation between English and other languages in the news domain. As in the previous year, we include Chinese, Czech, German, Finnish, Russian, and Turkish. A new language this year is Estonian.
We created a test set for each language pair by translating newspaper articles and provided training data.

Test Data
The test data for this year's task was selected from online sources, as in previous years. We took about 1500 English sentences and translated them into the other languages, and then additional 1500 sentences from each of the other languages and translated them into English. This gave us test sets of about 3000 sentences for our English-X language pairs, which have been either originally written in English and translated into X, or vice versa. The composition of the test documents is shown in Table 1, the size of the test sets in terms of sentence pairs and words is given in Figure 2.
The stories were translated by professional translators 3 funded by the EU Horizon 2020 projects CRACKER and QT21 (German, Czech), by Yandex, 4 a Russian search engine company (Turkish, Russian), by BAULT, a research community on building and using language technol-3 In particular, the Czech and German test sets were translated to/from English by the professional level of service of Translated.net, preserving 1-1 segment translation and aiming for literal translation where possible. Each language combination included 2 different translators: the first translator took care of the translation, the second translator was asked to evaluate a representative part of the work to give a score to the first translator. All translators translate towards their mother tongue only and need to provide a proof or their education or professional experience, or to take a test; they are continuously evaluated to understand how they perform on the long term. The domain knowledge of the translators is ensured by matching translators and the documents using T-Rank, http://www.translated.net/en/T-Rank. 4 http://www.yandex.com/ ogy funded by the University of Helsinki (Finnish) and the University of Tartu (Estonian). The Chinese-English task was sponsored by Nanjing University, Xiamen University, the Institutes of Computing Technology and of Automation, Chinese Academy of Science, Northeastern University (China) and Datum Data Co., Ltd. All of the translations were done directly, and not via an intermediate language. Since Estonian-English was run for the first time, both the test and development set had to be translated: the size of both was 2000 sentences (4000 in total).

Training Data
As in past years we provided parallel corpora to train translation models, monolingual corpora to train language models, and development sets to tune system parameters. Some training corpora were identical from last year (Europarl, 5 Common Crawl, SETIMES2, Russian-English parallel data provided by Yandex, Wikipedia Headlines provided by CMU) and some were updated (United Nations, CzEng, 6 News Commentary v13, monolingual news data). The new Estonian-English language pair had parallel data from Europarl, EU Press Releases and ParaCrawl, as well as a monolingual corpus of mostly news articles called BigEst. 7 Some statistics about the training materials are given in Figures 1 and 2.

Multilingual and Unsupervised Sub-tracks
This year the news translation task included two sub-tracks: one on multilingual translation and another one on unsupervised MT. The multilingual sub-track covered any submissions that used any data (monolingual or parallel) from a third language to help translating the language pair in question: for example, using English-Finnish data to improve English-Estonian translation. All entries to this sub-track had to use only the WMT-provided data sets, and thus had to be constrained.
In the unsupervised MT sub-track the participants were further constrained to using only the   3,074,921,453 2,872,785,485 333,498,145 1,168,529,851 157,264,161 100,779,314 511,196,951 1,672,324,647 Words 65,128,419,540 65,154,042,103 6,694,811,063 23,313,060,950 2,935,402,545 2,906,100,138 11,882,126,760,462 339,983,035 50,162,437 101,436,673 47,083,545 27,618,190 88,463,  monolingual training data from WMT; this additionally excluded the monolingual corpora that are largely parallel (monolingual parts of Europarl and News Commentary). The aim of this task was to see how far can one get in terms of translation quality without any parallel data used for training. 8 While there was no restriction in terms of language pairs, three language pairs were "verbally endorsed": English to/from Turkish, Estonian and German. The motivation behind the choice of languages was to test the effect of multilingual and unsupervised methods on low-resource language pairs (Turkish-English, Estonian-English) and to contrast the results with a resource-rich pair (German-English).
Submissions to both sub-tracks are joined with the main translation track and evaluated without separation in the same way.

Submitted Systems
We received 103 submissions from 32 institutions. The participating institutions, organized into 35 teams are listed in Table 2 and detailed in the rest of this section. Each system did not necessarily appear in all translation tasks. We also included 39 online MT systems (originating from 5 services), which we anonymized as ONLINE-A,B,F,G,Y. For presentation of the results, systems are treated as either constrained or unconstrained, depending on whether their models were trained only on the provided data. Since we do not know how they were built, the online systems are treated as unconstrained during the automatic and human evaluations. (Grönroos et al., 2018) Aalto participated in the constrained condition of the multi-lingual subtrack, with a single system trained to translate from English to both Finnish and Estonian. The system is based on the Transformer (Vaswani et al., 2017) implementation in OpenNMT-py (Klein et al., 2017). It is trained on filtered parallel and filtered back-translated monolingual data. The main contribution is a novel cross-lingual Morfessor (Virpioja et al., 2013) segmentation using cognates extracted from the parallel data. The aim is to improve the consistency of the morphological segmentation. Aalto decode using an ensemble of 3 (et) or 8 (fi) models. (Gwinnup et al., 2018) AFRL-SYSCOMB is a system-combination entry consisting of three inputs. The first is an Open-NMT system trained on the provided parallel data except ParaCrawl and the backtranslated corpus used in the AFRL WMT17 system (Gwinnup et al., 2017). This system uses a standard RNN architecture and was fine-tuned with the other available news task test sets. The second is a Marian (Junczys-Dowmunt et al., 2018) system ensembling 5 Univ. Edinburgh "bi-deep" and 6 transformer models all trained on the WMT18 bitexts provided, including ParaCrawl. Some models employed pretrained word embeddings built on BPE'd corpora (Sennrich et al., 2016b). A Marian transformer model performed right-to-left rescoring for this system. The third system is trained with Moses , using the same data as the Marian system. Hierarchical reordering and Operation Sequence Model were employed. The 5-gram English language model was trained with KenLM (Heafield, 2011) on the same corpus as the AFRL WMT15 system with the same BPE used in the Marian systems. Lastly, RWTH Jane's system combination (Freitag et al., 2014) was applied yielding approximately a +0.5 gain in BLEU. (Deng et al., 2018) Alibaba systems are based on the Transformer model architecture, with the most recent features from the academic research integrated, such as weighted Transformer, Transformer with relative position attention, etc. The system also employs most techniques that have been proven effective during the past WMT years, such as BPE-based subword, back translation, fine-tuning based on selected data, model ensembling and reranking, at industrial scale. For some morphologically-rich languages, linguistic knowledge is also incorporated into the neural network. (Kocmi et al., 2018) The CUNI-KOCMI submission focuses on the low-resource language neural machine translation (NMT). The final submission uses a method of transfer learning: the model is pretrained on a related high-resource language (here Finnish) first, followed by a child low-resource language (Estonian) without any change in hyperparameters. Averaging and backtranslation are also experimented with.

CUNI-KOCMI
TALP, Technical University of Catalonia (Casas et al., 2018) Table 2: Participants in the shared translation task. Not all teams participated in all language pairs. The translations from the online systems were not submitted by their respective companies but were obtained by us, and are therefore anonymized in a fashion consistent with previous years of the workshop. " " indicates invited participation with a late submission, where the team is not considered a regular participant.
2.4.5 CUNI-TRANSFORMER (Popel, 2018) CUNI-TRANSFORMER is the Transformer model trained according to Popel and Bojar (2018) plus a novel concat-regime backtranslation with checkpoint averaging, tuned separately for CZ-domain and nonCZ-domain articles, possibly handling also translation-direction ("translationese") issues. For cs→en also a coreference preprocessing was used adding the female-gender pronoun where it was pro-dropped in Czech, referring to a human and could not be inferred from a given sentence.

FACEBOOK-FAIR (Edunov et al., 2018)
FACEBOOK-FAIR is an ensemble of six selfattentional models with back-translation data according to Edunov et al. (2018). Synthetic sources are sampled instead of beam search, oversampling the real bitext at a rate of 16, i.e., each bitext is sampled 16 times more often per epoch than the back-translated data. At inference time, translations which are copies of the source are filtered out, replacing them with the output of a very small news-commentary only trained model. The system FACEBOOK-FAIR has been submitted anonymously as ONLINE-Z and approval for disclosing the authors' identity has only been granted after the final results had become available. Due to the non-standard way of submission, the system is not considered a regular participant, but an invited/late submission and marked with " " throughout the paper. (Bei et al., 2018) GTCOM-PRIMARY is based on the Transformer "base" model architecture using Marian toolkit, and it also applies some methods that have been proven effective in NMT systems, such as BPE, back-translation, right-to-left reranking and ensembling decoding. In this experiment, right-toleft reranking does not help. Another focus is given to data filtering through rules, translation model and language model including parallel data and monolingual data. The language model is based the Transformer architecture as well. The final system is trained with four different seeds and mixed data.
HY-AH (Raganato et al., 2018;Hurskainen and Tiedemann, 2017) is a rule-based machine translation system, relying on a rule-based dependency parser for English, a hand-crafted translation lexicon (based on dictionary data extracted from parallel corpora by word alignment), various types of transfer rules, and a morphological generator for Finnish.
HY-NMT (Raganato et al., 2018) submissions are based on the Transformer "base" model, trained with all the parallel data provided by the shared task plus back-translations, with a shared vocabulary between source and target language and a domain label for each source sentence. For the multilingual sub-track synthetic data for English→Estonian and Estonian→English was also used. Ultimately, a single model for all language pairs was trained and then fine-tuned for each language pair. Sentences selection from the new ParaCrawl improved the effectiveness of the corpus by 0.5 BLEU points, with an overall increase of 0.8 BLEU compared to the baseline of not using ParaCrawl.

LI-MUZE
LI-MUZE is an ensemble of 4 averaged Transformer models with one right-to-left and one target-to-source averaged Transformer model, the configuration of all the models is the same as the Transformer big-model, trained on the official training data with 4.5M back-translated data from the monolingual news of 2016 and 2017 data. The English vocabulary size is 36K BPE subwords. Chinese is tokenized by Chinese characters and the vocabulary size is 10K.

LMU-NMT (Huck et al., 2018)
For the WMT18 news translation shared task, LMU Munich (Huck et al., 2018) has trained basic shallow attentional encoder-decoder systems (Bahdanau et al., 2014) with the Nematus toolkit (Sennrich et al., 2017), like last year (Huck et al., 2017a). LMU has participated with these NMT systems for the English-German language pair in both translation directions. The training data is a concatenation of Europarl, News Commentary, Common Crawl, and some synthetic data in the form of backtranslated monolingual news texts. The 2017 monolingual News Crawl is not employed, nor are the parallel Rapid and ParaCrawl corpora. The German data is preprocessed with a linguistically informed word segmentation technique (Huck et al., 2017b). By using a linguistically more sound word segmentation, advantages over plain BPE segmentation are expected in three important aspects: vocabulary reduction, reduction of data sparsity, and open vocabulary translation. The NMT system can learn linguistic word formation processes from the segmented data. In the English→German translation direction, LMU furthermore conducted fine-tuning towards the domain of news articles (Huck et al., 2017a) and reranked the n-best list with a right-to-left neural model (Liu et al., 2016) which is trained for reverse word order (Freitag et al., 2013).
2.4.14 LMU-UNSUP (Stojanovski et al., 2018) For the unsupervised track of the WMT18 news translation task, LMU Munich submitted the LMU-UNSUP system (Stojanovski et al., 2018) which is a neural translation model trained without any access to parallel data. The model is trained with ∼4M German and English sentences each, which are sampled from NewsCrawl articles from 2007 to 2017. Bilingual word embeddings trained in an unsupervised manner (Conneau et al., 2017) were used to translate the monolingual data by doing word-by-word translation and this synthetically created parallel data is used in the training as well. The same model is used to do both German→English and English→German translation. The model is based on Lample et al. (2018) and it uses denoising and on-the-fly backtranslation. Additionally the model uses the word-byword translated data in the initial training stages to jump-start the training and disables the denoising component as the last training step for further improvements. The NMT embeddings are initialized with embeddings obtained from fasttext trained jointly on German and English monolingual BPElevel data.

MICROSOFT-MARIAN
(Junczys-Dowmunt, 2018) MICROSOFT-MARIAN is the Transformer-big model implemented in Marian with an updated version of Edinburgh's training scheme for WMT2017, following current common practices: truecasing and tokenization using Moses scripts, BPE subwords, backtranslation (using a shallow model), ensembling of four left-to-right deep models and reranking of 12-best list with an en-semble of four right-to-left models. The novelties are primarily in new data filtering (dual conditional cross-entropy filtering) and sentence weighting methods.

MLLP-UPV (Iranzo-Sánchez et al., 2018)
MLLP-UPV is an ensemble of Transformer architecture-based neural machine translation systems. To train the system under "constrained" conditions, the provided parallel data was filtered with a scoring technique using character-based language models, and was augmented based on synthetic source sentences generated from the provided monolingual corpora. The ensemble consists of 4 independent training runs of the Transformer "base" model, trained with 10M filtered sentences (including from ParaCrawl) and 20M backtranslated sentences from NewsCrawl2017.

MMT-PRODUCTION
MMT-PRODUCTION is the machine translation system offered by MMT s.r.l. (www.modernmt. eu) as of July 2018. It is a Transformer-based neural MT system trained on public and proprietary data, containing about 100M sentence pairs and about 1.5G English words. It exploits a single model of 'transformer-big' size, and a single pass-decoding; texts are processed using internal tools. (Tars and Fishel, 2018) NEUROTOLGE.EE is a multi-domain NMT system that treats text domain as language and applies the zero-shot multi-lingual approach to multiple domains in the training corpus. For WMT18, text domains were replaced with unsupervised clustering into 16 clusters using FastText's sentence embeddings. During translation the input segment is classified using its sentence embedding and translated as the corresponding cluster/domain.

NICT (Marie et al., 2018)
NICT NMT systems were trained with the Transformer architecture using the provided parallel data enlarged with a large quantity of backtranslated monolingual data generated with a new incremental training framework. The primary submissions to the task are the result of a simple combination between NICT SMT and NMT systems.
2.4.20 NIUTRANS (Wang et al., 2018b) NIUTRANS baseline systems are based on the Transformer architecture with the "base" model, equipped with checkpoint averaging and backtranslation techniques. NIUTRANS further improves the translation performance by 2.28-3.83 BLEU points from four aspects of model variations (larger inner-hidden-size in FFN, using ReLU and attention dropout, Swish activation function, relative positional representation), diverse ensemble decoding (ensemble decoding with up to 15 models, generated by different strategies), reranking (up to 14 features for reranking), and post-processing (aimed at consistent translation of proper nouns, especially English literals in Chinese sentences).

NJUNMT
The NJUNMT-PRIVATE is most likely the system developed by Natural Language Processing Group of Nanjing University based on highlevel API of TensorFlow, https://github. com/zhaocq-nlp/NJUNMT-tf. Further details on training are not available.

NTT (Morishita et al., 2018)
NTT combine Transformer "big" model, corpus cleaning technique for provided and synthetic parallel corpora, and right-to-left n-best re-ranking techniques. Through their experiments, NTT found filtering of noisy training sentences and right-to-left re-ranking as the keys to better accuracy.
2.4.23 PARFDA (Biçici, 2018) PARFDA selects a subset of the training and LM data to build task-specific SMT models. PARFDA uses phrase-based Moses and all constrained available resources provided by WMT18. The datasets are available at https://github.com/bicici/ parfdaWMT2018.
PROMT-HYB-MARIAN is an ensemble of 5 transformer models trained on WMT data and inhouse news data.
PROMT-HYB-OPENNMT is a hybrid system based on PROMT Rule-based engine and an NMT post-editing (PE) engine. The NMT PE component is a sequence-to-sequence model with attention and deep biRNN encoder trained with Open-NMT toolkit.
PROMT-RULE-BASED is a rule-based system, without any specific training or tuning. Fine-tuning on old testsets (newstest2008-newstest2014). RWTH English→Turkish system is based on 6-layer encoder-decoder Transformer architecture. Since the task has low resources, dropout with the rate of 0.3 to all applicable layers was used. Even though the two languages are not much related, joint BPE and weight tying helped a lot as part of regularization. For the final submission, RWTH used augmented training data with 1M-sentence back-translations and ensembled four models with different random seeds.

RWTH-UNSUPER (Graça et al., 2018)
The RWTH-UNSUPER unsupervised NMT system is built based on recent works by Lample et al. (2018) and Artetxe et al. (2018). RWTH-UNSUPER best performing systems follow the batch optimization strategy and are initialized with cross-lingual embeddings. Furthermore, RWTH-UNSUPER found that sharing a vocabulary performs better than having separate ones. Freezing embeddings hurts performance and it was found best to initialize embeddings with pre-trained ones and train them as usual.

TALP-UPC (Casas et al., 2018)
TALP-UPC is the Transformer "base" model trained with the Tensor2Tensor implementation (Vaswani et al., 2018) and wordpieces vocabulary. The training corpus is multilingual (concatenating Finnish-English and Estonian-English) and includes ParaCrawl with garbage cleaned up via langdetect.

TENCENT (Wang et al., 2018a)
TENCENT-ENSEMBLE (called TenTrans) is an improved NMT system on Transformer based on self-attention mechanism. In addition to the basic settings of Transformer training, TENCENT-ENSEMBLE uses multi-model fusion techniques, multiple features reranking, different segmentation models and joint learning. Additionally, data selection strategies were adopted to fine-tune the trained system, achieving a stable performance improvement.
An additional system paper (Hu et al., 2018) describes a non-primary submission. TILDE-C-NMT are constrained English-Estonian and Estonian-English NMT systems that were deployed as ensembles of averaged factored data Transformer models. The models were trained using filtered parallel data and back-translated data in a 1-to-1 proportion. The parallel data were supplemented with synthetic data (generated from the same parallel data) that contain unknown token identifiers in order to acquire models that are more robust to unknown phenomena. TILDE-C-NMT-COMB is a constrained Estonian-English NMT system that is a system combination of multiple constrained factored data NMT systems.
TILDE-C-NMT-2BT systems were trained using Sockeye and Transformer models. Before training the initial systems, parallel data were cleaned using the parallel-corpora-tools. Before back-translation, monolingual data were also filtered. After back-translation, the resulting synthetic corpora were filtered again. Intermediate systems were trained with the first batch of par-allel+synthetic data. The back-translation and filtering process was performed a second time with additional monolingual data to train the final systems with parallel and two sets of synthetic data. TILDE-NC-NMT are unconstrained English→Estonian and Estonian→English NMT systems that were deployed as averaged Transformer models. These models were also trained using back-translated data similarly to the constrained systems, however, the data, taking into account its relatively large size, was not factored.

UBIQUS
UBIQUS-NMT is a Transformer "base" model trained and run with the OpenNMT implementation. It uses back-translation according to Sennrich et al. (2016a) and it does not include ParaCrawl. Subwords are generated with Sentence Piece. 9 2.4.31 UCAM (Stahlberg et al., 2018) UCAM is a generalization of previous work (de Gispert et al., 2017) to multiple architectures. It is a system combination of two Transformer-like models, a recurrent model, a convolutional model, and a phrase-based SMT system. The output is probably dominated by the Transformer, and to some extend by the SMT system.

UEDIN (Haddow et al., 2018)
For Estonian↔English and Finnish↔English, the UEDIN systems are an ensemble of four left-toright systems, reranked with four right-to-left systems, built using Marian. Each ensemble consists of two Transformers and two deep RNNs. The RNNs use the UEDIN multi-head / multi-hop variant. All available parallel data were used, plus back-translated data from 2017 (for into-English) and 2014-2017 (for out-of-English). The natural parallel data was generally over-sampled to give an equal mix of parallel and synthetic data. For English↔Estonian, UEDIN selected 30% of ParaCrawl based on translation model perplexity for a model built on the rest of the data.
The UEDIN systems for other language pairs use an ensemble of four deep RNN left-to-right systems, reranked with 4 deep RNN right-to-left systems. The RNN models use the UEDIN multi-head 9 https://github.com/google/sentencepiece / multi-hop attention variant. All the provided parallel data (including ParaCrawl) were used, applying langid filtering to remove some incorrect sentence pairs. Synthetic data were also used, created by back-translating the 2017 English news crawl, and the 2017 and 2016 Czech news crawls. For Czech→English, the synthetic data was oversampled 2x.
2.4.33 UMD (Xu and Carpuat, 2018) The UMD best system is an ensemble of three 6-layer left-to-right Transformer models reranked with target-to-source and left-to-right models. Each Transformer model is trained with a 2:1 mixture of parallel and backtranslated monolingual data. For parallel data, duplicates are removed and "bad" sentence pairs filtered out. Monolingual data is sub-sampled from news 2017 (English) and news 2011 (Chinese). Subwords (BPE) are used for both English and Chinese sentences.

UNISOUND
The UNISOUND systems are probably developed by the Unisound company (www.unisound.com). No further information is available. (Del et al., 2018) UNSUPTARTU is an unsupervised MT system using n-gram embedding cross-lingual mapping to create a phrase table. An RNN LM is used in decoding.

Submission Summary
Next we summarize the general trends in the systems submitted to the translation task and its subtracks.
The dominating majority of the submissions (29 systems) are based on the Transformer approach (Vaswani et al., 2017), with a varying number of encoder/decoder layers and other details. Four more systems use the basic attentional encoderdecoder approach (Bahdanau et al., 2014), three are phrase-based SMT systems and two are rulebased. Several submissions use ensembles of components with different approaches.
Most systems report using back-translated data, some of them filtering the synthetic data and some using a fixed sampling rate between the real and synthetic data.
As far as subwords go, two widely used options are byte-pair encoding (Sennrich et al., 2016b) and sentencepiece (Kudo and Richardson, 2018). Some submissions use linguistically motivated segmentation, especially for the highly agglutinative Finnish.
There were 3 submissions to the multilingual sub-track, all three applying multilingual transfer learning and training systems to translate from English into Finnish and Estonian simultaneously.
Unsupervised MT also had 3 submissions, of which two applied their systems to German to/from English and the third was done for Estonian-to-English translation. The two German-English-German systems use the neural MT method of Lample et al. (2018) with small modifications, and the Estonian-English system used a phrase-based statistical unsupervised approach from the same article.

Human Evaluation
A human evaluation campaign is run each year to assess translation quality and to determine the final ranking of systems taking part in the competition. This section describes how preparation of evaluation data, collection of human assessments, and computation of the official results of the shared task was carried out this year.
Work on evaluation over the past few years has provided fresh insight into ways to collect direct assessments (DA) of machine translation quality (Graham et al., 2013(Graham et al., , 2014(Graham et al., , 2016, and two years ago the evaluation campaign included parallel assessment of a subset of News task language pairs evaluated with relative ranking (RR) and DA. DA has some clear advantages over RR, namely the evaluation of absolute translation quality and the ability to carry out evaluations through quality controlled crowd-sourcing. As established in 2016 (Bojar et al., 2016a), DA results (via crowd-sourcing) and RR results (produced by researchers) correlate strongly, with Pearson correlation ranging from 0.920 to 0.997 across several source languages into English and at 0.975 for English-to-Russian (the only pair evaluated outof-English). Last year, we thus employed DA for evaluation of systems taking part in the news task and do so again this year. Where possible, we collect DA judgments via the crowd-sourcing platform, Amazon's Mechanical Turk, and as in previous year's we ask participating teams to provide manual evaluation of system outputs via Appraise. Researcher involvement was needed particularly for translations into Czech, German, Es-tonian, Finnish and Turkish.
Human assessors are asked to rate a given translation by how adequately it expresses the meaning of the corresponding reference translation (i.e. no bilingual speakers are needed) on an analogue scale, which corresponds to an underlying absolute 0-100 rating scale. Since DA involves evaluation of a single translation per screen, this allows the sentence length restriction usually applied during manual evaluation to be removed for both researchers and crowd-sourced workers. 10 Figure 3 shows one DA screen as completed by researchers on Appraise, while Figure 4 provides a screenshot of DA shown to crowd-sourced workers on Amazon's Mechanical Turk.
The annotation is organized into "HITs" (following the Mechanical Turk's term "human intelligence task"), each containing 100 such screens and requiring about half an hour to finish. Appraise users were allowed to pause their annotation at any time, Amazon interface did not allow any pauses. More details of composition of HITs are given in Section 3.3 below.

Evaluation Campaign Overview
In terms of the News translation task manual evaluation, a total of 584 individual researcher accounts were involved, and 915 turker accounts. 11 Researchers in the manual evaluation came from 33 different research groups and contributed judgments of 118,705 translations, while 225,900 translation assessment scores were submitted in total by the crowd. 12 Under ordinary circumstances, each assessed translation would correspond to a single individual scored segment. However, since distinct systems can produce the same output for a particular input sentence, we are often able to take advantage of this and use a single assessment for multiple systems. Similar to last year's evaluation, we only combine human assessments in this way if the string of text belonging to multiple systems is exactly identical. For example, even small dif-10 The maximum sentence length with RR was 30 in WMT16.
11 Numbers do not include the 1,533 workers on Mechanical Turk and 7 on Appraise who did not pass quality control.
12 Numbers include quality control items for workers who passed quality control but omit the additional 347,700 assessments collected on Mechanical Turk where a worker did not pass quality control and equivalent 1,466 judgments for the small number of Appraise workers who did not meet the quality control threshold. A 40% pass rate for quality control is typical of DA evaluations on Mechanical Turk. is presented with a reference translation and a single system output randomly selected from competing systems (anonymized), and is asked to rate the translation on a sliding scale. ferences in punctuation disqualify combination of similar system outputs, and this is due to a general lack of evidence about what kinds of minor differences may or may not impact human evaluation. Table 3 shows the numbers of segments for which distinct MT systems participating in the News Translation Task produced identical outputs. The biggest saving in terms of exact duplicate translations, being produced by multiple systems, was for German to English, where a 17.4% saving of resources by combining identical outputs before human evaluation.

Data Collection
System rankings are produced from a large set of human assessments of translations, each of which indicates the absolute quality of the output of a system. Annotations are collected in an evalua-tion campaign that enlists the help of participants in the shared task. Each team is asked to contribute 8 hours annotation time, which we estimated at 16 100-translation HITs per primary system submitted. We continue to use the open-source Appraise 13 (Federmann, 2012) tool for our data collection, in addition to Amazon Mechanical Turk. 14 Table 4 shows total numbers of human assessments collected in WMT18 contributing to final scores for systems. 15 The effort that goes into the manual evaluation campaign each year is impressive, and we  are grateful to all participating individuals and teams. We believe that human annotation provides the best decision basis for evaluation of machine translation output and it is great to see continued contributions on this large scale.

Crowd Quality Control
This year, two distinct HIT structures were run in the overall evaluation campaign, the standard DA set-up was employed for Mechanical Turk and a portion of the Appraise evaluation, while an additional HIT structure was used for the remaining part of the Appraise evaluation. Below we firstly describe the standard DA HIT structure and quality control mechanism before describing the additional version used for part of the Appraise evaluation. In both set-ups, translations are arranged in sets of 100-translation HITs to provide control over assignment and positioning of quality control items to human annotators.

Standard DA HIT Structure
In the standard DA HIT structure, three kinds of quality control translation pairs are employed as described in Table 5: we repeat pairs (expecting a similar judgment), damage MT outputs (expecting significantly worse scores) and use references instead of MT outputs (expecting high scores).
In total, 60 items in a 100-translation HIT serve in quality control checks but 40 of those are regu-lar judgments of MT system outputs (we exclude assessments of bad references and ordinary reference translations when calculating final scores). The effort wasted for the sake of quality control is thus 20%.
Also in the standard DA HIT structure, within each 100-translation HIT, the same proportion of translations are included from each participating system for that language pair. This ensures the final dataset for a given language pair contains roughly equivalent numbers of assessments for each participating system. This serves three purposes for making the evaluation fair. Firstly, for the point estimates used to rank systems to be reliable, a sufficient sample size is needed and the most efficient way to reach a sufficient sample size for all systems is to keep total numbers of judgments roughly equal as more and more judgments are collected. Secondly, it helps to make the evaluation fair because each system will suffer or benefit equally from an overly lenient/harsh human judge. Thirdly, despite DA judgments being absolute, it is known that judges "calibrate" the way they use the scale depending on the general observed translation quality. With each HIT including all participating systems, this effect is averaged out. Furthermore apart from quality control items, HITs are constructed using translations sampled from the entire set of outputs for a given language pair.  Table 4: Amount of data collected in the WMT18 manual evaluation campaign (assessments after removal of quality control items and "de-collapsing" multi-system outputs). The final seven rows report summary information from previous years of the workshop.

Alternate DA HIT Structure
The alternate DA HIT structure employed by Appraise this year for a subset of researcher HITs is shown in Table 6. This set-up reduces the number of quality control items in a HIT and is therefore more efficient (12% overhead) by omitting repeat pairs and good reference pairs. This comes at the cost of a reduced ability to analyze the quality of data provided by human annotators.
In addition for this set-up, an additional constraint (not originally applied in standard DA) was imposed. As much as possible within a 100translation HIT the HIT included the output of all participating systems for each source input. This constraint has the advantage of producing assessments from the same human assessor for translations of the same source input but is not ideal in terms of the original aim of DA -to as much as possible produce absolute scores for transla-tions (as opposed to relative ones) -because it positions assessment of competing translations in close proximity within a HIT and judges may attempt to remember their judgment for a different candidate translation of a given input sentence.

Construction of Bad References
In all set-ups employed in the evaluation campaign, and as in previous years, bad reference pairs were created automatically by replacing a phrase within a given translation with a phrase of the same length randomly selected from n-grams extracted from the full test set of reference translations belonging to that language pair. This means that the replacement phrase will itself comprise a fluent sequence of words (making it difficult to tell that the sentence is low quality without reading the entire sentence) while at the same time making its presence highly likely to sufficiently change the Repeat Pairs: Original System output (10) An exact repeat of it (10); Bad Reference Pairs: Original System output (10) A degraded version of it (10); Good Reference Pairs: Original System output (10) Its corresponding reference translation (10).

Annotator Agreement
When an analogue scale (or 0-100 point scale, in practice) is employed, agreement cannot be measured using the conventional Kappa coefficient, ordinarily applied to human assessment when judgments are discrete categories or preferences. Instead, to measure consistency we filter crowd-sourced human assessors by how consistently they rate translations of known distinct quality using the bad reference pairs described previously. Quality filtering via bad reference pairs is especially important for the crowd-sourced portion of the manual evaluation. Due to the anonymous nature of crowd-sourcing, when collecting assessments of translations, it is likely to encounter workers who attempt to game the service, as well as submission of inconsistent evaluations and even robotic ones. We therefore employ DA's quality control mechanism to filter out low quality data, facilitated by the use of DA's analogue rating scale. Assessments belonging to a given crowdsourced worker who has not demonstrated that he/she can reliably score bad reference translations significantly lower than corresponding genuine system output translations are filtered out.
A paired significance test is applied to test if degraded translations are consistently scored lower than their original counterparts and the p-value produced by this test is used as an estimate of human assessor reliability. Assessments of workers whose p-value does not fall below the conventional 0.05 threshold are omitted from the evaluation of systems, since they do not reliably score degraded translations lower than corresponding MT output translations.
This year's assessment includes the first largescale DA evaluation where quality control items were applied to assessments of a known-reliable group, comprised of the portion of researchers who completed HITs on Appraise with the original DA HIT structure. Although this group should be considered highly reliable compared to Mechanical Turk for example, we must however keep in mind that a small part of this group are in fact hired to complete assessments and their reliability could vary more than what would be expected of volunteer researchers. Table 7 shows the number of workers in the crowd-sourced and researcher groups who met our filtering requirement by showing a significantly lower score for bad reference items compared to corresponding MT outputs, and the proportion of those who simultaneously showed no significant difference in scores they gave to pairs of identical translations.
The main observation to be taken from Table 7 is the difference in proportions of human assessors on Mechanical Turk versus researchers who passed the quality filtering criteria for DA, by scoring degraded translations significantly lower than the original MT output counterparts, as 37% of Mechanical Turk workers were deemed reliable compared to 93% of evaluators in the researcher group. This low rate of workers passing quality filtering is in line with past DA evalu-ations, and the high proportion of annotators passing quality control is expected of a mostly knownreliable group. For crowd-sourced workers, consistent with past DA evaluations, Table 7 shows a substantially higher number of low quality workers encountered for evaluation of languages other than English on Mechanical Turk. For example, in the case of Russian and Chinese only a respective 22% and 10% of workers were considered reliable enough to include their assessments in the evaluation, compared to around 42% on average for English evaluations.
When we examine repeat assessments of the same translation, both filtered groups show similar levels of reliability with 96% of filtered Mechanical Turk workers and 95% of researchers showing no significant difference in scores for repeat assessment of the same translation. The idea is that the repeated input should receive a very similar score. Assuming that annotators do not remember their previous assessment for the repeated sentence, the "Exact Rep." corresponds to intra-annotator agreement and it reaches very high scores. 16 Within the researcher group, although assessors have high levels of reliability overall, reliability in this respect varies quite a bit for different languages. For example, only 75% of assessors in the researcher group completing assessments for Estonian showed no significant difference for repeat assessment of the same translation, and 87% for Turkish, both lower levels of reliability than usually encountered on Mechanical Turk even though the research group is expected to be more reliable than crowd-sourced workers. However, on closer inspection, the number of human assessors who took part in the Turkish and Estonian evaluations is small and the seemingly large difference in percentages in fact correspond to as few as three individuals.

Producing the Human Ranking
All research and crowd data that passed quality control were combined to produce the overall shared task results. In order to iron out differences in scoring strategies of distinct human assessors, human assessment scores for translations were first standardized according to each individ-ual human assessor's overall mean and standard deviation score, for both researchers and crowd. Average standardized scores for individual segments belonging to a given system are then computed, before the final overall DA score for that system is computed as the average of its segment scores (Ave z in Table 8). Results are also reported for average scores for systems, computed in the same way but without any score standardization applied (Ave % in Table 8). Table 8 includes final DA scores for all systems participating in WMT18 News Translation Task. Clusters are identified by grouping systems together according to which systems significantly outperform all others in lower ranking clusters, according to Wilcoxon rank-sum test.
Note that for English→German, the system FACEBOOK-FAIR is not considered a regular participant, but an invited/late submission, see Section 2.4.6.
Appendix A shows the underlying head-to-head significance test results for all pairs of systems.

Source-based Direct Assessment
A secondary bilingual manual evaluation was carried out involving an adaptation of the standard monolingual DA evaluation in which the source language input segment was used in place of the reference. Figure 5 provides a screenshot of this evaluation as implemented in Appraise, which we refer to as source-based DA. In this set-up system outputs are evaluated by bilinguals who have access to the source language input segment only and no reference translation. The main motivation for doing so was to free up reference translations to allow them to be used instead as a "human system" in the evaluation. By structuring the evaluation as a bilingual task it allows a human system to be manually evaluated under exactly the same conditions as all other systems thus providing an estimate of human performance. 17 The aim of source-based DA is to produce accurate rankings for systems as well as the human system to allow direct comparison of system and human performance, motivated by recent indications that Machine Translation quality may in some cases be approaching human performance (Wu et al., 2016;Hassan et al., 2018). For sourcebased DA, annotators will ideally be bilingual, i.e.  Table 7: Number of unique workers, (A) those whose scores for bad reference items were significantly lower than corresponding MT outputs; (B) those of (A) whose scores also showed no significant difference for exact repeats of the same translation.
Researcher denotes the portion of the evaluation carried out with the standard DA HIT structure, while Researcher alt denotes the remaining part that employed the altered HIT structure in which some quality control items are omitted.   paign. The annotator is presented with a source text and a single system output randomly selected from competing systems (anonymized), and is asked to rate the translation on a sliding scale.
understand the source language sufficiently well, in addition to being native speakers of the target language. However, we did not specifically stipulate in this year's evaluation that human annotators be native speakers of the target language.
We run source-based DA for evaluation of English to Czech translation. This language pair was selected because sufficient annotators were available, helped by the fact that the set of systems participating in this language pair is small. This part of the campaign employs the alternate HIT structure described in Section 3.3.2 with reduced quality control items, i.e. it does not include exact repeats of translations or reference translations for quality control purposes.
A total of 17 annotators worked on the sourcebased DA pilot. 100% of annotators proved reliable, meaning that they scored bad reference items significantly lower than corresponding MT outputs (see Table 7 part (A) for corresponding reference-based DA percentages). For six candidate systems we collected 2, 574 assessments, resulting in an average of 429 annotations per individual system. Enforcing segment overlap during HIT creation resulted in 423 segments for which all six candidate translations have been scored. In total, annotators worked on 438 distinct segments.  Table 9: Source-based DA results for English→Czech newstest2018, where systems are ordered by standardized mean DA score, though systems within a cluster are considered tied. Lines between systems indicate clusters according to Wilcoxon rank-sum test at p-level p < 0.05. Systems with gray background indicate use of resources that fall outside the constraints provided for the shared task. NEWSTEST2018-REF denotes the human system comprised of human-produced reference translations. outperform all others in lower ranking clusters, according to Wilcoxon rank-sum test.
As can be seen from clusters in Table 9, one system, CUNI-TRANSFORMER, appears to achieve quality better than that of the human reference, NEWSTEST2018-REF, while another, UEDIN, appears to be on par with human performance, and although both systems certainly achieve very impressive results, claims of human parity should be taken with a degree of caution for several reasons which we outline below.

Considerations as to Human Parity
Before making any statements about "machine translation outperforming humans" or "machinehuman parity in translation", it is important to consider the following points: • The alternate HIT structure applied in this 291 version of DA has not been tested thoroughly enough to be certain of high reliability. For example, as described in Section 3.3.2, forcing all translations of a given source segment to be assessed by the same human judge within the same HIT could cause individual DA ratings to become highly relative as opposed to the aim of DA ratings to be as close as possible to absolute judgments of translation quality. Furthermore, an additional bias that could cause problems for this HIT structure is one associated with a past evaluation method, relative ranking. When evaluating competing translations of the same source that are situated in close proximity within a HIT, annotators may be primed by high (or low) quality outputs resulting in overly severe (or lenient) judgments for subsequent translations of the same source segment (Bojar et al., 2011).
• While standard monolingual DA employs annotators only required to be speakers of a single language, source-based DA requires fluency in two languages and it is not known the degree to which varying levels of native language fluency in at least one language may negatively impact the reliability of DA rankings in the case of bilingual annotators.
• It is likely that the quality of reference translations can vary and this could potentially impact the reliability of human performance estimates in source-based DA. Although reference-based DA assumes high quality reference translations, in the unfortunate case of problematic references, the overall rankings are unlikely to suffer to any large degree in terms of the reliability of system rankings, since all competing systems are likely to suffer equally from any lack of quality in reference translations.
However, in the adapted source-based version of DA, the effect of low quality reference translations is quite different. Firstly, since assessment involves comparison of MT outputs with the source, genuine participating systems will not suffer from the fact reference translations are low quality, since references are not involved in their evaluation. On the other hand, human performance estimates certainly will, as a drop in reference qual-ity is indeed highly likely to negatively impact the placement of human performance estimates in system rankings. The reliability of comparisons with human performance with source-based DA is therefore highly dependent on high quality reference translations, as employment of a low quality set of references can only lead to underestimates of human performance. Considering the manual evaluation included several reports of ill-formed reference translations, conclusions of human parity and/or superiority relative to humans should be avoided.
• Since none of WMT18 systems process larger units than individual sentences and our evaluation does not include any context beyond individual segments, it is possible that the human estimate is under-rewarded for correct cross-sentential phenomena.
• The sample size employed in the sourcebased DA evaluation was smaller than the recommended 1,500 judgments per system.
• The way in which translations in the test sets were originally created was as follows: one half of the test data for a given language pair was translated in one language direction and the other half in the opposite direction. It is well known that the translation direction affects translation quality in training and this could also be the case for evaluation. For instance, the human reference can be scored lower for "adding" information in the case when it was actually the source sentence and the translator omitted the information when creating the translation which now serves as the source side in the test set.
• The formal education in linguistics or translatology of human assessors has not been taken into account: it is likely that whether or not human assessors have received any formal training in translation might influence their acceptance of varying levels of wellformedness in translations. For example, untrained assessors might not be as sensitive to subtle differences in verb conjugation, based on their own experience: In many real-life situations, the exact verb tense or conditional chosen in one sentence may not really impact the overall message because it can be  implied from the context (and thus left free to the imagination of the annotator in our sentence-based evaluation) or from general knowledge.
In sum, while we are confident that our sourcebased evaluation was carried out correctly, we see it only as a pilot and with conclusion limited to the very particular evaluation setting. This pilot however clearly suggests that for well-resourced language pairs, an update of WMT evaluation style will be needed to keep up with the progress in machine translation.

Test Suites
Arguably, both the manual and automatic evaluations carried out at WMT News Translation Task are rather opaque. We learn (for each language pair and with a known confidence) which systems perform better on average over the sentences sampled from the news test set.
This average performance however does not provide any insight into which particular phenomena are handled better or worse by the systems. It is quite possible that the overall best-performing system may be unreliable for long sentences, for named entities, for pronouns or others. Such targeted evaluations may be important for particular deployment settings and use cases, and they are definitely important for us, MT system developers, in order to focus on them in subsequent research.
To this end, WMT18 organizers ran a "call for test suites", asking researchers to design and provide sets of sentences focusing on phenomena of their interest. Table 10 lists the participating test suites and their authors. Most of the test suites were available only for a limited number of language pairs.
Each participating test-suite team provided a set of source sentences (organized into full documents, if relevant for the particular test suite). In some cases, reference translations were also made available to WMT18 organizers (but not to translation system teams). We included the source sentences of the test suite in the source texts distributed to News Translation Task participants and collected translations of their MT systems. These, in turn, were handed over to test suite authors for evaluation. In some cases, the evaluation was fully automatic, in some cases, extended manual evaluation was carried out by the test suite team.
It is important to note that the test suite texts do not always adhere to the news domain. News Task systems which are heavily optimized towards this domain may thus underperform on such test suites. As long as this mismatch is taken into consideration, such an evaluation is valid and interesting, because it tests also the cross-domain applicability of WMT18 systems.

Test Suite Details
We now briefly describe each of the participating test suites. More details and the actual evaluation on the given test suite is available in the respective test suite paper. (Rios et al., 2018) The test suite by Rios et al. (2018) presents German→English MT systems with sentences containing one of 20 German words that need to be disambiguated when translating into English, e.g. Schlange which could mean either a snake or a queue.

Word Sense Disambiguation
The results on that test suite clearly document that the performance in word sense disambiguation (WSD) has substantially improved over time since 2016. While the performance in WSD generally correlates with BLEU very well, some exceptions are found, e.g. UEDIN-NMT systems from WMT16 and WMT17 or LMU-NMT performing slightly better in BLEU than in WSD. Another interesting observation is that the self-attentive architecture of Transformer seems to have a considerable advantage over RNN-based systems.
The unsupervised systems are among the worst producing, but this is in line with their low performance as estimated by BLEU.

2018)
The test suite used by Macketanz et al. (2018) is a manually designed set of 5,000 sentences covering 106 linguistic phenomena in 14 categories.
The performance on this test suite is evaluated semi-automatically, with automatic checks accepting and rejecting some translations and a human annotator resolving the rest.
The results highlight the overall performance of UCAM, followed by NTT and MLLP-UPV. RWTH, JHU and UEDIN are the next group.  (Burlot and Yvon, 2017) and its variations to WMT18 systems translating into Czech, German, Finnish, and a smaller similar test suite also to Turkish-to-English systems. The test suite is evaluated semi-automatically and tests selected phenomena primarily reflected in morphology of Czech, German, Finnish and Turkish, resp. The tests check if translation preserves a certain contrast (e.g. the gender or number of pronouns, definiteness, verb tense or person), whether the agreement is correctly preserved under some alternation of the input (e.g. a pronoun replaced by an adjective and noun) and similarly if a particular feature is preserved across lexical variation (using a hyponym). The English-Finnish set also considers rare words: numbers and named entities in particular.
The results for English-to-Czech suggest that the Transformer model (CUNI-TRANSFORMER) may tend to produce more creative translations than the recurrent architecture (UEDIN), because it performs slightly worse in contrast preservation, most notably verb past tense, conditional, or comparative adjectives.
The English-to-German results primarily indicate that current state-of-the-art systems have no longer any real problems with internal agreement in noun phrases, coordinated verbs, preserving negation, pronoun number, strong/weak adjectives or superlatives. Phenomena like coreference, compound generation or verb future generation remain a challenge.
The English-to-Finnish evaluation again confirms easy phenomena (e.g. verb negation or preservation of numbers) and highlight languagespecific hard phenomena (subordinate clause type, verb future or determiner definiteness). For this language pair, a more thorough manual validation of the test suite was also performed, indicating lower reliability for some phenomena.
Rare words (names entities) are best handled by online systems, which are probably either trained on more varied data, or include specific mechanisms to deal with this type of input, which is of lower concern for research systems.
The Turkish-English tests suggest that none of the systems handles verb particles well, with the accuracy of reflecting e.g. present vs. future subject particle in less that one third of cases. Tested verb features are handled better but apparently still considerably worse than in the other language pairs, with e.g. negation reaching only 70%.
In general, the overall performance according to human evaluation is not necessarily reflected in the performance in the Morpheval tests. A particularly interesting is the case of FACEBOOK-FAIR (denoted "online-Z" in Burlot et al., 2018), the top English-to-German system according to manual evaluation, which performs worst in the Morpheval test on preserving morphological features under lexical variation. (Biçici, 2018) The set suite by Biçici (2018) consisting of only 10 sentences aimed to test the performance of English↔Turkish systems out of their news domain. Due to the small size of the test suite, it is difficult to draw any conclusions from it.

Czech-English Grammatical Contrasts (Cinková and Bojar, 2018)
On a set of about 3000 selected sentences (a subset of the 5150 distributed to news task participants), Cinková and Bojar (2018) examine the extent to which reference translations and MT outputs fol-low the most prototypical pattern for certain linguistic phenomena in English-to-Czech translation. The examined MT systems include both primary News Translation Task systems, as well as three phrase-based baseline systems (Kocmi et al., 2018). While the test suite cannot be used rank systems according to their "translation quality", it displays interesting differences among system types and the reference translation.
In essence, English control and gerund constructions can be translated as Czech finite, nonfinite or subordinate clauses. The test suite focuses on cases when the particular target construction can be expected. According to an automatic evaluation, the reference translation follows this expected choice in about 90% of sentences of the test suite while all MT systems score considerably lower. The Moses baseline, ONLINE-G and ONLINE-A are the lowest, taking the expected route only in about 50% of cases. The top-performing system in terms of WMT18 manual evaluation, CUNI-TRANSFORMER and UEDIN perform "best" in this test suite, reaching about 70%, closely followed by the hybrid (nonprimary) system Chimera (Kocmi et al., 2018) and ONLINE-B. These results may be related e.g. to the effects of "translationese", i.e. particular constructions that appear in the target text as an artifact of the translation from a given source language. At the same time, the relation to the translation quality (see esp. Section 3.6) and the test suite results of Cinková and Bojar (2018) can be quite intricate. It is conceivable that the reference displays most of the translationese effects, CUNI-TRANSFORMER and UEDIN are able to escape this pitfall but for further systems, the scores start indicating simply a lower translation quality.  present another open-ended test suite. They provide News Task systems with texts from the area of academic writing in Humanities and Arts, Social Sciences, Biological and Health Sciences, and finally Physical Sciences. After the automatic translation by WMT18 News Tasks MT systems, an automatic evaluation tool, EVALD, is used to assess the quality of the discourse. EVALD is trained either to evaluate texts by Czech native speakers or by second-language learners. The version for Czech natives is not sufficiently discerning when applied to MT outputs, but the version for Czech learners displays measurable differences.

EVALD Discourse Evaluation
Since no reference is available and no manual evaluation of the machine translated texts was carried out,  restrict their examination to the variance of EVALD scores across subsets of the test suite. The nativeness of the original author seems to play the most important role, followed by the MT system identity and, with some gap, the genre and the domain of the text. These are promising results, confirming again that current MT systems are getting to the level of translation quality where it makes sense to compare them with tests designed for human writers. The quality of the source will however become the prime factor in this evaluation, only followed by the quality of the MT system.

Conclusion
We presented the results of the WMT18 News Translation Shared Task. Our main findings rank participating systems in their sentence-level translation quality, as assessed in a large-scale manual evaluation using the method of Direct Assessment (DA).
The novelties this year include measuring the reliability of volunteer researchers as assessors of translation quality (as opposed to crowd workers), a pilot in source-based DA evaluation and additional test suites that shed some light at the differences of individual participating MT systems and make first steps in new avenues of evaluating MT outputs using tests originally designed for humans.
In addition to highlighting the best-performing systems in each of the 14 examined translation directions, the results indicate that for some language pairs, the state of the art in machine translation is very close to the performance of human translators. This results is in line with other recent studies, e.g. Wu et al. (2016);Hassan et al. (2018), but the style of evaluation (DA for individual sentences) has to be carefully considered before making any strong claims. and 645357 (Cracker), and from the Connecting Europe Facility under agreement No. NEA/CEF/ICT/A2016/1331648 (ParaCrawl).
We would also like to thank the University of Helsinki, the University of Tartu, 18 Yandex and Microsoft for supplying test data for the news translation task.
The human evaluation campaign was very gratefully supported by contributions from Amazon, Microsoft, and Science Foundation Ireland in the ADAPT Centre for Digital Content Technology (www.adaptcentre.ie) at Dublin City University funded under the SFI Research Centres Programme (Grant 13/RC/2106) co-funded under the European Regional Development Fund.
We would also like to give special thanks to the small group of Turkish speakers who rescued our English-Turkish human evaluation at very short notice by contributing their time voluntarily. Finally, we are grateful to the large number of anonymous Mechanical Turk workers who contributed their human intelligence to the human evaluation.

A Differences in Human Scores
Tables 11-24 show differences in average standardized human scores for all pairs of competing systems for each language pair. The numbers in each of the tables' cells indicate the difference in average standardized human scores for the system in that column and the system in that row. Because there were so many systems and data conditions the significance of each pairwise comparison needs to be quantified. We applied Wilcoxon rank-sum test to measure the likelihood that such differences could occur simply by chance. In the following tables indicates statistical significance at p < 0.05, † indicates statistical significance at p < 0.01, and ‡ indicates statistical significance at p < 0.001, according to Wilcoxon rank-sum test.