Automatic Machine Translation Evaluation in Many Languages via Zero-Shot Paraphrasing

We propose the use of a sequence-to-sequence paraphraser for automatic machine translation evaluation. The paraphraser takes a human reference as input and then force-decodes and scores an MT system output. We propose training the aforementioned paraphraser as a multilingual NMT system, treating paraphrasing as a zero-shot"language pair"(e.g., Russian to Russian). We denote our paraphraser"unbiased"because the mode of our model's output probability is centered around a copy of the input sequence, which in our case represent the best case scenario where the MT system output matches a human reference. Our method is simple and intuitive, and our single model (trained in 39 languages) outperforms or statistically ties with all prior metrics on the WMT19 segment-level shared metrics task in all languages, excluding Gujarati where the model had no training data. We also explore using our model conditioned on the source instead of the reference, and find that it outperforms every quality estimation as a metric system from the WMT19 shared task on quality estimation by a statistically significant margin in every language pair.


Introduction
Machine Translation (MT) systems have improved dramatically in the past several years. This is largely due to advances in neural MT (NMT) methods (Sutskever et al., 2014; Bahdanau et al., 2015), but the pace of improvement would not have been possible without automatic MT metrics, which provide immediate feedback on MT quality without the time and expense associated with obtaining human judgments of MT output.
However, the improvements that existing automatic metrics helped enable are now causing the correlation between human judgments and automatic metrics to break down (Ma et  pecially for BLEU , which has been the de facto standard metric since its introduction almost two decades ago. The problem currently appears limited to very strong systems, but as hardware, methods, and available training data improve, it is likely BLEU will fail more frequently in the future. This could prove extremely detrimental if the MT community fails to adopt an improved metric, as good ideas could quietly be discarded or rejected from publication because they do not correlate with BLEU. In fact, it is possible this is already happening. We propose using a sentential, sequence-tosequence paraphraser to force decode and score MT outputs conditioned on their corresponding human references. Our model effectively stores the entire (exponentially large) set of potential para-phrases of a sentence, both valid and invalid, and we "query" the model with the system output to see how well the system output paraphrases the human reference translation.
The best possible MT output is one which perfectly matches a human reference; therefore, in our application, an ideal paraphraser would be one with an output distribution centered around a copy of its input sentence. We denote such a model an "unbiased paraphraser" to distinguish it from a standard paraphraser trained to produce output which conveys the meaning of the input while also being lexically and/or syntactically different from it. For this reason, we propose using a multilingual NMT system as an unbiased paraphraser by treating paraphrasing as a zero-shot "language pair" (e.g., Russian to Russian). We show that a multilingual NMT model is much closer to an ideal unbiased paraphraser than a generative paraphraser trained on synthetic paraphrases. It also allows us to train a single model for all the languages we wish to evaluate. Figure 1 illustrates our method, which we denote Prism (Probability is the metric). Figure 2 shows how our model (see § 4) penalizes both fluency and adequacy errors given a human reference.
We train a single model in 39 languages and show that it outperforms or statistically ties with every metric and baseline from the WMT 2019 MT metrics task (Ma et al., 2019), as well as the recently published BERTscore method (Zhang et al., 2020), at segment-level human correlation in all languages except Gujarati (which was not included in our training data). We find that our model performs well at judging strong NMT systems, as evidenced by positive human correlation on the top four systems (as judged by humans) submitted to WMT19 in every language pair. In contrast, BLEU has negative correlation in 5 language pairs. Additionally, we show that our method can be applied to the task of "Quality estimation (QE) as a metric" (by conditioning on the source instead of the reference) and outperforms all prior methods submitted to the WMT 2019 QE shared task (Fonseca et al., 2019) by a statistically significant margin in every language pair. We release code and models. 1 Finally, we present analysis which shows that: (1) Due to the effort of the human translators, our multilingual NMT system (which we use as an unbiased paraphraser) need not be SOTA at trans-1 https://github.com/thompsonb/prism lation in order to judge SOTA MT systems; (2) Our method has high correlation with human judgments even when those human judgments were made without using the reference; (3) Our unbiased paraphraser outperforms a standard generative paraphraser trained on synthetic paraphrases; and (4) Our method outperforms a sentence embeddingbased contrastive semantic similarity approach, which is also trained on bitext in many languages, even when that method is augmented with language model (LM) scores to address fluency. : Example token-level log probabilities from our model for various output sentences, conditioned on input sentence (i.e., human reference) "Jason went to school at the University of Madrid.". H(out|in) denotes the average token-level probability. We observe that our model generally penalizes any deviations (bolded) from the input sentence, but tends to penalize deviations which change the meaning of the sentence or introduces a disfluency more harshly than those which are fluent and adequate. Sentence-level BLEU with smoothing=1 ("sBLEU") and LASER embedding cosine similarity ("LASER") are shown for comparison. We note that LASER appears fairly insensitive to disfluencies.

Method
We propose using a paraphraser to force decode and estimate probabilities of MT system outputs, conditioned on their corresponding human references. Let p(y t |y i<t , x) be the probability our paraphraser assigns to the t-th token in output sequence y, given the previous output tokens y i<t and the input sequence x. We consider two ways of combining token-level probabilities from the model -sequence-level log probability (G) and average token-level log probability (H): Let sys denote MT system output, ref denote human reference, and src denote source. While we expect scoring sys conditioned on ref to be most indicative of the quality of sys, we also explore scoring ref conditioned on sys. This is done because we find qualitatively that output sentences which drop some of the meaning conveyed by the input sentence are penalized less harshly by the model than output sentences which contain extra information not present in the input sentence. 4,5 We postulate that the output sentence that best represents the meaning of an input sentence is, in fact, simply a copy of the input sentence, as precise word order and choice often convey subtle connotations. As such, we seek a model which we denote an "unbiased paraphraser" whose output distribution is centered around a copy of the input sentence. While a standard generative paraphraser is trained to retain semantic meaning, it does not meet our criteria because it is simultaneously trained to produce output which is lexically/syntactically different than its input, a key element in generative paraphrasing (Bhagat and Hovy, 2013).
We propose using a multilingual NMT system as an unbiased paraphraser. A multilingual NMT system consists of an encoder which maps a sentence in to an (ideally) language agnostic semantic representation, and decoder to map that representation back to a sentence. The model has only seen bitext in training, but we propose to treat paraphrasing as a zero-shot "language pair" (e.g., Russian to Russian).
As our model is multilingual, we can also score MT system output conditioned on the source instead of the human reference. This task is denoted "quality estimation (QE) as a metric" and was part of the WMT19 QE shared task (Fonseca et al., 4 Scoring in both directions to penalize the presence of information in one sentence but not the other is similar, at least in spirit, to methods which use bi-directional textual entailment as an MT metric (Padó et al., 2009;Khobragade et al., 2019). 5 Conditional probabilities estimated by MT systems have been shown to be effective at filtering out noisy MT training data (Junczys-Dowmunt, 2018). 2019). We use "Prism-ref" to denote our metric, and "Prism-src" to denote our QE as a metric.
Our final system-level metric and QE metric are defined based on results on our development set (see § 5.2) as follows: Prism-src = H(sys|src) To obtain system-level scores, we average segmentlevel scores over all segments in the test set.

Experiments
We train a multilingual NMT model and explore the extent to which it functions as an unbiased paraphraser. We then conduct several preliminary experiments on the WMT18 MT metrics data to determine how to best utilize the token-level probabilities from the paraphraser, and report results on the WMT19 system-and segment-level metrics and QE as metrics tasks.

Data Preparation
Our method relies heavily on a model, which in turn relies heavily on the data on which it is trained, so we describe here the rationale behind the design decisions made regarding the training data. Full details sufficient for replication are provided in Appendix B. Training a single large model consumed the majority of our compute budget, thus performing ablations, especially on full sizes models, is unfortunately beyond of the scope of this work.
Language Agnostic Representations To encourage our intermediate representation to be as language agnostic as possible, we choose datasets with as much language pair diversity as possible (i.e., not just en-* and *-en), as Kudugunta et al.
(2019) has shown that encoder representation is affected by both the source language and target language. While it is common to append the target language token to the source sentence, we instead prepend it to the target sentence so that the encoder cannot do anything target-language specific with this tag. At test time, we force-decode the desired language tag prior to scoring.
Noise NMT systems are known to be sensitive to noise, including sentence alignment errors (Khayrallah and Koehn, 2018), so we perform filtering with LASER (Schwenk, 2018; Chaudhary et al., 2019). We also perform language ID filtering using FastText (Joulin et al., 2016) to avoid training the decoder with incorrect language tags. Aharoni et al. (2019) found that performance of zero-shot translation in a related language pair increased with the number of languages substantially between 5 languages and 25 languages, with a plateau somewhere between 25 and 50 languages. We view paraphrasing as zero-shot translation between sentences in the same language, so we expect to need a similar number of languages. 6

Number of Languages
Copies We filter sentence pairs with excessive copies and partial copies, as multiple studies (Ott et al., 2018; Khayrallah and Koehn, 2018) have noted that MT performance degrades substantially when systems are exposed to copies in training.

Model Training Details
Our data comes primarily from Wikimatrix In preliminary experiments using smaller ("Transformer Big") models, we actually saw similar performance for models trained on 10 and 39 languages. This is perhaps due to our choice of datasets with many language pairs (Aharoni et al. (2019) train on language pairs in and out of English), but we hesitate to draw conclusions from those models as they had significantly worse performance than our full size model. 7  200 batches, gradient clipping of 1.2, and dropout of 0.1. We train for 6 epochs, which takes approximately 9 days on a p3.16xlarge instance rented from Amazon AWS, which has 8 Volta P100 GPUs with 16 GB of memory each. The model has approximately 745M parameters.

Baselines and Contrastive Methods
We compare to all baselines and submissions to the WMT19 shared metrics task (Ma et al., 2019), as well as BERTscore F1 (Zhang et al., 2019, 2020), which did not submit to or report on WMT19. We explore several contrastive methods to better understand the performance of our method.
Generative Sentential Paraphraser We compare scoring with our unbiased paraphraser vs a standard, English-only paraphraser trained on the Parabank2 dataset (Hu et al., 2019c). We train a Transformer with an 8-layer encoder, 8layer decoder, 1024 dimensional embeddings, feedforward size of 8192, and 16 attention heads.
LASER We explore using the cosine distance between LASER embeddings of the MT output and human reference, using the pretrained 93-language model provided by the authors. 10 We are particularly interested in LASER as it, like our model, is trained on parallel bitext in many languages.
Language Model We find qualitatively that LASER is fairly insensitive to disfluencies (see Figure 2), so we also explore augmenting it with language model scores of the system outputs. We train a multilingual language model on the same data as our multilingual NMT system. The model architecture is based on GPT-2 (Radford et al., 2019), and we use the fairseq transformer_lm_gpt2_small implementation. We train for 200k updates of approximately 131k tokens. The model has 369M parameters. We train with shared embeddings and a learning rate of 0.0005, and we stop gradients at sentence boundaries, using --sample-break-mode eos.

Paraphraser Bias
We explore the extent to which our paraphraser is unbiased in several ways: Qualitatively, we generate from the model using beam search and examine the output. Quantitatively, we contrast the conditional probabilities of three outputs for the same input: (1) the sequence generated by the model

MT Metrics Evaluation
We report segment-level performance with the Kendall's τ variant used in the shared task. Systemlevel performance is computed following the shared task as the Pearson correlation with the mean of the human judgments from Bojar et al. , using the scripts released for the shared task, to estimate confidence intervals for each metric. Metrics with non-overlapping 95% confidence intervals are identified as having a statistically significant difference in performance.

Paraphraser Bias Results
We find that for our model, a copy of the input is almost as probable as the beam search output (see Table 1). In contrast, the gap is much larger for Parabank2. Additionally, beam search from our model produces output which is very similar to the input (BLEU of 82.8 with respect to input), as desired, while the Parabank2 model tends to change  the output more (BLEU of 31.9 with respect to input). This is supported by qualitative inspection, where we see our model tends to produce copies or near copies of the input while the Para-bank2 model has a clear tendency to make more significant changes, which occasionally also significantly alters the meaning of the sentence (see Appendix A). We expect that sentBLEU, when averaged over many sentences, should track with semantic similarity, thus our method should track (on average) with sentBLEU as well. We find this to be the case with our multilingual paraphraser, but Parabank2 has nearly the same scores for output with sentBLEU between 0.6 and 1.0 (see Figure 3). All of these findings support our hypothesis that our model is closer to an ideal unbiased paraphraser than the contrastive Parabank2 model which is trained on synthetic paraphrases.

Preliminary (Development) Results
Preliminary experiments on the WMT18 metrics task data are shown in Figure 4 and Appendix C. We find that length-normalized log probability (H) slightly outperforms un-normalized log probability (G). When using the reference, we find an equal weighting of of H(sys|ref) and H(ref|sys) to be approximately optimal, but we find that when using the source, H(src|sys) does not appear to add any useful information to H(sys|src). These findings were used to select the definitions of Prism-ref and Prism-src in § 3.
We find the probability of sys as estimated by an LM [H(src)] and the cosine distance en-cs en-de en-fi en-gu en-kk en-lt en-ru en-zh de-cs de-fr fr-de  Note that our models were not trained on Gujarati (gu). "LASER + LM" denotes the optimal linear combination found on the development set.

Segment-Level Metric Results
Segment-level metric results are shown in Table 2. On language pairs in to non-English, we outperform prior work by a statistically significant margin in 8 of 11 language pairs and are statistically tied in the rest, 12 with the exception of Gujarati (gu) where the model had no training data. In to English, our metric is statistically tied with the best prior work in every language pair. Our metric tends to significantly outperform our contrastive LASER + LM method, although the contrastive method performs surprisingly well in en-ru. Table 3 shows system-level metric performance compared to BLEU, BERTscore, and Yisi variants on the top four systems submitted to WMT19. Results for all metrics on the top four systems, with statistical significance, is provided in Appendix E. en-cs en-de en-fi en-gu en-kk en-lt en-ru en-zh de-cs de-fr fr-de  While correlations with human judgments are not high in all cases for our metric, they are at least positive. In contrast, BLEU has negative correlation in 5 language pairs, and BERTscore and Yisi-1 variants are each negative in at least two.

System-Level Metric Results
We do not find the system-level results on all submitted MT systems (see Appendix F) to be particularly interesting; as noted by Ma et al. (2019), a single weak system can result in high overall system-level correlation, even for an otherwise poorly performing metric.

QE as a Metric Results
We find that our reference-less Prism-src outperforms all QE as a metrics systems from the WMT19 shared task, by a statistically significant margin, in every language pair at segment-level human correlation (Table 4). Prism-src also outperforms or statistically ties with every QE as a metric systems at system-level human correlation (Appendix F).

Analysis and Discussion
Are Human References Helpful? The fact that our model is multilingual allows us to explore the extent to which the human reference actually improves our model's ability to judge MT system output, compared to using the source instead. 13 Comparing the performance of our method with 13 The underlying assumption in all of MT metrics is that the work done by the human translator makes it is easier to automatically judge the quality of MT output. However, if our model or the MT systems being judged were strong enough, we would expect this assumption to break down. access to the human reference (Prism-ref) vs our method with access to only the source (Prism-src), we find that the reference-based method statistically outperforms the source-based method in all but one language pair. We find the case where they are not statistically different, de-cs, to be particularly interesting: de-cs was the only language pair in WMT 19 where the systems were unsupervised (i.e., did not use parallel training data). As a result, it is the only language pair where our model outperformed the best WMT system at translation. In most cases, our model is substantially worse at translation than the best WMT system (see Appendix G); thus the performance difference between Prism-ref and Prism-src would suggest that the model needs no help in judging MT systems which are weaker than it is, but the human references are assisting our model in evaluating MT systems which are stronger than it is. This means that we have not simply reduced the task of MT evaluation to that of building a SOTA MT system. We see that a reasonably good (but not SOTA) multilingual NMT system, with help from the human translator(s) that produced the references, can be a SOTA MT metric and judge SOTA MT systems.
Does our Method Exhibit Reference Bias? Human judgments of MT system output can be made using either the source (by bilingual annotators) or a human reference (by monolingual annotators). In WMT19, judgments for translations in to English were reference-based, while translations in to non-English were source-based. Fomicheva and Specia en-cs en-de en-fi en-gu en-kk en-lt en-ru en-zh de-cs de-fr fr-de  (2016) noted that reference-based annotations unfairly reward MT system output which is similar to the reference over equally valid output which is less similar to the reference, denoted "referencebias." 14 It is likely that MT metrics exhibit a similar bias, so we were curious if we could observe any differences in trends between language pairs with reference-based vs source-based annotations. With the exception of de-cs (discussed above), we see statistically significant improvements for Prism-ref over Prism-src (which does not use the reference, so cannot be biased by it), both in to English and in to non-English. Thus we conclude that if our method suffers from reference bias, its effects are small compared to the benefit of using the human translation.
Unbiased vs. Generative Paraphraser Our unbiased paraphraser statistically outperforms the generative English-only Parabank2 paraphraser in 6 of 7 language pairs, however wins are statistically significant in only 2 languages pairs with statistical ties in the rest. We believe this is be due to the Parabank2 model having a lexical/syntactic bias away from its input -see § 5.1 Additionally, creating synthetic paraphrases and training individual models in many languages would be a substantial undertaking.
Paraphrasing vs LASER + LM The proposed method significantly outperforms the contrastive LASER-based method in most language pairs, even when LASER is augmented with a language model. This suggests that training a multilingual paraphraser is a better use of multilingual bitext than training a sentence embedder, although the comparison is complicated by the fact that LASER is trained on different data from our model. This

Conclusion and Future Work
In this work, we show that a multilingual NMT system can be used as an unbiased, multilingual paraphraser, and we show that the resulting paraphraser can be used as an MT metric. Our single model supports 39 languages and outperforms prior metrics on the most recent WMT metrics shared task evaluation. We present analysis showing our method's high human judgment correlation is not simply the result of reference bias. We also present analysis showing that we have not simply reduced the task of evalution to that of building a SOTA MT system; the work done by the human translator helps the evaluation model judge systems that are stronger (at translation) than it is, and we do not need a SOTA multilingual NMT model to score SOTA MT systems or be a SOTA MT metric.
Our method outperforms metrics using highly optimized BERT variants, and we are optimistic our method will improve further as stronger multilingual NMT models become publicly available.
In future work, we would like to explore whether the unbiased paraphraser presented in this work is well suited to other other tasks, such as data augmentation. We would also like to extend this work to paragraph-or document-level evaluation by training a paragraph-or document-level multilingual NMT system, as there is growing evidence that MT evaluation would be better conducted at the document level, rather than the sentence level (Läubli et al., 2018). A Generation Examples Figure 5 shows sentences generated from both our model and parabank2.

28-Year-Old Chef Found Dead at San Francisco Mall THIS WORK 28-Year-Old Chef Found Dead at San Francisco Mall PARABANK2
28-year-old chef found dead in a mall in San Francisco REFERENCE A 28-year-old chef who had recently moved to San Francisco was found dead in the stairwell of a local mall this week.

THIS WORK
A 28-year-old chef who had recently moved to San Francisco was found dead in the stairwell of a local mall this week. PARABANK2 Earlier this week, a 28-year-old chef who had recently moved to San Francisco was found dead on the steps of a local department store.

REFERENCE
But the victim's brother says he can't think of anyone who would want to hurt him, saying, "Things were finally going well for him." THIS WORK But the victim's brother says he can't think of anyone who would want to hurt him, saying, "Things were finally going well for him." PARABANK2 But the victim's brother said he couldn't think of anyone who'd want to hurt him, and he said he was finally okay.

REFERENCE
The body found at the Westfield Mall Wednesday morning was identified as 28-year-old San Francisco resident Frank Galicia, the San Francisco Medical Examiner's Office said.

THIS WORK
The body found at the Westfield Mall Wednesday morning was identified as 28-year-old San Francisco resident Frank Galicia, the San Francisco Medical Examiner's Office said. PARABANK2 The body found Wednesday morning at the Westfield Mall has been identified by the San Francisco Medical Examiner's Office as 28-year-old San Franscisco resident Frank Galicia.

REFERENCE
The San Francisco Police Department said the death was ruled a homicide and an investigation is ongoing.

THIS WORK
The San Francisco Police Department said the death was deemed a homicide and an investigation is ongoing. PARABANK2 The San Francisco P.D. says the death has been ruled a murder and is under investigation.

REFERENCE
A spokesperson for Sons & Daughters said they were "shocked and devastated" by his death.

THIS WORK
A spokesperson for Sons & Daughters said they were "shocked and devastated" by his death PARABANK2 A spokesman for Sons & Daughters said that his death "shocked and devastated them."

REFERENCE "
We are a small team that operates like a close knit family and he will be dearly missed," the spokesperson said. THIS WORK "We are a small team that operates like a close-knit family and he will be dearly missed," the spokesman said. PARABANK2 "We are a small team, operating as a close-knit family, and we will miss him dearly," said the spokesman.

REFERENCE
Our thoughts and condolences are with Frank's family and friends at this difficult time. THIS WORK Our thoughts and condolences are with Frank's family and friends at this difficult time.

PARABANK2
Our thoughts and condolences go out to Frank's family and friends in these difficult times.

REFERENCE
Louis Galicia said Frank initially stayed in hostels, but recently, "Things were finally going well for him." THIS WORK Louis Galicia said Frank initially stayed in hostels, but recently, "Things were finally going well for him." PARABANK2 Louis Galicia said that Frank initially stayed in the dormitory, but lately, "He's finally doing okay." Figure 5: Sentences generated via beam search (beamwidth 5) for the multilingual model presented in this work vs parabank2. We note that our model tends to produces copies or near copies of the input, which is the desired behavior for our application. Changes are emphasized with bold or strikethrough. The parabank2 model tends to produce output with lexical/syntactic changes, which occasionally also significantly change the meaning of the sentence (denoted in red). References (paraphraser inputs) are the first ten sentences of wmt17 zh-en.

B Data Details for Replication
The bulk of our data comes from Wikimatrix (Schwenk et al., 2019), a large collection of parallel data extracted from Wikipedia, and for more domain variety, we added Global Voices, 15 EuroParl (Koehn, 2005) (random subset of to 100k sentence pairs per language pair), SETimes, 16 United Nations (Eisele and Chen, 2010) (random sample of 1M sentence pairs per language pair). We also included WMT Kazakh-English and Kazakh-Russian data from WMT, to be able to evaluate on Kazakh. WMT Kazakh-English and Kazakh-Russian were limited to the best 1M and 200k sentence pairs, respectively, as judged by LASER. We used a margin threshold of 1.05 for Wikimatrix and a threshold of 1.04 for the remaining datasets, as we expect them to be cleaner. We find that FastText classifies many sentences as non-English when they contain mostly English but also contain a few non-English words, especially from lower resource languages. To remedy this, we performed LID on 5-grams and filtered out sentences for which LID did not classify at least half of the 5-grams as the expected language.
We filtered out sentences where there was more than 60% overlap in 3-grams or 40% overlap in 4-grams.Via manual inspection, this seemed to provide a good trade-off between allowing numbers and named entities to be copied, but filtering out sentences that were clearly not translated. We perform tokenization with sentencepiece prior to filtering, using a 200k vocabulary for all language pairs, to account for languages like Chinese which do not denote word boundaries. Note that this vocabulary was used only for filtering, not for training the final model.
We limited training to languages with at least 1M examples, which resulted in 39 languages. Figure 6 shows amount of data in each language.  Table 5, Table 6, Table 7, and Table 8 show systemand segment-level results, in to and out of English, for the WMT 2018 MT metrics shared task, along with all baselines and submitted systems.         Table 10, and Table 11 show segment-level metrics (excluding QE as a metric) results, for language pairs in to, out of, and not including English, for the WMT 2019 MT metrics shared task, along with all baselines and submitted systems. Table 12, Table 13, and Table 14 show segmentlevel QE as a metric results, for language pairs in to, out of, and not including English, for the WMT 2019 MT metrics shared task, along with all baselines and submitted systems.                Table 18, Table 19, and Table 20, show systemlevel results, for metrics (excludes QE as metric) for language pairs in to, out of, and not including English, for the WMT 2019 MT metrics shared task, along with all baselines and submitted systems. Table 21, Table 22, and Table 23, show systemlevel results, for QE as metric, for language pairs in to, out of, and not including English, for the WMT 2019 MT metrics shared task, along with all baselines and submitted systems.   Table 24: BLEU scores for our multilingual NMT system on WMT19 testsets, compared to best system from WMT19. Our multilingual system achieves SOTA as an MT metric despite substantially under performing all the best WMT19 MT systems at translation (excluding unsupervised). †: WMT systems were unsupervised (no parallel data). ‡: Multilingual system did not train on Gujarati (gu). Systems are not trained on the same data, so this should not be interpreted as a comparison between multilingual and single-language pair MT. ISO 639-1 language codes. Table 24 shows that our system is substantially worse at translation, as measured by BLEU, than the best systems submitted to WMT19 in every language pair except de-cs, where the WMT models were unsupervised (i.e., used no parallel data). This implies that our system is able to judge the quality of state-of-the-art MT systems without itself being state-of-the-art.