BLEURT: Learning Robust Metrics for Text Generation

Text generation has made significant advances in the last few years. Yet, evaluation metrics have lagged behind, as the most popular choices (e.g., BLEU and ROUGE) may correlate poorly with human judgment. We propose BLEURT, a learned evaluation metric for English based on BERT. BLEURT can model human judgment with a few thousand possibly biased training examples. A key aspect of our approach is a novel pre-training scheme that uses millions of synthetic examples to help the model generalize. BLEURT provides state-of-the-art results on the last three years of the WMT Metrics shared task and the WebNLG data set. In contrast to a vanilla BERT-based approach, it yields superior results even when the training data is scarce and out-of-distribution.

Human evaluation is often the best indicator of the quality of a system. However, designing crowd sourcing experiments is an expensive and high-latency process, which does not easily fit in a daily model development pipeline. Therefore, NLG researchers commonly use automatic evaluation metrics, which provide an acceptable proxy for quality and are very cheap to compute. This paper investigates sentence-level, referencebased metrics, which describe the extent to which a candidate sentence is similar to a reference one. The exact definition of similarity may range from string overlap to logical entailment.
The first generation of metrics relied on handcrafted rules that measure the surface similarity between the sentences. To illustrate, BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004), two popular metrics, rely on N-gram overlap. Because those metrics are only sensitive to lexical variation, they cannot appropriately reward semantic or syntactic variations of a given reference. Thus, they have been repeatedly shown to correlate poorly with human judgment, in particular when all the systems to compare have a similar level of accuracy (Liu et al., 2016;Novikova et al., 2017;Chaganty et al., 2018).
Increasingly, NLG researchers have addressed those problems by injecting learned components in their metrics. To illustrate, consider the WMT Metrics Shared Task, an annual benchmark in which translation metrics are compared on their ability to imitate human assessments. The last two years of the competition were largely dominated by neural net-based approaches, RUSE, YiSi and ESIM (Ma et al., 2018(Ma et al., , 2019. Current approaches largely fall into two categories. Fully learned metrics, such as BEER, RUSE, and ESIM are trained end-to-end, and they typically rely on handcrafted features and/or learned embeddings. Conversely, hybrid metrics, such as YiSi and BERTscore combine trained elements, e.g., contextual embeddings, with handwritten logic, e.g., as token alignment rules. The first category typically offers great expressivity: if a training set of human ratings data is available, the metrics may take full advantage of it and fit the ratings distribution tightly. Fur-thermore, learned metrics can be tuned to measure task-specific properties, such as fluency, faithfulness, grammar, or style. On the other hand, hybrid metrics offer robustness. They may provide better results when there is little to no training data, and they do not rely on the assumption that training and test data are identically distributed.
And indeed, the IID assumption is particularly problematic in NLG evaluation because of domain drifts, that have been the main target of the metrics literature, but also because of quality drifts: NLG systems tend to get better over time, and therefore a model trained on ratings data from 2015 may fail to distinguish top performing systems in 2019, especially for newer research tasks. An ideal learned metric would be able to both take full advantage of available ratings data for training, and be robust to distribution drifts, i.e., it should be able to extrapolate.
Our insight is that it is possible to combine expressivity and robustness by pre-training a fully learned metric on large amounts of synthetic data, before fine-tuning it on human ratings. To this end, we introduce BLEURT, 1 a text generation metric based on BERT (Devlin et al., 2019). A key ingredient of BLEURT is a novel pre-training scheme, which uses random perturbations of Wikipedia sentences augmented with a diverse set of lexical and semantic-level supervision signals.
To demonstrate our approach, we train BLEURT for English and evaluate it under different generalization regimes. We first verify that it provides state-of-the-art results on all recent years of the WMT Metrics Shared task (2017 to 2019, to-English language pairs). We then stress-test its ability to cope with quality drifts with a synthetic benchmark based on WMT 2017. Finally, we show that it can easily adapt to a different domain with three tasks from a data-to-text dataset, WebNLG 2017 (Gardent et al., 2017). Ablations show that our synthetic pretraining scheme increases performance in the IID setting, and is critical to ensure robustness when the training data is scarce, skewed, or out-of-domain.
The code and pre-trained models are available online 2 .
1 Bilingual Evaluation Understudy with Representations from Transformers. We refer the intrigued reader to Papineni et al. 2002 for a justification of the term understudy.
2 http://github.com/google-research/ bleurt 2 Preliminaries Define x = (x 1 , .., x r ) to be the reference sentence of length r where each x i is a token and let x = (x 1 , ..,x p ) be a prediction sentence of length p. Let {(x i ,x i , y i )} N n=1 be a training dataset of size N where y i ∈ R is the human rating that indicates how goodx i is with respect to x i . Given the training data, our goal is to learn a function f : (x,x) → y that predicts the human rating.

Fine-Tuning BERT for Quality Evaluation
Given the small amounts of rating data available, it is natural to leverage unsupervised representations for this task. In our model, we use BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2019), which is an unsupervised technique that learns contextualized representations of sequences of text. Given x andx, BERT is a Transformer (Vaswani et al., 2017) that returns a sequence of contextualized vectors: where v [CLS] is the representation for the special [CLS] token. As described by Devlin et al. (2019), we add a linear layer on top of the [CLS] vector to predict the rating: where W and b are the weight matrix and bias vector respectively. Both the above linear layer as well as the BERT parameters are trained (i.e. fine-tuned) on the supervised data which typically numbers in a few thousand examples. We use the regression loss ℓ supervised = 1 N N n=1 y i −ŷ 2 . Although this approach is quite straightforward, we will show in Section 5 that it gives state-of-theart results on WMT Metrics Shared Task 17-19, which makes it a high-performing evaluation metric. However, fine-tuning BERT requires a sizable amount of IID data, which is less than ideal for a metric that should generalize to a variety of tasks and model drift.

Pre-Training on Synthetic Data
The key aspect of our approach is a pre-training technique that we use to "warm up" BERT before fine-tuning on rating data. 3 We generate a large number of of synthetic reference-candidate pairs (z,z), and we train BERT on several lexical-and semantic-level supervision signals with a multitask loss. As our experiments will show, BLEURT generalizes much better after this phase, especially with incomplete training data.
Any pre-training approach requires a dataset and a set of pre-training tasks. Ideally, the setup should resemble the final NLG evaluation task, i.e., the sentence pairs should be distributed similarly and the pre-training signals should correlate with human ratings. Unfortunately, we cannot have access to the NLG models that we will evaluate in the future. Therefore, we optimized our scheme for generality, with three requirements.
(1) The set of reference sentences should be large and diverse, so that BLEURT can cope with a wide range of NLG domains and tasks. (2) The sentence pairs should contain a wide variety of lexical, syntactic, and semantic dissimilarities. The aim here is to anticipate all variations that an NLG system may produce, e.g., phrase substitution, paraphrases, noise, or omissions. (3) The pre-training objectives should effectively capture those phenomena, so that BLEURT can learn to identify them. The following sections present our approach.

Generating Sentence Pairs
One way to expose BLEURT to a wide variety of sentence differences is to use existing sentence pairs datasets (Bowman et al., 2015;Williams et al., 2018;Wang et al., 2019). These sets are a rich source of related sentences, but they may fail to capture the errors and alterations that NLG systems produce (e.g., omissions, repetitions, nonsensical substitutions). We opted for an automatic approach instead, that can be scaled arbitrarily and at little cost: we generate synthetic sentence pairs (z,z) by randomly perturbing 1.8 million segments z from Wikipedia. We use three techniques: mask-filling with BERT, backtranslation, and randomly dropping out words. We obtain about 6.5 million perturbationsz. Let us describe those techniques.
Mask-filling with BERT: BERT's initial training task is to fill gaps (i.e., masked tokens) in tokenized sentences. We leverage this functionality by inserting masks at random positions in the Wikipedia sentences, and fill them with the language model. Thus, we introduce lexical alter-ations while maintaining the fluency of the sentence. We use two masking strategies-we either introduce the masks at random positions in the sentences, or we create contiguous sequences of masked tokens. More details are provided in the Appendix.
Backtranslation: We generate paraphrases and perturbations with backtranslation, that is, round trips from English to another language and then back to English with a translation model (Bannard and Callison-Burch, 2005;Ganitkevitch et al., 2013;Sennrich et al., 2016). Our primary aim is to create variants of the reference sentence that preserves semantics. Additionally, we use the mispredictions of the backtranslation models as a source of realistic alterations.
Dropping words: We found it useful in our experiments to randomly drop words from the synthetic examples above to create other examples. This method prepares BLEURT for "pathological" behaviors or NLG systems, e.g., void predictions, or sentence truncation.

Pre-Training Signals
The next step is to augment each sentence pair (z,z) with a set of pre-training signals {τ k }, where τ k is the target vector of pre-training task k.
Good pre-training signals should capture a wide variety of lexical and semantic differences. They should also be cheap to obtain, so that the approach can scale to large amounts of synthetic data. The following section presents our 9 pretraining tasks, summarized in Table 1. Additional implementation details are in the Appendix.
Backtranslation Likelihood: The idea behind this signal is to leverage existing translation models to measure semantic equivalence. Given a pair (z,z), this training signal measures the probability thatz is a backtranslation of z, P (z|z), normalized by the length ofz. Let P en→fr (z fr |z) be a translation model that assigns probabilities to French sentences z fr conditioned on English sentences z and let P fr→en (z|z fr ) be a translation model that assigns probabilities to English  sentences given french sentences. If |z| is the number of tokens inz, we define our score as τ en-fr,z|z = log P (z|z) |z| , with: Because computing the summation over all possible French sentences is intractable, we approximate the sum using z * fr = arg max P en→fr (z fr |z) and we assume that P en→fr (z * fr |z) ≈ 1: We can trivially reverse the procedure to compute P (z|z), thus we create 4 pre-training signals τ en-fr,z|z , τ en-fr,z|z , τ en-de,z|z , τ en-de,z|z with two pairs of languages (en ↔ de and en ↔ fr) in both directions.
Textual Entailment: The signal τ entail expresses whether z entails or contradictsz using a classifier. We report the probability of three labels: Entail, Contradict, and Neutral, using BERT finetuned on an entailment dataset, MNLI (Devlin et al., 2019;Williams et al., 2018).
Backtranslation flag: The signal τ backtran flag is a Boolean that indicates whether the perturbation was generated with backtranslation or with maskfilling.

Modeling
For each pre-training task, our model uses either a regression or a classification loss. We then aggregate the task-level losses with a weighted sum. Let τ k describe the target vector for each task, e.g., the probabilities for the classes Entail, Contradict, Neutral, or the precision, recall, and Fscore for ROUGE. If τ k is a regression task, then the loss used is the ℓ 2 loss i.e. ℓ k = τ k − τ k 2 2 /|τ k | where |τ k | is the dimension of τ k and τ k is computed by using a task-specific linear layer on top of the [CLS] embedding: If τ k is a classification task, we use a separate linear layer to predict a logit for each class c:τ kc = W τ kcṽ [CLS] + b τ kc , and we use the multiclass cross-entropy loss. We define our aggregate pre-training loss function as follows: where τ m k is the target vector for example m, M is number of synthetic examples, and γ k are hyperparameter weights obtained with grid search (more details in the Appendix).

Experiments
In this section, we report our experimental results for two tasks, translation and data-to-text. First, we benchmark BLEURT against existing text generation metrics on the last 3 years of the WMT Metrics Shared Task (Bojar et al., 2017). We then evaluate its robustness to quality drifts with a series of synthetic datasets based on WMT17. We test BLEURT's ability to adapt to different tasks with the WebNLG 2017 Challenge Dataset (Gardent et al., 2017). Finally, we measure the contribution of each pre-training task with ablation experiments.
Our Models: Unless specified otherwise, all BLEURT models are trained in three steps: regular BERT pre-training (Devlin et al., 2019), pre-training on synthetic data (as explained in Section 4), and fine-tuning on task-specific ratings (translation and/or data-to-text). We experiment with two versions of BLEURT, BLEURT and BLEURTbase, respectively based on BERT-Large (24 layers, 1024 hidden units, 16 heads) and BERT-Base (12 layers, 768 hidden units, 12 heads) (Devlin et al., 2019), both uncased. We use batch size 32, learning rate 1e-5, and 800,000 steps for pre-training and 40,000 steps for finetuning. We provide the full detail of our training setup in the Appendix.   We evaluate the agreement between the automatic metrics and the human ratings. For each year, we report two metrics: Kendall's Tau τ (for consistency across experiments), and the official WMT metric for that year (for completeness). The official WMT metric is either Pearson's correlation or a robust variant of Kendall's Tau called DARR, described in the Appendix. All the numbers come from our own implementation of the benchmark. 4 Our results are globally consistent with the official results but we report small differences in 2018 and 2019, marked in the tables. 4 The official scripts are public but they suffer from documentation and dependency issues, as shown by a README file in the 2019 edition which explicitly discourages using them.

Models:
We experiment with four versions of BLEURT: BLEURT, BLEURTbase, BLEURT -pre and BLEURTbase -pre. The first two models are based on BERT-large and BERT-base. In the latter two versions, we skip the pre-training phase and fine-tune directly on the WMT ratings.
For each year of the WMT shared task, we use the test set from the previous years for training and validation. We describe our setup in further detail in the Appendix. We compare BLEURT to participant data from the shared task and automatic metrics that we ran ourselves. In the former case, we use the the best-performing contestants for each year, that is, chrF++, BEER, Meteor++, RUSE, Yisi1, ESIM and Yisi1-SRL (Mathur et al., 2019). All the contestants use the same WMT training data, in addition to existing sentence or token embeddings. In the latter case, we use Moses sentenceBLEU, BERTscore (Zhang et al., 2020), and MoverScore (Zhao et al., 2019). For BERTscore, we use BERT-large uncased for fairness, and roBERTa (the recommended version) for completeness . We run MoverScore on WMT 2017 using the scripts published by the authors.   dominates the benchmark for each language pair (Tables 2 and 3). BLEURT and BLEURTbase are also competitive for year 2019: they yield the best results for all language pairs on Kendall's Tau, and they come first for 3 out of 7 pairs on DARR. As expected, BLEURT dominates BLEURTbase in the majority of cases. Pre-training consistently improves the results of BLEURT and BLEURTbase.

Results
We observe the largest effect on year 2017, where it adds up to 7.4 Kendall Tau points for BLEURTbase (zh-en). The effect is milder on years 2018 and 2019, up to 2.1 points (tr-en, 2018). We explain the difference by the fact that the training data used for 2017 is smaller than the datasets used for the following years, so pre-training is likelier to help. In general pretraining yields higher returns for BERT-base than for BERT-large-in fact, BLEURTbase with pretraining is often better than BLEURT without.

Takeaways:
Pre-training delivers consistent improvements, especially for BLEURT-base. BLEURT yields state-of-the art performance for all years of the WMT Metrics Shared task.

Robustness to Quality Drift
We assess our claim that pre-training makes BLEURT robust to quality drifts, by constructing a series of tasks for which it is increasingly pressured to extrapolate. All the experiments that follow are based on the WMT Metrics Shared Task 2017, because the ratings for this edition are particularly reliable. 5 Methodology: We create increasingly challenging datasets by sub-sampling the records from the WMT Metrics shared task, keeping low-rated translations for training and high-rated translations for test. The key parameter is the skew factor α, that measures how much the training data is leftskewed and the test data is right-skewed. Figure 1 demonstrates the ratings distribution that we used in our experiments. The training data shrinks as α increases: in the most extreme case (α = 3.0), we use only 11.9% of the original 5,344 training records. We give the full detail of our sampling methodology in the Appendix.
We use BLEURT with and without pre-training and we compare to Moses sentBLEU and BERTscore. We use BERT-large uncased for both BLEURT and BERTscore. Results: Figure 2 presents BLEURT's performance as we vary the train and test skew independently. Our first observation is that the agreements fall for all metrics as we increase the test skew. This effect was already described is the 2019 WMT Metrics report (Ma et al., 2019). A common explanation is that the task gets more difficult as the ratings get closer-it is easier to discriminate between "good" and "bad" systems than to rank "good" systems.
Training skew has a disastrous effect on BLEURT without pre-training: it is below BERTscore for α = 1.0, and it falls under sentBLEU for α ≥ 1.5. Pre-trained BLEURT is much more robust: the only case in which it falls under the baselines is α = 3.0, the most extreme drift, for which incorrect translations are used for train while excellent ones for test.
Takeaways: Pre-training makes BLEURT significantly more robust to quality drifts.

WebNLG Experiments
In this section, we evaluate BLEURT's performance on three tasks from a data-to-text dataset, the WebNLG Challenge 2017 (Shimorina et al., 2019). The aim is to assess BLEURT's capacity to adapt to new tasks with limited training data.
Dataset and Evaluation Tasks: The WebNLG challenge benchmarks systems that produce natural language description of entities (e.g., buildings, cities, artists) from sets of 1 to 5 RDF triples. The organizers released the human assessments for 9 systems over 223 inputs, that is, 4,677 sentence pairs in total (we removed null values). Each input comes with 1 to 3 reference descriptions. The submissions are evaluated on 3 aspects: semantics, grammar, and fluency. We treat each type of rating as a separate modeling task. The data has no natural split between train and test, therefore we experiment with several schemes. We allocate 0% to about 50% of the data to training, and we split on both the evaluated systems or the RDF inputs in order to test different generalization regimes.

Systems
and Baselines: BLEURT -pre -wmt, is a public BERT-large uncased checkpoint directly trained on the WebNLG ratings. BLEURT -wmtwas first pre-trained on synthetic data, then fine-tuned on WebNLG data. BLEURT was trained in three steps: first on synthetic data, then on WMT data (16-18), and finally on WebNLG data. When a record comes with several references, we run BLEURT on each reference and report the highest value (Zhang et al., 2020).
We report four baselines: BLEU, TER, Meteor, and BERTscore. The first three were computed by the WebNLG competition organizers. We ran the latter one ourselves, using BERTlarge uncased for a fair comparison. Figure 3 presents the correlation of the metrics with human assessments as we vary the share of data allocated to training. The more pretrained BLEURT is, the quicker it adapts. The vanilla BERT approach BLEURT -pre -wmt requires about a third of the WebNLG data to dominate the baselines on the majority of tasks, and it still lags behind on semantics (split by system). In contrast, BLEURT -wmt is competitive with as little as 836 records, and BLEURT is comparable with BERTscore with zero fine-tuning.

Results:
Takeaways: Thanks to pre-training, BLEURT can quickly adapt to the new tasks. BLEURT finetuned twice (first on synthetic data, then on WMT data) provides acceptable results on all tasks without training data.

Ablation Experiments
Figure 4 presents our ablation experiments on WMT 2017, which highlight the relative importance of each pre-training task. On the left side, we compare BLEURT pre-trained on a single task to BLEURT without pre-training. On the right side, we compare full BLEURT to BLEURT pretrained on all tasks except one. Pre-training on BERTscore, entailment, and the backtranslation scores yield improvements (symmetrically, ablating them degrades BLEURT). Oppositely, BLEU and ROUGE have a negative impact. We conclude that pre-training on high quality signals helps BLEURT, but that metrics that correlate less well with human judgment may in fact harm the model. 6

Related Work
The WMT shared metrics competition (Bojar et al., 2016;Ma et al., 2018Ma et al., , 2019 has inspired the creation of many learned metrics, some of which use regression or deep learning (Stanojevic and Sima'an, 2014;Ma et al., 2017;Shimanaka et al., 2018;Chen et al., 2017;Mathur et al., 2019). Other metrics have been introduced, such as the recent MoverScore (Zhao et al., 2019) which combines contextual embeddings and Earth Mover's Distance. We provide a head-to-head comparison with the best performing of those in our experiments. Other approaches do not attempt to estimate quality directly, but use information extraction or question answering as a proxy (Wiseman et al., 2017;Goodrich et al., 2019;Eyal et al., 2019). Those are complementary to our work. There has been recent work that uses BERT for evaluation. BERTScore (Zhang et al., 2020) proposes replacing the hard n-gram overlap of BLEU with a soft-overlap using BERT embeddings. We use it in all our experiments. Bertr (Mathur et al., 2019) and YiSi (Mathur et al., 2019) also make use of BERT embeddings to capture similarity. Sum-QE (Xenouleas et al., 2019) fine-tunes BERT for quality estimation as we describe in Section 3. Our focus is different-we train metrics that are not only state-of-the-art in conventional IID experimental setups, but also robust in the presence of scarce and out-of-distribution training data. To our knowledge no existing work has explored pretraining and extrapolation in the context of NLG.
Previous studies have used noising for referenceless evaluation (Dušek et al., 2019). Noisy pre-training has also been proposed before for other tasks such as paraphrasing (Wieting et al., 2016;Tomar et al., 2017) but generally not with synthetic data. Generating synthetic data via paraphrases and perturbations has been commonly used for generating adversarial examples (Jia and Liang, 2017;Iyyer et al., 2018;Belinkov and Bisk, 2018;Ribeiro et al., 2018), an orthogonal line of research.

Conclusion
We presented BLEURT, a reference-based text generation metric for English. Because the metric is trained end-to-end, BLEURT can model human assessment with superior accuracy. Furthermore, pre-training makes the metrics robust particularly robust to both domain and quality drifts. Future research directions include multilingual NLG evaluation, and hybrid methods involving both humans and classifiers.

Backtranslation: Consider
English and French.
Given a forward translation model P en→fr (z fr |z en ) and backward translation model P fr→en (z en |z fr ), we generatez as follows: z = arg max zen (P fr→en (z en |z * fr )) where z * fr = arg max z fr (P fr→en (z fr |z)). For the translations, we use a Transformer model (Vaswani et al., 2017), trained on English-German with the tensor2tensor framework. 7 Word dropping: Given a synthetic example (z,z) we generate a pair (z,z ′ ), by randomly dropping words fromz. We draw the number of words to drop uniformly, up to the length of the sentence. We apply this transformation on about 30% of the data generated with the previous method.

A.2 Pre-Training Tasks
We now provide additional details on the signals we used for pre-training.
Automatic Metrics: As shown in the table, we use three types of signals: BLEU, ROUGE, and BERTscore. For BLEU, we used the original Moses SENTENCEBLEU 8 implementation, using the Moses tokenizer and the default parameters. For ROUGE, we used the seq2seq implementation of ROUGE-N. 9 We used a custom implementation of BERTSCORE, based on BERT-large uncased. ROUGE and BERTscore return three scores: precision, recall, and F-score. We use all three quantities.
Backtranslation Likelihood: We compute all the losses using custom Transformer model (Vaswani et al., 2017), trained on two language pairs (English-French and English-German) with the tensor2tensor framework.
Normalization: All the regression labels are normalized before training.

B.1 Training Setup for All Experiments
We user BERT's public checkpoints 10 with Adam (the default optimizer), learning rate 1e-5, and batch size 32. Unless specified otherwise, we use 800,00 training steps for pre-training and 40,000 steps for fine-tuning. We run training and evaluation in parallel: we run the evaluation every 1,500 steps and store the checkpoint that performs best on a held-out validation set (more details on the data splits and our choice of metrics in the following sections). We use Google Cloud TPUs v2 for learning, and Nvidia Tesla V100 accelerators for evaluation and test. Our code uses Tensorflow 1.15 and Python 2.7.

B.2 WMT Metric Shared Task
Metrics. The metrics used to compare the evaluation systems vary across the years. The organizers use Pearson's correlation on standardized human judgments across all segments in 2017, and a custom variant of Kendall's Tau named "DARR" on raw human judgments in 2018 and 2019. The latter metrics operates as follows. The organizers gather all the translations for the same reference segment, they enumerate all the possible pairs (translation 1 , translation 2 ), and they discard all the pairs which have a "similar" score (less than 25 points away on a 100 points scale). For each remaining pair, they then determine which translation is the best according both human judgment and the candidate metric. Let |Concordant| be the number of pairs on which the NLG metrics agree and |Discordant| be those on which they disagree, then the score is computed as follows: The idea behind the 25 points filter is to make the evaluation more robust, since the judgments collected for WMT 2018 and 2019 are noisy. Kendall's Tau is identical, but it does not use the filter. Training setup. To separate training and validation data, we set aside a fixed ratio of records in such a way that there is no "leak" between the datasets (i.e., train and validation records that share the same source). We use 10% of the data for validation for years 2017 and 2018, and 5% for year 2019. We report results for the models that yield the highest Kendall Tau across all records on validation data. The weights associated to each pretraining task (see our Modeling section) are set with grid search, using the train/validation setup of WMT 2017.
Baselines. we use three metrics: the Moses implementation of sentenceBLEU, 11 BERTscore, 12 and MoverScore, 13 which are all available online. We run the Moses tokenizer on the reference and candidate segments before computing sentenceBLEU.

B.3 Robustness to Quality Drift
Data Re-sampling Methodology: We sample the training and test separately, as follows. We split the data in 10 bins of equal size. We then sample each record in the dataset with probabilities 1 B α and 1 (11−B) α for train and test respectively, where B is the bin index of the record between 1 and 10, and α is a predefined skew factor. The skew factor α controls the drift: a value of 0 has no effect (the ratings are centered around 0), and value of 3.0 yields extreme differences. Note that 11 https://github.com/moses-smt/ mosesdecoder/blob/master/mert/ sentence-bleu.cpp 12 https://github.com/Tiiiger/bert_score 13 https://github.com/AIPHES/ emnlp19-moverscore the sizes of the datasets decrease as α increases: we use 50.7%, 30.3%, 20.4%, and 11.9% of the original 5,344 training records for α = 0.5, 1.0, 1.5, and 3.0 respectively.

B.4 Ablation Experiment-How Much
Pre-Training Time is Necessary?
To understand the relationship between pretraining time and downstream accuracy, we pretrain several versions of BLEURT and we fine-tune them on WMT17 data, varying the number of pretraining steps. Figure 5 presents the results. Most gains are obtained during the first 400,000 steps, that is, after about 2 epochs over our synthetic dataset.