Putting Evaluation in Context: Contextual Embeddings Improve Machine Translation Evaluation

Accurate, automatic evaluation of machine translation is critical for system tuning, and evaluating progress in the field. We proposed a simple unsupervised metric, and additional supervised metrics which rely on contextual word embeddings to encode the translation and reference sentences. We find that these models rival or surpass all existing metrics in the WMT 2017 sentence-level and system-level tracks, and our trained model has a substantially higher correlation with human judgements than all existing metrics on the WMT 2017 to-English sentence level dataset.


Introduction
Evaluation metrics are a fundamental component of machine translation (MT) and other language generation tasks. The problem of assessing whether a translation is both adequate and coherent is a challenging text analysis problem, which is still unsolved, despite many years of effort by the research community. Shallow surfacelevel metrics, such as BLEU and TER (Papineni et al., 2002;Snover et al., 2006), still predominate in practice, due in part to their reasonable correlation to human judgements, and their being parameter free, making them easily portable to new languages. In contrast, trained metrics (Song and Cohn, 2011;Stanojevic and Sima'an, 2014;Ma et al., 2017;Shimanaka et al., 2018), which are learned to match human evaluation data, have been shown to result in a large boost in performance.
This paper aims to improve over existing MT evaluation methods, through developing a series of new metrics based on contextual word embeddings (Peters et al., 2018;Devlin et al., 2019), a technique which captures rich and portable representations of words in context, which have been shown to provide important signal to many other NLP tasks (Rajpurkar et al., 2018). We propose a simple untrained model that uses off-the-shelf contextual embeddings to compute approximate recall, when comparing a reference to an automatic translation, as well as trained models, including: a recurrent model over reference and translation sequences, incorporating attention; and the adaptation of an NLI method (Chen et al., 2017) to MT evaluation. These approaches, though simple in formulation, are highly effective, and rival or surpass the best approaches from WMT 2017. Moreover, we show further improvements in performance when our trained models are learned using noisy crowd-sourced data, i.e., having single annotations for more instances is better than collecting and aggregating multiple annotations for single instances. The net result is an approach that is more data efficient than existing methods, while producing substantially better human correlations. 1 2 Related work MT metrics attempt to automatically predict the quality of a translation by comparing it to a reference translation of the same source sentence.
Metrics such as BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) use n-gram matching or more explicit word alignment to match the system output with the reference translation. Character-level variants such as BEER, CHRF and CHARACTER overcome the problem of harshly penalising morphological variants, and perform surprisingly well despite their simplicity (Stanojevic and Sima'an, 2014;Popović, 2015;Wang et al., 2016).
In order to allow for variation in word choice and sentence structure, other metrics use information from shallow linguistic tools such as POStaggers, lemmatizers and synonym dictionaries (Banerjee and Lavie, 2005;Snover et al., 2006;Liu et al., 2010), or deeper linguistic informa-tion such as semantic roles, dependency relationships, syntactic constituents, and discourse roles (Giménez and Màrquez, 2007;Castillo and Estrella, 2012;Guzmán et al., 2014). On the flip side, it is likely that these are too permissive of mistakes.
More recently, metrics such as MEANT 2.0 (Lo, 2017) have adopted word embeddings (Mikolov et al., 2013) to capture the semantics of individual words. However, classic word embeddings are independent of word context, and context is captured instead using hand-crafted features or heuristics.
Neural metrics such as ReVal and RUSE solve this problem by directly learning embeddings of the entire translation and reference sentences. ReVal (Gupta et al., 2015) learns sentence representations of the MT output and reference translation as a Tree-LSTM, and then models their interactions using the element-wise difference and angle between the two. RUSE (Shimanaka et al., 2018) has a similar architecture, but it uses pretrained sentence representations instead of learning the sentence representations from the data.
The Natural Language Inference (NLI) task is similar to MT evaluation (Padó et al., 2009): a good translation entails the reference and viceversa. An irrelevant/wrong translation would be neutral/contradictory compared to the reference. An additional complexity is that MT outputs are not always fluent. On the NLI datasets, systems that include pairwise word interactions when learning sentence representations have a higher accuracy than systems that process the two sentences independently (Rocktäschel et al., 2016;Chen et al., 2017;Wang et al., 2017). In this paper, we attempt to introduce this idea to neural MT metrics.

Model
We wish to predict the score of a translation t of length l t against a human reference r of length l r . For all models, we use fixed pre-trained contextualised word embeddings e k to represent each word in the MT output and reference translation, in the form of matrices W t and W r .

Unsupervised Model
We use cosine similarity to measure the pairwise similarity between t and r based on the maximum similarity score for each word embedding e i ∈ t with respect to each word embedding e j ∈ r. We approximate recall of a word in r with its maximum similarity with any word in t. The final predicted score, y, for a translation is the average recall of its reference:

Supervised Models
Trained BiLSTM We first encode the embeddings of the translation and reference with a bidirectional LSTM, and concatenate the max-pooled and average-pooled hidden states of the BiLSTM to generate v t and v r , respectively: To get the predicted score, we run a feedforward network over the concatenation of the sentence representations of t and r, and their element-wise product and difference (a useful heuristic first proposed by Mou et al. (2016)). We train the model by minimizing mean squared error with respect to human scores.
This is similar to RUSE, except that we learn the sentence representation instead of using pretrained sentence embeddings.
Trained BiLSTM + attention To obtain a sentence representation of the translation which is conditioned on the reference, we compute the attention-weighted representation of each word in the translation. The attention weights are obtained by running a softmax over the dot product similarity between the hidden state of the translation and reference BiLSTM. Similarly, we compute the relevant representation of the reference: We then useh t andh r as our sentence representations in Eq. (3)-(6) to compute the final scores.

Enhanced Sequential Inference Model (ESIM):
We also directly adapt ESIM (Chen et al., 2017), a high-performing model on the Natural Language Inference task, to the MT evaluation setting. We treat the human reference translation and the MT output as the premise and hypothesis, respectively.
The ESIM model first encodes r and t with a BiLSTM, then computes the attention-weighted representations of each with respect to the other (Eq. (7)-(9)). This model next "enhances" the representations of the translation (and reference) by capturing the interactions between h t andh t (and h r andh r ): We use a feedforward projection layer to project these representations back to the model dimension, and then run a BiLSTM over each representation to compose local sequential information. The final representation of each pair of reference and translation sentences is the concatenation of the average-pooled and max-pooled hidden states of this BiLSTM. To compute the final predicted score, we apply a feedforward regressor over the concatenation of the two sentence representations.
For all models, the predicted score of an MT system is the average predicted score of all its translations in the testset.

Experimental Setup
We use human evaluation data from the Conference on Machine Translation (WMT) to train and evaluate our models (Bojar et al., 2016(Bojar et al., , 2017a, which is based on the Direct Assessment ("DA") method (Graham et al., 2015). Here, system translations are evaluated by humans in comparison to a human reference translation, using a continuous scale (Graham et al., 2015. Each annotator assesses a set of 100 items, of which 30 items are for quality control, which is used to filter out annotators who are unskilled or careless. Individual worker scores are first standardised, and then the final score of an MT system is computed as the average score across all translations in the test set. Manual MT evaluation is subjective and difficult, and it is not possible even for a diligent human to be entirely consistent on a continuous scale. Thus, any human annotations are noisy by nature. To obtain an accurate score for individual translations, the average score is calculated from scores of at least 15 "good" annotators. This data is then used to evaluate automatic metrics at the sentence level (Graham et al., 2015).
We train on the human evaluation data of news domain of WMT 2016, which is entirely crowdsourced. The sentence-level-metric evaluation data consists of accurate scores for 560 translations each for 6 to-English language pairs and English-to-Russian (we call this the "TrainS" dataset). The dataset also includes mostly singly-annotated 2 DA scores for around 125 thousand translations from six source languages into English, and 12.5 thousand translations from English-to-Russian ("TrainL" dataset), that were collected to obtain human scores for MT systems.
For the validation set, we use the sentencelevel DA judgements collected for the WMT 2015 data (Bojar et al., 2015): 500 translation-reference pairs each of four to-English language pairs, and English-to-Russian.
For more details on implementation and training of our models, see Appendix A.
We test our metrics on all language pairs from the WMT 2017 (Bojar et al., 2017b) news task in both the sentence and system level setting, and evaluate using Pearson's correlation between our metrics' predictions and the Human DA scores.
For the sentence level evaluation, insufficient DA annotations were collected for five from-English language pairs, and these were converted to preference judgements. If two MT system translations of a source sentence were evaluated by at least two reliable annotators, and the average score for System A is reasonably greater than the average score of System B, then this is interpreted as a Relative Ranking (DARR) judgement where Sys A is better than Sys B. The metrics are then evaluated using (a modified version of) Kendall's Tau correlation over these preference judgements.
We also evaluate on out-of-domain, system level data for five from-English language pairs from the WMT 2016 IT task.

Results
Tab. 1 compares the performance of our proposed metrics against existing metrics on the WMT 17 to-English news dataset. MEANT 2.0 (Lo, 2017) is the best untrained metric -it uses pre-trained word2vec embeddings (Mikolov et al., 2013)-, andRUSE (Shimanaka et al., 2018) is the best trained metric. We also include SENT-BLEU and CHRF baselines. Our simple average recall metric ("BERTR") has a higher correlation than all existing metrics, and is highly competitive with RUSE. When trained on the sentence-level data (as with RUSE), the BiLSTM baseline does not perform well, however adding attention makes it competitive with RUSE. The ESIM model -which has many more parameters -underperforms compared to the BiLSTM model with attention.
However, the performance of all models improves substantially when these metrics are trained on the larger, singly-annotated training data (denoted "TrainL"), i.e., using data from only those annotators who passed quality control. Clearly the additional input instances make up for the increased noise level in the prediction variable. The simple BiLSTM model performs as well as RUSE, and both the models with attention substantially outperform this benchmark.
In this setting, we look at how the performance of ESIM improves as we increase the number of training instances (Fig. 1). We find that on the same number of training instances (3360), the model performs better on cleaner data compared to singly-annotated data (r = 0.57 vs 0.64). However, when we have a choice between collecting multiple annotations for the same instances vs collecting annotations for additional instances, the second strategy leads to more gains.
We now evaluate the unsupervised BERTR model and the ESIM model (trained on the large dataset) in the other settings. In the sentence level tasks out-of-English (Tab. 4), the BERTR model (based on BERT-Chinese) significantly outperforms all metrics in the English-to-Chinese testset. For other language pairs, BERTR (based on multilingual BERT) is highly competitive with other metrics. ESIM performs well in the language pairs that are evaluated using Pearson's cor- relation. But the results are mixed when evaluated based on preference judgements. This could be an effect of our training method -using squared error as part of regression loss -being better suited to Pearson's r -and might be resolved through a different loss, such as hinge loss over pairwise preferences which would better reflect Kendall's Tau (Stanojevic and Sima'an, 2014). Furthermore, ESIM is trained only on to-English and to-Russian data. It is likely that including more language pairs in the training data will increase correlation.
On the system level evaluation of the news domain, both metrics are competitive with all other metrics in all language pairs both to-and out-of-English (see Tab. 3

and Tab. 4 in Appendix B).
In the IT domain, we have mixed results (Tab. 5 in the Appendix). ESIM significantly outperforms all other metrics in English-Spanish, is competitive in two other language pairs, and is outperformed by other metrics in the remaining two language pairs.

Qualitative Analysis
We manually inspect translations in the validation set. Tab. 6 in Appendix C shows examples of good translations, where our proposed metrics correctly recognise synonyms and valid word re-orderings, unlike SENT-BLEU. However, none of the metrics recognise a different way of expressing the same meaning. From Tab. 7, we see that SENT-BLEU gives high scores to translations with high partial overlap with the reference, but ESIM cor-cs-en de-en fi-en lv-en ru-en tr-en zh-en AVE.   Graham and Baldwin, 2014) en-cs en-de en-fi en-lv en-ru en-tr en-zh τ τ τ τ ρ τ ρ  Table 2: Pearson's r and Kendall's τ on the WMT 2017 from-English system-level evaluation data. The first section represents existing metrics, both trained and untrained. We then present results of our unsupervised metric, followed by our supervised metric trained in the TrainL setting: noisy 125k instances. Correlations of metrics not significantly outperformed by any other for that language pair are highlighted in bold (William's test (Graham and Baldwin, 2014) for Pearson's r and Bootstrap (Efron and Tibshirani, 1993) for Kendall's τ .) rectly recognises then as low quality translations. However, in some cases, ESIM can be too permissive of bad translations which contain closely related words. There are also examples where a small difference in words completely changes the meaning of the sentence, but all the metrics score these translations highly.

Conclusion
We show that contextual embeddings are very useful for evaluation, even in simple untrained models, as well as in deeper attention based methods. When trained on a larger, much noisier range of instances, we demonstrate a substantial improvement over the state of the art.
In future work, we plan to extend these models by using cross-lingual embeddings, and combine information from translation-source interactions as well as translation-reference interactions. There are also direct applications to Quality Estimation, by using the source instead of the reference.

A Implementation details
We implement our models using AllenNLP in Py-Torch. We experimented with both ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019) embeddings, and found that BERT consistently performs as well as, or better than ELMo, thus we report results using only BERT embeddings in this paper. For BERTR, we use the top layer embeddings of the wordpieces of the MT and Reference translations. We use bert base uncased for all to-English language pairs, bert base chinese models for English-to-Chinese and bert base multilingual cased for the remaining to-English language pairs. For the trained metrics, we learn a weighted average of all layers of BERT embeddings.
On the to-English testsets, we use bert base uncased embeddings and train on the WMT16 to-English data.
On all other testsets, we use the bert base multilingual cased embeddings and train on the WMT 2016 English-to-Russian, as well as all to-English data.
Following the recommendations of the original ESIM paper, we fix the dimension of the BiLSTM hidden state to 300 and set the Dropout rate to 0.5. We use the Adam optimizer with an initial learning rate of 0.0004 and batch size of 32, and use early stopping on the validation dataset.
Training the ESIM model on the full dataset takes around two hours on a single V100 GPU, and all models take less than two minutes to evaluate a standard WMT dataset of 3000 translations.