Metrics for Evaluation of Word-level Machine Translation Quality Estimation

The aim of this paper is to investigate suitable evaluation strategies for the task of word-level quality estimation of machine translation. We suggest various metrics to replace F 1 -score for the “BAD” class, which is currently used as main metric. We compare the metrics’ performance on real system outputs and synthetically generated datasets and suggest a reliable alternative to the F 1 -BAD score — the multiplication of F 1 -scores for different classes. Other metrics have lower discriminative power and are biased by unfair labellings.


Introduction
Quality estimation (QE) of machine translation (MT) is a task of determining the quality of an automatically translated text without any oracle (reference) translation. This task has lately been receiving significant attention: from confidence estimation (i.e. estimation of how confident a particular MT system is on a word or a phrase (Gandrabur and Foster, 2003)) it evolved to systemindependent QE and is performed at the word level (Luong et al., 2014), sentence level (Shah et al., 2013) and document level .
The emergence of a large variety of approaches to QE led to need for reliable ways to compare them. The evaluation metrics that have been used to compare the performance of systems participating in QE shared tasks 1 have received some criticisms. Graham (2015) shows that Pearson correlation better suits for the evaluation of sentence-level QE systems than mean absolute error (MAE), often used for this purpose. Pearson correlation evaluates how well a system captures the regularities in the data, whereas MAE essentially measures the difference between the true and the predicted scores and in many cases can be minimised by always predicting the average score as given by the training set labels.
Word-level QE is commonly framed as a binary task, i.e., the classification of every translated word as "OK" or "BAD". This task has been evaluated in terms of F 1 -score for the "BAD" class, a metric that favours 'pessimistic' systems -i.e. systems that tend to assign the "BAD" label to most words. A trivial baseline strategy that assigns the label "BAD" to all words can thus receive a high score while being completely uninformative (Bojar et al., 2014). However, no analysis of the word-level metrics' performance has been done and no alternative metrics have been proposed that are more reliable than the F 1 -BAD score.
In this paper we compare existing evaluation metrics for word-level QE, suggest a number of alternatives, and show that one of these alternatives leads to more objective and reliable results.

Metrics
One of the reasons word-level QE is a challenging problem is the fact that "OK" and "BAD" labels are not equally important: we are generally more interested in finding incorrect words than in assigning a suitable category to every single word. An ideal metric should be oriented towards the recall for the "BAD" class. However, the case of F 1 -BAD score shows that this is not the only requirement: in order to be useful the metric should not favour pessimistic labellings, i.e., all or most words labelled as "BAD". Below we describe possible alternatives to the F 1 -BAD score.

F 1 -score variants
Word-level F 1 -scores. Since F 1 -BAD score is too pessimistic, an obvious solution would be to balance it with F 1 -score for the "OK" class. However, the widely used weighted average of F 1scores for the two classes is not suitable as it will be dominated by F 1 -OK due to labels imbalance. Any reasonable MT system will nowadays generate texts where most words are correct, so the label distribution is very skewed towards the "OK" class. Therefore, we suggest instead the multiplication of F 1 -scores for individual classes: it is equal to zero if one of the components is zero, and since both are in the [0,1] range, the overall result will not exceed the value of any of the multipliers.
Phrase-level F 1 -scores. One of the features of MT errors is their phrase-level nature. Errors are not independent: one incorrect word can influence the classification of its neighbours. If several adjacent words are tagged as "BAD", they are likely to be part of an error which spans over a phrase.
Therefore, we also evaluate word-level F 1scores and alternative metrics which are based on correctly identified erroneous or error-free spans of words. The phrase-level F 1 -score we suggest is similar to the one used for the evaluation of named entity recognition (NER) systems (Tjong Kim Sang and De Meulder, 2003). There, precision is the percentage of named entities found by a system that are correct, recall is the percentage of named entities present in the corpus that are found by a system. For the QE task, instead of named entities we have spans of erroneous (or correct) words. Precision is the percentage of correctly identified spans among all the spans found by a system, recall is the percentage of correctly identified spans among the spans in the test data.
However, in NER the correct borders of a named entity are of big importance, because failure to identify them results in an incorrect entity. On the other hand, the actual borders of an error span in QE are not as important: the primary goal is to identify the erroneous region in the sentence, the task of finding the exact borders of an error cannot be solved unambiguously even by human annotators (Wisniewski et al., 2013). In order to take into account partially correct phrases (e.g. a 4-word "BAD" phrase where the first word was tagged as "OK" by a system and the remaining words were correctly tagged as "BAD"), we compute the number of true positives as the sum of percentages of words with correctly predicted tags for every "OK" phrase. The number of true negatives is defined analogously.

Other metrics
Matthews correlation coefficient. MCC (Powers, 2011) was used as a secondary metric in WMT14 word-level QE shared task (Bojar et al., 2014). It is determined as follows: where T P , T N , F P and F N are true positive, true negative, false positive and false negative values, respectively. This coefficient results in values in the [-1, 1] range. If the reference and hypothesis labellings agree on the majority of the examples, the final figure is dominated by the T P × T N quantity, which gets close to the value of the denominator. The more false positives and false negatives the predictor produces, the lower the value of the numerator.
Sequence correlation. The sequence correlation score was used as a secondary evaluation metric in the QE shared task at WMT15 (Bojar et al., 2015). Analogously to the phrase-level F 1 -score, it is based on the intersection of spans of correct and incorrect words. It also weights the phrases to give them equal importance and penalises the difference in the number of phrases between the reference and the hypothesis.

Metrics comparison
One of the most reliable ways of comparing metrics is to measure their correlation with human judgements. However, for the word-level QE task, asking humans to rate a system labelling or to compare the outputs of two or more QE systems is a very expensive process. A practical way of getting the human judgements is the use of quality labels in downstream human tasks -i.e. tasks where quality labels can be used as additional information and where they can influence human accuracy or speed. One such a downstream task can be computer-assisted translation, where the user translates a sentence having automatic translation as a draft, and word-level quality labels can highlight incorrect parts in a sentence. Improvements in productivity could show the degree of usefulness of the quality labels in this case. However, such an experiment is also very expensive to be performed. Therefore, we consider indirect ways of comparing the metrics' reliability based on prelabelled gold-standard test sets.

Comparison on real systems
One of the purposes of system comparison is to identify the best-performing system. Therefore, we expect a good metric to be able to distinguish between systems as well as possible. One of the quality criteria for a metric will thus be the number of significantly different groups of systems the metric can identify. Another criterion to evaluate metrics is to compare the real systems' performance with synthetic datasets for which we know the desirable behaviour of the metrics. If a metric gives the expected scores to all artificially generated datasets, it detects some properties of the data which are relevant to us, so we can expect it to work adequately also on real datasets.
Here we compare the performance of six metrics: • F 1 -BAD -F 1 -score for the "BAD" class.
We used these metrics to rank all systems submitted to the WMT15 QE shared task 2 (wordlevel QE). 2 In addition to that, we test the performance of the metrics on a number of synthetically created labellings that should be ranked low in comparison to real system labellings: • all-bad -all words are tagged as "BAD". • all-good -all words are tagged as "OK".
• optimistic -98% words are tagged as "OK", with only a small number of "BAD" labels generated: this system should have high precision (0.9) and low recall (0.1) for the "BAD" label. • pessimistic -90% words are tagged as "BAD": this system should have high recall (0.9) for the "BAD" label, but low recall (0.1) for the "OK" label. • random -labels are drawn randomly from the label probability distribution.
We rank the systems according to all the metrics and compute the level of significance for every 2 Systems that took part in the shared task are listed and described in (Bojar et al., 2015). pair of systems with randomisation tests (Yeh, 2000) with Bonferroni correction (Abdi, 2007). In order to evaluate the metrics' performance we compute the system distinction coefficient d -the probability of two systems being significantly different, which is defined as the ratio between the number of significantly different pairs of systems and all pairs of systems. We also compute d for the top half and for the bottom half of the ranked systems list separately in order to check how well each metric can discriminate between better performing and worse performing systems. 3 The results are shown in Table 1. For every synthetic dataset we show the number of real system outputs that were rated lower than this dataset, with the rightmost column showing the sum of this figure across all the synthetic sets.
We can see that three metrics are better at distinguishing synthetic results from real systems: Se-qCor and both multiplied F 1 -scores. In the case of SeqCor this result is explained by the fact that it favours longer spans of "OK" and "BAD" labels and thus penalises arbitrary labellings. The multiplications of F 1 -scores have two components which penalise different labellings and balance each other. This assumption is confirmed by the fact that F 1 -BAD scores become too pessimistic without the "OK" component: they both favour synthetic systems with prevailing "BAD" labels. Phrase-F 1 -BAD ranks these systems the highest: all-bad and pessimistic outperform 16 out of 17 systems according to this metric.
MCC is, in contrast, too 'optimistic': the optimistic dataset is rated higher than most of system outputs. In addition to that, it is not good at distinguishing different systems: its system distinction coefficient is the lowest among all metric. SeqCor and phrase-F 1 -multiplied, despite identifying artificial datasets, cannot discriminate between real systems: SeqCor fails with the top half systems, phrase-F 1 -multiplied is bad at finding differences in the bottom half of the list.
Overall, F 1 -multiplied is the only metric that performs well both in the task of distinguishing  Table 1: Results for all metrics. Numbers in synthetic dataset columns denote the number of system submissions that were rated lower than the corresponding synthetic dataset.
synthetic systems from real ones and in the task of discriminating among real systems, despite the fact that its d scores are not the best. However, F 1 -BAD is not far behind: it has high values for d scores and can identify synthetic datasets quite often.

Comparison on synthetic datasets
The experiment described above has a notable drawback: we evaluated metrics on the outputs of systems which had been tuned to maximise the F 1 -BAD score. This means that the system rankings produced by other metrics may be unfairly considered inaccurate. Therefore, we suggest a more objective metric evaluation procedure which uses only synthetic datasets. We generate datasets with different proportion of errors, compute the metrics' values and their statistical significance and then compare the metrics' discriminative power. This procedure is further referred to as repeated sampling, because we sample artificial datasets multiple times.
Our goal is for the synthetic datasets to simulate real systems' output. We achieve this by using the following procedure for synthetic data generation: • Choose the proportion of errors to introduce in the synthetic data. • Collect all sequences that contain incorrect labels from the outputs of real systems. • Randomly choose the sequences from this set until the overall number of errors reaches the chosen threshold. • Take the rest of segments from the goldstandard labelling (so that they contain no errors).
Thus our artificial datasets contain a specific number of errors, and all of them come from real systems. We can generate datasets with very small differences in quality and identify metrics according to which this difference is more significant.
Let us compare the discriminative power of metrics m 1 and m 2 . We choose two error thresholds e 1 and e 2 . Then we sample a relatively small number (e.g. 100) of random datasets with e 1 errors. Then -100 random datasets with e 2 errors. We compute the values for both metrics on the two sets of random samples and for each metric we test if the difference between the results for the two sets is significant (we compute the statistic significance using non-paired t-test with Bonferroni correction). Since we sampled the synthetic datasets a small number of times it is likely that the metrics will not detect any significant differences between them. In this case we repeat the process with a larger (e.g. 200) number of samples and compare the p-values for two metrics again. By gradually increasing the number of samples at some point we will find that one of the metrics recognises the differences in scores as statistically significant, while another one does not. This means that this metric has higher discriminative power: it needs less samples to determine that the systems they are different. The procedure is outlined in Algorithm 1.
In our experiments in order to make p-values more stable we repeat each sampling round (sampling of a set with e i errors 100, 200, etc. times) 1,000 times and use the average of p-values. We used fixed sets of sample numbers: [100,200,500,1000,2000,5000,10,000] and error thresholds: [30%, 30.01%, 30.05%, 30.1%, 30.2%]. The significance level α is 0.05.
Since we compare all six metrics on five error thresholds, we have 10 p-values for each metric at every sampling round. We analyse the results in the following way: for every difference in the percentage of errors (e.g. thresholds of 30% and 30.01% give 0.01% difference, thresholds of 30% and 30.2% -0.2% difference), we define the minimum number of samplings that a metric 10000 5000 5000 1000 500 500 phr F 1 -BAD 10000 10000 5000 1000 500 500 Table 2: Repeated sampling: the minimum number of samplings required to discriminate between samples with a different proportions of errors.
Result: m x ∈ {m 1 , m 2 }, where m x -metric with the highest discriminative power on error thresholds e 1 and e 2 N ← 100 α ← significance level while p-val m 1 α and p-val m 2 α do s 1 ← N random samples with e 1 errors s 2 ← N random samples with e 2 errors p-val m 1 ← t-test(m 1 (s 1 ), m 1 (s 2 )) p-val m 2 ← t-test(m 2 (s 1 ), m 2 (s 2 )) if p-val m 1 < α and p-val m 2 α then return m 1 else if p-val m 1 α and p-val m 2 < α then return m 2 else N ← N + 100 end Algorithm 1: Repeated sampling for metrics m 1 , m 2 and error thresholds e 1 , e 2 . needs to observe significant differences between datasets which differ in this number of errors. Table 2 shows the results. Numbers in cells are minimum numbers of samplings. We do not show error differences greater than 0.2 because all metrics identify them well. All metrics are sorted by discriminative power from best to worst, i.e. metrics at the top of the table require less samplings to tell one synthetic dataset from another.
As in the previous experiment, here the discriminative power of the multiplication of F 1 -scores is the highest. Surprisingly, MCC performs equally well. Similarly to the experiment with real systems, the F 1 -BAD metric performs worse than the F 1 -multiply metric, but here their difference is more salient. All phrase-motivated metrics show worse results.