Alternative Objective Functions for Training MT Evaluation Metrics

MT evaluation metrics are tested for correlation with human judgments either at the sentence- or the corpus-level. Trained metrics ignore corpus-level judgments and are trained for high sentence-level correlation only. We show that training only for one objective (sentence or corpus level), can not only harm the performance on the other objective, but it can also be suboptimal for the objective being optimized. To this end we present a metric trained for corpus-level and show empirical comparison against a metric trained for sentence-level exemplifying how their performance may vary per language pair, type and level of judgment. Subsequently we propose a model trained to optimize both objectives simultaneously and show that it is far more stable than–and on average outperforms–both models on both objectives.


Introduction
Ever since BLEU (Papineni et al., 2002) many proposals for an improved automatic evaluation metric for Machine Translation (MT) have been made. Some proposals use additional information for extracting quality indicators, like paraphrasing (Denkowski and Lavie, 2011), syntactic trees (Liu and Gildea, 2005;Stanojević and Sima'an, 2015) or shallow semantics (Rios et al., 2011;Lo et al., 2012) etc. Whereas others use different matching strategies, like n-grams (Papineni et al., 2002), treelets (Liu and Gildea, 2005) and skip-bigrams (Lin and Och, 2004). Most metrics use several indicators of translation quality which are often combined in a linear model whose weights are estimated on a training set of human judgments.
Because the most widely available type of human judgments are relative ranking (RR) judgments, the main machine learning method used for training the metrics were based on the learningto-rank framework (Li, 2011). While the effectiveness of this framework for training evaluation metrics has been confirmed many times, e.g., (Ye et al., 2007;Duh, 2008;Stanojević and Sima'an, 2014;Ma et al., 2016), so far there is no prior work exploring alternative objective functions for training learning-to-rank models. Without exception, all existing learning-to-rank models are trained to rank sentences while completely ignoring the corpora judgments, likely because human judgments come in the form of sentence rankings.
It might seem that sentence and corpus level tasks are very similar but that is not the case. Empirically it has been shown that many metrics that perform well on the sentence level do not perform well on the corpus level and vice versa. By training to rank sentences the model does not necessarily learn to give scores that are well scaled, but only to give higher scores to better translations. Training for the corpus level score would force the metric to give well scaled scores on the sentence level.
Human judgments of sentences can be aggregated in different ways to hypothesize human judgments of full corpora. However, this fact has not been used so far to train learning-to-rank models that are good for ranking different corpora.
This work fills-in this gap by exploring the merits of different objective functions that take corpus level judgments into consideration. We first create a learning-to-rank model for ranking corpora and compare it to the standard learning-to-rank model that is trained for ranking sentences. This comparison shows that performance of these two objectives can vary radically depending on the chosen meta-evaluation method. To tackle this prob-  Figure 1: Computation Graph lem we contribute a new objective function, inspired by multi-task learning, in which we train for both objectives simultaneously. This multiobjective model behaves a lot more stable over all methods of meta-evaluation and achieves a higher correlation than both single objective models.

Models
All the models that we define have one basic function in common, we call it a f orward(·) function, that maps the features of any sentence to a single real number. That function can be any differentiable function including multi-layer neural networks as in (Ma et al., 2016), but here we will stick with the standard linear model: Here φ is a vector with feature values of a sentence, w is a weight vector and b is a bias term. Usually in training we would like to process a mini-batch of feature vectors Φ, where Φ is a matrix in which each column is a feature vector of individual sentence in the mini-batch or in the corpus. By using broadcasting we can rewrite the previous definition of the f orward(·) function as: Now we can define the score of a sentence as a sigmoid function applied over the output of the f orward(·) function because we want to get a score between 0 and 1: As the corpus level score we will use just the average of sentence level scores: where m is the number of sentences in the corpus. Next we present several objective functions that are illustrated by the computation graph in Figure 1.

Training for Sentence Level Accuracy
Here we use the training objective very similar to BEER (Stanojević and Sima'an, 2014) which is a learning-to-rank framework that finds a separating hyper-plane between "good" and "bad" translations. Unlike BEER, we use a max-margin objective instead of logistic regression.
For each mini-batch we randomly select m human relative ranking pairwise judgments and after extracting features for all the sentences taking part in these judgments we put features in two matrices Φ swin and Φ slos . These matrices are structured in such a way that for judgment i the column i in Φ swin contains the features of the "good" translation in the judgment and the column i in Φ slos the features of the "bad" translation.
We would like to maximize the average margin that would separate sentence level scores of pairs of translations in each judgment. Because the squashing sigmoid function does not influence the ranking we can directly optimize on the unsquashed forward pass and require that the margin between "good" and "bad" translation is at least 1:

Training for Corpus Level Accuracy
At the corpus level we would like to do a similar thing as on the sentence level: maximize the distance between the scores of "good" and "bad" corpora. In this case we have additional information that is not present on the sentence level: we know not only which corpus is (according to humans) better, but also by how much it is better. For that we can use one of the heuristics such as the Expected Wins (Koehn, 2012). We can use this information to guide the learning model by how much it should separate the scores of two corpora. For doing this we use an approach similar to Max-Margin Markov Networks (Taskar et al., 2003) where for each training instance we dynamically scale the margin that should be enforced. We want the margin between the scores ∆ corp to be at least as big as the margin between the human scores ∆ human assigned to these systems. In one mini-batch we will use only a randomly chosen pair of corpora with feature matrices Φ cwin and Φ clos for which we have a human comparison. The corpus level loss function is given by:

Training Jointly for Sentence and Corpus
Level Accuracy In this model we optimize both objectives jointly in the style of multi-task learning (Caruana, 1997).
Here we employ the simplest approach of just tasking the interpolation of the previously introduced loss functions.
The interpolation is controlled by the hyperparameter α which could in principle be tuned for good performance, but here we just fix it to 0.5 to give both objectives equal importance.

Feature Functions
The feature functions that are used are reimplementation of many (but not all) feature functions of BEER. Because the point of this paper is about the exploration of different objective functions we did not try to experiment with more complex feature functions based on paraphrasing, function words or permutation trees.
We use just simple precision, recall and 3 types of F-score (with β parameters 1, 2 and 0.5) over different "pieces" of translation: • character n-grams of orders 1,2,3,4 and 5 • word n-grams of orders 1,2,3 and 4 • skip-bigrams of maximum skip 2 and ∞ (similar to ROUGE-S2 and ROUGE-S* (Lin and Och, 2004)) One final feature deals with length-disbalance. If the length of the system and reference translation are a and b respectively then this feature is computed as max(a,b)−min(a,b) min (a,b) . It is computed both for word and character length.
All of the models are implemented using Ten-sorFlow 1 and trained with L2 regularization λ = 0.001 and ADAM optimizer with learning rate 0.001. The mini-batch size for sentence level judgments is 2000 and for the corpus level is one comparison. Each model is trained for 200 epochs out of which the one performing best on the validation set for the objective function being optimized is used during the test time.
We show the results for the relative ranking (RR) judgments correlation in Table 1. For all language pairs that are of the form en-X we show it under the column X and for all the language pairs that have English on the target side we present their average under the column en.
RR corpus vs. sentence objective The corpusobjective is better than the sentence-objective for both corpus and sentence level RR judgments on 5 out of 7 languages and also on average correlation.
RR joint vs. single-objectives Training for the joint objective improves even more on both levels of RR correlation and outperforms both singleobjective models on average and on 4 out of 7 languages.
Making confident conclusions from these results is difficult because, to the best of our knowledge, there is no principled way of measuring statistical significance on the RR judgments. That is why we also tested on direct assessment (DA) judgments available from WMT16. On DA we can measure statistical significance on the sentence level using Williams test (Graham et al., 2015) and on the corpus level using combination of hybrid-supersampling and Williams test (Graham and . The results of correlation with human judgment are for sentence and corpus level are shown in Table 2.  This shows that gambling on one objective function (being that sentence or corpus level objective) could give unpredictable results. This is precisely the motivation for creating the joint model with multi-objective training.
DA joint vs. single objectives By choosing to jointly optimize both objectives we get a much more stable model that performs well both on DA and RR judgments and on both levels of judgment. On the DA sentence level, the joint model was not outperformed by any other model and on 3 out of 7 language pairs it significantly outperforms both alternative objectives. On the corpus level results are a bit mixed, but still joint objective outperforms both other models on 4 out of 7 language pairs and also it gives higher correlation on average.

Conclusion
In this work we found that altering the objective function for training MT metrics can have radical effects on performance. Also the effects of the objective functions can sometimes be unexpected: the sentence objective might not be good for sentence level correlation (in case of RR judgments) and the corpus objective might not be good for corpus level correlation (in case of DA judgments). The difference among objectives is better explained by different types of human judgments: the corpus objective is better for RR while sentence objective is better for DA judgments.
Finally, the best results are achieved by training for both objectives at the same time. This gives an evaluation metric that is far more stable in its performance over all methods of meta-evaluation.