BEER 1.1: ILLC UvA submission to metrics and tuning task

We describe the submissions of ILLC UvA to the metrics and tuning tasks on WMT15. Both submissions are based on the BEER evaluation metric originally presented on WMT14 (Stanojevi´c and Sima’an, 2014a). The main changes introduced this year are: (i) extending the learning-to-rank trained sentence level metric to the corpus level (but still decomposable to sentence level), (ii) incorporating syntactic ingredients based on dependency trees, and (iii) a technique for ﬁnding parameters of BEER that avoid “gaming of the metric” during tuning.


Introduction
In the 2014 WMT metrics task, BEER turned up as the best sentence level evaluation metric on average over 10 language pairs (Machacek and Bojar, 2014). We believe that this was due to: 1. learning-to-rank -type of training that allows a large number of features and also training on the same objective on which the model is going to be evaluated : ranking of translations 2. dense features -character n-grams and skipbigrams that are less sparse on the sentence level than word n-grams 3. permutation trees -hierarchical decomposition of word order based on (Zhang and Gildea, 2007) A deeper analysis of (2) is presented in (Stanojević and Sima'an, 2014c) and of (3) in (Stanojević and Sima'an, 2014b).
Here we modify BEER by 1. incorporating a better scoring function that give scores that are better scaled 2. including syntactic features and 3. removing the recall bias from BEER .
In Section 2 we give a short introduction to BEER after which we move to the innovations for this year in Sections 3, 4 and 5. We show the results from the metric and tuning tasks in Section 6, and conclude in Section 7.

BEER basics
The model underying the BEER metric is flexible for the integration of an arbitrary number of new features and has a training method that is targeted for producing good rankings among systems. Two other characteristic properties of BEER are its hierarchical reordering component and character ngrams lexical matching component.

Old BEER scoring
BEER is essentially a linear model with which the score can be computed in the following way: where w is a weight vector and φ is a feature vector.

Learning-to-rank
Since the task on which our model is going to be evaluated is ranking translations it comes natural to train the model using learning-to-rank techniques.
Our training data consists of pairs of "good" and "bad" translations. By using a feature vector φ good for a good translation and a feature vector φ bad for a bad translation then using the following equations we can transform the ranking problem into a binary classification problem (Herbrich et al., 1999): If we look at φ good − φ bad as a positive training instance and at φ bad − φ good as a negative training instance, we can train any linear classifier to find weight the weight vector w that minimizes mistakes in ranking on the training set.

Lexical component based on character n-grams
Lexical scoring of BEER relies heavily on character n-grams. Precision, Recall and F1-score are used with character n-gram orders from 1 until 6. These scores are more smooth on the sentence level than word n-gram matching that is present in other metrics like BLEU (Papineni et al., 2002) or METEOR (Michael Denkowski and Alon Lavie, 2014). BEER also uses precision, recall and F1-score on word level (but not with word n-grams). Matching of words is computed over METEOR alignments that use WordNet, paraphrasing and stemming to have more accurate alignment.
We also make distinction between function and content words. The more precise description of used features and their effectiveness is presented in (Stanojević and Sima'an, 2014c).

Reordering component based on PETs
The word alignments between system and reference translation can be simplified and considered as permutation of words from the reference translation in the system translation. Previous work by (Isozaki et al., 2010) and (Birch and Osborne, 2010) used this permutation view of word order and applied Kendall τ for evaluating its distance from ideal (monotone) word order.
BEER goes beyond this skip-gram based evaluation and decomposes permutation into a hierarchical structure which shows how subparts of permutation form small groups that can be reordered all together. Figure 1a shows PET for permutation 2, 5, 6, 4, 1, 3 . Ideally the permutation tree will be filled with nodes 1, 2 which would say that there is no need to do any reordering (everything is in the right place). BEER has features that compute the number of different node types and for each different type it assigns a different weight. Sometimes there are more than one PET for the same permutation. Consider Figure 1b and 1c which are just 2 out of 3 possible PETs for permutation 4, 3, 2, 1 . Counting the number of trees that could be built is also a good indicator of the permutation quality. See (Stanojević and Sima'an, 2014b) for details on using PETs for evaluating word order.

Corpus level BEER
Our goal here is to create corpus level extension of BEER that decomposes trivially at the sentence level. More concretely we wanted to have a corpus level BEER that would be the average of the sentence level BEER of all sentences in the corpus: In order to do so it is not suitable to to use previous scoring function of BEER . The previous scoring function (and training method) take care only that the better translation gets a higher score than the worse translation (on the sentence level). For this kind of corpus level computations we have an additional requirement that our sentence level scores need to be scaled proportional to the translation quality.

New BEER scoring function
To make the scores on the sentence level better scaled we transform our linear model into a probabilistic linear model -logistic regression with the following scoring function: There is still a problem with this formulation. During training, the model is trained on the difference between two feature vectors φ good − φ bad , while during testing it is applied only to one feature vector φ test . φ good − φ bad is usually very close to the separating hyperplane, whereas φ test is usually very far from it. This is not a problem for ranking but it presents a problem if we want well scaled scores. Being extremely far from the Our model was trained to give a probability of the "good" translation being better than the "bad" translation so we should also use it in that wayto estimate the probability of one translation being better than the other. But which translation? We are given only one translation and we need to compute its score. To avoid this problem we pretend that we are computing a probability of the test sentence being a better translation than the reference for the given reference. In the ideal case the system translation and the reference translation will have the same features which will make logistic regression output probability 0.5 (it is uncertain about which translation is the better one). To make the scores between 0 and 1 we multiply this result with 2. The final scoring formula is the following:

BEER + Syntax = BEER Treepel
The standard version of BEER does not use any syntactic knowledge. Since the training method of BEER allows the usage of a large number of features, it is trivial to integrate new features that would measure the matching between some syntax attributes of system and reference translations.
The syntactic representation we exploit is a dependency tree. The reason for that is that we can easily connect the structure with the lexical content and it is fast to compute which can often be very important for evaluation metrics when they need to evaluate on large data. We used Stanford's dependency parser (Chen and Manning, 2014) because it gives high accuracy parses in a very short time.
The features we compute on the dependency trees of the system and its reference translation are: For each of these we compute precision, recall and F1-score.
It has been shown by other researchers (Popović and Ney, 2009) that POS tags are useful for abstracting away from concrete words and measure the grammatical aspect of translation (for example it can captures agreement).
Dependency word bigrams (bigrams connected by a dependency arc) are also useful for capturing long distance dependencies.
Most of the previous metrics that work with dependency trees usually ignore the type of the dependency that is (un)matched and treat all types equally (Yu et al., 2014). This is clearly not the case. Surely subject and complement arcs are more important than modifier arc. To capture this we created individual features for precision, recall and F1-score matching of each arc type so our system could learn on which arc type to put more weight.
All words take some number of arguments (valency), and not matching that number of arguments is a sign of a, potentially, bad translation. With this feature we hope to capture the aspect of not producing the right number of arguments for all words (and especially verbs) in the sentence.
This model BEER Treepel contains in total 177 features out of which 45 are from original BEER .

BEER for tuning
The metrics that perform well on metrics task are very often not good for tuning. This is because recall has much more importance for human judgment than precision. The metrics that put more weight on recall than precision will be better with correlation with human judgment, but when used for tuning they will create overly long translations. This bias for long translation is often resolved by manually setting the weights of recall and precision to be equal (Denkowski and Lavie, 2011;He and Way, 2009).
This problem is even bigger with metrics with many features.
When we have metric like BEER Treepel which has 117 features it is not clear how to set weights for each feature manually. Also some features might not have easy interpretation as precision or recall of something. Our method for automatic removing of this recall bias, which is presented in (Stanojević, 2015), gives very good results that can be seen in Table 1.
Before the automatic adaptation of weights for tuning, tuning with standard BEER produces translations that are 15% longer than the reference translations. This behavior is rewarded by metrics that are recall-heavy like METEOR and BEER and punished by precision heavy metrics like BLEU. After automatic adaptation of weights, tuning with BEER matches the length of reference translation even better than BLEU and achieves the BLEU score that is very close to tuning with BLEU. This kind of model is disliked by ME-TEOR and BEER but by just looking at the length of the produced translations it is clear which approach is preferred.

Metric and Tuning task results
The results of WMT15 metric task of best performing metrics is shown in Tables 2 and 3 for the  system level and Tables 4 and 5 for segment level.
On the sentence level for out of English language pairs on average BEER was the best metric (same as the last year). Into English it got 2nd place with its syntactic version and 4th place as the original BEER .
On the corpus level BEER is on average second for out of English language pairs and 6th for into English. BEER and BEER Treepel are the best for en-ru and fi-en.  Table 6: Results on Czech-English tuning The difference between BEER and BEER Treepel are relatively big for de-en, cs-en and ru-en while for fr-en and fi-en the difference does not seem to be big.
The results of WMT15 tuning task is shown in Table 6. The system tuned with BEER without recall bias was the best submitted system for Czech-English and only the strong baseline outperformed it.

Conclusion
We have presented ILLC UvA submission to the shared metric and tuning task. All submissions are centered around BEER evaluation metric. On the metrics task we kept the good results we had on sentence level and extended our metric to corpus level with high correlation with high human judgment without losing the decomposability of the metric to the sentence level. Integration of syntactic features gave a bit of improvement on some language pairs. The removal of recall bias allowed us to go from overly long translations produced in tuning to translations that match reference relatively close by length and won the 3rd place in the tuning task. BEER is available at https: //github.com/stanojevic/beer.    .308 ± .013 n/a .289 ± .012 n/a n/a .298 ± .013 PARMESAN n/a n/a n/a .089 ± .006 n/a .089 ± .006 Table 5: Segment-level Kendall's τ correlations of automatic evaluation metrics and the official WMT human judgments when translating out of English. The last three columns contain average Kendall's τ computed by other variants.