UPF-Cobalt Submission to WMT15 Metrics Task

An important limitation of automatic evaluation metrics is that, when comparing Machine Translation (MT) to a human reference, they are often unable to discriminate between acceptable variation and the differences that are indicative of MT errors. In this paper we present UPF-Cobalt evaluation system that addresses this issue by penalizing the differences in the syntactic contexts of aligned candidate and reference words. We evaluate our metric using the data from WMT workshops of the recent years and show that it performs competitively both at segment and at system levels.


Introduction
Current automatic MT evaluation methods are grounded on the following key idea: the closer an MT is to a professional Human Translation (HT), the higher its quality. Thus, metrics typically calculate evaluation scores based on some sort of similarity between machine and human translations. The performance of evaluation systems is in its turn evaluated by calculating the correlation with human judgments. Manual quality assessment can be conducted in various ways: adequacy and fluency scoring, calculating postediting cost or post-editing time, error analysis, ranking, etc. In the latter case, humans are asked to compare the outputs of different MT systems and rank them in terms of quality. Rankingbased evaluation has gained a lot of attention in the recent years and is used in important evaluation campaigns such as the Metrics task at the Workshop on Machine Translation (WMT). This This work was supported by IULA (UPF) and the FI-DGR grant program of the Generalitat de Catalunya. setting is preferred, since it has been shown to yield higher inter-annotator agreement than absolute quality assessment (Callison-Burch et al., 2007).
In our opinion, one of the main reasons why the correlation between automatic evaluation and human rankings is still not satisfactory is that metrics' scores are not discriminative enough to approximate human comparisons. Given various candidate translations of the same source sentence, all of them different from the reference, evaluation systems are often unable to determine which translation is better as they cannot tell apart candidatereference differences related to acceptable linguistic variation and the differences induced by MT errors. Furthermore, if all candidate translations contain a number of translation errors, metrics fail to predict the human ranking because they make no estimation of the relative importance of different types of MT errors for the overall translation quality.
We suggest that the aforementioned limitations can be addressed by means of enhancing word comparison with contextual information. Variation between two translation options is acceptable if semantically similar words in the corresponding sentences occur in equivalent contexts. In case of translation errors either the lexical choice is inappropriate or the syntactic contexts of the words are different (incorrect choice of function words, word order errors, etc.).
Our evaluation metric, UPF-Cobalt 1 exploits contextual information by means of weighting the contribution of each pair of lexically similar words in candidate and reference translations depending on whether they occur in similar syntactic environments. Syntactic functions of the words in context are taken into consideration. In this way, more fine-grained distinctions can be made regarding the relative importance of mistranslated material.
In this paper we present UPF-Cobalt submission to the WMT15 Metrics task. Experiments show that UPF-Cobalt achieves competitive results, both at segment and at system levels. On WMT14 data, our metric would have been ranked as second-best performing metric at segment level, and tied with the first best-performing metric at system level.
The rest of this paper is organized as follows. Section 2 describes UPF-Cobalt. In Section 3 we present the experiments and analyze the results. Section 4 examines relevant pieces of related work. Finally, in Section 5 we give the conclusions and suggest directions for future work.

Metric Description
Following , we argue that for measuring sentence similarity and related tasks, identifying similar words and deciding on the relation between the two sentences should be kept separate. This is especially relevant for MT evaluation where system output may share a high number of similar words with the reference and still be grammatically ill-formed and totally unacceptable. Thus, not only the number but also the characteristics of the correspondences between candidate and reference words must be taken into consideration. Therefore, we follow a two-stage approach to evaluation. First, MT is aligned to the reference. Next, the candidate translation is scored taking into account both the number of aligned words and their roles in the corresponding sentences.

Monolingual Word Aligner
We assume that using better candidate-reference alignment results in better MT evaluation. Research in the area of monolingual alignment demonstrates that exploiting syntactic context to discriminate between candidate pairs for alignment significantly improves the results (MacCartney et al., 2008;Thadani et al., 2012;Yao et al, 2013;Sultan et al., 2014). The alignment module of UPF-Cobalt builds on an existing system Monolingual Word Aligner (MWA) 2 which takes context information into account and has been shown to significantly improve on state-of-the-art results (Sultan et al., 2014).
MWA exploits lexical similarity and contextual evidence to make alignment decisions. Lexical similarity component identifies possible candidates for alignment. In addition to exact and lemma match, Paraphrase Database (Ganitkevitch et al., 2013) of lexical and phrasal paraphrases is employed to recognize semantically similar words. 3 We enhance MWA with additional lexical similarity resources to maximize the coverage of the alignment. In addition to the paraphrase database, UPF-Cobalt employs WordNet synsets (Miller and Fellbaum, 2007) and distributional similarity (Turney and Pantel, 2010). WordNet is commonly used in MT evaluation and related fields for dealing with lexical variation. By contrast, to the best of our knowledge, distributional similarity has not yet been exploited for the evaluation task.
We use publically available distributional similarity resource (Levy and Goldberg, 2014), which contains dependency-based word embeddings. To minimize the noise, we establish the following restrictions. To be considered candidates for alignment the words must have the cosine similarity higher than a threshold (based on data observation, we currently define it as 0.25). Also, they must have at least one pair of exact matching content words in their contexts.
Contextual evidence is used to choose the best alignment candidates and is defined as the number of similar words in the contexts of the words to be aligned. At syntactic level, the context is constituted by the head and dependent nodes in a dependency graph. 4 Context words are considered as evidence for alignment if they are lexically similar and have the same or equivalent syntactic relations with the words to be aligned. Sultan et al. (2014) have developed a list of mappings between different syntactic functions that instantiate the same semantic relation. Thus, for example, the dependency relation between subject and predicate in an active clause and byagent and predicate in a passive clause are defined to be equivalent. We consider that this functionality is helpful for addressing syntactic variation in reference-based MT evaluation and reuse it for scoring.

Scoring Method
Given a candidate-reference alignment, we further need to know if the correspondences identified at the alignment stage are actually indicative of MT quality. UPF-Cobalt computes a score for each pair of aligned words as a combination of their lexical similarity and the differences of the syntactic contexts in which the words occur.
Lexical Similarity. The weights for different types of lexical similarity are established heuristically, depending on the accuracy of the lexical resource that was used for aligning them: 5 • Word form: 1.0 • Lemma or stem: 0.9 • WordNet synsets: 0.8 • Paraphrase database: 0.6 • Distributional similarity: 0.5 Context Penalty. Context penalty is applied in cases where aligned words play different roles in the corresponding sentences. For each pair of aligned nodes (h) in the candidate translation and (r) in the reference translation context penalty is calculated as follows: Where (c) refers to the words that belong to the syntactic context of the reference word (r) (immediate neighbors in the dependency graph). 6 If the context word is found in the set of aligned word pairs |A| and its counterpart in the candidate translation has the same or equivalent syntactic relation with the word (h), the weight w(c i ) equals to 0. Otherwise, the weight is defined according to the relative importance of the dependency function of the context word. Intuitively, mistranslating or omitting words with syntactic functions that correspond to arguments alters the context to a greater extent than dropping a determiner or an adjunct. We define three groups of syntactic functions accordingly and establish the corresponding weights as follows: • Arguments and complements: 1.0 • Modifiers and adjuncts: 0.8 • Specifiers and auxiliaries: 0.2 The natural logarithm of count(c) in Formula (1) gives a higher value to the contextual difference when the number of context words is high, while limiting the increase if the number of context words continues to grow. The final value of context penalty is normalized from 0 to 1 using logarithmic function: Given the values of lexical similarity and context penalty, the score for each pair of aligned word is defined as follows: Sentence-level score is then calculated as a weighted combination of precision and recall over the sum of the scores for aligned candidate and reference words. To obtain system-level scores, we computed the ratio of sentences in which each system was assigned the highest sentence-level score by our metric.

Experiments
We conduct experiments with the data from WMT13 and WMT14 Metrics tasks (Macháček and Bojar, 2013;Macháček and Bojar, 2014). To evaluate our metric's performance at segment level, we use Kendall's Tau correlation (τ ) with human rankings, as defined in (Macháček and Bojar, 2014). At system level, we use Pearson correlation coefficient (r). Table 1 presents the results averaged over all into-English translation directions. For the sake of comparison, we provide the results for the best performing metrics that participated in WMT13 and WMT14 Metrics tasks, as well as baseline metrics BLEU (Papineni et al., 2002) and Meteor (Denkowski and Lavie, 2014).
As shown in Table 1, our approach is competitive (UPF-Cobalt would have been ranked as the best performing metric on WMT13 data and as the second best on WMT14 data) and generalizes well  Context penalty. To estimate the benefit of using our context penalty we substituted it with fragmentation penalty from Meteor, which explicitly penalizes differences in sequential word order. As expected, this results in a significant drop in the correlation. Thus, this new component is indeed crucial for our metric's performance.
MWA has been shown to outperform Meteor in the alignment task. However, contrary to our expectations, simply using a more accurate aligner does not suffice to improve the correlation (Meteor achieves 0.354 correlation on this dataset).
Manual inspection of the results shows that this is primarily due to the fact that MWA does not support phrase-level alignments. This functionality is highly relevant for the evaluation task as it allows covering acceptable variation that involves multiword expressions. We plan to integrate phrasal alignments in the metric in the future.
Distributional similarity. Removing this component implies a considerable decrease in the correlation. Qualitative analysis of the results shows that its main contribution concerns cases of quasisynonyms, i.e. words that can be considered synonymous only given the similarity of their contexts. The noise introduced by the component is neutralized by context penalty. If unrelated words are aligned, their context penalty will be high and aligning them won't increase sentence-level evaluation score. Also, in the ranking formulation of the evaluation task, distributional similarity helps to discriminate between low-quality translations. That is to say, it allows distinguishing sentences where words are at least minimally related from sentences, in which, for instance, source-language words are simply left untranslated.
Dependency weights. To test if giving different weights to contextual differences according to the dependency functions of the words involved, we put the values of all the weights to 1. This negatively affects the results, confirming that some differences are stronger indicators of MT errors than others. Thus, using the proposed weighting scheme the metric is capable of discriminating more or less serious MT errors based on the relative importance of mistranslated material.
Equivalence of syntactic constructions. Eliminating this functionality produces a smaller decrease in the correlation. Representing syntactic context as immediate neighbors of the word in a dependency graph allows covering a limited set of equivalent constructions, which are not frequent enough to have a significant impact on the results. The framework is flexible and more complex context equivalence definitions can be integrated in the future.
To appreciate the advantages of the metric, Table 3 provides a qualitative comparison of UPF-Cobalt's performance with strong baseline metric Meteor. 7 In this example, Meteor assigns low  Table 3: Example of candidate and reference translations with the corresponding Meteor and UPF-Cobalt scores scores to both candidate translations, due to the differences in word order and the presence of function words absent in the reference. However, it is clear that Candidate 1 is perfectly acceptable, whereas Candidate 2 contains an error concerning the relation between the words "voter" and "Obama". UPF-Cobalt correctly assigns a higher score to Candidate 1. Here all the content words are aligned and no context penalty is applied, since the syntactic contexts in which the words occur are equal or equivalent. Thus, prep for relation in the candidate translation is equivalent to noun compound modifier relation nn in the reference and prep of label in the candidate corresponds to possession modifier poss in the reference. UPF-Cobalt assigns a lower score to Candidate 2 due to the differences in the syntactic contexts of the words "voter" (context penalty -0.426) and "Obama" (context penalty -0.286), which constitute a translation error. Thus, context penalty values calculated for each pair of aligned words can be used for spotting and locating translation errors. Qualitative analysis of the results also shows an interesting pattern in cases where UPF-Cobalt is outperformed by other metrics. This pattern is particularly relevant in the ranking evaluation setting. Consider the following example. Ref: Nevada has already completed a pilot. Cand1: Nevada already has completed the pilot project. Cand2: Nevada has already completed the pilot project.
When ranking translations humans intend to avoid ties whenever possible. Both Candidate 1 and Candidate 2 are essentially correct, but the second translation is more adequate with regards to the norms and conventions of target language use. UPF-Cobalt assigns equal scores to both MTs. Thus, it successfully avoids penalizing acceptable differences in word order (the differences that do not affect the output of the dependency parser). However, it is not able to make more finegrained distinctions regarding the fluency of MT. This issue can be addressed by integrating target language model features in the metric.

Related Work
Metrics based on string-level comparison take context into account in a simplistic manner. For instance, BLEU (Papineni et al., 2002) uses ngrams with length (1-4) and Meteor (Denkowski and Lavie, 2014) addresses the differences in sequential word order by means of fragmentation penalty, based on the number of adjacent aligned words. This often leads to penalizing acceptable differences induced by the use of semantically equivalent expressions. At the same time, spurious matches of the words that coincide in their surface form but play totally different roles in the corresponding sentences can incorrectly increase evaluation score.
To address these limitations a series of linguistically informed approaches have been proposed. Amigó et al. (2006) measure the degree of overlap between the dependency trees of candidate and reference translations. Giménez and Màrquez (2010) propose a combination of specialized similarity measures operating at different linguistic levels (lexical, syntactic and semantic). Guzman et al. (2014) further enrich this metric set with discourse level information. Padó et al. (2009) measure MT quality based on a rich set of features motivated by textual entailment.
Our work follows this line of research and exploits syntactic context to characterize the correspondences between the words in candidate and reference translations. In addition, we address the problem of syntactic variation that has rarely been dealt with in linguistically-informed MT evaluation. As shown in Fomicheva et al. (2015), this kind of variation is a regular source of differences between human reference and MT. Structural shifts (Ahrenberg and Merkel, 2000) are common practice in HT. Translators often introduce optional changes to the original sentence in order to adhere to specific principles of target language use, including stylistic issues and discourse processing conditions. MT may not contain such shifts but still be grammatically well-formed and perfectly deliver the contents of the source sentence. By taking into consideration the equivalence of syntactic constructions it is possible to avoid penalizing MT in these cases.

Conclusions and Future Work
We have shown that using contextual information helps to distinguish candidate translations that are different from the reference and still essentially correct from those that share high number of words with HT but fail to preserve the meaning of the source sentence due to translation errors.
Also, we enhanced existing methods for addressing meaning-preserving variation by exploiting distributional similarity at lexical level and classes of equivalent dependency types at syntactic level. The results demonstrate that the metric achieves competitive performance on WMT13 and WMT14 data.
As future work, we consider improving the metric by extending the alignment component to phrase-level and refining the equivalent dependency types to increase the coverage of linguistic variation at syntactic level. Another interesting direction would be to integrate target-language features and take into consideration the properties of non-aligned material. Finally, we plan to test if the metric can be successfully used for error detection and classification.