UHH Submission to the WMT17 Metrics Shared Task

In this paper the UHH submission to the WMT17 Metrics Shared Task is presented, which is based on sequence and tree kernel functions applied to the reference and candidate translations. In addition we also explore the effect of applying the kernel functions on the source sentence and a back-translation of the MT output, but also on the pair composed of the candidate translation and a pseudo-reference of the source segment. The newly proposed metric was evaluated using the data from WMT16, with the results demonstrating a high correlation with human judgments.


Introduction
The evaluation of Machine Translation (MT) represents a very important domain of research, as providing meaningful, automatic and accurate methods for determining the quality of machinetranslated output is a key component in the development cycle of a MT system. However, the task is inherently difficult due to the expressiveness of natural language, which often allows conveying a message in more than one equivalent ways. When translating from a source language into a target one, the input data for evaluation conventionally consists of a set of tuples, with each tuple composed of: • a source segment, representing the sentence to be translated in the source language • a candidate translation (also known as a target segment), obtained by translating the source segment into the target language using an MT system • a reference translation, representing a correct human-generated translation of the source segment As a research field, MT evaluation can be divided into two categories: reference-free evaluation and reference-based one. The referencefree evaluation, also known as Quality Estimation, aims at providing automatic methods, for assessing the quality of candidate translations, which do not require reference translations. In the case of a reference-based evaluation, the target segment is compared with the reference translation resulting in a score that measures the similarity between the two sentences. Different approaches for computing the comparison have been implemented, with the most frequently used one being BLEU (Papineni et al., 2002), which measures the quality of the candidate translation by counting the number of n-grams it has in common with the reference translations. Nonetheless, multiple disadvantages of BLEU have already been pointed out, as in Callison-Burch et al. (2006), where it is shown that an increase of the BLEU score does not necessary correlate with a better performing system. This has motivated further research into additional MT evaluation methods that rely on more than lexical matching by additionally including the syntactic and semantic structure of the sentences (e.g. (Popović and Ney, 2009), (Gautam and Bhattacharyya, 2014) ).
We propose a new method for the evaluation of MT output, based on tree and sequence kernel functions, applied on the pair of reference and candidate translations. In addition, we study the impact of applying the kernels on the tuple consisting of the source segment and a back-translation, together with the pair comprised of the candidate translation and a pseudo-reference. A pseudoreference is the result of translating the source segment into the target language, while a backtranslation is obtained by translating the target segment into the source language. The evaluation results show that the new metric strongly correlates with human judgments, outperforming the state-of-the-art methods.
A syntactic evaluation method based on tree kernels is proposed in Liu and Gildea (2005). It uses the subtree kernel introduced in Collins and Duffy (2002) to calculate the similarity between the reference and the candidate translations. Besides this, a syntactic metric based on counting the number of fixed-depth subtrees shared by the two translations is also introduced, with both metrics being applied on the constituency trees of the input data. Additionally, a dependency tree based metric is presented, which computes the number of common headword chains, where a headword chain is defined as the concatenation of words that form a path in the dependency tree.
Another MT evaluation method that makes use of tree kernels is introduced in Guzmán et al. (2014). It also uses the subtree kernel introduced in Collins and Duffy (2001), but in this case it calculates the similarity between the discourse trees of the candidate and reference translation. The evaluation combined the newly proposed metric with already existing ones and the results showed that the addition is beneficial for improving the correlation scores.
The role of back-translations has also been investigated before, like in the case of Rapp (2009) where the quality of a candidate translation is assessed by measuring the similarity, in terms of a modified version of BLEU, between its backtranslation and the initial source segment. In the case of pseudo-references, they have been used as an additional source of data for tuning the parameters of MT systems, like in the case of Ammar et al. (2013). An evaluation method based on pseudo-references is presented in Albrecht and Hwa (2007) and then further extended in Albrecht and Hwa (2008), where a metric is trained to correlate with human judgments based on features extracted with the help of three pseudoreferences. The features are in the form of 18 kinds of reference-based scores together with an additional set of 25 monolingual fluency scores. The results showed that the new metric correlates well with human assessments and generalizes well across different language pairs. The novelty of the MT Evaluation metric introduced in this paper is twofold. First of all, the method makes use of the Partial Tree Kernel (PTK), a more general type of kernel function, which to the authors' knowledge has not been applied in the context of MT metrics-based evaluation before. Secondly, the proposed method also explores what impact do sequence kernels (SK) have on the quality of a kernel evaluation metric, by studying its potential individually, but also in combination with the Partial Tree Kernel. Furthermore, we extend on the previous work of pseudoreferences and back-translations by studying their impact in the context of using them as input data for kernel functions.

Methods and implementation
A kernel function makes use of structural representations of the input data in order to calculate the number of substructures they share, without explicitly stating the feature spaces corresponding to the two representations (Moschitti, 2006a). The types of representations taken into account can be, among others, vectorial, sequential or treebased. The tree kernels developed so far distinguish themselves from one another by the types of tree fragments (e.g. subsets, subtrees or partial trees) and the type of syntactic trees (constituency or dependency) they employ in their computation, which influences their suitability for certain tasks (see Moschitti (2006a)). Contrastively, sequence kernels (e.g (Bunescu and Mooney, 2005), (Nguyen et al., 2009)) make use of subsequences in the computation of the kernel.
The new method for the evaluation of Machine Translation proposed in this paper, denoted as TSKM, makes use of both tree and sequence kernels, which are applied on the pair of candidate and reference translations. The tree kernel used is represented by the Partial Tree Kernel (PTK) Figure 1: Example of a dependency tree. (Moschitti, 2006a). It uses partial tree fragments, which are a generalization over subtrees and subset trees, so that a node and its partial descendants can constitute a valid fragment. An example of a dependency tree is presented in Figure 1 1 and some possible partial trees for it are (has(carried(out))) or (carried(out talks)). For the sequence kernel (SK), the kernel introduced in Bunescu and Mooney (2005) is utilized, which computes the number of common patterns shared by the two input sentences.
Formally, TSKM can be defined as: T SKM basic = T SKM (r, c) = P T K(r,c)+SK(r,c)

2
(1) with r and c denoting the reference and the candidate translations and PTK and SK referring to the scores of the Partial Tree Kernel and the Sequence Kernel.
Furthermore, we experimented with using an additional pseudo-reference and a back-translation in the computation of the metric in order to explore how the different combination schemes influence the performance of TSKM. One possible kind of combination can be represented as:  sis models 3 . The dependency parse trees obtained were converted to tree representations which can be used by the PTK. The lexical-centered-tree approach presented in Croce et al. (2011) was utilized, which required storing both the grammatical relation and the pos-tag information as the rightmost children of a dependency tree node. The score of the kernel functions were normalized using the formula from Croce et al. (2011): with T1 and T2 standing for the input data tuple and K indicating the type of kernel function. Regarding SK, only a tokenization of the data was required, as the SK function was applied on substructures composed of the lexical items.
For the computation of the kernel functions we used the Partial Tree Kernel 4 and the Sequence Kernel 5 implementations, found in the KeLP

Experimental setup
The evaluation of TSKM was performed using data pertaining to the News domain from the First Conference On Machine Translation (WMT16) 6 . For the results obtained in the WMT17 Metrics Task, please refer to the official results paper. The following language pairs were used in the evaluation: English-German, Czech-English, German-English, Finnish-English, Russian-English and Turkish-English. The MT outputs evaluated correspond to systems submitted to the WMT16 News Translation Task (Bojar et al., 2016), having different types ranging from statistical phrase-based to neural or syntax-based ones. The test sets consist of approximately 3000 tuples, incorporating the source segment together with the reference and candidate translations. We evaluated TSKM in terms of Pearson correlation with human judgments. During the manual evaluation phase of WTM16, human judgments were collected by ranking five candidate translations, with ties being allowed. In order to compute a single TSKM score for an MT system, all the individual sentence scores were combined by averaging them.
Different variants of TSKM were taken into account for evaluation. To investigate how the lexical variation affects the performance of the metric, we also implemented versions of the metric where lemmas are used instead of the exact lexical items.

Results
The results of the evaluation are presented in Tables 1 and 2, which contain the correlation scores for the different TSKM variants taken into account. For comparison purposes, the scores for some state-of-the-art MT evaluation methods are also presented: BLEU (Papineni et al., 2002), NIST (Doddington, 2002), PER (Tillmann et al., 1997), CDER (Leusch et al., 2006) and WER. The results were obtained using the evaluation scripts made available by the WMT16 conference 7 . The following metric notation was adopted for each of the TSKM variants evaluated: where Kernel identifies the type of kernel utilized (SK or PTK) and level refers to the input data tuple used in the calculation. The possible tuple types are: • (r,c) -the pair of reference and candidate translations • (c,s t ) -the pair of candidate translations and translated source • (s,c t ) -the pair of source segment and backtranslated candidate In  presented. We first experimented with applying TSKM on the (r,c) and the (c,s t ) input data pairs. The best performing TSKM variant, SK(r,c)+ PTK(r,c), represents the combination between PTK and SK applied on the reference and candidate translations. Its average correlation score over all language pairs outperforms the state-ofthe-art metrics. We can observe that the addition of the pair consisting of the candidate translation and the pseudo-reference generated mixed results.
In the case of Finnish-English there was an obvious downgrade in performance, possibly due to the complex morphology of Finnish. Another observation to be pointed out is that the 'not exact' TSKM variants are stronger correlated with the human judgments than their 'exact' counterparts.
In addition to the metric variants presented in Table 1, we further extended the evaluation to the English-German and German-English language pairs by including the source and backtranslation tuple in the evaluation, with the results being presented in Table 2. In this case, the best performing method for both language pairs, PTK(r,c)+PTK(c,s t )+ PTK(s,c t ), makes use of all the three possible input data tuples, succeeding to outperform the state-of-the-art metrics. Yet another aspect worth to point out is that, in the case of English-German, the 'exact' metric variants are the ones that display better correlations. This would suggest that when choosing between 'not exact' or 'exact' variants for TSKM, the direction of the translation (e.g. in/out of English) should be taken into account. Moreover, we can observe that there is a drastic decrease of correlation in the case of English-German translations, which can possibly be explained by the highly inflectional nature of the German language.
Additional preliminary evaluation experiments, presented in Table 3, were performed after the submission to the Shared Task. Generalizations of the Partial Tree Kernel were used, namely the Smoothed Partial Tree Kernel (SPTK) (Croce et al., 2011) and the Compositional Smoothed Partial Tree Kernel (CSPTK) (Annesi et al., 2013) (Annesi et al., 2014). The SPTK uses a term similarity function to semantically match tree nodes. The term similarity function can be obtained through either word vector spaces or distributional analysis. On the other hand, the CSPTK represents a generalization of SPTK, which uses Distributional Compositional Semantics to determine the degree of similarity between tree fragments. The implementations for these kernels together with an example wordspace for English are also available in the KeLP package. The results show that by relaxing the matching constraints to allow for lexical variation these kernels outperform PTK when used by TSKM.

Conclusions and future work
In this paper, we introduced TSKM, our submission to the WMT17 Metrics Task, which is based on tree and sequence kernels. The metric was evaluated using multiple language pairs, with the evaluation results being very encouraging. We also experimented with applying the kernel functions on additional tuple input data, that involve back-translations and pseudo-references. In the case of the pseudo-reference the results indicate that its addition to TSKM can be beneficial, especially in the case of the PTK. However, the most important aspect to notice is that, with the exception of Finnish-English, the pseudo-reference based methods achieved correlation scores that are very similar to the official reference based ones, which suggests that TSKM could be applied even in the context of artificially generated reference translations. The addition of the back-translations of the target sentences to TSKM generated encouraging results, which prompts us to extend the evaluation to include further language pairs. Based on the evaluation results, we can also observe that the SK metric variants succeeded in attaining correlation scores that are relatively similar to the PTK variants. This suggests that the SK metric variant can be successfully used in the case when no syntactic analysis tools are available for the target language.
Future work will be concentrated on using the constituency trees as a structural input representations for PTK in addition to the dependency trees. The evaluation will also be extended to determine how well does TSKM generalize across domains. We also plan to analyze in more detail the decrease in correlation scores when using the pseudo-reference in the case of Finnish-English, by using different MT systems to generate additional pseudo-references in order to determine if the type of MT system influences the correlation with human judgments. Another future work idea is to extend the evaluation for SPTK and CSPTK, by including them in different TSKM combinations and evaluating on additional language pairs.