Predicting Machine Translation Adequacy with Document Embeddings

This paper describes USAAR’s submission to the the metrics shared task of the Workshop on Statistical Machine Translation (WMT) in 2015. The goal of our sub-mission is to take advantage of the semantic overlap between hypothesis and reference translation for predicting MT output adequacy using language independent document embeddings. The approach presented here is learning a Bayesian Ridge Regressor using document skip-gram embeddings in order to automatically evaluate Machine Translation (MT) output by predicting semantic adequacy scores. The evaluation of our submission – measured by the correlation with human judgements – shows promising results on system-level scores.


Introduction
Translation is becoming an utility in everyday life. The increased availability of real-time machine translation services relying on Statistical Machine Translation (SMT) allows users who do not understand the language of the source text to quickly gist the text and understand its general meaning. For these users, accurate meaning of translated words is more important than the fluency of the translated sentence.
However, SMT suffers from poor lexical choices. Fluent but inadequate translations are commonly produced due to the strong bias towards the language model component that prefers consecutive words based on the data that the system is trained on.
Current state of art MT evaluation metrics are generally able to identify problems with grammaticality of the translation but less evidently accuracy of translated semantics, e.g. incorrect translation of ambiguous words or wrong assignment of semantic roles. In the example below, the ideal Machine Translation (MT) evaluation metric should appropriately penalise poor lexical choice, such as braked, and reward or at least allow leeway for semantically similar translations, such as external trade.

Phrase-based MT:
The foreign trade braked the economy.

Neural MT:
External trade also slowed the economy.

Reference (EN):
Foreign goods trade had slowed, too.
The German word bremste is commonly used as braked in the context of driving, but the appropriate translation should have been slowed in the example mentioned above. Although the phrase external trade differs from foreign goods trade in the reference sentence, it should be considered as an acceptable translation.
We propose a semantically grounded, language independent approach using Semantic Textual Similarity (STS) to evaluate the adequacy of the machine translation outputs with respect to their reference translations.
The remainder of this paper is structured as follows. Section 2 gives an overview of the related work in the field of MT evaluation. Section 3 presents the approach behind the USAAR submission to the metrics shared task. In Section 4 we present the data and experiments for this submission. Section 5 covers the evaluation of our metric by the WMT2015 metrics task organisers and in Section 6 we conclude on our WMT2015 metrics task submission.

Related Work
Researchers in the field of MT evaluation have proposed a large variety of methods for assessing the quality of automatically produced translations. Approaches range from fully automatic quality scoring to efforts aimed at the development of "human" evaluation scores that try to exploit the (often tacit) linguistic knowledge of human evaluators.

Automatic Evaluation of MT
MT output is usually evaluated by automatic language-independent metrics that can be applied to MT output, independent of the target language. Automatic metrics typically compute the closeness (adequacy) of a hypothesis to a reference translation and differ from each other by how this closeness is measured. The most popular MT evaluation metrics are IBM BLEU (Papineni et al., 2002) and NIST (Doddington, 2002), used not only for tuning MT systems, but also as evaluation metrics for translation shared tasks, such as the Workshop on Statistical Machine Translation (WMT).
IBM BLEU uses n-gram precision by matching machine translation output against one or more reference translations. It accounts for adequacy and fluency by calculating word precision, i.e. the n-gram precision.
In order to deal with the over generation of common words, precision counts are clipped, meaning that a reference word is exhausted after it is matched against the same word in the hypothesis. This is then called the modified n-gram precision. For BLEU, the modified n-gram precision is calculated with N=4, the results being combined by using the geometric mean. Instead of recall, BLEU computes the Brevity Penalty (BP) (see formula in 2), thus penalising candidate translations which are shorter than the reference translations.
The NIST metric is derived from IBM BLEU. The NIST score is the arithmetic mean of modified n-gram precision for N=5 scaled by the BP. Additionally, NIST also considers the information gain of each n-gram, giving more weight to more informative (less frequent) n-grams and less weight to less informative (more frequent) n-grams.
Another often used machine translation evaluation metric is METEOR (Denkowski and Lavie, 2014). Unlike IBM BLEU and NIST, METEOR evaluates a candidate translation by calculating precision and recall on the unigram level and combining them into a parametrised harmonic mean. The result from the harmonic mean is then scaled by a fragmentation penalty which penalizes gaps and differences in word order. METEOR is described in detail in Section 3.1.
Besides these evaluation metrics, several other metrics are used for the evaluation of MT output. Some of these are the WER (word error-rate) metric based on the Levensthein distance (Levenshtein, 1966), the position-independent error rate metric PER (Tillmann et al., 1997) and the translation edit rate metric TER (Snover et al., 2006) with its newer version TERp (Snover et al., 2009).
The semantics of both hypotheses and reference translations is considered by MEANT (Lo et al., 2012). MEANT, based on HMEANT (Lo and Wu, 2011a;Lo and Wu, 2011b;Lo and Wu, 2011c), is a fully automatic semantic MT evaluation metric, measuring semantic fidelity by determining the degree of parallelism of verb frames and semantic roles between hypothesis and reference translations. Some approaches aim at combining several linguistic and semantic aspects. Gonzàlez et al. (2014) as well as Comelles and Atserias (2014) introduce their fully automatic approaches to machine translation evaluation using lexical, syntactic and semantic information when comparing the machine translation output with reference translations.

Human Evaluation of MT
Human MT evaluation approaches employ the knowledge of human annotators to assess the quality of automatically produced translations along the two axes of target language correctness and semantic fidelity. The Linguistics Data Consortium (LDC) introduced a MT evaluation task that elicits quality judgement of MT output from human annotators using a numerical scale (Linguistics Data Consortium, 2005). These judgements were split into two categories: adequacy, the degree of meaning preservation, and fluency, target language correctness.
Adequacy judgements require annotators to rate the amount of meaning expressed in the reference translation that is also present in the translation hypothesis. Fluency judgements require annotators to rate how well the translation hypothesis in the target language is formed, disregarding the sentence meaning. Although evaluators are asked to assess the fluency and adequacy of a hypothesis translation on a Likert scale separately, Callison-Burch et al. (2007) reported high correlation between annotators' adequacy and fluency scores.
MT output is also evaluated by measuring human post-editing time for productivity (Guerberof, 2009;Zampieri and Vela, 2014), or by asking evaluators to rank MT system outputs (by ordering a set of translation hypotheses according to their quality).  show that this task is very easy to accomplish for evaluators, since it does not imply specific skills, a homogeneous group being enough to perform this task. This is also the method applied during the last years WMTs, where humans are asked to rank machine translation output by using APPRAISE (Federmann, 2012), a software tool that integrates facilities for such a ranking task.
An indirect human evaluation method, that is also employed for error analysis, are reading comprehension tests (e.g. Maney et al. (2012), Weiss and Ahrenberg (2012)). Other evaluation metrics try to measure the effort that is necessary for "repairing" MT output, that is, for transforming it into a linguistically correct and faithful translation. One such metric is HTER (Snover et al., 2006), which uses human annotators to generate targeted reference translations by means of postediting, the rationale being that by this the shortest path between a hypothesis and its correct version can be found.

Semantic Textual Similarity
Given two snippets of text, the Semantic Textual Similarity (STS) task attempts to measure their semantic equivalence on a scale of 1 to 5 (Agirre et al., 2014). The STS task is organized annually during the SemEval workshop and systems are evaluated based on their Pearson correlation coefficient with the human annotations.
The STS is similar to the task of determining the adequacy of a translation hypothesis with respect to a reference translation. The STS task is usually treated as a regression task where systems are trained using features such as: (i) linguistics annotation overlaps between the two text snippets, e.g. syntactic dependency, lexical paraphrases, part of speech (Šarić et al., 2012;Han et al., 2012;Pilehvar et al., 2013) (ii) machine translation metrics as features in training a supervised regressor (Rios et al., 2012;Barrón-Cedeño et al., 2013;Huang and Chang, 2014;Tan et al., 2015b) (iii) word/document embeddings similarity (Sultan et al., 2015;Arora et al., 2015).
Linguistic annotations are restricted by the availability of the annotation tools, that are often language dependent. Machine translation evaluation metrics generally provide a shallow comparison between hypotheses and reference translations focusing on capturing the grammatical similarities between the texts, whereas the use of document embeddings focuses on capturing the semantic similarity between texts. Word embeddings dates back to the traditional Latent Semantic Analysis (LSA) vector spaces used for information retrieval (Landauer and Dutnais, 1997) to the current trend of using neural nets for NLP/MT tasks (Bordes et al., 2011;Huang et al., 2012;Bordes et al., 2012;Chen and Manning, 2014;Bowman et al., 2015).

Our Approach
Although consensus exists that lexical-based metrics cannot cover the entire range of linguistic phenomena (Vela et al., 2014a;Vela et al., 2014b), the goal in the MT community remains to have a language independent metric that takes into account for lexical, syntactic and semantic information when mapping the MT output against the reference translation. The questions that have to be accounted for in such a language-independent metric are: (i) Is there a lexical overlap between reference and hypothesis translation?
(ii) Is there a syntactic overlap between reference and hypothesis translation?
(iii) Is there a semantic overlap between reference and hypothesis translation?
In the ideal situation one would also take into account lexical, syntactic and semantic information from the source text. Specific information (on lexical, syntactic, semantic level) from the source text could help improving not only the translation process, but also the evaluation.
As pointed out in Section 2, there are several approaches which tend to cover the entire range of linguistic phenomena in the evaluation process. The approach presented in this paper is leaned on the STS approach, mentioned in Section 2.3, aiming to provide a language independent adequacy score using document embedding similarity as opposed to the traditional synonyms and paraphrase overlap approach used in METEOR. The matching of synonyms in METEOR relies on Word-Net (Miller, 1995), which is a limited resource, making it impossible to use the synonymy module from METEOR for other languages than English. The provided or self-extracted paraphrase tables for METEOR are available only for languages for which big corpora are available, making it difficult to provide paraphrases for underresourced languages. Since METEOR relies on the WordNet synonymy and language dependent paraphrase tables for its semantic component, our goal is to substitute this components with a language independent component. Different from the STS task, the WMT metrics task provides the ranks of the systems' hypotheses instead of absolute human evaluation scores of the translation hypotheses. To generate the absolute scores, we use the METEOR scores between the translation hypotheses and the reference translations.
The neural nets were trained to produce 400 dense features for 100 epochs with a window size of 5 for all words from the WMT metrics task data.

sim(hyp, ref
The document embedding similarity is achieved by the dot product between the translation hypothesis (hyp) and the reference translation (ref ). Geometrically, the dot product between the hypothesis and the reference translation yields the cosine similarity between two vectors. Alternatively, one could also calculate the cosine similarity by summing the square of the word vector of the intersecting word embeddings and normalise the document by the root of the sum square for all words in the documents (Tan, 2013) 1 .
Using the similarity scores between the hypothesis and reference embeddings, we train a Bayesian Ridge Regressor targeting the METEOR scores as the desired output. (Denkowski and Lavie, 2014) is an MT evaluation metric which tries to consider both grammatical and semantic knowledge. The metric is based on the alignment between a hypothesis translation and a reference translation containing four modules. The number of modules to be used depends on the availability of resources for a specific language. The first module generates the alignments based on the surface forms of the words in the hypothesis and reference translation. The next module performs the alignment on word stems, followed by the alignment of words listed as synonyms in Word-Net (Miller, 1995). The last module is responsible for the paraphrase matching between the hypothesis and reference translation, based on the provided or the self-extracted paraphrase tables. For the final score calculation all matches are generalised to phrase/chunk matches with a start position and phrase length in each sentence.

METEOR (Metric for Evaluation of Translation with Explicit ORdering)
Different from other evaluation metrics, ME-TEOR makes the distinction between content words and function words in the hypothesis (h c , h f ) and reference (r c , r f ) translation. This distinction is made by a provided function words list.
From the final alignment between hypothesis and reference translation, precision (P ) and recall (R) is calculated by weighting content words and function words differently. This is described by Denkowski and Lavie (2014) as follows. For each of the matchers (m i ) count the number of content and function words covered by matches of this type in the hypothesis (m i (h c ), m i (h r )) and reference (m i (r c ), m i (r r )) translation. The weighted precision (P ) and recall (R) is computed by using the matcher weights w i ...w n and the function word weight γ as shown in 5 and 6.
The harmonic mean is calculated by the formula in equation 7.
METEOR also accounts for word order differences and gaps by scaling F mean by the fragmentation penalty (P en). The fragmentation penalty (P en) in equation 8 is computed by using the total number of matched words (m) and the number of chunks (ch).
The final score is then: The parameters α, β, γ, δ and w i ...w n are parameters that can be used for tuning METEOR for a given task.

Cosine Similarity
Cosine similarity is a similarity measure that can handle the fact that very similar documents (in our case sentences) may have different lengths. The cosine similarity of two documents is calculated by deriving a vector ( V ) for each sentence or document d, denoted as V (d) 2 . The set of documents 2 The normalization of the terms in the vector is computed by using using T F * IDF in a collection is viewed as a set of vectors in a vector space, each term (meaning a word) having its own axis. By this kind of representation the initial ordering of terms in the document is lost, since cosine similarity does not incorporate context.
The cosine of two vectors can be derived by using the Euclidean dot product formula: Derived from the formula in (10) the similarity between two documents d 1 and d 2 can be computed by the cosine similarity of their vector representations V (d 1 ) and V (d 2 ).
The numerator in (11) represents the dot product of the vectors V (d 1 ) and V (d 2 ) and is defined as shown in equation (12).
The denominator corresponds to the product of the Euclidean length of the vectors V i (d 1 ) and V i (d 1 ).
The vectors are length normalised by the formulas in (13) and (14).

ZWICKEL: A Regression-based Metric
Similar to the Semantic Textual Similarity (STS) and MT Quality Estimation approaches , we treat the MT metric task as a regression task with the aim of learning a Bayesian Ridge function that maps the cosine similarity feature to the target METEOR score. A Bayesian Regressor finds a maximum a posteriori solution under a Gaussian prior N over the parameters w with the precision of λ −1 . The α and λ parameters are treated as random variables estimated from the data.
p(y|X, w, α) == N(y|X, w, α) The Bayesian Ridge estimates a probabilistic regression model with a zero-mean prior for the parameter w, given by a spherical Gaussian: Without the caveats of mathematical argot, we refer to the cosine similarities as X, and to the METEOR scores as Y. We aim to learn a regressor that outputs the paraphrase and synonym ME-TEOR scores using the cosine similarities, without the paraphrase/synonym tables. Essentially, this leads to a language independent METEOR measure based on cosine similarity between translation and reference vectors.

COMET: A Combination of METEOR and ZWICKEL
We noticed that the outputs of the basic ZWICKEL score is conservative and does not allow an extreme 0.0 or 1.0 score unlike the METEOR score. Thus, we created a "switch-like metric", COMET, that treat the METEOR scores as oracle when ME-TEOR reports 0.0 or 1.0 scores, otherwise it falls back to ZWICKEL.

Experiments
This year's USAAR submission to the WMT metrics shared task concentrated on evaluating translations into German and into English, assigning a score both at sentence and system level.

Training Data
For training our system we used the available data from the previous WMT shared tasks by conflating them into a single data set 3 . The into German set consisted of 359545 sentence pairs and the into English set consisted of 1194017 sentence pairs.

Test Data
The test data for our evaluation metrics consist of all system outputs from this year's translation task performed on the newstest2015 data set. Depending on the source language the data sets consist of a different number of sentences. Into English we evaluated MT systems having the following source languages: • Czech with 10 system submissions and 2655 translated sentences per system • German with 13 system submissions and 2168 translated sentences per system • Finnish with 14 system submissions and 1369 translated sentences per system • Russian with 13 system submissions and 2817 translated sentences per system Into German we evaluated 16 systems with 2168 translated sentences per system. Based on the sentence scores we provided also a system score for each language pair. The system score was calculated by using different means (median, arithmetic mean, arithmetic geometric mean, harmonic mean and root squared mean) for each proposed metric.

USAAR's Submission to the WMT2015 Metrics Shared Task
In order to evaluate the efficacy of our method we contributed with three systems to the metrics task: • COSINE: the raw document embedding similarity, i.e. sim(hyp, ref) • ZWICKEL: the cosine-based metric outputs from the regressor described above • COMET: the combination of ZWICKEL outputs from the regressor and METEOR

Evaluation
All submissions to the metrics task were evaluated 4 at system level by computing their Pearson correlation coefficient with human judgements. For the evaluation of translations into English our best submission is COMET, achieving on average a correlation coefficient of 0.788±0.026. For the evaluation of translations from English into German, COMET is again our best submission with a correlation coefficient of 0.448±0.40. Table 5 shows the system-level Pearson correlation coefficient for COSINE, ZWICKEL and COMET 5 for each language pair into English and for the language pair English-German.
Spearman's correlation coefficient was also computed, but just the average over all language  pairs into English and into German. From the results in Table 5 we notice that COMET was the metric performing best for both translations into English and German, achieving a coefficient of 0.665±0.069 for translations into English and 0.588±0.072 for translations from German into English.

Conclusion
This paper presents USAAR's submission to the WMT2015 metrics shared task. Our aim of our submission was a language independent method for predicting MT adequacy based on the semantic similarity between hypothesis and reference translation by using document embeddings. We contributed with three evaluation metrics, COMET, a combination of a cosine-based metric and ME-TEOR, being the one correlating best with the human evaluators.
Previous studies have shown that METEOR systematically underestimate the quality of the translations (Vela et al., 2014b). Future work on our approach using document embeddings and cosine similarities could be used to also predict different scores (i.e. other than METEOR). Additionally, further experiments on document/word embeddings would be beneficial to find the bestfit solution for the cosine similarity calculation between a machine translation and its reference translation.