Meteor++: Incorporating Copy Knowledge into Machine Translation Evaluation

In machine translation evaluation, a good candidate translation can be regarded as a paraphrase of the reference. We notice that some words are always copied during paraphrasing, which we call copy knowledge. Considering the stability of such knowledge, a good candidate translation should contain all these words appeared in the reference sentence. Therefore, in this participation of the WMT’2018 metrics shared task we introduce a simple statistical method for copy knowledge extraction, and incorporate it into Meteor metric, resulting in a new machine translation metric Meteor++. Our experiments show that Meteor++ can nicely integrate copy knowledge and improve the performance significantly on WMT17 and WMT15 evaluation sets.


Introduction
Automatic Metrics for machine translation (MT) evaluation have received significant attention in the past few years. MT evaluation measures how close machine-generated translations are to professional human translations, which can be treated as paraphrase evaluation except when the candidates are identical to references. The main difference is that MT evaluation only takes the correctness into consideration while paraphrase evaluation also focuses on diversity.
According to some previous studies on paraphrasing, we find that paraphrasing knowledge can be divided into two categories: copy knowledge and paraphrasable knowledge. The former reflects stable information which tends to keep intact during paraphrasing, while the latter can be paraphrased in various ways. There are some previous researches taking account of copy mechanism (Vinyals et al., 2015;Gu et al., 2016;See et al., 2017;Li et al., 2017) in text generation. And in this paper, we extend the idea of copy from generation to MT evaluation.
Firstly, we give an introduction to copy knowledge extraction on paraphrase corpus, and then propose Meteor++ incorporated with it based on Meteor. Our experiment results show that Me-teor++ has higher Pearson Correlation with human score than Meteor on WMT evaluation sets and demonstrate the efficacy of copy knowledge.

Background
Various metrics for MT evaluation have been proposed and the widely used metrics are BLEU (Papineni et al., 2002) and Meteor (Banerjee and Lavie, 2005;Lavie, 2011, 2014). The main principle behind BLEU is the measurement of n-gram overlapping between the words produced by the machine and the human translation references at the corpus level. BLEU emphasizes precision and not take recall into account directly while Meteor not only combines the two but also gives a higher weight to recall in general. We choose Meteor in this paper because recall is extremely important for assessing the quality of MT output, as it reflects to what degree the translation covers the entire content of the source sentence.
The Meteor metric has been shown to have high correlation with human judgments in evaluation such as the 2010 ACL Workshop on Statistical Machine Translation and NIST Metrics MATR (Callison-Burch et al., 2010). It is based on general concept of flexible unigram matching, unigram precision and unigram recall, including the match of words that are simple morphological variants of each other by the identical stem and words that are synonyms of each other. Meteor firstly conduct an alignment include several stages (exact, stem, synonym and paraphrase) with different weight between two sentences. Then cal-  Table 2: WMT "copy-words" examples, c means raw count and p means co-occurrence probability, we select the candidates with the human scores greater or equal to 0.7 and combine them with their references as paraphrase pairs. Finally, we filter out 1088 paraphrase pairs with a vocabulary of 4619 words. Totally we extract 268 "copywords" with 2 as c threshold and 0.8 as the p threshold. Note that all the words are in their lower cases.
culate weighted precision P and recall R. For each matcher (m i ), it counts the number of content and function words covered by matches of ith type in the candidate (m i (h c ), m i (h f )) and reference (m i (r c ), m i (r f )), |h f | and |r f | mean the total number of function words in candidate and reference, |h c | and |r c | mean the total number of content words in candidate and reference.
The parameterized harmonic mean of precision P and recall R then calculated: To account for gaps and differences in word order, a fragmentation penalty is calculated using the total number of matched words (m, averaged over hypothesis and reference) and number of chunks(ch): The Meteor score is then calculated: The parameters α, β, γ, δ and w i ...w n are tuned to maximize correlation with human judgments.  Table 3: Copy knowledge classification, we combine the copy knowledge of Quora and WMT, and get 695 "copywords" totally, c is the raw count and p is the proportion of each type.
are not semantically equivalent. Therefore the recall and precision of copy knowledge play a key role in the quality of translations.
In light of this, we propose a method for copy knowledge extraction in formula (6), p w means the co-occurrence probability, C(w) means the raw appearance count of word and C(co w ) means cooccurrence count. We select the words whose raw counts and co-occurrence probabilities in highquality candidates and references exceed certain thresholds (F , P ) as "copy-words".
Here we test the method described above on the Quora 1 and the WMT datasets. The Quora dataset consists of over 400, 000 lines of potential question duplicate pairs. Each question pair has a binary value that indicates whether the line truly contains a duplicate pair. Here we only use the duplicate question pairs, including 142, 963 paraphrase pairs and a vocabulary of 32, 582 words. The WMT dataset consists of WMT15-17 (Bojar et al., 2017(Bojar et al., , 2016Stanojević et al., 2015). We select the candidates with high human scores and combine them with their references as paraphrase pairs. There are 9287 pairs with human scores and only about one thousand pairs are useful. We regard the pairs which have human scores exceed the threshold as useful pairs (here we set the threshold as 0.8). Since the amount of available texts with high human score is quite small, it is still not possible to conclude which words belong to copy knowledge. 1 https://www.kaggle.com/quora/question-pairs-dataset Table 1 and Table 2 show part of the copy knowledge extraction results of the Quora and the WMT.
In Table 3, we divide the copy knowledge into several categories, and find that it is mainly composed of locations, persons, organizations, miscellaneousness and some others. We label these 695 (427 + 268) "copy-words" manually and see that about 67% of them are named entities. In general, named entity occupies a large proportion.

Model
Inspired by the observation of copy knowledge, we propose Meteor++ based on Meteor. In Meteor++, we incorporate copy knowledge into precision P and recall R indirectly. Specifically, we give penalties to the following two conditions from the perspective of recall and precision: • Recall : there exist some "copy-words" only in references but not in candidates.
• Precision : there exist some "copy-words" only in candidates but not in references.
The candidates suffer the first condition may discard some important information, and the second may add some other extra information. We propose to correct the formulation of precision P and recall R in Meteor as following: In formula (8), for each matcher (m i ) , which counts the number of "copy-word" covered by matches of i-th type in the candidate (m i (h p )) and lang-pair de-en fi-en ru-en ro-en cs-en tr-en lv-en zh-en WMT17   the reference (m i (r p )), |h p | and |r p | respectively mean the total number of "copy-words" in the candidate and the reference. X is a hyper-parameter used to smooth the results as following: For Smoothing : In formula (1) and (2), we have already punished the unmatched words, here we only give an appropriate extra penalty to the "copy-words" missing.
Compensation For The Gap : In section 3.1, we only propose a simple statistical method to extract copy knowledge and it still has a long distance from the real copy knowledge.
Likewise, we have the modified recall formula as (9). After that correction, theP andR will substitute for the original P and R in the following calculation.
This two formulas can be regarded as using the precision and the recall of the "copy-words" to punish the entire sentence. If the "copy-words" are not identical in the candidate-reference pair, P and R will be discounted by the formula (8) and (9). We need to obtain a sufficiently high recall and precision of "copy-word" to guarantee the quality of the candidates since the copy knowledge is of greater importance.

Settings
We evaluate our model on WMT15 and WMT17 metric task evaluation sets by calculating the correlation with the real human scores. The official human judgments of translation quality are collected using direct assessment(DA) (Graham et al., 2013). The direct assessment evaluation protocol give the annotators the reference and one MT output only and ask them to evaluate the translation adequacy of the MT output on an absolute scale.
The WMT datasets totally have 9287 pairs with human scores and after filtering out the lower human score pairs, only about one thousand pairs can be regarded as the paraphrase pairs. As we described in section 3.1, named entity is an important part of copy knowledge and accounts for 67%, here we take named entity as the copy knowledge because of the absence of referencecandidate pairs with high human scores on WMT datasets. And we use NLTK (Loper and Bird, 2002;Bird and Loper, 2004) toolkit to recognize named entities as our "copy-words" in experiments. Table 4 shows the NE density of each language pair on WMT15-17 datasets and we select the WMT16 evaluation sets as our development sets. Our development experiments show that the parameter X has positive correlation with the NE density. We can see that WMT17 evaluation sets have higher NE density and WMT15 evaluation sets have lower NE density. In the experiments of Table 5, we set X = 14 on WMT17 and X = 8 on WMT15. Table 5 shows the Pearson correlation with the WMT15 and WMT17 direct assessment of translation adequacy at segment-level. We can see that Meteor++ has higher average segment-level Pearson correlation with DA human scores than Meteor on all WMT datasets.

Conclusion
In this paper, we describe the submissions of our metric Meteor++ for WMT18 Metrics task in detail. According to the observation of paraphrasing corpus, we discover copy knowledge in which the words keep intact after paraphrasing. We propose a simple statistical method to extract copy knowledge based on the given parallel monolingual paraphrases. Then, we present Meteor++ to examine the method of integrating copy knowledge into MT evaluation based on Meteor. Because words in copy knowledge always have a high possibility to be found in both candidates and references in machine translation, the Meteor++ could process better than Meteor. The experiment results on WMT datasets for each language pair show that Meteor++ has higher average segment-level Pearson correlation with DA human scores than Meteor and demonstrate the efficacy of copy knowledge.

Future Work
In this paper, we give a simple statistical method to extract copy knowledge, and propose the Me-teor++ incorporate with it. Although it has already demonstrated great promise, we are still in the process of enhancing the metric in the following directions: Copy Knowledge Extraction: We only propose a simple statistical method to extract copy knowledge which select the words with a high co-occurrence probability in paraphrase pairs. Here we just use bag-of-words to represent sentences and regard the intersection of them as cooccurrence. Therefore the copy knowledge we extract has a long way to go compared to the real copy knowledge. Furthermore, we are considering about constructing an alignment on the large-scale parallel monolingual corpus and then extracting universal copy knowledge based on it for broad use.
Training the hyper-parameter X on Data: The hyper-parameter X was designed to smooth the results and compensate for the gap between the copy knowledge we extract and the real copy knowledge. As our copy knowledge is getting more and more closer to the real copy knowledge, we plan to optimize the formulas by training on a separate data set, and choosing the X formula with the best correlations with human assessment on the training data.