MEANT 2.0: Accurate semantic MT evaluation for any output language

We describe a new version of MEANT, which participated in the metrics task of the Second Conference on Machine Translation (WMT 2017). MEANT 2.0 uses idf-weighted distributional ngram accuracy to determine the phrasal similarity of semantic role ﬁllers and yields better correlations with human judgments of translation quality than earlier versions. The improved phrasal similarity enables a subversion of MEANT to accurately evaluate translation adequacy for any output language, even languages without an automatic semantic parser. Our results show that MEANT, which is a non-ensemble and untrained metric, consistently performs as well as the top participants in previous years - including ensemble and trained ones - across different output languages. We also present the timing statistics for MEANT for better estimation of the evaluation cost. MEANT 2.0 is open source and publicly available. 1


Introduction
We introduce a new version of MEANT, which participated in evaluating MT systems for all language pairs in the metrics task of the Second Conference on Machine Translation (WMT 2017). MEANT 2.0 is a non-ensemble and untrained metric that only requires a monolingual corpus in the output language to build the word embeddings and an automatic shallow semantic parser to obtain the predicate-argument structure to evaluate MT systems for a language pair. We have also build a degraded subversion, MEANT 2.0 -nosrl, to evaluate MT systems for any output language by re-1 http://chikiu-jackie-lo.org/home/index.php/meant moving the dependency on semantic parsers for semantic role labeling (SRL) the reference and the machine translations. The correlation of MEANT with human judgments has been improved by using both inverse document frequency (idf) and distributional ngram accuracy within the phrasal similarity calculation: the former to weight the importance of each word for better adequacy, the latter to to account for word reordering for greater fluency. Our results show that MEANT consistently performs as well as the top participants in previous years across different output languages, including ensemble and trained participants. We also present the timing statistics that show the relatively low cost of running MEANT. This highly portable and open source semantic MT evaluation metric is a more accurate alternative to BLEU in evaluating translation quality for low-resource languages.

The family of MEANT
MEANT and its variants (Lo et al., 2015(Lo et al., , 2014Lo and Wu, 2011a) evaluate translation adequacy by measuring the similarity of the semantic frames and their role fillers between the human reference and machine translations. Figure 1 illustrates the concept of MEANT -the semantic roles and their fillers of the reference translation match more with those of the MT2 than with those of the MT1, therefore MT2 is a more adequate translation than MT1.
MEANT consistently outperforms the commonly used automatic MT evaluation metrics, BLEU (Papineni et al., 2002), NIST (Doddington, 2002), METEOR (Denkowski and Lavie, 2014), TER (Snover et al., 2006), CDER (Leusch et al., 2006) and WER in correlation with human adequacy judgment. It is relatively easy to port to other languages. In the full version of MEANT, it required only a monolingual corpus (for eval- Figure 1: Example that illustrates the concept of MEANT. The solid role alignments mean the translation is mostly correct while the dotted role alignments mean the translation is partly correct. The semantic roles and fillers of the reference match more with those of MT2 than those of MT1, therefore MT2 is a more adequate translation than MT1. uating lexical semantic similarity) and an automatic semantic parser (for evaluating frame semantic similarity) of the output language. In section 3, we describe a new subversion of MEANT that can be computed even when a semantic parser for the output language is unavailable.
MEANT is the weighted f-scores over corresponding semantic frames and role fillers in the reference and the machine translations. MEANT is generally computed as follows: 1. Apply a shallow semantic parser to both the reference and machine translations.
2. Apply the maximum weighted bipartite matching algorithm to align the semantic frames between the reference and machine translations according to the lexical similarities of the predicates.
where s(e, f ) is the lexical similarity computed using word embeddings (Mikolov et al., 2013). By aggregating the lexical similarities, we can obtain the phrasal similarities. s sent is the phrasal similarity of the whole sentence between the reference and the MT output. s i,pred is the phrasal similarities of the predicates between the reference translations and the MT output and s i,j is that of the role fillers of the arguments of role type j. w pred is the weight of the lexical similarities of the aligned predicates in step 2. w j is the weight of the phrasal similarities of the role fillers of the arguments of role type j of the aligned frames between the reference translations and the MT output in step 3 if their role types are matching. There is a total of 12 weights for the set of semantic role labels in MEANT (Lo and Wu, 2011b) estimated by heuristics (Lo and Wu, 2012).
Finally, the weight α for the precision and recall is introduced for different usages of MEANT. α should be set to 1 so that MEANT is pure recall when it is used for MT evaluation and α should be set to 0.5 so that MEANT is the balance of precision and recall, when it is used for MT system optimization.
HMEANT (Lo and Wu, 2011a) is the variant of MEANT for human evaluation, where the semantic roles in the reference and in the MT output are annotated by humans. XMEANT (Lo et al., 2014) is the cross-lingual variant of MEANT, which estimates translation quality of the MT output against the source sentence using automatic semantic parsers for the input and output languages and alignment probabilities to determine the crosslingual lexical semantic similarity.

Improvements in MEANT 2.0
We improve the performance of MEANT on evaluating translation adequacy by weighing the importance of each word by inverse document frequency when computing phrasal similarity, so that a higher score will be given to phrases with more matches for content words than for function words. We also modify the phrasal similarity calculation so that instead of aggregating lexical similarities for the bag of words in the phrase, it aggregates ngram lexical similarities. Thus, the word order of the semantic role fillers for the whole sentence is taken into account. Our development experiments showed that the optimal value of n is 2.
We also generalize the concept of weighted precision and recall when computing phrasal similarities for the semantic role fillers. Lastly, we simplify the computation of the frame semantic similarities by introducing a weight β to linearly combine the phrasal similarity of the whole sentence and the frame semantic similarity of the reference and the MT output into MEANT. Our development experiments show that the optimal value of β is 0.1. In summary, equations (1) to (8) are replaced by equations (9) to (16) as follow: prec e,f ≡ idf-weighted max-aligned distributional ngram precision (9) rec e,f ≡ idf-weighted max-aligned distributional ngram recall (10)  Table 1: Statistics of resources used to train the word embeddings and the resulted vocabulary size of the model. (16) As a result, for languages without an automatic semantic parser or sentences without a valid predicate-argument structure recognized by an automatic semantic parser, the MEANT score is the phrasal similarity of the whole sentence.

Setup
We use the monolingual corpora provided for the WMT translation task (Bojar et al., , 2016a to build the word embeddings for evaluating lexical similarities using word2vec (Mikolov et al., 2013). Table 1 summarizes the resources used to train the word embeddings and the resulting vocabulary size of the distributional lexical semantic similarity model.
We use mateplus (Roth and Woodsend, 2014) for German and English semantic role labeling and mate-tools (Björkelund et al., 2009) for Chinese semantic role labeling. Instead of the 12 semantic role types used in (Lo and Wu, 2011b), we merge the semantic role labels of Chinese, English and German into 8 role types (who, did, what, whom, when, where, why, how) for more robust performance.
For languages except Chinese, tokenization step simply involves separating punctuations at the end of the words in both the reference and the MT output. Chinese does not have clear word boundaries. Each individual Chinese character usually carries multiple meanings and relies on surrounding characters to disambiguate it. Naive Chinese character segmentation would affect the accuracy of the vector representation and the distributional lexical semantic similarity model. Thus, we use ICTCLAS (Zhang et al., 2003) to segment the Chinese monolingual corpus into words before building the word embeddings.

Experiments and results
We use the WMT 2014-2016 metrics task evaluation set (Machacek and Bojar, 2014;Stanojević et al., 2015;Bojar et al., 2016b) for our development experiments. The official human judgments of translation quality were collected using relative ranking. The annotators were given the original input and the reference and were asked to order up to 5 different MT outputs according to the translation quality.
Two other kinds of human judgments of translation quality were collected in the WMT 2016 metrics task. The direct assessment evaluation protocol gave the annotators the reference and one MT output only and asked them to evaluate the translation adequacy of the MT output on an absolute scale. The HUME metric (Birch et al., 2016) is very similar to HMEANT, which evaluates translation adequacy via semantic units in the input sentence annotated by humans following the UCCA (Abend and Rappoport, 2013) guidelines. However, HUME also takes nominal and adjectival argument structures into account (instead of only predicate argument structure as in HMEANT).
Due to space limitations, we only report the results of MEANT 2.0, MEANT 2.0 -nosrl, BLEU and the best correlation in each of the individual language pairs. Since we use exactly the same protocol for each of the test sets, our reported results are directly comparable with those reported in Machacek and Bojar (2014); Stanojević et al. (2015); Bojar et al. (2016b). We summarize the observations in the following sections.
5.1 Correlation with human at system-level 5.1.1 On relative ranking judgment Table 2 shows the Pearson's correlation with the WMT 2014-2016 official human relative ranking scores at system-level. As expected, MEANT 2.0 performs significantly better than MEANT 2.0nosrl in most of the language pairs. Overall, both MEANT 2.0 and the nosrl variant are very competitive with other metrics for all test sets.
For the WMT14 test set, MEANT 2.0 is the best metric among all participants in that year in the de-en and en-de direction. On average over from-English directions, MEANT 2.0 592 input language   For the WMT15 test set, MEANT 2.0nosrl is the best metric among all the participants in that year in the en-cs direction. On average over into-English directions, MEANT 2.0 is in 6th place while the nosrl variant is in 9th place. Both variants of MEANT lose only to ensemble and trained metrics in that year.
For the WMT16 test set, MEANT 2.0 ties for 1st place with the best metric in the ru-en direction in that year. Both variants of MEANT perform as well as the leading metrics in all other directions, except en-de. Table 3 shows the Pearson's correlation with the WMT 2016 direct assessment of translation adequacy at system-level.

On direct assessment judgment
Both MEANT 2.0 and MEANT 2.0nosrl beat all the other metrics that year in the tr-en direction and perform very competitively when compared to the leading pack in other directions. MEANT 2.0 performs better than the nosrl variant in all directions, except fi-en.

Correlation with human judgment at
segment-level 5.2.1 On relative ranking judgment   For the WMT15 test set, both MEANT 2.0 and MEANT 2.0 -nosrl beat all the participating metrics in that year in the fr-en direction. MEANT 2.0 -nosrl also has the highest correlation with human in the en-cs and en-ru directions and the overall average of the from-English directions. Again, on the average of all the to-English directions, MEANT 2.0 and the nosrl variant are in 2nd and 3rd place respectively and only lose to an ensemble and trained metric.
For the WMT16 test set, both MEANT 2.0 and MEANT 2.0 -nosrl beat all other participants in that year in the de-en, en-de directions while MEANT 2.0 -nosrl is also the champion in the en-cs and the en-ru directions. Both variants perform very competitively when compared to the leading metrics in all other directions.  nosrl variant in all directions, except ru-en. Table 6 shows the Pearson's correlation of MEANT with the HUME human assessment on the himltest test set at segment-level. Both MEANT 2.0 and MEANT 2.0nosrl beat all other participating metrics in that year in the en-de direction. MEANT 2.0 -nosrl also has the highest correlation with HUME among all the participants in that year in the en-pl direction.  on different output languages using MEANT. The time taken for punctuation tokenization is almost negligible. This is because in common practice for MT system development, the validation and evaluation set are reused frequently, so the processing of the reference translation is typically precomputed. Furthermore, the MT system is trained to output tokenized translations, so it is not necessary to run the tokenization step on the MT output. Therefore, the tokenization step does not affect the time cost of MEANT in practical applications (even in the case of Chinese, where word segmentation takes significantly longer).

Evaluation speed
The loading time of the word embedding model is proportional to the vocabulary size of the model reported in table 1; it takes less than a second to load 10k vocabularies into memory.
Automatic semantic role labeling (SRL) is the most time consuming step in running MEANT. The time reported in table 7 includes parsing both the reference and the MT output. However, as pointed out above, common practice for MT system development is to frequently reuse the validation and evaluation sets. Thus, semantic role labeling of the reference translation could be precomputed to reduce the time taken for the SRL step in the development cycle.
Finally, the time used in computing the MEANT score is proportional to the size of the evaluation set and the word embedding model. The scoring step processes around 50 to 100 sentences each second.

Conclusion
We present a new version of MEANT that participated in evaluating MT systems for all language pairs in the metrics task of the Second Conference on Machine Translation (WMT 2017). The correlation of MEANT with human judgment has been improved by better addressing translation adequacy via weighing the importance of each word in the phrasal similarity computation by inverse document frequency, and better addressing translation fluency via using distributional ngram accuracy to account for word reordering in the computation. Our results show that MEANT consistently performs well across different output languages in the previous year's test set at both system-level and segment-level.
MEANT 2.0 -nosrl is a non-ensemble and untrained metric that requires only a monolingual corpus in the output language for building the word embeddings to evaluate MT systems for a new language pair. Although there is an overhead time cost in semantic role labeling sentence pairs in MEANT 2.0 and loading the word embedding model in both MEANT 2.0 and its nosrl subversion, the time cost can be reduced almost by half in real applications. This highly portable and open source semantic MT evaluation metric is a more accurate alternative to BLEU in evaluating translation quality for low-resource languages.