YiSi - a Unified Semantic MT Quality Evaluation and Estimation Metric for Languages with Different Levels of Available Resources

We present YiSi, a unified automatic semantic machine translation quality evaluation and estimation metric for languages with different levels of available resources. Underneath the interface with different language resources settings, YiSi uses the same representation for the two sentences in assessment. Besides, we show significant improvement in the correlation of YiSi-1’s scores with human judgment is made by using contextual embeddings in multilingual BERT–Bidirectional Encoder Representations from Transformers to evaluate lexical semantic similarity. YiSi is open source and publicly available.


Introduction
A good automatic MT quality metric is one that closely reflect the usefulness of the translation, in terms of assisting human readers to understand the meaning of the input sentence. BLEU (Papineni et al., 2002) has long been shown not to correlate well with human judgment on translation quality (Machacek and Bojar, 2014;Stanojević et al., 2015;Bojar et al., 2016Bojar et al., , 2017Ma et al., 2018). However, it is still the most commonly used metric for reporting quality of machine translation systems. One of the major reasons is that BLEU is ready-to-deploy to all languages due to its simplicity. Semantic MT evaluation metrics, such as ME-TEOR (Denkowski and Lavie, 2014) and MEANT (Lo, 2017), require additional linguistic resources to more accurately evaluate the meaning similarity between the MT output and the reference translation. The lower portability hinders the wide adoption of these metrics.
We, therefore, propose a unified framework, YiSi, for MT quality evaluation and estimation that take advantage of both metric paradigms by providing options to fallback to surface-level lexi-cal similarity when semantic models are not available for the languages in assessment.
YiSi were first used in WMT 2018 metrics shared task (Ma et al., 2018) and performed well and consistently at segment-level across the tested language pairs in correlating with human judgment. An YiSi based system successfully served in WMT2018 parallel corpus filtering task (Lo et al., 2018).
This year, instead of using word2vec (Mikolov et al., 2013) to evaluate lexical semantic similarity in YiSi, we use BERT -Bidirectional Encoder Representation from Transformers (Devlin et al., 2018). YiSi is open source and publicly available. 1

YiSi
YiSi 2 is a unified semantic MT quality evaluation and estimation metric for languages with different levels of available resources. Inspired by MEANT (Lo, 2017), YiSi-1 is a MT quality evaluation metric that measures the similarity between a machine translation and human references by aggregating the weighted distributional lexical semantic similarities and optionally incorporating shallow semantic structures. YiSi-0 is the degenerate version of YiSi-1 that is ready-to-deploy to any languages. It uses longest common character substring to measure the lexical similarity. YiSi-2 is the bilingual, reference-less version, which uses bilingual embeddings to evaluate crosslingual lexical semantic similarity between the input and MT output. Like YiSi-1, YiSi-2 can exploit shallow semantic structures as well.
YiSi-0 and YiSi-1 were first used in WMT 2018 metrics shared task (Ma et al., 2018) and performed well and consistently at segment-level across the tested language pairs in correlating with human judgment. While YiSi-1 also successfully served in WMT2018 parallel corpus filtering task, YiSi-2 showed comparable accuracy in our internal experiments (Lo et al., 2018).

Overview
Following the guiding principle that a good MT quality metric reflects how well human readers understand the meaning of the input sentence, YiSi is the weighted f-scores over corresponding semantic frames and role fillers in the two sentences E and F in assessment. The procedure of computing YiSi is described as follow: 1. Apply a shallow semantic parser to both E and F .
2. Apply the maximum weighted bipartite matching algorithm to align the semantic frames between E and F according to the lexical similarities of the predicates.
3. For each pair of aligned frames, apply the maximum weighted bipartite matching algorithm to align the arguments between E and F according to the lexical similarity of role fillers.

4.
Compute the weighted f-score over the matching role labels of these aligned predicates and role fillers according to the following definitions: (Figure 1 is the graphical representation of the following computation.) w (e) = lexical weight of e s(e, f ) = lexical similarity of e and f where s(e, f ) is the lexical similarity and it is weighted by w(e) and w(f ) for computing phrasal precision and recall respectively. Different variants of YiSi have different definition of lexical similarities and weights depend on the resources available for the assessment settings. By aggregating the weighted lexical similarities into n-gram similarities, we then align the bag of n-grams in the two sentences using maximum alignment on the n-gram similarities. The phrasal similarity precision, s p , and recall, s r , (as defined below) are the weighted average of the similarities of the aligned n-gram.
With the phrasal semantic precision and recall, we compute the structural semantic precision and recall as follow: where w t is the weight of the lexical similarities of the aligned predicates in step 2. w j is the weight of the phrasal similarities of the role fillers of the arguments of role type j of the aligned frames between the reference translations and the MT output in step 3 if their role types are matching. As in (Lo, 2017), we merge the semantic role labels into 8 role types (who, did, what, whom, when, where, why, how) for more robust performance. Thus, there is a total of 8 weights for the set of semantic role types in YiSi estimated by type counts in the document F. The frame precision/recall is the weighted sum of the phrasal precision/recall of the aligned role fillers. The token coverage w e i and w f i estimate the importance of frame i in the sentence E and F . The structural semantic precision and recall is the weighted average of all the aligned frames in sentence E and F respectively. Now, the overall precision and recall is the weighted sum of the phrasal precision and recall of the whole sentence of − −− → e sent and − −− → f sent , like in the following: It is important to note that the weight β should NOT be interpreted as the importance of the structural semantic similarity in YiSi because there is a huge overlap in the structural semantic similarity and the phrasal semantic similarity. Instead, we should pay attention to the significant difference in the performance of YiSi with and without structural semantic similarity, especially in YiSi-2, the crosslingual variant. In this experiment, β is set to 0.1. Finally, the weight α for the precision and recall is introduced for different usages of YiSi. α should be set to 0.7 to make YiSi more recall-oriented when it is used for MT evaluation. When used for MT system optimization, α should be set to 0.5 to balance precision and recall.
In the following, we describe how we estimate the lexical similarity s(e, f ) and lexical weight w(e) under different resource conditions.

YiSi-0: quality evaluation metric for extremely low resource languages
YiSi-0 is the degenerate resource-free variant of YiSi for MT quality evaluation, where sentence E is the MT output and sentence F is the reference. Figure 2 shows the resources used in YiSi-0. YiSi-0 uses the longest common character substring accuracy to evaluate lexical similarity between the MT output and human reference. Since the MT output and the human reference are both in the same language, the lexical weight w(e) of word e in the translation and the lexical weight w(f ) of word f in the reference are both estimated by the inverse-document-frequency of those words in the reference document F. Thus, formally YiSi- YiSi-1 is the monolingual variant of YiSi for MT quality evaluation, where sentence E is the MT output and sentence F is the reference. Figure 3 shows the resources used in YiSi-1. YiSi-1 requires an embedding model to evaluate lexical semantic similarity and optionally requires a semantic role labeler in the output language for evaluating structural semantic similarity. The lexical semantic similarity is the cosine similarity of the embeddings from the lexical representation model. Similar to YiSi-0, the lexical weight w(u) of word unit u in the MT and the reference are estimated by the inverse-document-frequency of that word in the reference document F. Thus, formally YiSi-1 is defined as follow: v(u) = embedding of unit u s 1 (e, f ) = cos(v(e), v(f )) w (u) = idf (u) = log(1 + |F| + 1 |F ∃u | + 1 ) YiSi-1 = YiSi(s=s 1 , β=0.0, E=MT, F =REF) YiSi-1 srl = YiSi(s=s 1 , β=0.1, E=MT, F =REF) Figure 4: Resources used in YiSi-2. Arrows in green depict resources in target language and arrows in orange depict resources in source language. The dash arrows mean that the semantic parsers are optional.

YiSi-2: quality estimation metric for languages with access to a bilingual embedding model
YiSi-2 is the cross-lingual variant of YiSi for MT quality estimation, where sentence E is the MT output and sentence F is the input. Figure 4 shows the resources used in YiSi-2. YiSi-2 requires a cross-lingual embedding model for evaluating cross-lingual lexical semantic similarity and optionally requires a semantic role labeler in both the input and the output languages for evaluating structural semantic similarity. The lexical semantic similarity is the cosine similarity of the embeddings from the crosslingual lexical representation model. The lexical weight w(e) of word unit e in the MT is estimated by the inversion-document-frequency of the word in the MT document E while the lexical weight w(f ) of word unit f in the MT is estimated by the inversion-document-frequency of the word in the MT document F. Thus, formally YiSi-2 is defined as follow: YiSi-2 = YiSi(s=s 2 , β=0.0, E=MT, F =IN) YiSi-2 srl = YiSi(s=s 2 , β=0.1, E=MT, F =IN)

Using BERT for lexical unit semantic similarity
In WMT 2018 metrics shared task, YiSi-1 uses word2vec (Mikolov et al., 2013) to evaluate lexical semantic similarity between the MT output and the human reference at word level. The shortcomings of this kind of static embedding models (also including but not limited to GloVe (Pennington et al., 2014)) is that they provide the same embedding representation for the same word without reflecting context of different sentences. In contrast, BERT (Devlin et al., 2018) uses a bidirectional transformer encoder (Vaswani et al., 2017) to capture the sentence context in the output embeddings (at subword unit level), such that the embedding for the same word/subword unit in different sentences would be different and better represented in the embedding space. Zhang et al. (2019) provided an extensive study on the performance of the output embeddings of difference layers of BERT model in correlation with human adequacy. Following the recommendation from their studies, we use embeddings extracted from BERT models with the following settings: • the 18th layer of the pretrained English cased BERT-Large model to represent the subword units in the reference and MT output in English for computing YiSi-1; • the 9th layer of the pretrained Chinese BERT-Base model to represent the characters in the reference and MT output in Chinese for computing YiSi-1; and • the 9th layer of the pretrained multilingual cased BERT-Base model to represent the subword units in the reference and MT output in languages other than Chinese and English for computing YiSi-1 and to represent the subword units in the original input and MT output in all language pairs for computing YiSi-2.

Using MATE/MATEPLUS for structural semantic similarity
There are a handful of shallow semantic parsers available publicly. mate-tools (Björkelund et al., 2009) is an SVM classifier based on features extracted from a dependency parse. Its successor mateplus (Roth and Woodsend, 2014) also uses features extracted from distributional word embeddings. mate-tools and mateplus are integrated into YiSi because of their support for languages other than English. We use mateplus for German's and English's semantic role labeling and mate-tools for Chinese's semantic role labeling.

Experiments and results
We use WMT 2018 metrics task evaluation set (Ma et al., 2018) for our development experiments.
The official human judgments of translation quality in WMT 2018 were collected using direct assessment. The direct assessment evaluation protocol in WMT2018 gave the annotators the reference and a MT output and asked them to evaluate the translation adequacy of the MT output on an absolute scale.
Due to space limitations, we only report the results of YiSi, chrF (Popović, 2015), BLEU and the best correlation in each of the individual language pairs. Since we use exactly the same correlation analysis as the official task for each of the test sets, our reported results are directly comparable with those reported in the task's overview paper. We summarize our observations in the following sections.

Correlation with human judgment at
system-level Table 1 shows the Pearson's correlation with WMT 2018 official aggregated human direct assessment of translation adequacy at system-level. YiSi-0 performs more stably than chrF and BLEU in correlating with human on translation quality across all translation directions. YiSi-0 achieves comparable results with chrF and BLEU in most of the translation directions while significantly outperforms chrF and BLEU in correlating with human in evaluating Turkish-English and English-Turkish translations.
YiSi-1 beats all the WMT2018 participants in correlation with human at system level for evaluating Czech-English, German-English, Chines-English, English-German, English-Estonian, English-Finnish and English-Russian translations.
In addition, YiSi-1 srl further improves YiSi-1's correlation with human at system level for evaluating German-English, Chinese-English translations.
For the quality estimation variants, YiSi-2 achieves reasonably good results (with less than input lang.  0.1 degradation in correlation with human) in evaluating Czech-English , German-English, Finnish-English translation without using the human translation as reference. At the same time, YiSi-2 srl improves YiSi-2's correlation with human at system level for evaluating English-German, English-Chinese translations.
3.2 Correlation with human judgment at segment-level Table 2 shows the Kendall's correlation with the rankings at segment-level human direct assessment obtained in the WMT 2018. YiSi-0 achieves comparable results with chrF and BLEU for evaluating all translation directions at segment level. YiSi-1 beats all the WMT2018 participants in correlation with human at segment level for evaluating almost all translation directions, except English-Turkish. In addition, YiSi-1 srl further improves YiSi-1's correlation with human at segment level for evaluating Czech-English and Finnish-English translations.
For the quality estimation variants, YiSi-2 performs significantly worse than YiSi-1 due to the lacking of a reference translation in the same language for evaluating fluency. Therefore, We can see that as shown by the significant improvement in YiSi-2 srl for evaluating English-German trans-lation without reference translation, using semantic parsers to extract the semantic frames of the input sentence and machine translation become very helping in evaluating translation fluency.

Conclusion
We have presented the on-going work in developing a unified semantic MT quality evaluation and estimation metric for languages with different levels of available resources. Initial experiment results show that the improved variants of YiSi that use BERT contextual embeddings correlate with human judgment significantly better than other trained metrics.