KoBE: Knowledge-Based Machine Translation Evaluation

We propose a simple and effective method for machine translation evaluation which does not require reference translations. Our approach is based on (1) grounding the entity mentions found in each source sentence and candidate translation against a large-scale multilingual knowledge base, and (2) measuring the recall of the grounded entities found in the candidate vs. those found in the source. Our approach achieves the highest correlation with human judgements on 9 out of the 18 language pairs from the WMT19 benchmark for evaluation without references, which is the largest number of wins for a single evaluation method on this task. On 4 language pairs, we also achieve higher correlation with human judgements than BLEU. To foster further research, we release a dataset containing 1.8 million grounded entity mentions across 18 language pairs from the WMT19 metrics track data.


Introduction
Reliable and accessible evaluation is an important catalyst for progress in machine translation (MT) and other natural language processing tasks. While human evaluation is still considered the goldstandard when done properly (Läubli et al., 2020), automatic evaluation is a cheaper alternative that allows for rapid development cycles. Today's prominent automatic evaluation methods like BLEU (Papineni et al., 2002) or METEOR (Banerjee and Lavie, 2005) rely on n-gram matching with reference translations. While these methods are widely adopted, they have notable deficiencies: • Reference translations cover a tiny fraction of all relevant input sentences or domains, and non-professional translators yield low-quality results (Zaidan and Callison-Burch, 2011).
• Different words in the candidate and reference translations that share an identical meaning The Navy of Ukraine completed the exercise in the azov sea

ВМС Украины завершили учения в Азовском море
The Ukrainian Navy has completed exercises in the Sea of Azov will be penalized by simple n-gram matching, and multiple references are rarely used to alleviate this (Qin and Specia, 2015).
• Human translations have special traits ("Translationese", Koppel and Ordan, 2011) and reference-based metrics were shown to be biased to produce higher scores for translationese MT outputs than for valid, alternative MT outputs (Freitag et al., 2020).
• N-gram matching enables measurement of relative improvements, but does not provide an interpretable quality signal (Lavie, 2010).
To alleviate these issues, we propose Knowledge-Based Evaluation (KoBE), an evaluation method based on a large-scale multilingual knowledge base (KB). In our approach, we first ground each source sentence and candidate translation against the KB using entity linking (McNamee et al., 2011;Rao et al., 2013;Pappu et al., 2017;Gillick et al., 2019;Wu et al., 2019). We then measure the recall for entities found in the candidate vs. entities found in the source for all sentence pairs in the test set. Matching entities are ones linked to the same KB entry in both the source and the candidate. Figure 1 shows our entity matches for two candidate translations vs. the source, where different surface forms that convey the same meaning are properly matched.
Our approach does not require reference translations, as it is based on linking entity mentions to the KB. This also makes it language-pair agnostic as long as the KB and entity linking systems cover the languages of interest. Since different words that share the same meaning should be resolved to the same entry in the KB, our method will not penalize different valid translations of the same entity. As our method measures the recall of the entities found in the source sentence, it is useful as an absolute quality signal and not just as a relative one. Finally, we can perform fine-grained error analysis using entity metadata to better understand where a system fails or succeeds in terms of entity types and domains.
To test our approach, we experiment with the "Quality Estimation as a Metric" benchmark (also named "Metrics Without References") from the WMT19 shared task on quality estimation (Fonseca et al., 2019). KoBE performs better than the other participating metrics on 9 language pairs, and obtains better correlation with human raters than BLEU on 4 language pairs, even though BLEU uses reference translations and KoBE is referenceagnostic. This demonstrates that KoBE is a promising step towards MT evaluation without reference translations.
To make our findings reproducible and useful for future work, we release the annotations we used together with scripts to reproduce our results. These entity linking annotations span over 425k sentences in 18 language pairs from 262 different MT systems, and contain 1.8 million entity mentions of 28k distinct entities. 1 To summarize, this work includes the following contributions: • We introduce KoBE, a novel knowledgebased, reference-less metric for machine translation quality estimation.
• We show this approach outperforms previously published results on 9 out of 18 language pairs from the WMT19 benchmark for evaluation without references.
• We release a data set with 1.8 million grounded entity mentions for the WMT19 benchmark to foster further research on knowledge-based evaluation.
1 https://github.com/zorikg/KoBE 2 Method To obtain a system-level score, we first annotate all source sentences s i ∈ S and candidate translations t i ∈ T from a test set of n sentence pairs using entity linking pipelines. 2 As a knowledge base, we used the publicly available Google Knowledge Graph Search API 3 which offers entities from various domains. Unfortunately, we are not aware of any open-source multilingual KB and entity linking systems that we could rely on for the same purpose. We then count the matches for each sentence pair; matches are all candidate entities that are linked to the same record in the KB as source entities. Entities mentioned several times are counted as individual matches, and matches are clipped by the number of appearances of each entity in the source. As a pre-processing step, we ignore entity mentions in the candidate that are not in the target language using an in-house language identification tool, which we found to improve results in early experiments. We then compute recall by summing the number of matching entities across all sentence pairs and dividing by the the number of entities mentioned in all source sentences: |entities(s i )| Our decision to ignore candidate entities that are not in the correct language came from an observation that for some low-resource language pairs, MT systems fail to translate the input and instead copy most of its content to the output -see Ott et al. (2018) for a similar observation. As our entity linking system is language agnostic, it was detecting the copied entities, which resulted in false matches.
We found precision to have weaker correlation on most language pairs, as it rewards systems producing a lower number of entities -systems that usually produced lower quality translations. Recall is more stable as the number of entities in the source is constant for all evaluated systems, and only the match count is changing. Since recall may give inflated scores when over-producing entities, we introduce an entity count penalty (ECP), inspired by BLEU's brevity penalty. ECP penalizes  systems producing c entities if c is more than twice the number of entities in the source, s: The WMT conference holds a Quality Estimation track (QE) that aims to predict the quality of MT systems given the source sentences and candidate translations (without reference translations). While this was usually done at the word or sentence level, one of the novelties in WMT19 was introducing a new task for using QE as a metric at the corpus level, testing the generalization ability of QE approaches in a massive multi-system scenario (Fonseca et al., 2019). To test our approach, we used the same setting as in this shared task. For every language pair of the 18 evaluated pairs, we use KoBE to score the MT systems participating in same years news translation task. We then measure the Pearson correlation of our scores for each system with its human direct-assessment (DA) scores. To ensure a fair comparison, we recompute the correlations for the other participating metrics and confirm that we reproduce the reported scores. 4 4 More implementation details for reproducing our results are available in the supplemental material.

Results
We compare KoBE with all participating metrics in the shared task. We refer the reader to Fonseca et al. (2019) for more details about the different metrics. We also compare our results with BLEU to have a benchmark for a reference-based metric.
The results for into-English language pairs are available in Table 1. KoBE outperforms all other submissions for German-to-English, Gujarati-to-English, Kazakh-to-English, Lithuanian-to-English and Russian-to-English, making it the best system in this section in terms of the number of wins. Results for from-English language pairs are available in Table 2. In this case KoBE outperforms the submitted systems for English-to-Czech and English-to-Kazakh with Pearson correlations of 0.597 and 0.827, and also obtains high correlations for English-to-German and English-to-Russian with 0.888 and 0.895, respectively. For English-to-Chinese we also obtain the highest correlation, but it is very low overall. Table 3 describes the results on language pairs not involving English (German-to-Czech, German-to-French and Frenchto-German). In this case KoBE obtains the best result for German-to-Czech with Pearson correlation of 0.958. For 4 language pairs (German-to-English, Russian-to-English, Chinese-to-English and German-to-Czech), KoBE outperforms BLEU in terms of the correlation with human judgements. This is encouraging given that KoBE does not use reference translations while BLEU does.
In Table 4 we perform additional experiments to test whether our method can also be used as a reference-based metric, by measuring the recall of entities mentioned in the candidate translations vs. entities mentioned in the references. KoBE indeed correlates well with human judgements and outperforms BLEU on 5 out of 7 language pairs, which we find impressive given that it only considers unordered entity mentions and not on all n-grams as in BLEU. Figure 2 shows a comparison of our scores vs. BLEU and human direct-assessment on Russian-to-English. In addition to the higher correlation with human judgements (0.928 vs. 0.879), our metric produces scores which are closer to the human scores on an absolute scale.

Discussion and Analysis
Summarizing the above findings, our method obtains the best results on 9 out of 18 language pairs, which makes it the method with the largest number of wins on the WMT19 metrics-without-references benchmark. This shows that knowledge-based evaluation is a promising path towards MT evaluation without references. In comparison, the next best method is YiSi-2 (Lo, 2019) which is based on token-level cosine-similarity using context-aware token representations from multilingual BERT (Devlin et al., 2019). We believe that combining our knowledge-based approach with such methods may result in even better correlation with human judgements, but leave this for future work.
As our metric is based on the recall of entities in the target with respect to the source, it is important that the entities will be properly detected in the target. A failure to detect an entity in the source will just lead KoBE to use less entities, while a failure to detect an entity in the target will lead KoBE to penalize an entity that is actually present. Our entity linking pipelines work best in English, which results in much higher correlations with human judgements when English is the target language (Table 1) vs. the correlations when English is the source (Table 2). We believe that as entity linking systems will improve for languages other than English, our metric will improve accordingly. Another possible concern may be regarding the evaluation of sentences which do not contain any detected entities -our analysis shows that was the case for less than 8% of the sentences, so it did not have a large effect on the corpus-level metric. 5 Figure 3 shows matching statistics for different MT systems across several entity categories from the KB. We can see that our scores vary across different categories between and within different systems, which can give an interpretable signal for system developers regarding where improvement efforts should be invested.
Our reproduction of the correlation results raises an issue with the current evaluation methodology in the shared task. In the published results (Fonseca et al., 2019), in order to support both lower-is-better metrics (e.g. TER Snover et al., 2006) and higheris-better metrics (e.g. BLEU), the absolute values of the Pearson correlations are reported. However, when looking in Table 2 and Table 3 we see that the same metric may be correlated with different signs in different language pairs. This may result in wrong ranking of evaluation metrics, as the absolute value may "cover up" such cases. We hope future evaluations will take this detail into account.
A possible drawback of our approach is that it only relies on entities, which do not fully cover the sentence semantics. However, in the quality estimation setting, we only have access to the source and candidate translation, which are in different languages. As different languages use different syntactic structures and vocabulary, it is hard to employ other structural cues -for example, the order of the entities may be different due to the grammatical differences between the languages. The strong correlation between our metric and human judgements shows that knowledge-based comparison is a strong indicator of translation quality in this challenging setting. This is in line with the results of Freitag et al. (2020) who showed that BLEU with extensively paraphrased references correlates better with human judgements than BLEU with vanilla references -our method is "paraphrasing" or "stripping   down" the candidate and reference to only contain the mentioned entities during evaluation.

Related Work
Quality estimation for MT has been studied extensively in recent years -see Specia et al. (2018) for a thorough overview. Most work has been on the sentence or word level, using supervised approaches e.g. Open-Kiwi (Kepler et al., 2019). Using semantic knowledge for MT evaluation was proposed in different approaches: METEOR (Denkowski and Lavie, 2014) used paraphrase tables for referencebased evaluation; YiSi (Lo, 2019) and MEANT (Lo, 2017) used semantic role labeling (SRL) annotations; Birch et al. (2016) used the UCCA semantic annotations (Abend and Rappoport, 2013) for human evaluation of MT; Li et al. (2013) proposed a name-aware BLEU score giving more weight to named entities. Babych and Hartley (2004) conducted a comparative evaluation of named entity recognition (NER) from MT outputs, concluding that the success rate of NER does not strongly correlate with human or automatic evaluation scores. We show contradicting results, which may stem from the better NER and MT systems available today, and from the entity linking step we add. To the best of our knowledge, our work is the first to introduce a reference-less MT evaluation method based purely on entity linking against a multilingual knowledge-base.

Conclusions and Future Work
We proposed KoBE, a method for reference-less machine translation evaluation using entity linking to a multilingual knowledge base. We demonstrated the applicability of our method by achieving strong results on the WMT19 benchmark for reference-less evaluation across 9 language pairs, where in 4 cases it also outperforms the referencebased BLEU. Our method is simple, interpretable and produces scores closer to human judgements on an absolute scale, while enabling more finegrained analysis which can be useful to find weak spots in the evaluated model. In future work, we would like to combine knowledge-based signals with unsupervised approaches like YiSi (Lo, 2019) and XMoverScore (Zhao et al., 2020) that use contextualized representations from cross-lingual LMs like multilingual BERT (Devlin et al., 2019). As our method does not require reference translations, we would like to explore scaling it to use much larger or domain specific monolingual datasets. Our knowledge-based approach can also be applied to other text generation tasks like summarization or text simplification where BLEU was shown to be problematic (Sulem et al., 2018). Finally, performing outlier-aware meta-evaluation which was recently shown to be important in such settings (Mathur et al., 2020) could be beneficial.

A Supplemental Material
The data used in this paper is taken from the WMT19 results. 6 We downloaded the news translation task submissions 7 and annotated them using entity linking pipelines. We make our annotations publicly available to reproduce our results. We downloaded the Metrics task data 8 and obtained the submitted metrics scores, together with the standardized human direct assessment (DA) scores, from the results/sys-level_ scores_metrics.csv file. We recalculated the Pearson correlations for all metrics and made sure we got the same results as reported in the WMT19 official results (Fonseca et al., 2019).
Our submission contains a copy of the sys-level_scores_metrics.csv file, containing the submitted metrics scores, together with the human direct assessment (DA) scores. In addition, we publish the annotations for all WMT19 news translation task submissions. The published data contains a file for each system in each language pair, as well as the annotations for the source text and reference translations. Our annotations are in json format and contain all the entities that were detected in each sentence. Each entity has an id and a start and end positions in the sentence. In addition, we publish a python script that, given the sys-level_scores_metrics.csv file and the annotations, first calculates our score