SentSim: Crosslingual Semantic Evaluation of Machine Translation

Machine translation (MT) is currently evaluated in one of two ways: in a monolingual fashion, by comparison with the system output to one or more human reference translations, or in a trained crosslingual fashion, by building a supervised model to predict quality scores from human-labeled data. In this paper, we propose a more cost-effective, yet well performing unsupervised alternative SentSim: relying on strong pretrained multilingual word and sentence representations, we directly compare the source with the machine translated sentence, thus avoiding the need for both reference translations and labelled training data. The metric builds on state-of-the-art embedding-based approaches – namely BERTScore and Word Mover’s Distance – by incorporating a notion of sentence semantic similarity. By doing so, it achieves better correlation with human scores on different datasets. We show that it outperforms these and other metrics in the standard monolingual setting (MT-reference translation), a well as in the source-MT bilingual setting, where it performs on par with glass-box approaches to quality estimation that rely on MT model information.


Introduction
Automatically evaluating machine translation (MT) as well as other language generation tasks has been investigated for decades, with substantial progress in recent years due to the advances of pretrained contextual word embeddings. The general goal of such evaluation metrics is to estimate the semantic equivalence between the input text (e.g. a source sentence or a document) and an output text that has been modified in some way (e.g. a translation or summary), as well as the general quality of the output (e.g. fluency). As such, by definition metrics should perform some forms of input-output comparisons.
*Contributed equally to this work. However, this direct comparison has been proven hard in the past because of the natural differences between the two versions (such as different languages). Instead, evaluation metrics have resorted to comparison against one or more correct outputs produced by humans, a.k.a. reference texts, where comparisons at the string level are possible and straightforward. A multitude of evaluation metrics have been proposed following this approach, especially for MT, the application we focus on in this paper. These include the famous BLEU (Papineni et al., 2002) and METEOR (Banerjee and Lavie, 2005) for machine translation, ROUGE (Lin, 2004) for summarization, and CIDER (Vedantam et al., 2014) for image captioning. These traditional metrics are based on simple-word, n-gram matching mechanisms or slight relaxations of these (e.g. synonyms) which are computationally efficient, but suffer from various limitations.
In order to overcome the drawbacks of the traditional string-based evaluation metrics, recent work (Williams et al., 2018;Bowman et al., 2015;Echizen'ya et al., 2019;Cer et al., 2017;Echizen'ya et al., 2019) has investigated metrics that perform comparisons in the semantic space rather than at the surface level. Notably, applications of Word Mover's Distance (WMD; Kusner et al., 2015), such as WMDo (Chow et al., 2019), VIFIDEL  and moverscore (Zhao et al., 2019), which compute similarity based on continuous word embeddings using pretrained representations. These have been shown to consistently outperform previous metrics on various language generation evaluation tasks.
However, these metrics have two limitations: (i) they still rely on reference outputs, which are expensive to collect, only cover one possible correct answer, and do not represent how humans do evaluation; (ii) they are bag-of-embeddings approaches which capture semantic similarity at the token level, but are unable to capture the meaning of the sen-tence or text as a whole, including correct word order.
In this paper, focusing on MT, to address these limitations we first posit that evaluation can be done by directly comparing the source to the machine translation using multilingual pretrained embeddings, such as multilingual BERT, avoiding the need of reference translations. We note that this is different from quality estimation (QE) metrics (Specia et al., 2013;Shah et al., 2015) , which also compare source and machine translated texts directly, but assume an additional step of supervised learning against human labels for quality. Second, we introduce Sentence Semantic Similarity (SSS) , an additional component to be combined with bag-of-embeddings distance metrics such as BERTScore. More specifically, we propose to explore semantic similarity at the sentence levelbased on sentence embeddings (Sellam et al., 2020;Thakur et al., 2020) -and linearly combine it with existing metrics that use word embeddings. By doing so, the resulting metrics have access to word and compositional semantics, leading with improved performance. The combination is a simple weighted sum, and does not require training data.
As a motivational example, consider the case in Table 1, from the WMT-17 Metrics task (Zhang et al., 2019). When faced with MT sentences that contain a negated version of the reference (MT3 and MT4), token-level metrics such as BERTScore and WMD cannot correctly penalize these sentences since they match representations of words in both versions without a full understanding of the semantics of the sentences. As a consequence, they return a high score for these incorrect translations, higher than the score for correct paraphrases of the reference (MT1 and MT2). Sentence similarity, on the other hand, correctly captures this mismatch in meaning, returning relatively lower scores for Translations 3 and 4. However on their own they may be too harsh, since the remaining of the sentence has the same meaning. The combination of these two metrics (last column) balances between these two sources of information and, as we will later show in this paper, has higher correlation with human scores.
Our main contributions are: 1. We investigate and show the effectiveness of linearly combining sentence-level semantic similarity with different metrics using token-level semantic similarity. The resulting combined metric, SentSim, consistently achieves higher Pearson Correlation with human judgements of translation quality than both word and sentence similarity alone.
2. We show, for the first time, that these metrics can be effective when comparing systemgenerated sentences directly against source sentences, in a crosslingual fashion.
3. Our SentSim metric outperforms existing metrics on various MT datasets in monolingual and crosslingual settings.

Related Work
Various natural language generation tasks, including machine translation, image captioning, among others, produce sentences as output. These are evaluated either manually or automatically by comparison against one or multiple reference sentences. A multitude of metrics have been proposed for the latter, which perform comparisons at various granularity levels, from characters to words to embedding vectors. The goal of such metrics is to replace human judgements. In order to understand how well they fare at this task, metrics are evaluated by how similar their scores are to human assigned judgements on held-out datasets. For absolute quality judgements, Pearson Correlation is the most popularly used metric for such a comparison (Mathur et al., 2020).
Recent studies have showed that the new generation of automatic evaluation metrics, which instead of lexical overlap are based on word semantics using continuous word embedding, such as BERT (Devlin et al., 2019), ElMo (Peters et al., 2019), XL-Net  or XLM-Roberta (Conneau et al., 2019), have significantly higher Pearson Correlation with the human judgements when comparing reference sentences with system generated sentences. Zhang et al. (2019) introduce BERTscore, an automatic evaluation metric based on contextual word embeddings, and tests it for text generation tasks such as machine translation and imaging captioning, using embeddings including BERT, XLM-Roberta, and XLNet (more details in Section 3.2). Mathur et al. (2019) (Mikolov et al., 2013). Lo (2019) presents Yisi, a unified automatic semantic machine translation quality evaluation and estimation metric using BERT embeddings. There is also a bulk of work on metrics that take a step further to optimize their scores using machine learning algorithms trained on human scores for quality (Sellam et al., 2020;Ma et al., 2017). They often perform even better, but the reliance on human scores for training, in addition to reference translations at inference time, makes them less applicable in practice. A separate strand of work that relies on contextual embeddings is that of Quality Estimation (Moura et al., 2020;Fomicheva et al., 2020a;Ranasinghe et al., 2020;Specia et al., 2020). These are also trained on human judgements of quality, but machine translations are compared directly to the source sentences rather than against reference translations.
In addition to embeddings for words, embeddings for full sentences have been shown to work very well to measure semantic similarity. These are extracted using Transformer models that are specifically trained for capturing sentence semantic meanings using BERT, Roberta, and XLM-Roberta embeddings (Reimers and Gurevych, 2019;Thakur et al., 2020) and provide state-of-art performance pretrained models for many languages. 1 In this paper, we take inspiration from these lines 1 https://github.com/UKPLab/sentence-transformers of previous works to propose unsupervised metrics that combine word and sentence semantic similarity and show that this can be effective for both MTreference and source-MT comparisons.

Method
In this section, we first describe in more detail the metrics that we have used in our experiments, namely semantic sentence cosine similarity, WMD and BERTScore. Then we present our simple approach to linearly combine these metrics.

Word Mover's Distance (WMD)
Kusner et al. (2015) presents word mover's distance (WMD) metric, a special case of Earth mover's distance (Rubner et al., 2000), computing the semantic distance between two text documents by aligning semantically similar words and capturing the word traveling flow between the similar words utilizing the vectorial relationship between their word embeddings (Mikolov et al., 2013). WMD has been proven to generate consistently high-quality results for the tasks of measuring text similarity and text classification (Kusner et al., 2015). A text document is represented as a vector D, where each element is denoted as the normalized frequency of a word in the document such that: where d i = c i / n j c j and c i is the frequency that the i th word which appears c i times in a given text document. Assuming there are two given words from different text document denoted as i and j, then the euclidean distance in the embedding x i and x j for the two words is defined as: where c(i, j) is defined as the "word traveling cost" from x i in one document to x j in the other document. Now, assuming there are two documents, one is the source document denoted as A where the word i belongs to, and another one is the target document denoted as B where the word j belongs to. A flow matrix T is defined in which every element is denoted as T ij , suggesting the number of times the word i in document A moves to the word j in document B. Then, the value of the flow matrix is normalized based on the total count of words in the vocabulary such that: The semantic distance calculated by WMD can be then defined as follows: WMD, or the semantic distance between two text documents, can thus be computed by optimizing values in the flow matrix T . In other words, WMD corresponds to the minimal semantic distance to move semantically similar words (via their embeddings) from one text document to another.

BERTScore
BERTScore (Zhang et al., 2020) is designed to evaluate semantic similarity between sentences in the same language, namely a reference sentence and a machine-generated sentence. Assume a reference sentence is denoted as x = (x 1 , ...., x k ) and a candidate sentence is denoted asx = (x 1 , ....,x k ), BERTScore uses contextual embeddings such as BERT (Devlin et al., 2019) or ELMo (Peters et al., 2019) to represent word tokens in the sentences. It finds word matchings between the reference and candidate sentence using cosine similarity, which can be optionally reweighted by the inverse document frequency scores (IDF) of each word. BERTScore matches each word token x in reference sentence to the closest word tokenx in candidate sentence for computing recall, and matches each word tokenx in candidate sentence to the closest word token x in reference sentence for computing precision. It combines recall with precision to produce an F1 score. However, only recall is used for evaluation in most cases, which is defined as follows: In essence, BERTScore can be viewed as a hard word alignment given a pair of sentences using contextual embeddings, in which each word is aligned to one other word, the closest in the embedding space according to the cosine distance between their vectors.

Semantic Sentence Similarity (SSS)
A commonly used method to measure sentence similarity is using the cosine distance between the two vectors summarizing the sentences: where α and β are the vectors representing the two sentences. The higher the value obtained through cosine similarity between two sentences vectors based on the pretrained sentence representation (Reimers and Gurevych, 2019;Thakur et al., 2020), the stronger their similarity.

SentSim
In order to bring the notion of semantic similarity to token similarity metrics, we combine the sentence cosine similarity using semantically finetuned sentence embedding with the metrics using contextual word embeddings. Assume that the generated score from sentence level metric is denoted as A, the value generated from token-level metric is denoted as B and the gold truth from human judgement is denoted as S. Our combination metric, namely SentSim, is as follows: (7) where A and B are normalized to the range between 0 and 1, w 1 and w 2 are the weights given to two metric scores. If metric B is negatively correlated with S, i.e., if it is a distance metric like WMD, we give it e 1−B . We use e B for similarity metrics such as SSS and BERTScore. In equation 7, we apply exponential for similarity scores as the linear addition of two similarity scores (A + B) in lower-order leads to a large variance and inconsistency in the correlation with human scores. Lower-order models are too simple to fit the relationship between similarities. Therefore, a non-linear model is required to project these similarities into higher-order (A n + B n ). Given the Taylor Series Expansion (Abramowitz and Stegun, 1965) of exponential function, we can get a factorial average of two similarities from lower-order to higher-order as follows: Our final metric is given in Equation 8, which follows from Equation 7 using Taylor Series Expansion. This was also shown in (Kilickaya et al., 2017;Clark et al., 2019), which convert distance scores to similarities by using the exponential function.
In Section 5, we report experiments with two linear metric combinations: SSS + WMD and SSS + BERTScore, where we give equal weight to each metric (w 1 = w 2 = 0.5). We have also investigated the linear combination between Sentence Mover's Distance (Zhao et al., 2019) and token-level metrics, but the performance is poorer than SSS, so we only show results in the Appendix A.1.

Experiment Setup
In this section, we describe two types of experimental scenarios, monolingual and crosslingual evaluation, as well as the three datasets and pretrained embeddings we used.

Task Scenarios
The first evaluation setting we experimented with is the standard monolingual evaluation task scenario (MT-REF), which takes reference sentences and machine generated sentences in the same language as input. The second one is the crosslingual evaluation task scenario (SRC-MT), which directly assesses the similarity between source sentences and machine generated sentences in different languages. We compute our combined metrics for each task scenario separately.

Datasets
We use various datasets with absolute human judgements from recent evaluation campaigns.
Multi-30K (Elliott et al., 2016) is a multilingual (English-German (en-de) and English-French (enfr)) image description dataset. We use the 2018 test set, in which each language pair contains more than 2K sentence tuples, including source sentences, reference sentences, machine generated sentences, and the corresponding human judgement scores in an (0-100) continuous range. Therefore, this dataset can be used for both crosslingual and monolingual task scenarios.
WMT-17 (Bojar et al., 2017) is a dataset containing multiple language pairs from the WMT News Translation task used for segment-level system evaluation in the Metrics task. We used all seven to-English datasets: German-English (deen), Chinese-English (zh-en), Latvian-English (lven), Czech-English (cs-en), Finnish-English (fi-en), Russian-English (ru-en), Turkish-English (tr-en) and two from-English datasets: English-Russian (en-ru), English-Chinese (en-zh). Each language has 560 sentence tuples, where each tuple has a source sentence, a reference sentence and multiple system generated sentences, in addition to a human score varying from 0 to 100. WMT-17 can be used in both monolingual and crosslingual evaluation task scenarios, and is our main experimental data. More recent WMT Metrics task datasets do not report metrics results using absolute judgements, but rather convert these into pairwise judgements. While such relative judgements are useful to assess metrics ability to rank different MT systems, they are not applicable to assess metrics in their ability to estimate quality in absolute terms, which are what we are interested in.
WMT-20 (Fomicheva et al., 2020b) is the dataset used in the WMT20 quality estimation task, where participants are expected to directly predict the translation quality between source sentences and machine generated sentences without using reference sentences. This dataset has seven language pairs: Sinhala-English (si-en), Nepalese-English (ne-en), Estonian-English (eten), English-German (en-de), English-Chinese (enzh), Romanian-English (ro-en), Russian-English (ru-en). We use the test set, witwhere each language pair contains 1K tuples with source and machine generated sentences, as well as human judgements in the 0-100 range. Therefore, with this dataset we can only perform crosslingual evaluation.

Embeddings
For each language model, we consider embeddings at the token level and sentence level individually and in combination. In our experiments, Roberta-Large and XLM-Roberta-Base for monolingual and crosslingual assessments respectively.
For crossligual embeddings we use XLM-Roberta instead of multilingual BERT (mBERT)    because the former significantly outperforms the latter (Conneau et al., 2019), as also shown by  for crosslingual semantic textual similarity (STS) tasks (Cer et al., 2017). For a fair comparison with previous metrics like WMD 0 , we replaced their original embeddings with XLM-Roberta-Base embeddings.
For the semantic sentence embedding, we used XLM-Roberta-Base embeddings from Sentence Transformer, which were trained on SNLI (Bowman et al., 2015) + MultiNLI (Williams et al., 2018) and then fine-tuned on the STS benchmark training data. These sentence embeddings have been shown to provide good representations of the semantic relationship between two sentences, but they had not yet been tested for machine translation evaluation. Without using semantic embeddings, the performance of SSS is not consistent across different languages pairs given our experimental datasets (see Appendix A.1). XLM-Roberta-Large embeddings are not used in our experiments because they are not available in the pre-trained Sentence Transformer package yet.
For monolingual word and semantic sentence embeddings we use the Roberta-Large model, which has shown the best performance with BERTScore (Zhang et al., 2019).

Results
The evaluation results are presented in this section. Our code and data can be found on github 2 . 2 https://github.com/Rain9876/Unsupervisedcrosslingual-Compound-Method-For-MT

SRC-MT Setting
From Table 2, we can observe the Pearson correlation results of our metrics by comparing the source sentences with machine translated sentences using both single metrics and their combinations in the Multi-30K dataset. The result reveals that SSS + WMD outperforms all individual metrics and the other combined metrics. It is clear that SSS is better than both WMD and BERTScore, with WMD outperforming BERTScore in this specific crosslingual task.
In Table 3, the benefit of SSS becomes even more evident. It again outperforms WMD and BERTScore, with BERTScore also significantly outperforming WMD in this case. Moreover, SSS + BERTScore showed the best and more stable performance for all language pairs in the WMT-17 dataset. This can be clearly visualised for en-lv as an example in Figure 1, where we plot metric scores in the Y axis against human scores in the X axis.
We believe the differences in the performance of the combined metric in the Multi-30K and WMT-17 datasets happens because the sentence length differs significantly in these datasets: sentences in Multi-30K have on average 12-14 words, much shorter than those in the WMT-17 dataset. Because WMD optimizes the word alignment globally for the whole sentence, instead of optimizing word alignment locally like BERTScore, the performance of WMD is better than BERTScore when sentence length is shorter, but it becomes a harder optimization problem when the sentence  Table 5: Pearson Correlation with human scores for the WMT-20 dataset with Roberta-Base in the SRC-MT setting. Metrics like D-TP and D-Lex-Sim (Fomicheva et al., 2020b) are unsupervised metrics which show good performance in the WMT-20 quality estimation shared task, while Leaderboard baseline is a supervised model provided by the organizers that uses training data to finetune pretrained representations. length is long. This may explain why the performance of SSS + WMD is better than that of SSS + BERTScore in Multi-30K but lower than that of SSS + BERTScore in the WMT-17 dataset. SSS also outperforms WMD and BERTScore in the WMT-20 dataset, as Table 5 shows. SSS + BERTScore reaches the best performance in three out of seven language pairs and is the best metric in comparison with BERTScore or WMD alone. The metrics that outperform SSS + BERTScore for three language pairs require multiple passes of the neural machine translation decoder to score or generate multiple translations (D-TP and D-Lex-Sim, respectively), or require supervised machine learning (Leaderboard baseline).

MT-REF Setting
In the machine generated sentence to reference sentence case, as Table 2 shows, SSS + WMD achieves the best result in the monolingual Multi-30K tasks for both German to German and French to French using XLM-Roberta-Base embeddings. However, for other datasets in this standard setting where we compare sentences in a monolingual fashion, as we can observe from Table 4 for the WMT-17 dataset, SSS + BERTScore is the best metric. The reason for the differences is again likely to be the sentence lengths in the two datasets. If taken independently, the performance of SSS is not as good here as that of WMD or BERTScore. The two variants of the combined metrics still outperform any metric on their own, and reach the best performance results in this dataset. It can also be observed from Table 4  our WMD with Roberta-Large. It indicates that the importance of using the pretrained contextual embedding as the representation of tokens. A visual example of correlation plots can be seen in Figure  2 for the en-lv language pair again. Generally, the metrics' performances in the case of SRC-MT are much lower than in the MT-REF setting. This can be attributed to the embeddings used. First, the models' embeddings are not the same in these two cases. In the case of MT-REF, monolingual embeddings are used, which are known to be stronger; however these cannot be used in the case of SRC-MT evaluation, where crosslingual embeddings are used instead, which have been trained on more than 100 languages. Also, the way the crosslingual embeddings were generated does not rely on specific alignments or mappings between tokens or sentences in different languages, which can make them suboptimal. Second, the size of pretrained model for the case of MT-REF (Roberta-Large) is much larger than that of SRC-MT (XLM-Roberta-Base). As previously mentioned, pre-trained semantic sentence embeddings using XLM-Roberta-Large are not available, so we instead provide a comparison with Roberta-Base for the MT-REF case with WMT-17 in Section 5.5 to show the impact of model size.

Effect of Embedding Layers
Since both XLM-Roberta-Base and Roberta-Large have multiple layers, selecting a good layer or combination of layers is important for WMD and BERTScore. Here we use the WMT-17 dataset to study these representation choices. The Pearson Correlation of WMD with human judgement scores for the SRC-MT setting by specific XLM-Roberta-Base's layers is shown in Figure 3. Se- lecting Layer 9 as the token embeddings for XLM-Roberta-Base leads to the best average Pearson Correlation among 9 language pairs in this SRC-MT setting.  output layers, the best layer seems to be 17. This is inline with the results described in (Zhang et al., 2019), where the best layer for Roberta-Large to use in BERTScore is also found to be layer 17.

Analysis of SSS vs token-level metrics
For illustration purposes, Table 6 shows a few cases where SSS performs better than token-level metric because it adds the notion of sentence meaning and where, as a consequence, SentSim performs better (examples E1 and E2). It also show cases where SSS is too sensitive to semantic changes (example E3). SSS also performs well in the SRC-MT case (example E4). Here, the second machine translation has very different and incorrect word order, and the token-level metric (BERTScore) has very low performance compared to SSS, but both tokenlevel and SSS metrics capture the incorrect word order. The combined metric (SentSim), therefore, is very robust.

Effect of Pretrained Embeddings
To analyse the impact of pre-trained embeddings, Table 7 shows the performance of Roberta-Base in the case of WMT-17 MT-REF. As with the general trend in NLP, this confirms that stronger embeddings (Roberta-Large, Table 4) lead to better performance. The same trend was observed for the other test sets.

Conclusions
In this paper, we propose to combine sentence-level and token-level evaluation metrics in an unsupervised way. In our experiments on a number of standard datasets, we demonstrate that this combination is more effective for MT evaluation than the current state-of-the-art unsupervised token-level metrics, substantially outperforming these as well as sentence-level semantic metrics on their own. The sentence level metric seems to capture higherlevel or compositional semantic similarity, which complements the token-level semantic similarity information.
We also show that this combination approach can be applied both in the standard monolingual evaluation setting, where machine translations are compared to reference translations, and in a crosslingual evaluation setting, where reference translations are not available and machine translations are directly compared with the source sentences.
In future work, we will aim to improve the crosslingual metric and explore other types of multilingual embeddings for better mapping across different languages.

A.1 Comparison to Sentence Mover's Distance
Sentence Mover's Distance (SMD) (Zhao et al., 2019) is an alternative sentence level metric which for sentence semantic similarity. It compares two text documents using sentence embeddings which are not semantically fine-tuned but based on averaging or pooling the sentences' combined contextual word embeddings. The SMD is defined as follows: SM D(x n , y n ) := E(x lx 1 ) − E(y ly 1 ) where E is the embedding function which maps an n-gram to its vector representation, l x and l y are the size of sentences. As a comparison, we experimented with the linear combination between SMD and each of our token-level metrics -WMD and BERTScore. The metrics performances for WMT-17 in both cases of SRC-MT and MT-REF, and WMT-20 SRC-MT are shown in Table 8, Table  9 and Table 10. The overall performance of this metric is inferior to that of SSS, which is to be expected since this is simply averaging token-level embeddings. Similar to our SSS, the SMD metric performance improves when it is combined with token-level metrics. The combined metrics' performance drops when there is a big difference between the scores of the two combined metrics, e.g. more than 10%. To pick an example, in Table 8 the gap between BERTScore and SMD for zh-en is 0.115, and the combined SMD + BERTScore only reaches a score of 0.503, compared to 0.51 from BERTScore alone. For other languages with closer BERTScore and SMD scores, the performance of the combined metric remains the same or improves, for example, ru-en.

A.2 Plots with Metrics' Performance
To facilitate visualisation of our main tabular results presented in the paper, Figures 5, 6, 7 show them as bar plots.