On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation

Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual textual similarity. In this paper, we concern ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations, which represents a natural adversarial setup for multilingual encoders. Reference-free evaluation holds the promise of web-scale comparison of MT systems. We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER. We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations, namely, (a) a semantic mismatch between representations of mutual translations and, more prominently, (b) the inability to punish “translationese”, i.e., low-quality literal translations. We propose two partial remedies: (1) post-hoc re-alignment of the vector spaces and (2) coupling of semantic-similarity based metrics with target-side language modeling. In segment-level MT evaluation, our best metric surpasses reference-based BLEU by 5.7 correlation points.


Introduction
A standard evaluation setup for supervised machine learning (ML) tasks assumes an evaluation metric which compares a gold label to a classifier prediction. This setup assumes that the task has clearly defined and unambiguous labels and, in most cases, that an instance can be assigned few labels. These assumptions, however, do not hold for natural language generation (NLG) tasks like machine trans-1 https://github.com/AIPHES/ ACL20-Reference-Free-MT-Evaluation lation (MT) (Bahdanau et al., 2015;Johnson et al., 2017) and text summarization (Rush et al., 2015;Tan et al., 2017), where we do not predict a single discrete label but generate natural language text. Thus, the set of labels for NLG is neither clearly defined nor finite. Yet, the standard evaluation protocols for NLG still predominantly follow the described default paradigm: (1) evaluation datasets come with human-created reference texts and (2) evaluation metrics, e.g., BLEU (Papineni et al., 2002) or METEOR (Lavie and Agarwal, 2007) for MT and ROUGE (Lin and Hovy, 2003) for summarization, count the exact "label" (i.e., n-gram) matches between reference and system-generated text. In other words, established NLG evaluation compares semantically ambiguous labels from an unbounded set (i.e., natural language texts) via hard symbolic matching (i.e., string overlap).
The first remedy is to replace the hard symbolic comparison of natural language "labels" with a soft comparison of texts' meaning, using semantic vector space representations. Recently, a number of MT evaluation methods appeared focusing on semantic comparison of reference and system translations (Shimanaka et al., 2018;Clark et al., 2019;Zhao et al., 2019). While these correlate better than n-gram overlap metrics with human assessments, they do not address inherent limitations stemming from the need for reference translations, namely: (1) references are expensive to obtain; (2) they assume a single correct solution and bias the evaluation, both automatic and human (Dreyer and Marcu, 2012;Fomicheva and Specia, 2016), and (3) limitation of MT evaluation to language pairs with available parallel data.
Reliable reference-free evaluation metrics, directly measuring the (semantic) correspondence between the source language text and system translation, would remove the need for human references and allow for unlimited MT evaluations: any monolingual corpus could be used for evaluating MT systems. However, the proposals of referencefree MT evaluation metrics have been few and far apart and have required either non-negligible supervision (i.e., human translation quality labels) (Specia et al., 2010) or language-specific preprocessing like semantic parsing (Lo et al., 2014;Lo, 2019), both hindering the wide applicability of the proposed metrics. Moreover, they have also typically exhibited performance levels well below those of standard reference-based metrics (Ma et al., 2019).
In this work, we comparatively evaluate a number of reference-free MT evaluation metrics that build on the most recent developments in multilingual representation learning, namely cross-lingual contextualized embeddings (Devlin et al., 2019) and cross-lingual sentence encoders (Artetxe and Schwenk, 2019). We investigate two types of crosslingual reference-free metrics: (1) Soft token-level alignment metrics find the optimal soft alignment between source sentence and system translation using Word Mover's Distance (WMD) (Kusner et al., 2015). Zhao et al. (2019) recently demonstrated that WMD operating on BERT representations (Devlin et al., 2019) substantially outperforms baseline MT evaluation metrics in the reference-based setting. In this work, we investigate whether WMD can yield comparable success in the reference-free (i.e., cross-lingual) setup; (2) Sentence-level similarity metrics measure the similarity between sentence representations of the source sentence and system translation using cosine similarity.
Our analysis yields several interesting findings. (i) We show that, unlike in the monolingual reference-based setup, metrics that operate on contextualized representations generally do not outperform symbolic matching metrics like BLEU, which operate in the reference-based environment. (ii) We identify two reasons for this failure: (a) firstly, cross-lingual semantic mismatch, especially for multi-lingual BERT (M-BERT), which construes a shared multilingual space in an unsupervised fashion, without any direct bilingual signal; (b) secondly, the inability of the state-of-the-art crosslingual metrics based on multilingual encoders to adequately capture and punish "translationese", i.e., literal word-by-word translations of the source sentence-as translationese is an especially persistent property of MT systems, this problem is particularly troubling in our context of referencefree MT evaluation. (iii) We show that by executing an additional weakly-supervised cross-lingual re-mapping step, we can to some extent alleviate both previous issues. (iv) Finally, we show that the combination of cross-lingual reference-free metrics and language modeling on the target side (which is able to detect "translationese"), surpasses the performance of reference-based baselines.
Beyond designating a viable prospect of webscale domain-agnostic MT evaluation, our findings indicate that the challenging task of reference-free MT evaluation is able to expose an important limitation of current state-of-the-art multilingual encoders, i.e., the failure to properly represent corrupt input, that may go unnoticed in simpler evaluation setups such as zero-shot cross-lingual text classification or measuring cross-lingual text similarity not involving "adversarial" conditions. We believe this is a promising direction for nuanced, fine-grained evaluation of cross-lingual representations, extending the recent benchmarks which focus on zeroshot transfer scenarios (Hu et al., 2020).

Related Work
Manual human evaluations of MT systems undoubtedly yield the most reliable results, but are expensive, tedious, and generally do not scale to a multitude of domains. A significant body of research is thus dedicated to the study of automatic evaluation metrics for machine translation. Here, we provide an overview of both reference-based MT evaluation metrics and recent research efforts towards reference-free MT evaluation, which leverage cross-lingual semantic representations and unsupervised MT techniques.
Reference-based MT evaluation. Most of the commonly used evaluation metrics in MT compare system and reference translations. They are often based on surface forms such as n-gram overlaps like BLEU (Papineni et al., 2002), SentBLEU, NIST (Doddington, 2002), chrF++ (Popović, 2017) or METEOR++(Guo and Hu, 2019). They have been extensively tested and compared in recent WMT metrics shared tasks (Bojar et al., 2017a;Ma et al., 2018aMa et al., , 2019. These metrics, however, operate at the surface level, and by design fail to recognize semantic equivalence lacking lexical overlap. To overcome these limitations, some research efforts exploited static word embeddings (Mikolov et al., 2013b) and trained embedding-based supervised metrics on sufficiently large datasets with available human judgments of translation quality (Shimanaka et al., 2018). With the development of contextual word embeddings (Peters et al., 2018;Devlin et al., 2019), we have witnessed proposals of semantic metrics that account for word order. For example, Clark et al. (2019) introduce a semantic metric relying on sentence mover's similarity and the contextualized ELMo embeddings (Peters et al., 2018). Similarly, Zhang et al. (2019) describe a reference-based semantic similarity metric based on contextualized BERT representations (Devlin et al., 2019). Zhao et al. (2019) generalize this line of work with their MoverScore metric, which computes the mover's distance, i.e., the optimal soft alignment between tokens of the two sentences, based on the similarities between their contextualized embeddings. Mathur et al. (2019) train a supervised BERT-based regressor for reference-based MT evaluation.
Reference-free MT evaluation. Recently, there has been a growing interest in reference-free MT evaluation (Ma et al., 2019), also referred to as "quality estimation" (QE) in the MT community. In this setup, evaluation metrics semantically compare system translations directly to the source sentences. The attractiveness of automatic referencefree MT evaluation is obvious: it does not require any human effort or parallel data. To approach this task, Popović et al. (2011) exploit a bag-ofword translation model to estimate translation quality, which sums over the likelihoods of aligned word-pairs between source and translation texts. Specia et al. (2013) estimate translation quality using language-agnostic linguistic features extracted from source lanuage texts and system translations. Lo et al. (2014) introduce XMEANT as a crosslingual reference-free variant of MEANT, a metric based on semantic frames. Lo (2019) extended this idea by leveraging M-BERT embeddings. The resulting metric, YiSi-2, evaluates system translations by summing similarity scores over words pairs that are best-aligned mutual translations. YiSi-2-SRL optionally combines an additional similarity score based on the alignment over the semantic structures (e.g., semantic roles and frames). Both metrics are reference-free, but YiSi-2-SRL is not resource-lean as it requires a semantic parser for both languages. Moreover, in contrast to our proposed metrics, they do not mitigate the misalignment of cross-lingual embedding spaces and do not integrate a target-side language model, which we identify to be crucial components. Recent progress in cross-lingual semantic similarity (Agirre et al., 2016;Cer et al., 2017) and unsupervised MT (Artetxe and Schwenk, 2019) has also led to novel reference-free metrics. For instance, Yankovskaya et al. (2019) propose to train a metric combining multilingual embeddings extracted from M-BERT and LASER (Artetxe and Schwenk, 2019) together with the log-probability scores from neural machine translation. Our work differs from that of Yankovskaya et al. (2019) in one crucial aspect: the cross-lingual reference-free metrics that we investigate and benchmark do not require any human supervision.
Cross-lingual Representations. Cross-lingual text representations offer a prospect of modeling meaning across languages and support crosslingual transfer for downstream tasks (Klementiev et al., 2012;Rücklé et al., 2018;Glavaš et al., 2019;Josifoski et al., 2019;Conneau et al., 2020). Most recently, the (massively) multilingual encoders, such as multilingual M-BERT (Devlin et al., 2019), XLM-on-RoBERTa (Conneau et al., 2020), and (sentence-based) LASER, have profiled themselves as state-of-the-art solutions for (massively) multilingual semantic encoding of text. While LASER has been jointly trained on parallel data of 93 languages, M-BERT has been trained on the concatenation of monolingual data in more than 100 languages, without any cross-lingual mapping signal. There has been a recent vivid discussion on the cross-lingual abilities of M-BERT (Pires et al., 2019;K et al., 2020;Cao et al., 2020). In particular, Cao et al. (2020) show that M-BERT often yields disparate vector space representations for mutual translations and propose a multilingual remapping based on parallel corpora, to remedy for this issue. In this work, we introduce re-mapping solutions that are resource-leaner and require easyto-obtain limited-size word translation dictionaries rather than large parallel corpora.

Reference-Free MT Evaluation Metrics
In the following, we use x to denote a source sentence (i.e., a sequence of tokens in the source language), y to denote a system translation of x in the target language, and y to denote the human reference translation for x.

Soft Token-Level Alignment
We start from the MoverScore (Zhao et al., 2019), a recently proposed reference-based MT evaluation metric designed to measure the semantic similarity between system outputs (y) and human references (y ). It finds an optimal soft semantic alignments between tokens from y and y by minimizing the Word Mover's Distance (Kusner et al., 2015). In this work, we extend the MoverScore metric to operate in the cross-lingual setup, i.e., to measure the semantic similarity between n-grams (unigram or bigrams) of the source text x and the system translation y, represented with embeddings originating from a cross-lingual semantic space.
First, we decompose the source text x into a sequence of n-grams, denoted by x n = (x n 1 , . . . , x n m ) and then do the same operation for the system translation y, denoting the resulting sequence of n-grams with y n . Given x n and y n , we can then define a distance matrix C such that C ij = E(x n i ) − E(y n j ) 2 is the distance between the i-th n-gram of x and the j-th n-gram of y, where E is a cross-lingual embedding function that maps text in different languages to a shared embedding space. WMD between the two sequences of n-grams x n and y n with associated n-gram weights 2 to f x n ∈ R |x n | and f y n ∈ R |y n | is defined as: where F ∈ R |x n |×|y n | is a transportation matrix with F ij denoting the amount of flow traveling from x n i to y n j .

Sentence-Level Semantic Similarity
In addition to measuring semantic distance between x and y at word-level, one can also encode them into sentence representations with multilingual sentence encoders like LASER (Artetxe and Schwenk, 2019), and then measure their cosine distance

Improving Cross-Lingual Alignments
Initial analysis indicated that, despite the multilingual pretraining of M-BERT (Devlin et al., 2019) and LASER (Artetxe and Schwenk, 2019), the monolingual subspaces of the multilingual spaces they induce are far from being semantically wellaligned, i.e., we obtain fairly distant vectors for mutual word or sentence translations. 3 To this end, we apply two simple, weakly-supervised linear projection methods for post-hoc improvement of the cross-lingual alignments in these multilingual representation spaces.
. . , (w n , w n k )} be a set of matched word or sentence pairs from two different languages and k. We define a remapping function f such that any f (E(w )) and E(w k ) are better aligned in the resulting shared vector space. We investigate two resource-lean choices for the re-mapping function f .
Linear Cross-lingual Projection (CLP). Following related work (Schuster et al., 2019), we re-map contextualized embedding spaces using linear projection. Given and k, we stack all vectors of the source language words and target language words for pairs D, respectively, to form matrices X and X k ∈ R n×d , with d as the embedding dimension and n as the number of word or sentence alignments. The word pairs we use to calibrate M-BERT are extracted from EuroParl (Koehn, 2005) using FastAlign (Dyer et al., 2013), and the sentence pairs to calibrate LASER are sampled directly from EuroParl. 4 Mikolov et al. (2013a) propose to learn a projection matrix W ∈ R d×d by minimizing the Euclidean distance beetween the projected source language vectors and their corresponding target language vectors: Xing et al. (2015) achieve further improvement on the task of bilingual lexicon induction (BLI) by constraining W to an orthogonal matrix, i.e., such that W W = I. This turns the optimization into the well-known Procrustes problem (Schönemann, 1966) with the following closed-form solution: We note that the above CLP re-mapping is known to have deficits, i.e., it requires the embedding spaces of the involved languages to be approximately isomorphic (Søgaard et al., 2018;Vulić et al., 2019). Recently, some re-mapping methods that reportedly remedy for this issue have been suggested (Glavaš and Vulić, 2020; Mohiuddin and Joty, 2020). We leave the investigation of these novel techniques for our future work.

Universal
Language Mismatch-Direction (UMD) Our second post-hoc linear alignment method is inspired by the recent work on removing biases in distributional word vectors (Dev and Phillips, 2019;Lauscher et al., 2019). We adopt the same approaches in order to quantify and remedy for the "language bias", i.e., representation mismatches between mutual translations in the initial multilingual space. Formally, given and k, we create individual misalignment vectors for each bilingual pair in D. Then we stack these individual vectors to form a matrix Q ∈ R n×d . We then obtain the global misalignment vector v B as the top left singular vector of Q. The global misalignment vector presumably captures the direction of the representational misalignment between the languages better than the individual (noisy) misalignment vectors E(w i ) − E(w i k ). Finally, we modify all vectors E(w ) and E(w k ), by subtracting their projections onto the global misalignment direction vector v B : Language Model BLEU scores often fail to reflect the fluency level of translated texts (Edunov et al., 2019). Hence, we use the language model (LM) of the target language to regularize the crosslingual semantic similarity metrics, by coupling our cross-lingual similarity scores with a GPT language model of the target language (Radford et al., 2018). We expect the language model to penalize translationese, i.e., unnatural word-by-word translations and boost the performance of our metrics. 5 judgments are based on comparing human references and system predictions. We will discuss this discrepancy in §5.3.
Word-level metrics. We denote our wordlevel alignment metrics based on WMD as MOVERSCORE-NGRAM + ALIGN(EMBEDDING), where ALIGN is one of our two post-hoc crosslingual alignment methods (CLP or UMD). For example, MOVER-2 + UMD(M-BERT) denotes the metric combining MoverScore based on bigram alignments, with M-BERT embeddings and UMD as the post-hoc alignment method.
Sentence-level metric. We denote our sentencelevel metrics as: COSINE + ALIGN(EMBEDDING). For example, COSINE + CLP(LASER) measures the cosine distance between the sentence embeddings obtained with LASER, post-hoc aligned with CLP.

Datasets
We collect the source language sentences, their system and reference translations from the WMT17-19 news translation shared task (Bojar et al., 2017b;Ma et al., 2018bMa et al., , 2019, which contains predictions of 166 translation systems across 16 language pairs in WMT17, 149 translation systems across 14 language pairs in WMT18 and 233 translation systems across 18 language pairs in WMT19. We evaluate for X-en language pairs, selecting X from a set of 12 diverse languages: German (de), Chinese (zh), Czech (cs), Latvian (lv), Finnish (fi), Russian (ru), and Turkish (tr), Gujarati (gu), Kazakh (kk), Lithuanian (lt) and Estonian (et). Each language pair in WMT17-19 has approximately 3,000 source sentences, each associated to one reference translation and to the automatic translations generated by participating systems.

Setting
Metrics cs-en de-en fi-en lv-en ru-en tr-en zh-en Average

Results
Figure 1 shows that our metric MOVER-2 + CLP(M-BERT) ⊕ LM, operating on modified M-BERT with the post-hoc re-mapping and combining a target-side LM, outperforms BLEU by 5.7 points in segment-level evaluation and achieves comparable performance in the system-level evaluation. Figure 2 shows that the same metric obtains 15.3 points gains (73.1 vs. 57.8), averaged over 7 languages, on WMT19 (system-level) compared to the the state-of-the-art reference-free metric YiSi-2. Except for one language pair, gu-en, our metric performs on a par with the reference-based BLEU (see Table 8 in the Appendix) on system-level.
In Table 1, we exhaustively compare results for several of our metric variants, based either on M-BERT or LASER. We note that re-mapping has considerable effect for M-BERT (up to 10 points improvements), but much less so for LASER. We believe that this is because the underlying embedding space of LASER is less 'misaligned' since it has been (pre-)trained on parallel data. 7 While the re-mapping is thus effective for metrics based on M-BERT, we still require the target-side LM to outperform BLEU. We assume the LM can address challenges that the re-mapping apparently is not able to handle properly; see our discussion in §5.1.
Overall, we remark that none of our metric com-7 However, in the appendix, we find that re-mapping LASER using 2k parallel sentences achieves considerable improvements on low-resource languages, e.g., kk-en (from -61.1 to 49.8) and lt-en (from 68.3 to 75.9); see Table 8. binations performs consistently best. The reason may be that LASER and M-BERT are pretrained over hundreds of languages with substantial differences in corpora sizes in addition to the different effects of the re-mapping. However, we observe that MOVER-2 + CLP(M-BERT) performs best on average over all language pairs when the LM is not added. When the LM is added, MOVER-2 + CLP(M-BERT) ⊕ LM and COSINE + UMD (LASER) ⊕ LM perform comparably. This indicates that there may be a saturation effect when it comes to the LM or that the LM coefficients should be tuned individually for each semantic similarity metric based on cross-lingual representations.

Analysis
We first analyze preferences of our metrics based on M-BERT and LASER ( §5.1) and then examine how much parallel data we need for re-mapping our vector spaces ( §5.2). Finally, we discuss whether it is legitimate to correlate our metric scores, which evaluate the similarity of system predictions and source texts, to human judgments based on system predictions and references ( §5.3).

Metric preferences
To analyze why our metrics based on M-BERT and LASER perform so badly for the task of referencefree MT evaluation, we query them for their preferences. In particular, for a fixed source sentence x, we consider two target sentencesỹ andŷ and evaluate the following score difference: When d > 0, then metric m prefersỹ overŷ, given x, and when d < 0, this relationship is reversed.
In the following, we compare preferences of our metrics for specifically modified target sentences y over the human references y . We chooseỹ to be (i) a random reordering of y , to ensure that our metrics do not have the BOW (bag-of-words) property, (ii) a word-order preserving translation of x, i.e., (ii-a) an expert reordering of the human y to have the same word order as x as well as (ii-b) a word-by-word translation, obtained either using experts or automatically. Especially condition (iib) tests for preferences for literal translations, a common MT-system property.
Expert word-by-word translations. We had an expert (one of the co-authors) translate 50 Ger-man sentences word-by-word into English. Table 2 illustrates this scenario. We note how bad the word-by-word translations sometimes are even for closely related language pairs such as German-English. For example, the word-by-word translations in English retain the original German verb final positions, leading to quite ungrammatical English translations. Figure 3 shows histograms for the d statistic for the 50 selected sentences. We first check condition (i) for the 50 sentences. We observe that both MOVER + M-BERT and COSINE+LASER prefer the original human references over random reorderings, indicating that they are not BOW models, a reassuring finding. Concerning (ii-a), they are largely indifferent between correct English word order and the situation where the word order of the human reference is the same as the German. Finally, they strongly prefer the expert word-by-word translations over the human references (ii-b).
Condition (ii-a) in part explains why our metrics prefer expert word-by-word translations the most: for a given source text, these have higher lexical overlap than human references and, by (ii-a), they have a favorable target language syntax, viz., where the source and target language word order are equal. Preference for translationese, (ii-b), in turn is apparently a main reason why our metrics do not perform well, by themselves and without a language model, as reference-free MT evaluation metrics. More worryingly, it indicates that crosslingual M-BERT and LASER are not robust to the 'adversarial inputs' given by MT systems.
Automatic word-by-word translations. For a large-scale analysis of condition (ii-b) across different language pairs, we resort to automatic word-byword translations obtained from Google Translate (GT). To do so, we go over each word in the source sentence x from left to right, look up its translation in GT independently of context and replace the word by the obtained translation. When a word has several translations, we keep the first one offered by GT. Due to context-independence, the GT word-by-word translations are of much lower quality than the expert word-by-word translations since they often pick the wrong word senses-e.g., the German word sein may either be a personal pronoun (his) or the infinitive to be, which would be selected correctly only by chance; cf. Table 2.
Instead of reporting histograms of d, we define a "W2W" statistic that counts the relative number of x Dieser von Langsamkeit geprägte Lebensstil scheint aber ein Patentrezept für ein hohes Alter zu sein. y However, this slow pace of life seems to be the key to a long life. y -random To pace slow seems be the this life. life to a key however, of long y -reordered This slow pace of life seems however the key to a long life to be.
x -GT This from slowness embossed lifestyle seems but on nostrum for on high older to his. x -expert This of slow pace characterized life style seems however a patent recipe for a high age to be.
x Putin teilte aus und beschuldigte Ankara, Russland in den Rücken gefallen zu sein. y Mr Putin lashed out, accusing Ankara of stabbing Moscow in the back. y -random Moscow accusing lashed Putin the in Ankara out, Mr of back. stabbing y -reordered Mr Putin lashed out, accusing Ankara of Moscow in the back stabbing.
x -GT Putin divided out and accused Ankara Russia in the move like to his. x -expert Putin lashed out and accused Ankara, Russia in the back fallen to be. Table 2: Original German input sentence x, together with the human reference y , in English, and a randomly (y -random) and expertly reordered (y -reordered) English sentence as well as expert word-by-word translation (x ) of the German source sentence. The latter is either obtained by the human expert or by Google Translate (GT). times that d(x , y ) is positive, where x denotes the described literal translation of x into the target language: Here N normalizes W2W to lie in [0, 1] and a high W2W score indicates the metric prefers translationese over human-written references. Table 3 shows that reference-free metrics with original embeddings (LASER and M-BERT) either still prefer literal over human translations (e.g., W2W score of 70.2% for cs-en) or struggle in distinguishing them. Re-mapping helps to a small degree. Only when combined with the LM scores do we get adequate scores for the W2W statistic. Indeed, the LM is expected to capture unnatural word order in the target language and penalize word-by-word translations by recognizing them as much less likely to appear in a language.
Note that for expert word-by-word translations, we would expect the metrics to perform even worse.   sponding original baseline.

Human Judgments
The WMT datasets contain segment-and systemlevel human judgments that we use for evaluating the quality of our reference-free metrics. The segment-level judgments assign one direct assessment (DA) score to each pair of system and human translation, while system-level judgments associate each system with a single DA score averaged across all pairs in the dataset. We initially suspected the DA scores to be biased for our setup-which compares x with y-as they are based on comparing y and y. Indeed, it is known that (especially) human professional translators "improve" y , e.g., by making it more readable, relative to the original x (Rabinovich et al., 2017). We investigated the validity of DA scores by collecting human assessments in the cross-lingual settings (CLDA), where annotators directly compare source and translation pairs (x, y) from the WMT17 dataset. This small-scale manual analysis hints that DA scores are a valid proxy for CLDA. Therefore, we decided to treat them as reliable scores for our setup and evaluate our proposed metrics by comparing their correlation with DA scores.

Conclusion
Existing semantically-motivated metrics for reference-free evaluation of MT systems have so far displayed rather poor correlation with human estimates of translation quality. In this work, we investigate a range of reference-free metrics based on cutting-edge models for inducing cross-lingual semantic representations: cross-lingual (contextualized) word embeddings and cross-lingual sentence embeddings. We have identified some scenarios in which these metrics fail, prominently their inability to punish literal word-by-word translations (the so-called "translationese"). We have investigated two different mechanisms for mitigating this undesired phenomenon: (1) an additional (weakly-supervised) cross-lingual alignment step, reducing the mismatch between representations of mutual translations, and (2) language modeling (LM) on the target side, which is inherently equipped to punish "unnatural" sentences in the target language. We show that the reference-free coupling of cross-lingual similarity scores with the target-side language model surpasses the reference-based BLEU in segment-level MT evaluation. We believe our results have two relevant implications. First, they portray the viability of referencefree MT evaluation and warrant wider research efforts in this direction. Second, they indicate that reference-free MT evaluation may be the most challenging ("adversarial") evaluation task for multilingual text encoders as it uncovers some of their shortcomings-prominently, the inability to capture semantically non-sensical word-by-word translations or paraphrases-which remain hidden in their common evaluation scenarios.
We release our metrics under the name XMover-Score publicly: https://github.com/AIPHES/ ACL20-Reference-Free-MT-Evaluation. Our metric allows for estimating translation quality on new domains. However, the evaluation is limited to those languages covered by multilingual embeddings. This is a major drawback for lowresource languages-e.g., Gujarati is not included in LASER. To this end, we take multilingual USE  as an illustrating example which covers only 16 languages (in our sample Czech, Latvian and Finish are not included in USE). We re-align the corresponding embedding spaces with our re-mapping functions to induce evaluation metrics even for these languages, using only 2k translation pairs. Table 4 shows that our metric with a composition of re-mapping functions can raise correlation from zero to 0.10 for cs-en and to 0.18 for lv-en. However, for one language pair, fi-en, we see correlation goes from negative to zero, indicating that this approach does not always work. This observation warrants further investigation.        Table 7: Pearson correlations with system-level human judgments on the WMT18 dataset.  Table 8: Pearson correlations with system-level human judgments on the WMT19 dataset. '-' marks the numbers not officially reported in (Ma et al., 2019).