Automatic Machine Translation Evaluation using Source Language Inputs and Cross-lingual Language Model

We propose an automatic evaluation method of machine translation that uses source language sentences regarded as additional pseudo references. The proposed method evaluates a translation hypothesis in a regression model. The model takes the paired source, reference, and hypothesis sentence all together as an input. A pretrained large scale cross-lingual language model encodes the input to sentence-pair vectors, and the model predicts a human evaluation score with those vectors. Our experiments show that our proposed method using Cross-lingual Language Model (XLM) trained with a translation language modeling (TLM) objective achieves a higher correlation with human judgments than a baseline method that uses only hypothesis and reference sentences. Additionally, using source sentences in our proposed method is confirmed to improve the evaluation performance.


Introduction
Automatic machine translation evaluation (MTE) has been studied to substitute human evaluation in machine translation development because it is low-cost, handy, and stable to use. Popular automatic MTE metrics such as BLEU (Papineni et al., 2002) calculate the evaluation score based on a surface-level similarity of a paired 1-to-1 reference and translated hypothesis sentences. BLEU particularly evaluates the sentence similarity with the ngram word matching rate between a reference and hypothesis. However, the evaluation score drops when a reference and hypothesis are dissimilar in the surface even if they share the same meaning.
To counter this problem, METEOR (Banerjee and Lavie, 2005) is proposed to mitigate the word matching of synonyms with a synonym dictionary. Yet still, with mitigation of word matching, surfacelevel similarity cannot fully compensate for seman-tics, thus word representation instead of word symbols is used in Word Mover's Distance (Kusner et al., 2015) and bleu2vec (Tättar and Fishel, 2017).
Besides, sentence representation is known to be an efficient feature instead of word representation because sentence vectors can represent more global meanings. RUSE (Shimanaka et al., 2018) and BERT (Devlin et al., 2019) based MTE, BERT regressor (Shimanaka et al., 2019), utilized sentence representation and performed well on WMT17 Metric Shared Task (Bojar et al., 2017). The metrics mentioned above compare a hypothesis translation to a reference. However, a reference translation represents only one possible translation and those MTE metrics are unlikely to correctly evaluate all candidates that share the same meanings of the reference or have fatally different meanings due to a few translation errors. This problem can be mitigated by the use of multiple reference translations as argued by Dreyer and Marcu (2012) and Qin and Specia (2015), but preparing such multiple references is costly.
Hereby, we propose a method to incorporate source sentence into MTE as another pseudo reference, since the source and reference sentences should be semantically equivalent. The proposed method uses Cross-lingual Language Model (XLM) (Lample and Conneau, 2019) to handle source and target languages in a shared sentence embedding space. The proposed method with XLM trained with a translation language modeling (TLM) objective showed a higher correlation with human judgments than a baseline method using hypothesis and reference sentences.

Proposed method: Automatic evaluation using XLM
We propose an MTE method using source language sentences as additional pseudo references. We use cross-lingual language models called XLM (Lample and Conneau, 2019) to encode both source and target language sentences into an embedding vector. XLM has three additional techniques to BERT: language independent subword based on Byte Pair Encoding (Sennrich et al., 2016), a language embedding layer, and a translation language modeling (TLM) objective that predicts masked words from surrounding words or a paired translation. The brief architecture of XLM is shown in Figure 1. (Lample and Conneau, 2019) reported that XLM trained with TLM objective obtains better performance than multilingual BERT (Devlin et al., 2019) on the XNLI cross-lingual classification task (Conneau et al., 2018).
The proposed method has two variants for the use of source language sentences, as illustrated in Figure 2. The first one called hyp+src/hyp+ref uses two sentence-pair vectors for hypothesissource and hypothesis-reference, encoded by a cross-lingual language model independently. These sentence-pair vectors are given to an MLP-based regression model to predict the human evaluation scores. This can be regarded as an ensemble model using a monolingual vector based on the reference and a cross-lingual vector based on the source sentence. The other one called hyp+src+ref takes a concatenation of hypotheses, source, and reference sentences as an input to a cross-lingual language model to obtain a sentence-pair vector. This sentence-pair vector is expected to be directly learned to represent the quality of the translation hypothesis given two correct sentences aligned aside.

Experiments
We conducted experiments to evaluate the performance of the proposed method in MTE by comparing with some existing methods.

Setting
The experiments were conducted with a corpus of all language pairs to English translation from segment-level WMT2017 Metrics Shared Task (Bojar et al., 2017). We split sentences in WMT15 and WMT16 to training and development data with the ratio of 9:1 and whole sentences in WMT17 are used for evaluation of MTE methods. The corpus size for each language pair is shown in Table 1.
We used two different models from all available XLM family models 2 : XLM15 pretrained by MLM and TLM, and XLM100 pretrained only by MLM. XLM15 is expected to perform better by the paired bilingual training of TLM, but the number of available languages is limited. XLM15 is compatible with only German, Russian, Turkish, and Chinese in the corpus, which confines the model to partial access to the corpus. On the other hand, XLM100  Thus the experiments had two corpus settings; One was a small corpus including {German (de), Russian (ru), Turkish (tr), and Chinese (zh)} to English (en) language pairs, and the other was a whole corpus including {Czech (cz), German (de), Finnish (fi), Latvian (lv), Romanian (ro), Russian (ru), Turkish (tr), and Chinese (zh)} to English language pairs. The evaluation was conducted with Pearson's correlation to human judgments in the test set.
We compared the proposed methods with Sent-BLEU (Bojar et al., 2017), BERT regressor (Shimanaka et al., 2019) by our implementation. We also conducted experiments using multilingual BERT, BERT multi (cased), to contrast language models and experiments limiting the model's input into source-hypothesis only and referencehypothesis only to study the impact of adding source sentences.
The fine-tuning on the proposed methods and BERT regressor was based on Mean Squared Error  (MSE) loss in the training set, back-propagated to both MLP and XLM in order. The hyperparameters were selected through grid search for the following parameters. Since models are affected by randomness in training, we ran ten experiments for each of the settings and report results of the average scores.

Results
The results of each small corpus and whole corpus experiments are shown in Tables 2 and 3, respectively. Note that XLM15 was not included in the  Performance of each language model As we can see from Table 2, the proposed method using XLM15 with hyp+src/hyp+ref structure surpassed BERT regressor in the small corpus. However, XLM100 did not work well in the experiments; its results were much worse than the others in the small corpus condition, and it did not compete with BERT regressor in the whole corpus condition as shown in Table 3. One possible reason is the lack of TLM objective pretraining in XLM100. Since the TLM task allows the model for learning semantically equivalent cross-lingual sentences directly, the TLM task can be concluded to be important for using source sentences in MTE. The results of multilingual BERT are worse than BERT regressor and XLM15, but close to XLM100 or slightly better in general. From this comparison of pretraining objectives and language models, we report that our proposed method is influenced by the multilingualism of a language model.
hyp+src/hyp+ref VS hyp+src+ref The results from Table 2 and Table 3 shows that hyp+src/hyp+ref structure is better than hyp+src+ref in most of the conditions, although we expected hyp+src+ref to perform better because it can access 2 translation answers as references at the same time. This is probably because both of XLM and multilingual BERT was not pretrained to handle 3 sentences in a sequence. However, it is perhaps possible that hyp+src+ref surpasses hyp+src/hyp+ref when a fine-tuning corpus is large enough.  Contribution of adding source sentences Every model with hyp+src/hyp+ref achieved a better score than both of hyp+src and hyp+ref, which indicates that source sentences contribute to the improvement of evaluation.

Analysis
Training data size We conducted another experiment to see the effect of the training corpus size using randomly halved {de, ru, tr, zh}-en small corpus. From the results in Table 4, BERT regressor stably performed well even when the number of training data is about 1000 sentences, however, XLM15, XLM100, and multilingual BERT deteriorated their performances. Since our proposed hyp+src/hyp+ref is an ensemble model and has a more complex network structure than hyp+ref, the Evaluation errors In order to see when models make errors to evaluate hypothesis sentences, we plot scatters of evaluation scores and human judgement scores (DA scores) in Figure 3 Although in comparison, the evaluation scores of our best model XLM15 hyp+src/hyp+ref are set more linearly than the baseline BERT regressor, the scores of all models seem much dispersed in the low DA area (DA < 0.0). This indicates that all evaluation models listed here tend to miss-evaluate when a hypothesis is poor. Furthermore, we show Pearson's correlation score for each of high and low DA score range in Table 5. As we confirmed in the scatter figures, the correlation scores of low DA is low; evaluation models work poorly when hypotheses are poor. However, the reduction rate of Pearson's scores from high DA to low DA is small with XLM15 hyp+src/hyp+ref and hyp+src. Therefore adding source sentences has an impact to stabilize the evaluation performance when hypotheses are low-quality.

Conclusion
In this paper, we proposed an MTE framework that utilizes source sentences using XLM. We show that the proposed method with TLM-trained XLM showed a higher correlation with human judgments than the baseline method in the small corpus condition and stabilize the evaluation performance regardless of the quality of translation sentences by using additional source sentences. We also investigated why our proposed method worked poorly in the other conditions and found the importance of TLM training. In future work, we will work around the problem of evaluation errors in the low DA range.