Metric for Automatic Machine Translation Evaluation based on Universal Sentence Representations

Sentence representations can capture a wide range of information that cannot be captured by local features based on character or word N-grams. This paper examines the usefulness of universal sentence representations for evaluating the quality of machine translation. Al-though it is difficult to train sentence representations using small-scale translation datasets with manual evaluation, sentence representations trained from large-scale data in other tasks can improve the automatic evaluation of machine translation. Experimental results of the WMT-2016 dataset show that the proposed method achieves state-of-the-art performance with sentence representation features only.


Introduction
This paper describes a segment-level metric for automatic machine translation evaluation (MTE). MTE metrics having a high correlation with human evaluation enable the continuous integration and deployment of a machine translation (MT) system. Various MTE metrics have been proposed in the metrics task of the Workshops on Statistical Machine Translation (WMT) that was started in 2008. However, most MTE metrics are obtained by computing the similarity between an MT hypothesis and a reference translation based on character N-grams or word N-grams, such as SentBLEU (Lin and Och, 2004), which is a smoothed version of BLEU (Papineni et al., 2002), Blend (Ma et al., 2017), MEANT 2.0 (Lo, 2017), and chrF++ (Popović, 2017), which achieved excellent results in the WMT-2017 Metrics task (Bojar et al., 2017). Therefore, they can exploit only limited information for segment-level MTE. In other words, MTE metrics based on character N-grams or word N-grams cannot make full use of sentence representations; they only check for word matches. We propose a segment-level MTE metric by using universal sentence representations capable of capturing information that cannot be captured by local features based on character or word Ngrams. The results of an experiment in segmentlevel MTE conducted using the datasets for to-English language pairs on WMT-2016 indicated that the proposed regression model using sentence representations achieves the best performance.
The main contributions of the study are summarized below: • We propose a novel supervised regression model for segment-level MTE based on universal sentence representations.
• We achieved state-of-the-art performance on the WMT-2016 dataset for to-English language pairs without using any complex features and models.

Related Work
DPMF comb (Yu et al., 2015a) achieved the best performance in the WMT-2016 Metrics task (Bojar et al., 2016). It incorporates 55 default metrics provided by the Asiya MT evaluation toolkit 1 (Giménez and Màrquez, 2010), as well as three other metrics, namely, DPMF (Yu et al., 2015b), REDp (Yu et al., 2015a), and ENTFp (Yu et al., 2015a), using ranking SVM to train parameters of each metric score. DPMF evaluates the syntactic similarity between an MT hypothesis and a reference translation. REDp evaluates an MT hypothesis based on the dependency tree of the reference translation that comprises both lexical and syntactic information. ENTFp (Yu et al., 2015a) evaluates the fluency of an MT hypothesis.   After the success of DPMF comb , Blend 2 (Ma et al., 2017) achieved the best performance in the WMT-2017 Metrics task (Bojar et al., 2017). Similar to DPMF comb , Blend is essentially an SVR (RBF kernel) model that uses the scores of various metrics as features. It incorporates 25 lexical metrics provided by the Asiya MT evaluation toolkit, as well as four other metrics, namely, BEER (Stanojević and Sima'an, 2015), CharacTER (Wang et al., 2016), DPMF and ENTFp. BEER (Stanojević and Sima'an, 2015) is a linear model based on character N-grams and replacement trees.
Charac-TER (Wang et al., 2016) evaluates an MT hypothesis based on character-level edit distance. DPMF comb is trained through relative ranking of human evaluation data in terms of relative ranking (RR). The quality of five MT hypotheses of the same source segment are ranked from 1 to 5 via comparison with the reference translation. In contrast, Blend is trained through direct assessment (DA) of human evaluation data. DA provides the absolute quality scores of hypotheses, by measuring to what extent a hypothesis adequately expresses the meaning of the reference translation. The results of the experiments in segment-level MTE conducted using the datasets for to-English language pairs on WMT-2016 showed that Blend achieved a better performance than DPMF comb (Table 2). In this study, as with Blend, we propose a supervised regression model trained using DA human evaluation data.
Instead of using local and lexical features, 2 http://github.com/qingsongma/blend ReVal 3 (Gupta et al., 2015a,b) proposes using sentence-level features. It is a metric using Tree-LSTM (Tai et al., 2015) for training and capturing the holistic information of sentences. It is trained using datasets of pseudo similarity scores, which is generated by translating RR data, and out-domain datasets of similarity scores of SICK 4 . However, the training dataset used in this metric consists of approximately 21,000 sentences; thus, the learning of Tree-LSTM is unstable and accurate learning is difficult ( Table 2). The proposed metric uses sentence representations trained using LSTM as sentence information. Further, we apply universal sentence representations to this task; these representations were trained using largescale data obtained in other tasks. Therefore, the proposed approach avoids the problem of using a small dataset for training sentence representations.

Regression Model for MTE Using Universal Sentence Representations
The proposed metric evaluates MT results with universal sentence representations trained using large-scale data obtained in other tasks. First, we explain two types of sentence representations used in the proposed metric in Section 3.1. Then, we explain the proposed regression model and feature extraction for MTE in Section 3.2.

Universal Sentence Representations
Several approaches have been proposed to learn sentence representations. These sentence representations are learned through large-scale data so cs-en de-en fi-en ro-en ru-en tr-en WMT-2015 500  500  500  -500  -WMT-2016 560  560  560  560  560  560   Table 1: Number of DA human evaluation datasets for to-English language pairs 8 in WMT-2015  and WMT-2016(Bojar et al., 2016. that they constitute potentially useful features for MTE. These have been proved effective in various NLP tasks such as document classification and measurement of semantic textual similarity, and we call them universal sentence representations. First, Skip-Thought 5  builds an unsupervised model of universal sentence representations trained using three consecutive sentences, such as s i−1 , s i , and s i+1 . It is an encoderdecoder model that encodes sentence s i and predicts previous and next sentences s i−1 and s i+1 from its sentence representation s i (Figure 1). As a result of training, this encoder can produce sentence representations. Skip-Thought demonstrates high performance, especially when applied to document classification tasks.
Second, InferSent 6 (Conneau et al., 2017) constructs a supervised model computing universal sentence representations trained using Stanford Natural Language Inference (SNLI) datasets 7 (Bowman et al., 2015). The Natural Language Inference task is a classification task of sentence pairs with three labels, entailment, contradiction and neutral; thus, InferSent can train sentence representations that are sensitive to differences in meaning. This model encodes sentence pairs u and v and generates features by sentence representations u and v with a bi-directional LSTM architecture with max pooling (Figure 2). InferSent demonstrates high performance across various document classification and semantic textual similarity tasks.

Regression Model for MTE
In this paper, we propose a segment-level MTE metric for to-English language pairs. This problem can be treated as a regression problem that estimates translation quality as a real number from an MT hypothesis t and a reference translation r. Once d-dimensional sentence vectors t and r are generated, the proposed model applies the follow-5 https://github.com/ryankiros/skip-thoughts 6 https://github.com/facebookresearch/InferSent 7 https://nlp.stanford.edu/projects/snli/ ing three matching methods to extract relations between t and r (Figure 3).

Experiments of Segment-Level MTE for To-English Language Pairs
We performed experiments using evaluation datasets of the WMT Metrics task to verify the performance of the proposed metric.

Setups
Datasets. We used datasets for to-English language pairs from the WMT-2016 Metrics task (Bojar et al., 2016) as summarized in Table 1. Following Ma et al. (2017), we employed all other to-English DA data as training data (4,800 sentences) for testing on each to-English language pair (560 sentences) in WMT-2016.
Features. Publicly available pre-trained sentence representations such as Skip-Thought 5 and InferSent 6 were used as the features mentioned in Section 3. Skip-Thought is a collection of 4,800-dimensional sentence representations trained on 74 million sentences of the BookCorpus dataset . InferSent is a collection of 4,096-dimensional sentence representations trained on both 560,000 sentences of the SNLI dataset (Bowman et al., 2015) and 433,000 sentences of the MultiNLI dataset (Williams et al., 2017).
Model. Our regression model used SVR with the RBF kernel from scikit-learn 9 .

Result
As can be seen in Table 2, the proposed metric, which combines InferSent and Skip-Thought representations, surpasses the best performance in three out of six to-English languages pairs and achieves state-of-the-art performance on average.

Discussion
These results indicate that it is possible to adopt universal sentence representations in MTE by training a regression model using DA human evaluation data. Since Blend is an ensemble method using combinations of various MTE metrics as features, our results show that universal sentence representations can consider information more abundantly than a complex model. Since ReVal is also based on sentence representations, we conclude that universal sentence representations trained on a large-scale dataset are more effective for MTE tasks than sentence representations trained on a small or limited in-domain dataset.

Error Analysis
We re-implemented Blend 10 (Ma et al., 2017) and compared the evaluation results with the proposed metric. 11 We analyzed 20% of the pairs of MT hypotheses and reference translations (112 sentence pairs × 6 languages = 672 sentence pairs) in descending order of DA human score in each language pair. In other words, the top 20% of MT hypotheses that were close to the meaning of the reference translations for each language pair were analyzed. Among these, only Blend estimates the translation quality as high for 70 sentence pairs, and only our metric estimates the translation quality as high for 88 sentence pairs.
Surface. Among pairs estimated to have high translation quality by each method, there were 26 pairs in Blend and 42 pairs in the proposed method with a low word surface matching rate between MT hypotheses and reference translations. This result shows that the proposed metric can evaluate a wide range of sentence information that cannot be captured by Blend.
Unknown words. There were 26 MT hypotheses consisting of words that were treated as unknown words in Skip-Thought or InferSent that were correctly evaluated in Blend. On the other hand, there were 26 MT hypotheses that were correctly evaluated in the proposed metric. This result shows that the proposed metric is affected by unknown words. However, it is also true that there are some MT hypotheses containing unknown words that can be correctly evaluated.
Therefore, we analyzed further by focusing on sentence length. There were 17 MT hypotheses consisting of words that were treated as unknown words by either Skip-Thought or InferSent with a short length (15 words or less) that were correctly evaluated in Blend. However, in the proposed metric, there were only two MT hypotheses that were correctly evaluated. This result indicates that the shorter the sentence, the more likely is the proposed metric to be affected by unknown words.

Conclusions
In this study, we tried to apply universal sentence representation to MTE based on the DA of human evaluation data. Our segment-level MTE metric achieved the best performance on the WMT-2016 dataset. We conclude that: • Universal sentence representations can consider information more comprehensively than an ensemble metric using combinations of various MTE metrics based on features of character or word N-grams.
• Universal sentence representations trained on a large-scale dataset are more effective than sentence representations trained on a small or limited in-domain dataset.
• Although a metric based on SVR with universal sentence representations is not good at handling unknown words, it correctly estimates the translation quality of MT hypotheses with a low word matching rate with reference translations.
Following the success of In-ferSent (Conneau et al., 2017), many works (Wieting and Gimpel, 2017;Cer et al., 2018;Subramanian et al., 2018) on universal sentence representations have been published. Based on the results of our work, we expect that the MTE metric will be further improved using these better universal sentence representations.