Integrating Meaning into Quality Evaluation of Machine Translation

Machine translation (MT) quality is evaluated through comparisons between MT outputs and the human translations (HT). Traditionally, this evaluation relies on form related features (e.g. lexicon and syntax) and ignores the transfer of meaning reflected in HT outputs. Instead, we evaluate the quality of MT outputs through meaning related features (e.g. polarity, subjectivity) with two experiments. In the first experiment, the meaning related features are compared to human rankings individually. In the second experiment, combinations of meaning related features and other quality metrics are utilized to predict the same human rankings. The results of our experiments confirm the benefit of these features in predicting human evaluation of translation quality in addition to traditional metrics which focus mainly on form.


Introduction
Machine translation (MT) systems translate large chunks of data automatically across languages. Although these systems may achieve high level accuracies using form related features (e.g. lexical and syntactic), they often fail to carry over the meaning embracing the form. Example (1) highlights the meaning difference between a Human Translation (HT) and an MT output for the same source sentence: Example (1) HT: "Your feet's too big." 1 MT: "Your feet is too great." 2 Although the form is often preserved, MT outputs may sound "strange" or "different" in comparison to HT ones due to the loss of meaning. Therefore, human translators generally enrich the text with the appropriate tone, style and sentiments during translation. Current quality evaluation metrics like BLEU (Papineni et al., 2002) and ME-TEOR (Lavie and Agarwal, 2007) are based on form related features and do not directly consider the transfer of meaning (e.g. sentiment and style) in MT. Some of these metrics check for synonyms and paraphrases but this approach is still limited to the coverage of the corresponding pair tables. In other words, these metrics do not explicitly evaluate the transfer of meaning in MT. Our main goals are: • to find out whether the transfer of meaning related features (e.g. sentiment and style) in MT influences the human judgment of translation quality.
• to compare meaning and form related features for quality evaluation of MT.
• to measure whether meaning and form related features can be combined to improve the performance of existing MT quality evaluation metrics.
By using publicly available parallel corpora (Tenth Workshop on Statistical Machine Translation (WMT15)), we achieve our goals with two experiments described in Section 5. Our results indicate that combining meaning related features with form related ones approximates to the human judged rankings better than the BLEU metric. These combined features also improve the performance of other MT quality evaluation metrics by 0.5-2 percentage points.

Related Work
So far, MT studies have focused mostly on features related to form (e.g., lexical and syntactic features) for the automatic evaluation of MT quality (e.g., BLEU (Papineni et al., 2002) and ME-TEOR (Lavie and Agarwal, 2007)). BLEU metric is based on n-gram matching of the HT and MT texts and used widely in the MT community to evaluate the MT quality. METEOR employs both word matching scores and the linguistic information (e.g., synonyms and stemming) in contrast to BLEU. Following studies have evaluated MT quality with various features: POS tags (Dahlmeier et al., 2011), morphemes (Tantug et al., 2008), sentence structure (Li et al., 2012), named entities (Buck, 2012), semantic textual similarity (Castillo and Estrella, 2012), paraphrasing (Snover et al., 2006), semantic roles (Lo and Wu, 2011) and language models (Stanojevic and Simaan, 2014). Recently, Yu et al. (2015) proposed another metric (i.e. DPMFComb) which is a combination of a syntax-based metric and some other evaluation metrics in Asiya 3 . At WMT15, DPMFComb obtained the best results at the metrics task for system-level evaluation of translation into English tasks.
Although previous methods require human reference translation, recent methods (e.g. quality estimation metrics), aim to eliminate the necessity of human translation. These methods apply Machine Learning (ML) techniques using lexical (e.g. average source/target token length), syntactic (e.g. ratio of percentage of POS tags in the source/target sentences), and statistical features (e.g. source/target sentence LM probability, word alignment probabilities, etc.) (Stymne et al., 2014;Langlois, 2015;Shah et al., 2015). Interested reader may also benefit from the survey on MT evaluation metrics by Han and Wong (2016  Chen and Zhu (2014) explore sentiment consistency between MT and HT texts to improve the MT quality by incorporating sentiment related features (e.g. subjectivity, polarity, intensity and negation). By using these features in their MT system, they improved the BLEU score by 1.1 point on NIST Chinese-to-English translation dataset 4 . Mohammad et al. (2015) also investigate the sentiment consistency between MT and HT texts with a different motivation. They improve sentiment analysis performance for Arabic by translating available resources (e.g., sentiment lexicon, sentiment annotated data) from English to Arabic. Although sentiment analysis of English translations of Arabic texts obtain competitive results to current state-of-the-art Arabic sentiment analysis systems, they did not evaluate the MT output quality.
There are also studies using MT systems to enrich labeled data for sentiment analysis by translating between languages and leveraging sentiment scores (Wan, 2009;Demirtaş and Pechenizkiy, 2013;Hiroshi et al., 2004). However, none of these studies employ meaning related features to evaluate the MT quality.

2015 Workshop on Statistical Machine Translation
We utilized WMT15 5 parallel corpora  which include several tasks (e.g., standard news translation task, a metrics task, a tuning task, a task for run-time estimation of machine translation quality, and an automatic post-editing task). 24 institutions participated in the translation task with a total of 68 machine translation systems. The WMT15 data includes:

2.
Of course I don't hate you. Of course I hate you.

3.
This is business news This is supposed to be of business news 4. The views of Chinese towards white people is similar! The Chinese think like white people! • Machine translations (MT) • Human judgments (e.g. from 1 (best) to 5 (worst)) for each MT text.
The data is available for five language pairs: Czech (ces)-English, French (fre)-English, German (deu)-English, Finnish (fin)-English, and Russian (rus)-English. Domains of the test data are the same for all languages except for French. The test data for the French-English language pair was fetched from a news discussion forum instead of news texts. Table 1 shows the statistics for the test data. The target language is English for all source languages. The domain of source text, the number of sentences and the number of human judgments are presented. All data was based on the news text corpora except for French-English pair.
In order to evaluate the quality of each MT system,  conducted a human evaluation using Appraise 6 (Federmann, 2012) which is an open source toolkit (similar to Amazon Mechanical Turk 7 ). Each segment consists of a source sentence in the original language (e.g. Czech), its corresponding human translation (English), and 5 anonymous MT system translations (English).
To make the task more consistent and to increase the number of data points, the organizers treated almost identical system translations as one. Even though exactly 5 translations are presented to each judge in a segment, there may be more than 5 MT systems that are ranked. Judges rate the segments from 1 (best) to 5 (worst) by the quality of translated sentences (allowing ties).
In total, there were 29,007 segments, each of which would have produced at least 10 individual system comparisons (e.g., A>B, B>C, A=C, C>B, etc.). To map these individual comparisons to system scores, the organizers used TrueSkill 8 (Herbrich et al., 2006), a Bayesian skill ranking algorithm (similar to Elo used in Chess (Elo, 1978)) and fed these individual bilateral comparisons to TrueSkill. A score is produced for each participated system. In this study, we utilized the HT texts, MT system translations and human judgments in our experiments. Table 2 provides examples of MT errors in comparison to HT. All example translations (MT vs. HT texts) are selected from the WMT15 dataset based on the lowest (5) rankings by human judges. Although translations overlap at the word level, they convey quite different meanings. In example (1), the word 'badly' has disappeared in MT output and led to a loss of information. In example (2), a negated sentence is translated as an affirmative sentence by the MT system. Example (3) illustrates how the MT system generates a more speculative sentence than HT sentence. The pair in example (4) differs in terms of formality between MT vs HT output. MT evaluation metrics may attribute high scores for these pairs since they mainly focus on lexical and syntactic matching. However, as our examples demonstrate, meaning could easily be lost if we rely only on form related MT system evaluation metrics.

Features
To investigate the consistency between MT and HT texts for sentiment and stylistic features, we make use of sentiment polarity, subjectivity, connotation, negation, speculation, readability and formality to measure how these features influence the quality of translation with respect to human rankings.
Sentiment Polarity indicates whether the designated sentence has an affirmative or negative sentiment. To measure the impact of this feature, we use Vader, a rule based sentiment analysis tool (Hutto and Gilbert, 2014). It utilizes grammatical and syntactical rules. In the experiments performed by Hutto and Gilbert (2014), Vader outperforms several competing sentiment analysis approaches.
Additionally, we trained a machine learning (ML) based sentiment analyzer using a deep learning approach described by Yildiz et al. (2016). Their architecture is a Convolutional Neural Network (CNN) which takes pre-trained word vectors 9 as input and applies interleaved convolution and pooling operations. The top layer in the network is Softmax layer which computes the probability of assigning a class (positive, negative). We adopted this architecture and trained a network using Stanford Twitter Sentiment Corpus 10 . The training set contains 1.6 million tweets automatically labeled as positive or negative from various domains while the test set is labeled manually. This ML based sentiment analyzer achieves 90.1% accuracy and outperforms the SVM classifier reported by Go et al. (2009).
Subjectivity indicates whether a text expresses an opinion. In order to compute the subjectivity scores, we trained our architecture using the sentiment polarity and subjectivity dataset 11 (Pang and Lee, 2004) which includes 5000 subjective and 5000 objective sentences. We applied 10-fold cross validation to the data and obtained 91.50% average accuracy.
Connotation indicates cultural or emotional association carried by words that appear in sentences (Feng et al., 2013). In contrast to the sentiment polarity, connotation polarity indicates subtle shades of sentiment beyond denotative or surface meaning of text. The words which do not express sentiment can carry a positive or negative connotation.
For instance, "life" and "home" are considered neutral with regard to the sentiment analysis. However, they convey a positive connotation (Carpuat, 2015). We use the connotation polarity of each word in a sentence to compute connotation score using a normalized version of the formulation proposed by Carpuat (2015). The connotation polarities of the words are obtained by looking up a lexicon which is constructed by Feng et al. (2013). We used the following formula to compute the connotation score: where CS is the connotation score, #positive indicates the number of the words with positive connotation and #negative indicates the number of the words with negative connotation. This formula assigns a continuous value between 1 and −1 to the sentence as a connotation score. The values close to 1 indicate that a given sentence carries a positive connotation while the values close to −1 indicate a negative connotation.
Negation turns an affirmative statement into a negative one. We also detect the effects of negation feature in our experiments. Konstantinova et al. (2012) present a freely available dataset which contains 400 reviews (50 each from 8 domains such as movies and consumer products) annotated by linguists for negation and speculation. We train our deep learning model with these datasets and obtain 96.65% accuracy for negation.
Speculation is used to express levels of certainty. We obtain 95.55% accuracy using the same dataset and method for negation.
Readability measures the ease of reading and comprehending a text (Dale and Chall, 1948). For readability measurement we use Flesch readingease test in which higher scores indicate that the text is easier to read. The Flesch readability score (Kincaid et al., 1975) is calculated using the sentence length and the number of syllables per word as presented in the formula below.
where A is the number of words, B is the number of sentences and C is the number of syllables in a given text. In addition to the rule based readability measurement, we use an ML based readability metric "simplicity" as described by Vajjala and Meurers (2016 Figure 1: Means of absolute differences between the feature scores of MT and HT outputs. x-axis denotes the language and human rank pairs. Error bars indicate 95% confidence intervals. train a pair-wise classifier using them. The training data is a sentence-aligned corpus constructed from news articles and Wikipedia pages and their simplified versions. The method correctly classifies the simplified and complex sentences in terms of their reading level with an accuracy of over 80%. Formality Heylighen and Dewaele (1999) state formality as the most important dimension of vari-ation between styles. They define the formality score as a function of POS tag frequencies. The formality score is given in Equation 3 where NF is the frequency of nouns, AdjF is the adjective frequency, PF is the preposition frequency, ArtF is the article frequency, PrpF is the proper noun frequency, VF is the verb frequency, AdvF is the adverb frequency and IF is the interjection frequency. (3) Additionally, we use an ML based formality score obtained by training the mentioned architecture on the dataset introduced by Pavlick and Tetreault (2016). We have observed 80.71% accuracy through 10-fold cross validation.
All metrics were normalized between (0, 1) except the Readability and Formality. Since these two metrics are formula-based, we avoided interfering with their original scales.

Method
In the WMT15 task, the language pairs are divided into two groups depending on whether English is the source or the target language. We only utilize the pairs where English is the target language due to the richness in resources. For each feature, MT texts are ranked using the following approach: 1. Compute the score for HT text (e.g., 0.65). 3. Compute the absolute difference between MT scores and the HT score (e.g.,Â = .14,B = .40,Ĉ = .45,D = .30,Ê = .35).
4. Rank the systems according to these differences where a smaller value corresponds to a better ranking (e.g. 1=A, 2=D, 3=E, 4=B, 5=C). Figure 1 shows absolute differences for four features with respect to language and human rankings of MT system output. For instance, Subjectivity-RB feature captures the differences between ranks when the source language is French but cannot achieve the same performance for Finnish and Russian translations. Moreover, Readability-RuleBased (RB) and Formality-MachineLearning (ML) seem to perform well for all languages whereas Polarity-ML falls short for French. Figure 2 illustrates the trend for all features in which absolute score difference between MT system outputs and HT text is low for high rankings (e.g., 1) and high for the low (e.g., 5) ones. Therefore, high ranked translations preserve the meaning better than the low ranked ones. Note that both Steps to map human rankings and feature rankings (for example, Polarity-ML) to system-wide scores. A to L denotes individual rankings and SegID i denotes the i th segment in the WMT15 test set.
figures are descriptive and do not correspond to an objective evaluation directly.

Experiment #1: Impact of Individual Features on Translation Quality
This experiment investigates the correlation between each feature and MT translation quality evaluated by rankings of human judges. Using the rankings described in Section 4, we followed the "System-Based Evaluation Methodology" by Stanojević et al. (2015). After obtaining the rankings for each feature as described in the previous section, we used TrueSkill to map segment rankings to system-wide scores (see Figure 3). Next, we compared TrueSkill scores obtained per feature and human judgments with Pearson's r correlation using the scripts provided by the WMT15 Metrics Task 12 . As stated in (Stanojević et al., 2015) the script performs bootstrap resampling of 1000 samples while calculating the correlation scores and the 95% confidence intervals. Random-Baseline 0.0 ± 2.9 -28.4 ± 1.5 47.6 ± 3.2 -65.9 ± 2.8 -3.6 ± 3.6 50.4 ± 3.4 Table 3: Pearson's r correlation between Trueskill scores of a metric and human judgments with the corresponding 95% confidence intervals are shown. Each row represent either a meaning related feature (top) or a selected metric from WMT15 (bottom). ML stands for machine learning and RB stands for rule based method.
We have used three metrics from WMT15 Metrics Task for comparison, BLEU and METEOR and DPMFComb. DPMFComb was selected since it was the best system in overall score in systembased evaluation of WMT15 Metrics Shared Task and the best performing evaluation metric for three out of five languages.
Meaning related features outperform Random baseline as expected. The random baseline is computed by assigning random ranks (1-5) to each translation in each segment. We assigned uniformly random ranks to all sentences without considering the language. Although its performance may vary per language, its overall performance is 0.0 (± 2.9).

Experiment #2: Impact of Combined Features on Translation Quality
As discussed in Section 4, our approach is fundamentally different than MT evaluation metrics such as BLEU. Results of our first experiment indicated strong correlations between quality scores of the features and human rankings. Therefore, we also investigate whether we can predict human rankings of MT translated text by combining these features since they capture different aspects of translation.
In contrast with Experiment 1, this experiment focuses on training systems that combine several features to predict human rankings. As input, BLEU, METEOR and DPMFComb metrics are utilized in combination with the feature scores. We experimented with several classifiers from RankLib 13 to train the ensemble systems and opted to utilize a Random Forest (Liaw and Wiener, 2002) based approach which produced the best 5-fold cross validation score.
First, we obtained scores and rankings for each translation using the Random Forest classifiers for the following combinations: 1. All meaning related features 2. All meaning related features + BLEU Results Combined meaning related features outperform the BLEU score (Table 4). Even though the margin is relatively low, it is a promising indication. Moreover, combining them with a metric increases the performance of the metric: 1.9pp for BLEU, 0.9pp for METEOR and 0.6pp for DPM-FComb. In other words, these features can utilize some meaning or style related information which is not captured by the conventional MT evaluation metrics.

Discussion & Conclusion
In this paper, we investigate how meaning related features influence the automatic evaluation of MT systems. Our experiments prove the additional benefit of these features in predicting human evaluation of translation quality. More specifically, we find that: • MT systems that are ranked higher by human judges preserve the meaning (features such as polarity, formality and readability) better than the low ranked ones.
• Rankings of MT output generated according to meaning based features correlate highly with human rankings on translation quality (See Figure 2).
• When meaning related features are combined with form related lexical features, human evaluation of MT system quality can be predicted with a higher accuracy. (See Table 4).
Extracting meaning related features from text and using form related features for MT evaluation have been studied separately. However, integrating meaning related features into MT quality evaluation can capture the meaning preservation from source to target languages. Our experiments prove that this integrated approach achieves a only slightly better performance than the form based metrics (e.g. BLEU). Moreover, our experiments indicate that the meaning related features can boost the performance of BLEU, METEOR and DPMFComb metrics without even specific optimization. Therefore, our method of integrating meaning related features to MT systems with ranking components can also improve the performances of other metrics instead of only relying on form based features.
Commonly used evaluation metrics (e.g. BLEU and METEOR) require a reference human translation to assess the quality of MT. We also use human translation as a reference since most meaning related feature extraction tools are only available for English and limited for other languages. Although there are studies assessing the quality of MT systems without human translation, meaning related features are still not integrated to MT systems yet. As new tools for other languages become available, we plan to extend our work to implement MT quality estimation for these languages as well. As future work, we will investigate the ways to develop more "human-like" MT systems by employing these meaning related and stylistic features in the training of MT systems or in postprocessing steps such as parameter tuning.