Improving Machine Translation Quality Estimation with Neural Network Features

Machine translation quality estimation is a challenging task in the WMT evaluation campaign. Feature extraction plays an important role in automatic quality estimation, and in this paper, we propose neural network features, including embedding features and cross-entropy features of source sentences and machine translations, to improve machine translation quality estimation. The sentence embedding features are extracted through global average pooling from word embedding and are trained by the word2vec toolkits, while the sentence cross-entropy features are calculated by the recurrent neural network language model. The experimental results on the development set of WMT17 machine translation quality estimation tasks show that the neural network features gain significant improvements over the baseline. Furthermore, when combining the neural network features and the baseline features, the system performance obtains further improvement.


Introduction
Quality estimation (QE) of machine translation estimates the quality of machine translation system outputs without human references using machine learning methods. It is often divided into two steps: first, it extracts various features from source sentences, translation outputs, and external language resources to describe the translation complexity, fluency and adequacy; and second, it predicts the quality of the translation outputs with the pre-trained machine learning model. Feature extraction is crucial to the performance of QE, and traditional methods, such as QuEst (Specia et al., 2013), extract linguistically motivated features to improve the correlation between the automatic QE and human assessment. However, extracting linguistically motivated features requires part-of-speech analysis, syntactic analysis, or semantic analysis, and these linguistic analyses relate to the target language types; this consideration limits their application in other languages. To address this problem, Shah et al. (2015a investigated continuous space language models for sentence-level QE, and Scarton et al. (2016) proposed word embedding features for documentlevel QE.
Inspired by their work, we propose sentence embedding features and cross-entropy features to improve the correlation between automatic QE and human assessment and to investigate how different sentence embedding dimensions of source sentences and translation outputs, as well as the size of the training corpus, affect the system performance of QE.

Related work
With the great success of deep learning that has been achieved in digital image processing and automatic speech recognition, deep learning has also made tremendous breakthroughs in natural language processing, e.g., the proposition of neural network language models (Bengio et al. 2003) and neural machine translation encoder-decoder frameworks (Bahdanau et al. 2014). Therefore, many researchers have proposed deep learning approaches for the QE task. In the word-level QE task, Kreutzer et al. (2015) presented deep feedforward neural networks to estimate the word confidence. Shah et al. (2015b) exploited word embedding as a feature to estimate whether the translation of the word is "good" or "bad" in machine translation outputs. Patel et al. (2016) applied a recurrent neural network language model to the word-level QE task.
In the sentence-level QE task ， Shah et al. (2015a) extracted continuous space language model (Schwenk et al. 2007) probabilities of source sentences and machine translation outputs as features, and combined them with baseline features to improve the system performance of QE. In the WMT16 QE Task，Shah et al. (2016) further proposed forward sentence cross-entropy, sentence embedding features, and neural machine translation log-likelihood features based on their previous work. They extracted word embedding features and cross-entropy features by the continuous space language model.
In contrast to the work of Shah et al., we utilize a continuous bag-of-words model to extract the word embeddings, construct sentence embedding through global average pooling from word embeddings, and utilize a recurrent neural network language model to extract sentence cross-entropy features.

Neural Network Features
To overcome the problem that the traditional feature extraction method relies heavily on sentence linguistic analysis, in this paper, we exploit the latest deep learning method to extract the features of translation quality from source language sentences and its machine translations. The extracted features include sentence embedding features and sentence cross-entropy features.

The Embedding Features
Word representation learning has attracted the attention of many researchers in recent years. Especially after 2013, Mikolov et al. (2013a) released the open source word embedding learning tool: word2vec 1 . Word2vec, as a word embedding learning tool, has implemented two models: CBOW (Continuous Bag-of-words) and Skip-Gram model, inspired by the neural network language model proposed by Bengio (Bengio et al. 2003). The CBOW and Skip-Gram model remove the hidden layer processing of the neural network language model, which is time consuming, and add the optimization methods of Negative Sampling and Hierarchical Softmax (Mikolov et al. 2013b). This approach improves the accuracy of the model and accelerates the training of the mod-el. The CBOW and Skip-Gram models are very similar. Their difference lies in that the CBOW model predicts the conditional probability of the current word by the context words, while the Skip-Gram model predicts the conditional probability of the context words by the current word. Because the training speed of the CBOW model is faster than that of the Skip-Gram model, we use the CBOW model to train the word embeddings of the source language and the target language.
The window size is set to 10, using the negative sampling optimization method. Additionally, the number of negative samples is set to 10. To accelerate the training, the sampling threshold of a high frequency word is set to 1e-5, and the iteration time is set to 15. We attempt various dimension of the word embedding, varied from 256 to 4096, to achieve best performance.
After obtaining the word embeddings of each word in the source sentence and the machine translation output, the sentence embeddings are computed by averaging them. This approach is applied to both the source sentences and the machine translation outputs. When the source sentence embedding (Vs) and the machine translation output embedding (Vt) are both obtained, two sentence embeddings are concatenated (V = [Vs; Vt]) as features for the QE task.

The Cross-Entropy Features
A language model, which occupies a significant position in natural language processing, is used for the modeling of the probability distributions of the word sequences. In section 3.1, the bag-ofwords model is used to obtain the word embedding features. However, the disadvantage of the bag-of-words model is that it ignores the contextual relationships between the words.
The recurrent neural network possesses sequentiality and memorability, and it performs well in sequential data modeling. Therefore, the Recurrent Neural Network Language Model (RNNLM) (Mikolov et al. 2010) was proposed and first used in automatic speech recognition and reordering of machine translations. The experimental results indicate that the RNNLM is superior to the back-off language model. Since RNNLM accounts for the word order, we extract the source language sentences and their machine translation crossentropies as features for the QE task.
The RNNLM is trained by the RNNLM toolkit 2 . The number of hidden layers is set to 100, parameter "bptt" is set to 4, and the output layer class number is set to 200. The WMT17 QE development set is used to optimize the parameters of the RNNLM. The training data is shown in section 4.1. The entropy of the WMT17 QE development set that we finally trained by the RNNLM is shown in Table 1

Experimental Results
To test the performance of the neural network features for the QE task, we conduct experiments on the development set of the WMT17 sentence-level QE task.

Experiment Set
The WMT17 sentence-level QE task contains two translation directions: English to German (en-de) and German to English (de-en). Among them, the en-de corpus concerns the IT domain, while de-en concerns the pharmaceutical domain. The training set of the en-de direction consists of 23,000 sentences; the development set consists of 1,000 sentences. The training set of the de-en direction consists of 25,000 sentences; the development set consists of 1,000 sentences. A test set of 2,000 sentences is provided for each direction. HTER (Snover et al. 2006) is provided as an estimation index for the translation quality of each training set and development set. The task of the participants is to establish a QE model to predict the HTER, with the source language sentences and their machine translations.
To train the word embedding and the RNNLM, the source side and the target side of the bilingual parallel corpus for the translation task, publicly released by the WMT evaluation campaign, are used; they include Europarl v7, Common Crawl corpus, News Commentary v8 and v11; Batch1 and 2 http://www.fit.vutbr.cz/~imikolov/rnnlm/ Batch2, localization PO files, IT-related terms from Wikipedia 3 ; WMT16 and WMT17 QE task1 corpus. The statistics of the bilingual parallel corpus are shown in Table 2, the corpus are shared for the two translation directions.
The Support Vector Regression (SVR) model is utilized for the QE. To implement the model, we use the Python machine learning toolkit: scikitlearn 4 , and the radial basis function is chosen for the SVR kernel function, the grid search algorithm for parameter optimization. The metrics included Pearson's correlation coefficient (Pearson r), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Spearman's correlation coefficient (Spearman ρ) and Delta Average (DeltaAvg), which were used to evaluate the performance of the QE model. Pearson r and Spearman ρ are set as primary metrics for scoring and ranking the evaluation respectively, and higher scores mean better correlations between QE and HTER.

Results
We exploit SVR with different features to build the QE model. Experiments are performed on the development set of the WMT17 QE, task1. The experimental results of en-de and de-en are shown in Tables 3 and 4, respectively. The rows "Baseline" and "Word2vec" represent only used the 17 baseline features that were officially released by the evaluation campaign and only used the sentence embedding features extraction by the word2vec toolkits, while the row "Word2vec+ Baseline" represents the combination of used baseline features and sentence embedding features, and so on. The system that we finally submitted uses a combination of all of the features. Mikolov et al. (2013c) attempt different dimensions of word embedding for the source language and the target language to achieve the best translation quality. Motivated by their work, we test the diverse dimensions of the word embedding for the source language and target language on the    training set. For the en-de direction, the best performance is obtained when the dimensions of the source word embedding and target word embedding are 1024 and 2048, respectively. While for de-en direction, the best performance is obtained when the dimensions of the source word embedding and target word embedding are both 2048. Then, based on the sentence embedding features, we add the cross-entropy features extracted by the RNNLM toolkit or the baseline features. When we added the cross-entropy features, the maximum value of Pearson r increased by 9.9% on the scoring evaluation, and the maximum value of Spearman ρ increased by 7.5% on the ranking evaluation. It can be found that in the en-de direction, the result obtained by adding cross-entropy features is superior to that from adding baseline features. Finally, when we combine all of the features, the maximum value of Pearson r increases by 44.6% on the scoring task, and the maximum value of Spearman ρ increased by 29.0% on the ranking evaluation compared with the baseline.
Because the training word embedding and RNNLM require a certain size of monolingual corpus, we also investigated the effects of differ-ent corpus scales on the quality of the extracted neural network features. It was found that when the training corpus contained more than 1M sentences, the QE system performance is not reduced, and when the corpus contained less than 1M sentences, the system performance will decrease gradually as the corpus size decreases. This finding demonstrates that the training word embedding and the RNNLM are not dependent heavily on the scale of the training corpus.
Finally, Table 5 provides the results of our system and the baseline system on the test set. We take the system "Word2vec+RNNLM+Baseline" as our primary system. In WMT16 QE, the performance of our system achieves the third place. In WMT17 QE, the best result of our system achieves the fifth place. Compared with the method proposed by , we use fewer features, but achieve better result on the test set.

Conclusions
In this paper, we train the embedding features using the word2vec toolkit and we enrich the features with cross-entropy features extracted by RNNLM to improve the correlation between the QE and human judgment. The experimental results show that the neural network features can significantly improve the system performance. Compared with the traditional linguistically motivated features, the extracted features of the neural network are independent of the specific language.
In the future, we will train an end-to-end pure neural network model for QE, instead of using traditional SVR methods.