Translation Quality Estimation using Recurrent Neural Network

This paper describes our submission to the shared task on word/phrase level Quality Estimation (QE) in the First Conference on Statistical Machine Translation (WMT16). The objective of the shared task was to predict if the given word/phrase is a correct/incorrect (OK/BAD) translation in the given sentence. In this paper, we propose a novel approach for word level Quality Estimation using Recurrent Neural Network Language Model (RNN-LM) architecture. RNN-LMs have been found very effective in different Natural Language Processing (NLP) applications. RNN-LM is mainly used for vector space language modeling for different NLP problems. For this task, we modify the architecture of RNN-LM. The modified system predicts a label (OK/BAD) in the slot rather than predicting the word. The input to the system is a word sequence, similar to the standard RNN-LM. The approach is language independent and requires only the translated text for QE. To estimate the phrase level quality, we use the output of the word level QE system.


Introduction
Quality estimation is the process to predict the quality of translation without any reference translation (Blatz et al., 2004;Specia et al., 2009). Whereas, Machine Translation (MT) system evaluation does require references (human translation). QE could be done at word, phrase, sentence or document level. This paper describes the submission to the shared task on word and phrase level QE (Task 2) for English-German (en-de) MT.
The shared task has the trace of last five years' research in the field of QE (Callison-Burch et al., 2012;Bojar et al., 2013;Bojar et al., 2015).
In this paper, we have used a modified version of RNN-LM, which accepts the word sequence (context window) as input and predicts label at the output for the middle word. For example, let us consider the following input/output sample: English (MT input): Layer effects are retained by default .
Tags: BAD BAD BAD OK OK OK Now if we have to predict the output tag (BAD) for the word "sind" in the MT output, our input sequence to the RNN-LM will be "Effekte sind standardmig" (if context window size is 3). Whereas, for standard RNN-LM model, "Effekte standardmig" would be the input to the network with "sind" as the output. We add padding at the start and end of the sentence according to the context window. The detailed description of the model and its implementation is given in section 3.
We have used the data provided by the or-ganizers for the shared task on quality estimation (2016) which includes: (i) source sentence (ii) translated output (word/phrase level) (iii) word/phrase level tagging (OK/BAD) (iv) post edited translation (v) 22 baseline features (vi) word alignment. The goal of the task is to predict whether the given word/phrase is a correct/incorrect (OK/BAD) translation in the given sentence. The remainder of the paper is organised as follows. Section 2 describes the related work. Section 3 presents RNN models we use, and its implementation. In section 4, we discuss the data distribution, our approaches, and results. Discussion of our methodology and different models is covered in section 5 followed by concluding remarks in section 6.

Related Work
For word level QE, supervised classification techniques are being used widely. Most of these approaches require manually designed features , similar to the feature set provided by the organizers. Logacheva et al. (2015) modeled the word level QE using the CRF++ tool with data selection and data bootstrapping in which data selection filters out the sentences having the smallest proportion of erroneous tokens and are assumed to be less useful for the task. The bootstrapping technique creates additional data instances and boosts the importance of BAD labels occurring in the training data. Shang et al. (2015) tried to solve the problem of label imbalance with creating sub-labels like OK B (begin), OK I (intermediate), OK E (end). Shah et al. (2015) have used word embedding as an additional feature (+25 features) with SVM classifier. Bilingual Deep Neural Network (DNN) based model for word level QE was proposed by Kreutzer et al. (2015), in which word embedding was pre-trained and fine-tuned with other parameters of the network using stochastic gradient descent. de Souza et al. (2014) have used Bidirectional LSTM as a classifier for word level QE.
The architecture of RNN-LM has been used for Natural Language Understanding (NLU) (Yao et al., 2013;Yao et al., 2014) earlier. Our approach is quite similar to the Kreutzer et al. (2015), but we are using RNN instead of DNN. We have also tried to address the problem of label-imbalance, introducing sub-labels as suggested by Shang et al. (2015).

RNN Models for QE
For this task, we exploited RNN's extensions, Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) and Gated Recurrent Unit (GRU) (Cho et al., 2014). LSTM and GRU have shown to perform better at modeling the long-range dependencies in the data than the simple RNN. Simple RNN also suffers from the problem of exploding and vanishing gradient (Bengio et al., 1994). LSTM and GRU tackle this problem by introducing a gating mechanism. LSTM includes input, output and forget gates with a memory cell, whereas GRU has reset and update gates only (no memory cell). The detailed description of each model is given in the following subsections.

LSTM
Different researchers use slightly different LSTM variants (Graves, 2013;Yao et al., 2014;Jozefowicz et al., 2015). We implemented the version of LSTM described by the following set of equations: where sigm is the logistic sigmoid function and tanh is the hyperbolic tangent function to add non linearity in the network.
is the elementwise multiplication of vectors. i, o, f are input, output, f orget gates respectively, j is the new memory content whereas c is the updated memory content. In these equations, W * are the weight matrices and b * are the bias vectors.

Deep LSTM
In this paper, we have used deep LSTM with two layers. Deep LSTM is created by stacking multiple LSTMs on the top of each other. We feed the output of the lower LSTM as the input to the upper LSTM. For example, if h t is the output of the lower LSTM, we apply a matrix transform to form the input x t for the upper LSTM. The matrix transformation allows having two consecutive LSTM layers of different sizes.

GRU
GRU is an architecture, which is quite similar to the LSTM. Chung et al. (2014) found that GRU outperforms LSTM on a suit of tasks. GRU is defined by the following set of equations: In the above equations, W * are the weight matrices and b * are the bias vectors. r and z are known as the reset and update gate respectively. GRU does not use any separate memory cell as used in LSTM. However, gated mechanism controls the flow of information in the unit.

Implementation Details
We implemented all the models (LSTM, deep LSTM and GRU) with 1 THEANO framework (Bergstra et al., 2010;Bastien et al., 2012) as described above. For all the models in the paper, the size of a hidden layer is 100, the word embedding dimensionality is 100 and the context word window size is 5.
We initialized all the square weight matrices as random orthogonal matrices. All the bias vectors were initialized to zero. Other weight matrices were sampled from a Gaussian distribution with mean 0 and variance 0.01 2 .
To update the model parameters, we have used Truncated Back-Propagation-Through-Time (T-BPTT) (Werbos, 1990) (Werbos, 1990) with stochastic gradient descent. We fixed the depth of BPTT to 7 for all the models. We used Ada-delta (Zeiler, 2012) (Zeiler, 2012) to adapt the learning rate of each parameter automatically ( = 10 −6 and ρ = 0.95). We trained each model for 50 epochs.

Experiments and Results
In this section, we describe the experiments carried out for the shared task and present the experimental results.

Data distribution
We have used the corpus shared by the organizers for our experiments. The split for train-ing/development/testing is detailed in

Methodology
In the following subsections, we discuss our approaches for word/phrase level quality estimation.

Word Level QE
Our experiments are mainly focused on the word level QE. We have used the output of the word level QE system for the estimation of the phrase level quality. As mentioned above, we have used the modified RNN-LM architecture for the experiments. Baseline (LSTM) system was developed by training word embedding from scratch with other parameters of the model. In another set of experiments, we have pre-trained the word embedding with word2vec (Mikolov et al., 2013b), and further tuned with the training of the model parameters. For pretraining, we have used an additional corpus (2M sentences approx.) from English-German Europarl data (Koehn, 2005).
For bilingual models, we restructured the source sentence (English) according to the target (German) using word alignment provided by the organizers. For many-to-one mapping in the alignment (English-German), we chose the first alignment only. The 'NULL' token was assigned to the words where were not aligned with any word on the target side. The input of the model is constructed by concatenating context words of source and target. For example, consider the source word sequence s 1 s 2 s 3 , and the target word sequence t 1 t 2 t 3 , then the input to the network will be s 1 s 2 s 3 t 1 t 2 t 3 .
In the training data, the distribution of the labels (OK/BAD) is skewed (OK to BAD ratio is approx. 4:1). To handle the issue, we tried one of the strategies proposed by Shang et al. (2015), in which we replace 'OK' label with sub-labels to balance the distribution. The sub-labels are OK B, OK I, OK E, depending on the location of the token in the sentence.

Phrase Level QE
For phrase level QE, we have not trained any explicit system. As it was mentioned by the organizers that a phrase is tagged as 'BAD', if any word in the phrase is an incorrect translation. So, We have taken the output of the word level QE system and tagged the phrase as 'BAD', if any word in the phrase boundary is tagged 'BAD'. And other phrases (all words have the OK tag) are simply tagged as 'OK'.

Results
To develop a baseline system for word and phrase level QE, organizers have used the baseline features (22 features) to train a Conditional Random Field (CRF) model with CRFSuite tool. The results of the experiments against Test2 are displayed in Table 2 and 3. We have evaluated our systems using the F1score. As 'OK' class is dominant in the data and a naive system tagging all the words 'OK' will score high. Hence, F1-score of the 'BAD' class has been used as a primary metric for the system evaluation. We have used the separate set of test and development corpus as shown in Table 2. The evaluation of all the experiments against Test1 corpus is displayed in Table 2   phrase level QE are shown in Table 3. From the result tables, it is evident that GRU outperforms LSTM as reported by Cho et al. (2014) for this task as well. Pre-training is helpful in all the models. Also, the introduction of sub-labels is able to handle the problem of labelimbalance up to some extent. The results of Bilingual models are better than monolingual models, as reported by Kreutzer et al. (2015).

Submission to the shared task
We have participated in the Task-2, which includes word and phrase level quality estimation. The submitted system setting was: GRU + P retrain + Sublabels, which is marked in the result tables (2 and 3) as well.   Table 5: Results, phrase level submission.
The approach is language independent and it uses only context words' vector for predicting the tag for a word. In the other words, we check if any word fits (grammatically) in the given slot of words or not. We could use language specific features to enhance the classification accuracy, though. Experiments with bilingual models are similar to the concept of adding more features to any machine learning algorithm. In monolingual models, we use only target (German) words' vector as feature whereas, in bilingual models, we use source (English) words' vector also. A challenge which machine learning practitioners often face is, how to deal with skewed classes in classification problems. The distribution of classes (OK/BAD) is skewed in our case as well. To handle the issue, we tried to balance the distribution of classes by introducing the sub-labels. LSTM and GRU are quite similar models, except the gating mechanism. It is hard to say which model will perform better in what conditions or in general (Chung et al., 2014). In this paper and in general as well, this restricts us to conduct only the empirical comparison between the LSTM and the GRU units. Deep models generally perform better than the shallow models, which is opposite for this task where LSTM outperforms Deep LSTM. The reason could be the insufficient data for training the deep models.

Conclusion and Future Work
We have developed a language independent word/phrase level Quality Estimation system using RNN. We have used RNN-LM architecture, with LSTM, deep LSTM, and GRU. We showed that these models benefit from pretraining and the introduction of sub-labels. Also, models with bilingual features outperform the monolingual models.
We can extend the work for sentence and document level quality estimation. Improving the word level quality estimation with data selection and bootstrapping (Logacheva et al., 2015), more effective ways to handle label-imbalance, training bigger models, using language specific features, other variations of LSTM architecture etc., are the other possibilities.