A Recurrent Neural Networks Approach for Estimating the Quality of Machine Translation Output

This paper presents a novel approach using recurrent neural networks for estimating the quality of machine translation output. A sequence of vectors made by the prediction method is used as the input of the ﬁnal recurrent neural network. The prediction method uses bi-directional recurrent neural network architecture both on source and target sentence to fully utilize the bi-directional quality information from source and target sentence. Our experiments show that the proposed recurrent neural networks approach achieves a performance comparable to the existing state-of-the-art models for estimating the sentence-level quality of English-to-Spanish translation.


Introduction
Estimating the quality of machine translation output, called quality estimation (QE) (Specia et al., 2009;Blatz et al., 2004), is to predict quality scores/categories for unseen machinetranslated sentences without reference translations at various granularity levels (sentence-level/wordlevel/document-level). Quality estimation is of growing importance in the field of machine translation (MT) since MT systems are widely used and the quality of each machine-translated sentence is able to vary considerably.
Previous research on QE, addressed as a regression/classification problem to compute quality scores/categories, has mainly focused on feature extraction and feature selection. Feature extraction is to find the relevant features, such as baseline features (Specia et al., 2013) and latent semantic indexing (LSI) based features (Langlois, 2015), capturing various aspects of quality from source and target sentences 1 and external resources. Feature selection is to select the best features by using selection algorithms, such as Gaussian processes (Shah et al., 2015) and heuristic (González-Rubio et al., 2013), among already extracted features. Finding desirable features has played a key role in the QE research.
In this paper we present a recurrent neural networks approach for estimating the quality of machine translation output at sentence level, which does not require manual effort for finding the best relevant features. The remainder of this paper is organized as follows. In Section 2, we propose a recurrent neural networks approach using a sequence of vectors made by the prediction method as input for quality estimation. And we describe the prediction method using bi-directional recurrent neural networks architecture in Section 3. In Section 4, we report evaluation results, and conclude our paper in Section 5.

Recurrent Neural Networks Approach for Estimating Quality Score
Because recurrent neural networks (RNNs) have the strength for handling sequential data (Goodfellow et al., 2015), we apply RNNs to estimate the quality score of translation. The input of the final RNN is a sequence of vectors that have quality information about whether tar- Figure 1: An illustration of the proposed recurrent neural networks model for quality estimation get words in a target sentence are properly translated from a source sentence. We will refer to this sequence of vectors as quality vectors (q y 1 , ... , q y Ty ).
Each quality vector q y j 2 has the quality information about how well a target word y j in a target sentence y = (y 1 , ... , y T y ) is translated from a source sentence 3 x = (x 1 , ... , x T x ). Quality vectors are generated from the prediction method (of Section 3).
To predict a quality estimation score (QE score) as an HTER score (Snover et al., 2006) in [0,1] for each target sentence, a logistic sigmoid function is used such that QE score(y, x) = QE score (q y 1 , ... , q y T y ) where s is a summary unit of the whole quality vectors and W QE ∈ R r . r is the dimensionality of summary unit.
To get the summary unit s, the hidden state v j employing p gated hidden units for the target word y j is computed by The gated hidden unit (Cho et al., 2014) for the activation function f is used to learn long-term depen-Since the training data for QE 4 are not enough to use a neural networks approach for making quality vectors, we use an alternative based on large-scale parallel corpora such as Europarl. We modify the word prediction method of RNN Encoder-Decoder (Cho et al., 2014) using parallel corpora to make the quality vectors.
In subsection 3.1, we describe the underlying word prediction method of RNN Encoder-Decoder. We i) extend the prediction method to use the additional backward RNN architecture on target sentence in subsection 3.2 and ii) modify to get the quality vectors (q y 1 , ... , q y Ty ) in subsection 3.3. Figure 1 is the graphical illustration of the proposed RNNs approach.

Word Prediction Method of RNN Encoder-Decoder
RNN Encoder-Decoder proposed by Cho et al. (2014) is able to predict the target word y j given a source sentence x and all preceding target words {y 1 , ..., y j−1 } by using a softmax function. And it is extended by Bahdanau et al. (2015) to use information of relevant source words for predicting the target word y j such that g is a nonlinear function predicting the probability of y j . s j−1 is the hidden state of the forward RNN on target sentence and contains information of preceding target words {y 1 , ... , y j−1 }. c j is the context vector which means relevant parts of source sentence associated with the target word y j . s j−1 and y j−1 are related to all preceding target words {y 1 , ..., y j−1 }, and c j is related to x in the word prediction function of (3).

Additional Backward RNN Architecture on Target Sentence
Bahdanau et al. (2015) introduce bi-directional RNN architecture only on source sentence to extend RNN Encoder-Decoder. In our proposed QE model, bi-directional RNN architecture is used both on source and target sentence. By applying bidirectional RNN architecture both on source and target sentence, we can fully and bi-directionally utilize source and target sentence for predicting target words, such that which is the extended version of (3) using the additional backward RNN architecture. 5 5 The additional backward RNN on target sentence use the context vectors shared by the forward RNN on target sentence.
To reflect further all following target words {y j+1 , ... , y T y } when predicting the target word y j , the hidden state s j+1 of the backward RNN and the next target word y j+1 are added. [ s j−1 ; s j+1 ] and [y j−1 ; y j+1 ] are related to y y j 6 , and c j is related to x in the word prediction function of (4).
W o 1 ∈ R Ky×q and W o 2 ∈ R q×l are weight matrices of softmax function. K y is the vocabulary sizes of target language and q is the dimensionality of quality vectors. l is the dimensionality of maxout units such that wheret j,k is the k-th element of a vectort j . And where S o ∈ R 2l×2n , V o ∈ R 2l×2m , and C o ∈ R 2l×2n . E y ∈ R m×Ky is the word embedding matrix on target sentence. m and n are the dimensionality of word embedding and hidden states of forward and backward RNNs. The hidden state s j+1 of the backward RNN and next target word y j+1 are used in (6). 7 From the extended prediction method of (4), the probability of the target word y j is computed by using information of relevant source words in source sentence x and all target words y y j surrounding the target word y j in target sentence.

Quality Vectors on Target Sentence
Word prediction method predicts the probability of target words as a number between 0 and 1. But we want to get quality vectors of q-dimensionality which have the more intrinsic quality information for target words.
To make quality vectors, we regard that the probability of the target word y j involves the quality information about whether the target word y j in target sentence is properly translated from source sentence. Thus, by decomposing the softmax function 8 of (4),  the quality vector q y j for the target word y j is computed by where • is an element-wise multiplication. All of quality information about possible K y target words at position j of target sentence is encoded in t j . Thus, by decoding t j , we are able to get quality vector q y j for the target word y j ∈ R Ky at position j of target sentence. Figure 2 and 3 show the ways to compute the quality vector q y j .

Experiments
The proposed RNNs approach was evaluated on the WMT15 Quality Estimation Shared Task 9 at sentence level of English-Spanish. We trained 10 the proposed model through a twostep process. First, by using English-Spanish parallel corpus of Europarl v7 (Koehn, 2005), we trained bi-directional RNNs having 1000 hidden units on source and target sentence to make quality vectors. Next, by using the training set of WMT15 QE task, to predicte QE scores we trained the final RNN that 9 http://www.statmt.org/wmt15/qualityestimation-task.html 10 Stochastic gradient descent (SGD) algorithm with adaptive learning rate (Adadelta) (Zeiler, 2012) is used to train the proposed model.   (Bojar et al., 2015).
use the quality vectors generated in previous step as the input and have 100 hidden units. Table 1 and 2 present the results of the proposed approach (Bi-RNN) and the official results for the scoring and ranking 11 variants of the WMT15 Quality Estimation Shared Task at sentence level. At both variants of the task, the proposed RNNs approach achieved the performance over the baseline performance. Also our experiments showed that the performance of the proposed RNNs approach is included to the best performance group (at the scoring variant of Table 1) or is close to the best performance group (at the ranking variant of Table 2).

Conclusion
This paper proposed a recurrent neural networks approach using quality vectors for estimating the quality of machine translation output at sentence level. This approach does not require manual effort for finding the best relevant features which the previous QE research has mainly focused on.
To make quality vectors we used an alternative prediction method based on large-scale parallel corpora, because the QE training data were not enough. By extending the prediction method to use bi-directional RNN architecture both on source and target sentence, we were able to fully utilize the bidirectional quality information from source and target sentence for quality estimation.
The proposed RNNs approach achieved a performance comparable to the existing state-of-the-art models at sentence-level QE. Our experiments have showed that RNNs approach is a meaningful step for QE research. Applying RNNs approach to wordlevel QE and studying other ways to make quality vectors better are remained for the future study.