Contextual Encoding for Translation Quality Estimation

The task of word-level quality estimation (QE) consists of taking a source sentence and machine-generated translation, and predicting which words in the output are correct and which are wrong. In this paper, propose a method to effectively encode the local and global contextual information for each target word using a three-part neural network approach. The first part uses an embedding layer to represent words and their part-of-speech tags in both languages. The second part leverages a one-dimensional convolution layer to integrate local context information for each target word. The third part applies a stack of feed-forward and recurrent neural networks to further encode the global context in the sentence before making the predictions. This model was submitted as the CMU entry to the WMT2018 shared task on QE, and achieves strong results, ranking first in three of the six tracks.


Introduction
Quality estimation (QE) refers to the task of measuring the quality of machine translation (MT) system outputs without reference to the gold translations (Blatz et al., 2004;Specia et al., 2013). QE research has grown increasingly popular due to the improved quality of MT systems, and potential for reductions in post-editing time and the corresponding savings in labor costs (Specia, 2011;Turchi et al., 2014). QE can be performed on multiple granularities, including at word level, sentence level, or document level. In this paper, we focus on quality estimation at word level, which is framed as the task of performing binary classification of translated tokens, assigning "OK" or "BAD" labels. 1 Our software is available at https://github.com/ junjiehu/CEQE. Early work on this problem mainly focused on hand-crafted features with simple regression/classification models (Ueffing and Ney, 2007;Biçici, 2013). Recent papers have demonstrated that utilizing recurrent neural networks (RNN) can result in large gains in QE performance (Martins et al., 2017). However, these approaches encode the context of the target word by merely concatenating its left and right context words, giving them limited ability to control the interaction between the local context and the target word.
In this paper, we propose a neural architecture, Context Encoding Quality Estimation (CEQE), for better encoding of context in word-level QE. Specifically, we leverage the power of both (1) convolution modules that automatically learn local patterns of surrounding words, and (2) handcrafted features that allow the model to make more robust predictions in the face of a paucity of labeled data. Moreover, we further utilize stacked recurrent neural networks to capture the long-term dependencies and global context information from the whole sentence. We tested our model on the official benchmark of the WMT18 word-level QE task. On this task, it achieved highly competitive results, with the best performance over other competitors on English-Czech, English-Latvian (NMT) and English-Latvian (SMT) word-level QE task, and ranking second place on English-German (NMT) and German-English word-level QE task.

Model
The QE module receives as input a tuple s, t, A , where s = s 1 , . . . , s M is the source sentence, t = t 1 , . . . , t N is the translated sentence, and A ⊆ {(m, n)|1 ≤ m ≤ M, 1 ≤ n ≤ N } is a set of word alignments. It predicts as output a sequencê y = y 1 , . . . , y N , with each y i ∈ {BAD, OK}. The overall architecture is shown in Figure 1 CEQE consists of three major components: (1) embedding layers for words and part-of-speech (POS) tags in both languages, (2) convolution encoding of the local context for each target word, and (3) encoding the global context by the recurrent neural network.

Embedding Layer
Inspired by (Martins et al., 2017), the first embedding layer is a vector representing each target word t j obtained by concatenating the embedding of that word with those of the aligned words s A(:,t j ) in the source. If a target word is aligned to multiple source words, we average the embedding of all the source words, and concatenate the target word embedding with its average source embedding. The immediate left and right contexts for source and target words are also concatenated, enriching the local context information of the embedding of target word t j . Thus, the embedding of target word t j , denoted as x j , is a 6d dimensional vector, where d is the dimension of the word embeddings. The source and target words use the same embedding parameters, and thus identical words in both languages, such as digits and proper nouns, have the same embedding vectors. This allows the model to easily identify identical words in both languages. Similarly, the POS tags in both languages share the same embedding parameters.

One-dimensional Convolution Layer
The main difference between the our work and the neural model of Martins et al. (2017) is the onedimensional convolution layer. Convolutions provide a powerful way to extract local context features, analogous to implicitly learning n-gram features. We now describe this integral part of our model.
After embedding each word in the target sentence {t 1 , . . . , t j , . . . , t N }, we obtain a matrix of embeddings for the target sequence, where ⊕ is the column-wise concatenation operator. We then apply one-dimensional convolution (Kim, 2014;Liu et al., 2017) on x 1:N along the target sequence to extract the local context of each target word. Specifically, a one-dimensional convolution involves a filter w ∈ R hk , which is applied to a window of h words in target sequence to produce new features.
where b ∈ R is a bias term and f is some functions. This filter is applied to each possible window of words in the embedding of target sen- By the padding proportionally to the filter size h at the beginning and the end of target sentence, we can obtain new features c pad ∈ R N of target sequence with output size equals to input sentence length N . To capture various granularities of local context, we consider filters with multiple window sizes H = {1, 3, 5, 7}, and multiple filters n f = 64 are learned for each window size.
The output of the one-dimensional convolution layer, C ∈ R N ×|H|·n f , is then concatenated with the embedding of POS tags of the target words, as well as its aligned source words, to provide a more direct signal to the following recurrent layers.

RNN-based Encoding
After we obtain the representation of the sourcetarget word pair by the convolution layer, we follow a similar architecture as (Martins et al., 2017) to refine the representation of the word pairs using feed-forward and recurrent networks.  One-hot highest order of ngram that includes target word and its left context One-hot highest order of ngram that includes target word and its right context One-hot highest order of ngram that includes source word and its left context One-hot highest order of ngram that includes source word and its right context 5. Two feed-forward layers of size 100 and 50 respectively with ReLU activation.
We concatenate the 31 baseline features extracted by the Marmot 2 toolkit with the last 50 feedforward hidden features. The baseline features are listed in Table 2. We then apply a softmax layer on the combined features to predict the binary labels.

Training
We minimize the binary cross-entropy loss between the predicted outputs and the targets. We train our neural model with mini-batch size 8 using Adam (Kingma and Ba, 2015) with learning rate 0.001 and decay the learning rate by multiplying 0.75 if the F1-Multi score on the validation set decreases during the validation. Gradient norms are clipped within 5 to prevent gradient explosion for feed-forward networks or recurrent neural networks. Since the training corpus is rather small, we use dropout (Srivastava et al., 2014) with probability 0.3 to prevent overfitting.

Experiment
We evaluate our CEQE model on the WMT2018 Quality Estimation Shared Task 3 for wordlevel English-German, German-English, English-Czech, and English-Latvian QE. Words in all languages are lowercased. The evaluation metric is the multiplication of F1-scores for the "OK" and "BAD" classes against the true labels. F1-score is the harmonic mean of precision and recall. In Table 3, our model achieves the best performance on three out of six test sets in the WMT 2018 wordlevel QE shared task.

Ablation Analysis
In  1. Because the number of "OK" tags is much larger than the number of "BAD" tags, the model is easily biased towards predicting the "OK" tag for each target word. The F1-OK scores are higher than the F1-BAD scores across all the language pairs.
2. For German-English, English Czech, and English-German (SMT), adding the baseline features can significantly improve the F1-BAD scores.
3. For English-Czech, English-German (SMT), and English-German (NMT), removing POS tags makes the model more biased towards predicting "OK" tags, which leads to higher F1-OK scores and lower F1-BAD scores.
4. Adding the convolution layer helps to boost the performance of F1-Multi, especially on English-Czech and English-Germen (SMT) tasks. Comparing the F1-OK scores of the model with and without the convolution layer, we find that adding the convolution layer help to boost the F1-OK scores when translating from English to other languages, i.e., English-Czech, English-German (SMT and NMT). We conjecture that the convolution layer can capture the local information more effectively from the aligned source words in English.  Table 5 shows two examples of quality prediction on the validation data of WMT2018 QE task for English-Czech. In the first example, the model without POS tags and baseline features is biased towards predicting "OK" tags, while the model with full features can detect the reordering error. In the second example, the target word "panelu" is a variant of the reference word "panel". The target word "znaky" is the plural noun of the reference "znak". Thus, their POS tags have some subtle differences. Note the target word "zmnit" and its aligned source word "change" are both verbs. We can observe that POS tags can help the model capture such syntactic variants.

Sensitivity Analysis
During training, we find that the model can easily overfit the training data, which yields poor performance on the test and validation sets. To make the model more stable on the unseen data, we apply dropout to the word embeddings, POS embeddings, vectors after the convolutional layers and the stacked recurrent layers. In Figure 2, we examine the accuracies dropout rates in [0.1, 0.3, 0.7]. We find that adding dropout alleviates overfitting issues on the training set. If we reduce the dropout rate to 0.1, which means randomly setting some values to zero with probability 0.1, the training F1-Multi increases rapidly and the validation F1-multi score is the lowest among all the settings. Preliminary results proved best for a dropout rate of 0.3, so we use this in all the experiments.

Conclusion
In this paper, we propose a deep neural architecture for word-level QE. Our framework leverages a one-dimensional convolution on the concatenated word embeddings of target and its aligned source words to extract salient local feature maps. In additions, bidirectional RNNs are applied to capture temporal dependencies for better sequence prediction. We conduct thorough experiments on four language pairs in the WMT2018 shared task. The proposed framework achieves highly competitive results, outperforms all other participants on English-Czech and English-Latvian word-level, and is second place on English-German, and German-English language pairs.