A Neural Network based Approach to Automatic Post-Editing

We present a neural network based automatic post-editing (APE) system to improve raw machine translation (MT) output. Our neural model of APE (NNAPE) is based on a bidirectional recurrent neural network (RNN) model and consists of an encoder that encodes an MT output into a ﬁxed-length vector from which a de-coder provides a post-edited (PE) translation. APE translations produced by NNAPE show statistically signiﬁcant improvements of 3.96, 2.68 and 1.35 BLEU points absolute over the original MT, phrase-based APE and hierarchical APE outputs, respectively. Furthermore, human evaluation shows that the NNAPE generated PE translations are much better than the original MT output.


Introduction
For many applications the performance of stateof-the-art MT systems is useful but often far from perfect. MT technologies have gained wide acceptance in the localization industry. Computer aided translation (CAT) has become the de-facto standard in large parts of the translation industry which has resulted in a surge of demand for professional post-editors. This, in turn, has resulted in substantial quantities of PE data which can be used to develop APE systems.
In the context of MT, "post-editing" (PE) is defined as the correction performed by humans over the translations produced by an MT system (Veale and Way, 1997), often with minimal amount of manual effort (TAUS Report, 2010) and as a process of modification rather than revision (Loffler-Laurian, 1985).
MT systems primarily make two types of errors -lexical and reordering errors. However, due to the statistical and probabilistic nature of modelling in statistical MT (SMT), the currently dominant MT technology, it is non-trivial to rectify these errors in the SMT models. Post-edited data are often used in incremental MT frameworks as additional training material. However, often this does not fully exploit the potential of these rich PE data: e.g., PE data may just be drowned out by a large SMT model. An APE system trained on human post-edited data can serve as a MT post-processing module which can improve overall performance. An APE system can be considered as an MT system, translating predictable error patterns in MT output into their corresponding corrections.
APE systems assume the availability of source language input text (SL IP ), target language MT output (T L M T ) and target language PE data (T L P E ). An APE system can be modelled as an MT system between SL IP T L M T and T L P E . However, if we do not have access to SL IP , but have sufficiently large amounts of parallel T L M T -T L P E data, we can still build an APE model between T L M T and T L P E .
Translations provided by state-of-the-art MT systems suffer from a number of errors including incorrect lexical choice, word ordering, word insertion, word deletion, etc. The APE work presented in this paper is an effort to improve the MT output by rectifying some of these errors. For this purpose we use a deep neural network (DNN) based approach. Neural MT (NMT) (Kalchbrenner and Blunsom, 2013;Cho et al., 2014a;Cho et al., 2014b) is a newly emerging approach to MT. On the one hand DNNs represent language in a continuous vector space which eases the modelling of semantic similarities (or distance) between phrases or sentences, and on the other hand it can also consider contextual information, e.g., utilizing all available history information in deciding the next target word, which is not an easy task to model with standard APE systems.
Unlike phrase-based APE systems (Simard et al., 2007a;Simard et al., 2007b;Pal, 2015;, our NNAPE system builds and trains a single, large neural network that accepts a 'draft' translation (T L M T ) and outputs an improved translation (T L P E ).
The remainder of the paper is organized as follows. Section 2 gives an overview of relevant related work. The proposed NNAPE system is described in detail in Section 3. We present the experimental setup in Section 4. Section 5 presents the results of automatic and human evaluation together with some analysis. Section 6 concludes the paper and provides avenues for future work.

Related Work
APE has proved to be an effective remedy to some of the inaccuracies in raw MT output. APE approaches cover a wide methodological range. Simard et al. (2007a) and Simard et al. (2007b) applied SMT for post-editing, handling the repetitive nature of errors typically made by rule-based MT systems. Lagarda et al. (2009) used statistical information from the trained SMT models for post-editing of rule-based MT output. Rosa et al. (2012) and Mareček et al. (2011) applied a rule-based approach to APE on the morphological level. Denkowski (2015) developed a method for real time integration of post-edited MT output into the translation model by extracting a grammar for each input sentence. Recent studies have even shown that the quality of MT plus PE can exceed the quality of human translation (Fiederer and OBrien, 2009;Koehn, 2009;DePalma and Kelly, 2009) as well as the productivity (Zampieri and Vela, 2014) in some cases.
Recently, a number of papers have presented the application of neural networks in MT (Kalchbrenner and Blunsom, 2013;?;Cho et al., 2014b;Bahdanau et al., 2014). These approaches typically consist of two components: an encoder encodes a source sentence and a decoder decodes into a target sentence.
In this paper we present a neural network based approach to automatic PE (NNAPE). Our NNAPE model is inspired by the MT work of Bahdanau et al. (2014) which is based on bidirectional recurrent neural networks (RNN). Unlike Bah-danau et al. (2014), we use LSTMs rather than GRUs as hidden units. RNNs allow processing of arbitrary length sequences, however, they are susceptible to the problem of vanishing and exploding gradients (Bengio et al., 1994). To tackle vanishing gradients in RNNs, two architectures are generally used: gated recurrent units (GRU) (Cho et al., 2014b) and long-short term memory (LSTM) (Hochreiter and Schmidhuber, 1997). According to empirical studies (Chung et al., 2014;Józefowicz et al., 2015) both architectures yield comparable performance. GRUs tend to train faster than LSTMs. On the other hand, given sufficient amounts of training data, LSTMs may lead to better results. Since our task is monolingual and we have more than 200K sentence pairs for training, we use a full LSTM (as the hidden units) to model our NNAPE system.
The model takes T L M T as input and provides T L P E as output. To the best of our knowledge the work presented in this paper is the first approach to APE using neural networks.

Neural Network based APE
The NNAPE system is based on a bidirectional (forward-backward) RNN based encoder-decoder.

A Bidirectional RNN APE Encoder-Decoder
Our NNAPE model encodes a variable-length sequence of T L M T (e.g. x = x 1 , x 2 , x 3 ...x m ) into a fixed-length vector representation and then decodes a given fixed-length vector representation back into a variable-length sequence of T L P E (e.g. y = y 1 , y 2 , y 3 ...y n ). Input and output sequence lengths, m and n, may differ. A Bidirectional RNN encoder consists of forward and backward RNNs. The forward RNN encoder reads in each x sequentially from x 1 to x m and at each time step t, the hidden state h t of the RNN is updated by using a non-linear activation function f (Equation 1), an elementwise logistic sigmoid with an LSTM unit.
Similarly, the backward RNN encoder reads the input sequence and calculates hidden states in reverse direction (i.e. x m to x 1 and h m to h 1 respectively). After reading the entire input sequence, the hidden state of the RNN is provided a summary c context vector ('C' in Figure 1) of the whole input sequence. The decoder is another RNN trained to generate the output sequence by predicting the next word y t given the hidden state η t and the context vector c t (c.f., Figure1). The hidden state of the decoder at time t is computed as given below.
The context vector c t can be computed as Here, α ti , is the weight of each h i and can be computed as where e ti = a(η t−1 , h i ) is an alignment model which provides a matching score between the inputs around position i and the output at position t. The alignment score is based on the i th annotation h i of the input sentence and the RNN hidden state η t−1 . The alignment model itself is a feedforward neural network which directly computes a soft alignment that allows the gradient of the cost function to be backpropagated through. The gradient is used to train the alignment model as well as the T L M T -T L P E translation model jointly.
The alignment model is computed m × n times as follows: where W a ∈ R n h ×n h , U a ∈ R n h ×2n h and v a ∈ R n h are the weight matrices of n h hidden units.

Experiments
We evaluate the model on an English-Italian APE task, which is detailed in the following subsections.

Data
The training data used for the experiments was developed in the MateCat 1 project and consists of 312K T L M T -T L P E parallel sentences. The parallel sentences are (English to) Italian MT output and their corresponding (human) post-edited Italian translations. Google Translate (GT) is the MT engine which provided the original Italian T L M T output. The data includes sentences from the Europarl corpus as well as news commentaries. Since the data contains some non-Italian sentences, we applied automatic language identification (Shuyo, 2010) in order to select only Italian sentences. Automatic cleaning and pre-processing of the data was carried out by sorting the entire parallel training corpus based on sentence length, filtering the parallel data on maximum allowable sentence length of 80 and sentence length ratio of 1:2 (either direction), removing duplicates and applying tokenization and punctuation normalization using Moses (Koehn et al., 2007) scripts. After cleaning the corpus we obtained a sentencealigned T L M T -T L P E parallel corpus containing 213,795 sentence pairs. We randomly extracted 1000 sentence pairs each for the development set and test set from the pre-processed parallel corpus and used the remaining (211,795) as the training corpus. The training data features 57,568 and 61,582 unique word types in T L M T and T L P E , respectively. We chose the 40,000 most frequent words from both T L M T and T L P E to train our NNAPE model. The remaining words which are not among the most frequent words are replaced by a special token ([UNK]). The model was trained for approximately 35 days, which is equivalent to 2,000,000 updates with GPU settings.

Experimental Settings
Our bidirectional RNN Encoder-Decoder contains 1000 hidden units for the forward backward RNN encoder and 1000 hidden units for the decoder.
The network is basically a multilateral neural network with a single maxout unit as hidden layer (Goodfellow et al., 2013) to compute the conditional probability of each target word. The word embedding vector dimension is 620 and the size of the maxout hidden layer in the deep output is 500. The number of hidden units in the alignment model is 1000. The model has been trained on a mini-batched stochastic gradient descent (SGD) with 'Adadelta' (Zeiler, 2012). The main reason behind the use of 'Adadelta' is to automatically adapt the learning rate of each parameter ( = 10 −6 and ρ = 0.95). Each SGD update direction is computed using a mini-batch of 80 sentences. We compare our NNAPE system with state-ofthe-art phrase-based (Simard et al., 2007b) as well as hierarchical phrase-based APE (Pal, 2015) systems. We also compare the output provided by our system against the original GT output. For building the phrase-based and hierarchical phrasebased APE systems, we set maximum phrase length to 7. A 5-gram language model built using KenLM (Heafield, 2011) was used for decoding. System tuning was carried out using both k-best MIRA (Cherry and Foster, 2012) and Minimum Error Rate Training (MERT) (Och, 2003) on the held-out development set (devset). After parameters were tuned, decoding was carried out on the held out test set.

Evaluation
The performance of the NNAPE system was evaluated using both automatic and human evaluation methods, as described below.

Automatic Evaluation
The output of the NNAPE system on the 1000 sentences testset was evaluated using three MT evaluation metrics: BLEU (Papineni et al., 2002), TER (Snover et al., 2006) and Meteor (Denkowski and Lavie, 2011). Table 1 provides a comparison of our neural system performance against the baseline phrase-based APE (S 1 ), baseline hierarchical phrase-based APE (S 2 ) and the original GT output. We use a, b, c, and d to indicate statistical significance over GT, S 1 , S 2 and our NNAPE system (NN), respectively. For example, the S 2 BLEU score 63.87 a,b in Table 1 means that the improvement provided by S 2 in BLEU is statistically significant over Google Translator and phrase-based APE. Table 1 shows that S 1 provides statistically significant (0.01 < p < 0.04) improvements over GT across all metrics. Similarly S 2 yields statistically significant (p < 0.01) improvements over both GT and S 1 across all metrics. The NN system performs best and results in statistically significant (p < 0.01) improvements over all other systems across all metrics. A systematic trend (N N > S 2 > S 1 > GT ) can be observed in Table 1 and the improvements are consistent across the different metrics. The relative performance gain achieved by NN over GT is highest in TER.

Human Evaluation
Human evaluation was carried out by four professional translators, native speakers of Italian, with professional translation experience between one and two years. Since human evaluation is very costly and time consuming, it was carried out on a small portion of the test set consisting of 145 randomly sampled sentences and only compared NN with the original GT output. We used a polling scheme with three different options. Translators choose which of the two (GT or NN) outputs is the better translation or whether there is a tie ('uncertain'). To avoid any bias towards any particular system, the order in which two system outputs are presented is randomized so that the translators do not know which system they are contributing their votes to.
We analyzed the outcome of the voting process (4 translators each giving 145 votes) and found that the winning NN system received 285 (49.13%) votes compared to 99 (17.07%) votes received by the GT system, while the rest of the votes (196, 33.79%) go to the 'uncertain' option. We measured pairwise inter-annotator agreement between the translators by computing Cohen's κ coefficient (Cohen, 1960) reported in Table 2. The overall κ coefficient is 0.330. According to (Landis and Koch, 1977) this correlation coefficient can be interpreted as fair.  Table 2: Pairwise correlation between translators in the evaluation process.

Analysis
The results of the automatic evaluation show that NNAPE has advantages over the phrase-based and hierarchical APE approaches. On manual inspection we found that the NNAPE system drastically reduced the preposition insertion and deletion error in Italian GT output and was also able to handle the improper use of prepositions and determiners (e.g. "states" → "dei stati", "the states" → "gli stati"). The use of a bidirectional RNN neural model makes the model sensitive towards contexts. Moreover, NNAPE captures global reordering by capturing contextual features which helps to reduce word ordering errors to some extent.

Conclusion and Future Work
The NNAPE system provides statistically significant improvements over existing state-of-theart APE models and produces significantly better translations than GT which is very difficult to beat. This enhancement in translation quality through APE should reduce human PE effort. Human evaluation revealed that the NNAPE generated PE translations contain less lexical errors, NNAPE rectifies erroneous word insertions and deletions, and improves word ordering. In future, we would like to test our system in a real-life translation scenario to analyze productivity gains in a commercial environment. We also want to extend the APE system by incorporating source language knowledge into the network and compare LSTM against GRU hidden units.