Automatic Post-Editing of Machine Translation: A Neural Programmer-Interpreter Approach

Automated Post-Editing (PE) is the task of automatically correct common and repetitive errors found in machine translation (MT) output. In this paper, we present a neural programmer-interpreter approach to this task, resembling the way that human perform post-editing using discrete edit operations, wich we refer to as programs. Our model outperforms previous neural models for inducing PE programs on the WMT17 APE task for German-English up to +1 BLEU score and -0.7 TER scores.


Introduction
Automatic post-editing (APE) is the automated task that aims to correct common systematic and repetitive errors found in machine translation (MT) output. APE systems can also be used to adapt general-purpose MT output to specific domains without re-training MT models, or to incorporate information which is not available or expensive to compute at MT decoding stage. Postediting is considered as the modification process of a machine translated text with a minimum labor effort rather than re-translation from scratch.
Previous studies in neural APE have primarily concentrated on formalizing APE as a monolingual MT problem in the target language, with or without conditioning on the source sentence (Pal et al., 2016;. MT approach has suffered from over-correction where APE system performs unnecessary correction leading to paraphasing and the degradation of the output quality . Recent works (Libovický et al., 2016;Berard et al., 2017) have attempted to learn the predict of sequence of post-editing operations, e.g. insertion and deletion, to induce APE programs to turn the machine translated text into the desired out-put. Previous program induction approaches suffer from over-cautiousness, where the APE system tends to keep the machine translated text without any modification .
In this paper, we propose a programmerinterpreter approach to the APE task to address the over-cautiousness problem. Our architecture includes an interpreter module, which executes the previous editing action before generating the next one. This is in contrast to the previous work, where the full program is induced before it is executed. The ability of execution immediately at every time step provides a proper conditioning context based on the actual partial edited sentence to assist better prediction of the next operation. Moreover, the execution module can be pre-trained on monolingual target text, enabling our architecture to benefit from monolingual data in addition to PE data, which is hard to obtain. Our model is jointly trained on translation task and APE program induction task. The multi-task architecture allows the model to reconstruct the source-target alignment of the black-box MT system and inject it into post-editing task.
We compare our programmer-interpreter architecture against previous works on the English-German APE based on the data for this task in WMT16 and WMT17. Compared to the previous work on APE program induction, our architecture achieves improvements up to +1 BLEU and -0.7 TER scores. Our analysis also shows that APE programs generated by our model are not only better at correcting errors but also attempt to perform more editing actions. Pal et al. (2016) has applied the SEQ2SEQ model to APE. Their monolingual MT learned to postedit English-Italian Google Translation output and was able to reduce the preposition related errors. Blindly performing edition over MT output, the monolingual APE has difficulty to correct missing word or information in the source sentence. Neural multi-source MT architectures are applied to better capture the connection between the source sentence/machine translated text and the PE output (Libovický et al., 2016;Varis and Bojar, 2017;Junczys-Dowmunt and Grundkiewicz, 2017). Our work is motivated by Ling et al. (2017) on learning to indirectly solve an algebraic word problem by inducing a program which generates the answer together with an explanation. It further builds up on recent work on neural programmerinterpreter (Reed and De Freitas, 2016), where a neural network programmer learns to program an interpreter. The architecture is then trained using expert action trajectories as programs.

The NPI-APE Approach
Given a source sentence s s s and a machine translated sentence m m m, the goal is to find a postedited sentence t t t = arg max t t t P ape (t t t |m m m, s s s) where P ape (.) is our probabilistic APE model. In our proposed approach, we aim to find an editing action sequence z z z to execute in order to generate the desired post-edited sentence, P ape (t t t|m m m, s s s) = z z z∈Z P ape (t t t, z z z|m m m, s s s).
We decompose the joint probability of a program and an output as: where P prog (z i |t t t ≤j i−1 , m m m, s s s) is the programmer's probability in producing the next edit operation z i given the post edited output t t t ≤j i−1 generated from the operations so far z z z ≤i−1 , and P intp (t j i |m k i , z i ) is the interpreter's probability of outputing t j i given the edit operation z i and the MT word m k i .
Following Berard et al.(2017), our action sequence is performed on the MT sentence from left to right. At each position, we can take one of the following editing operations: (i) KEEP to keep the word and go to the next word, (ii) DELETE to delete the word and go to the next word, (iii) IN-SERT(WORD) to insert a new WORD and stay in that position, or (iv) STOP to terminate the process. In other words, the size of the operation set equals the size of the target vocabulary plus three, where we add the symbols KEEP, DELETE, and STOP as new tokens. Furthermore, j i is the number of KEEP and INSERT(WORD) operations, and k i is the number of KEEP and DELETE operations in the sequence of operations z z z ≤i . This hard attention mechanism is the outcome of the semantics of the operations, and injects task knowledge into the model. Moreover, P intp (t|m, z) is 1 if the output word t is consistent with performing the operation z on m, and zero otherwise.
Our decomposition of the joint probability of a program and post-edited output is distinguished from that proposed in (Berard et al., 2017), P ape (t t t, z z z|m m m, s s s) = P intp (t t t|z z z, m m m)P prog (z z z|m m m, s s s). Crucially, in our decomposition (eqns 1 and 2), the programming and interpreting are interleaved at each position, whereas in (Berard et al., 2017) the programming is fully done before the interpretation phase and they are independent.

Neural Architecture and Joint Training
The architecture consists of three components (i) A SEQ2SEQ model to translate the source sentence to the target in the forced-decoding mode (MT), (ii) A SEQ2SEQ model to incrementally generate the sequence of edit operations (Action Generator), and (iii) An RNN to summarize the post edited sequence of words produced from the execution of actions generated so far (Interpreter). The encoder and decoder of the MT component are a bidirectional and unidirectional LSTM, whose states are denoted by h h h l and g g g k , respectively. Similarly, the encoder and decoder of the AG (Action Generator) component are a bidirectional and unidirectional LSTM, whose states are denoted by h h h k and g g g i , respectively. The states of the unidirectional RNN in the interpreter are denoted by v v v j . The next edit operation z i is generated from the decoder state of the AG g g g i (see Figure 2), which is computed from the previous state g g g i−1 and a context including the following: (i) h h h k i where k i is the index of the MT output word currently processed, (ii) c c c k i which is the context vector from the MT component when generating the current target word, and (iii) v v v j i−1 which is the last hidden state of the interpreter RNN encoding the post edited sentence generated so far.
The model is trained jointly for the translation task (SRC→TGT), and for the post editing task (SRC,TGT→OP,PE). For the training data, we compute the lowest-cost sequence of editing operations (OP) using dynamic programming, where the cost of insertion and deletion are 1.

Experiments
Dataset. We evaluate the proposed approach on the English-to-German (En-De) post-editing task in the IT domain using the data from WMT16 1 and WMT17. 2 The official WMT'16 and WMT'17 1 http://www.statmt.org/wmt16/ape-task.html 2 http://www.statmt.org/wmt17/ape-task.html dataset contains 12K and 11K post-editing triplets (English, translated German, post-edited German) respectively in IT domain. We concatenated them to an 23K triplets. A synthetic corpus of 500K triplets (Junczys-Dowmunt and Grundkiewicz, 2016) is also available as additional training data. We performed our experiment in two different settings with and without synthetic data for comparison with Berard et al. (2017).
The RNN in the interpreter component can be thought of as a language model. This paves the way to pre-train it using monolingual text. We collect in-domain IT text from OPUS 3 from the following sections: GNOME, KDE, KDEdoc, OpenOffice, OpenOffice3, PHP and Ubuntu. After tokenizing, filtering out sentences containing special characters, and removing duplications, we obtain around 170K sentences.
Setup. There are three components in our architecture: machine translation (MT), action generator (AG), and interpreter (LM). We compare our MT+AG+LM architecture against MT+AG 4 Berard et al. (2017)  Furthermore, we compare against monolingual SEQ2SEQ (TGT→PE) as well as the multisource SEQ2SEQ (SRC+TGT→PE) (Varis and Bojar, 2017). Monolingual SEQ2SEQ (TGT→PE) model is an attentional SEQ2SEQ model (Bahdanau et al., 2015) that takes target sentence as input and outputs desired PE sentence. In multisource SEQ2SEQ (SRC+TGT→PE), we use two encoders for source and target sentences and concatenate their context vectors. In both models, the encoder and decoder contain a single layer of bidirectional and unidirectional LSTM respectively. The size of the LSTM hidden dimensions and word embedding in these models is set to 256 and 128, respectively. This ensures almost the same number of parameters (∼13M) in all architectures.
Training. We use a multi-task scenario to jointly train the parameters of the components in MT+AG as well as MT+AG+LM models. For the latter, we warm start the embedding of the target words with  All models are trained with SGD, where the learning rate is initialised to 1 and decays after each epoch. The learning rate is decayed 0.8 after every epoch for model trained with official postediting data, and 0.5 every half epoch for model with synthetic data. All models use the same vocabulary on the same data condition. The Vocabulary size is 30K for large dataset experiments, and 27K/19K for the 23K/12K data conditions. In all experiments, the best model is selected based on TER on the validation set. For decoding, we use beam search with the beam size of 10. Table 1 shows the result on different training datasets to compare our model against the baselines. Original MT is the strong standard donothing baseline, i.e. copying the MT translation as the PE output. In all settings, our MT+AG+LM models outperforms the MT+AG and monolingual/multi-source SEQ2SEQ models. Specifically, our model outperform MT+AG in 500K+12K training condition by almost 1 BLEU score on test2017.

Results
As expected, the models trained on 23K data perform better than those trained on 12K; further gains are obtained by adding 500K synthetic data. Interestingly, training MT+AG and MT+AG+LM models on 23K data lead to better TER/BLEU than those trained on 500K+12K. This implies the importance of in-domain training data, as the synthetic corpus is created using general domain Common-Crawl corpus.

Analysis
We perform fine-grained analysis of the changes made by our model vs MT+AG in order to understand the sources of improvements. For different data conditions, Table 2 shows the number of modified sentences by each model as well as the sentence level precision defined as the fraction of sentences with improved TER. Moreover, it reports the total number of actions generated by the model on sentences with improved TER, as well as the precision of such actions, i.e. the fraction of those observed in the ground truth action trajecto-ries.
As reported in Berard et al. (2017), one major challenge of predicting action sequences is the class imbalance. The model is often too conservative about its edits, and tends to use the KEEP far more than the INSERT and DELETE actions. Our Programmer-Interpreter model tackles this problem, as evidenced by its comparable number of modified sentences, but with higher sentence and action level precision in almost all cases.

Conclusion
In this paper, we have presented a neural programmer-interpreter approach to automated post-editing of MT output. Our approach interleaves generating the sequence of edit actions by a programmer component, and executing those actions with an interpreter component. This leads to better capturing the history of the past generated actions when generating the next action. Our approach achieves up to +1 BLEU and -.7 TER improvement compared to a variant in which programming is not interleaved with execution. Future work includes inducing macro-actions composed of simpler building block actions.