UdS Submission for the WMT 19 Automatic Post-Editing Task

In this paper, we describe our submission to the English-German APE shared task at WMT 2019. We utilize and adapt an NMT architecture originally developed for exploiting context information to APE, implement this in our own transformer model and explore joint training of the APE task with a de-noising encoder.


Introduction
The Automatic Post-Editing (APE) task is to automatically correct errors in machine translation outputs. This paper describes our submission to the English-German APE shared task at WMT 2019. Based on recent research on the APE task (Junczys-Dowmunt and Grundkiewicz, 2018) and an architecture for the utilization of documentlevel context information in neural machine translation (Zhang et al., 2018b), we re-implement a multi-source transformer model for the task. Inspired by Cheng et al. (2018), we try to train a more robust model by introducing a multi-task learning approach which jointly trains APE with a de-noising encoder.
We made use of the artificial eScape data set  provided for the task, since the multi-source transformer model contains a large number of parameters and training with large amounts of supplementary synthetic data can help regularize its parameters and make the model more general. We then tested the BLEU scores between machine translation results and corresponding gold standard post-editing results on the original development set, the training set and the synthetic data as shown in Table 1 Table 1 shows that there is a significant gap between the synthetic eScape data set  and the real-life data sets (the development set and the original training set from posteditors), potentially because  generated the data set in a different way compared to Junczys-Dowmunt and Grundkiewicz (2016) and very few post-editing actions are normally required due to the good translation quality of neural machine translation (Bahdanau et al., 2014;Gehring et al., 2017;Vaswani et al., 2017) which significantly reduces errors in machine translation results and makes the post-editing results quite similar to raw machine translation outputs.

Our Approach
We simplify and employ a multi-source transformer model (Zhang et al., 2018b) for the APE task, and try to train a more robust model through multi-task learning.

Our Model
The transformer-based model proposed by Zhang et al. (2018b) for utilizing document-level context information in neural machine translation has two source inputs which can also be a source sentence along with the corresponding machine translation output and therefore caters for the requirements of APE. Since both source sentence and machine translation outputs are important for the APE task (Pal et al., 2016;Vu and Haffari, 2018), we remove the context gate used to restrict the information flow from the first input to the final output in their architecture, and obtain the model we used for our submission shown in Figure 1.
The model first encodes the given source sentence with stacked self-attention layers, then "post-edits" the corresponding machine translation result through repetitively encoding the machine translation result (with a self-attention Compared to the multi-source transformer model used by Junczys-Dowmunt and Grundkiewicz (2018), this architecture has one more cross-attention module in the encoder for machine translation outputs to attend to the source input which makes the parameter sharing of layers between two encoders impossible, but we think this cross-attention module can help the de-noising task. The embedding of source, machine translation outputs and post-editing results is still shared as Junczys-Dowmunt and Grundkiewicz (2018) advised. Table 1 shows a considerable difference between the synthetic data set  and the real data set. To enable the model to handle more kinds of errors, we simulate new "machine translation outputs" through adding noise to the corresponding post-editing results. Following Cheng et al. (2018), we add noise directly to the look-up embedding of post-editing results instead of ma-nipulating post-editing sequences.

Joint Training with De-noising Encoder
Since the transformer (Vaswani et al., 2017) does not apply any weight regularization, we assume that the model can easily learn to reduce noise by enlarging weights, and propose to add adaptive noise to the embedding: where emb is the embedding matrix, strength is a number between [0.0, +∞) to control the strength of noise, N is the noise matrix of the same shape as emb. We explore both standard Gaussian distribution and uniform distribution of [−1.0, −1.0] as N . In this way the noise will automatically grow with the growing embedding weights.
Given that the transformer translation model (Vaswani et al., 2017) incorporates word order information through adding positional embedding to word embedding, we add noise to the combined embedding. In this case, the noise can both affect the word embedding (replacing words with their synonyms) and positional embedding (swapping word orders).
During training, we use the same model, and achieve joint training by randomly varying inputs: the inputs for the APE task are {source, mt, pe}, while those for the de-noising encoder task are {source, pe+noise, pe} where "source", "mt" and "pe" stand for the source sentence, the corresponding output from the machine translation system and the correct post-editing result. The final loss for joint training is: (2) i.e. the loss between the APE task and the denoising encoder task are balanced by λ in this way.

Experiments
We implemented our approaches based on the Neutron implementation (Xu and Liu, 2019) for transformer-based neural machine translation.

Data and Settings
We only participated in the English to German task, and we used both the training set provided by WMT and the synthetic eSCAPE corpus . We first re-tokenized 1 and truecased both data sets with tools provided by Moses (Koehn et al., 2007), then cleaned the data sets with scripts ported from the Neutron implementation, and the original training set was up-sampled 20 times as in (Junczys-Dowmunt and Grundkiewicz, 2018). We applied joint Byte-Pair Encoding (Sennrich et al., 2016) with 40k merge operations and 50 as the vocabulary threshold for the BPE. We only kept sentences with a max of 256 sub-word tokens for training, and obtained a training set of about 6.5M triples with a shared vocabulary of 42476. We did not apply any domain adaptation approach for our submission considering that (Junczys-Dowmunt and Grundkiewicz, 2018) shows few improvements, but advanced domain adaption (Wang et al., 2017) or fine-tuning (Luong and Manning, 2015) methods may still bring some improvements. The training set was shuffled for each training epoch.
Like Junczys-Dowmunt and Grundkiewicz (2018), all embedding matrices were bound with the weight of the classifier. But for tokens which in fact do never appear in post-editing outputs in the shared vocabulary, we additionally remove their weights in the label smoothing loss and set corresponding biases in the decoder classifier to −10 32 .
Unlike Zhang et al. (2018b), the source encoder, the machine translation encoder and the decoder had 6 layers. The hidden dimension of the 1 using arguments: -a -no-escape position-wise feed-forward neural network was 2048, the embedding dimension and the multihead attention dimension were 512. We used a dropout probability of 0.1, and employed label smoothing (Szegedy et al., 2016) value of 0.1. We used the Adam optimizer (Kingma and Ba, 2015) with 0.9, 0.98 and 10 −9 as β 1 , β 2 and . The learning rate schedule from Vaswani et al. (2017) with 8, 000 as the number of warm-up steps 2 was applied. We trained our models for only 8 epochs with at least 25k post-editing tokens in a batch, since we observed over-fitting afterwards. For the other hyper parameters, we used the same as the transformer base model (Vaswani et al., 2017).
During training, we kept the last 20 checkpoints saved with an interval of 1, 500 training steps (Vaswani et al., 2017;Zhang et al., 2018a), and obtained 4 models for each run through averaging every 5 adjacent checkpoints.
For joint training, we simply used 0.2 as the strength of noise (strength), and 0.5 as λ for joint training. Other values may provide better performance, but we did not have sufficient time to try this for our submission.
During decoding, we used a beam size of 4 without any length penalty.

Results
We first evaluated case-sensitive BLEU scores 3 on the development set, and results of all our approaches and baselines are shown in Table 2. "MT as PE" is the do-nothing baseline which takes the machine translation outputs directly as post-editing results.
"Processed MT" is the machine translation outputs through preprocessing (re-tokenizing and truecasing) and post-processing (de-truecasing and re-tokenizing without "-a" argument 4 ) but without APE. "Base", "Gaussian" and "Uniform" stand for our model trained only for the APE task, jointly trained with Gaussian noise and uniform noise, respectively. We reported the minimum and the maximum BLEU scores of the 4 averaged models for 2 https://github.com/tensorflow/ tensor2tensor/blob/master/tensor2tensor/ models/transformer.py#L1623.
3 https://github.com/moses-smt/ mosesdecoder/blob/master/scripts/ generic/multi-bleu.perl. 4 "-a" indicates tokenizing in the aggressive mode, which normally helps reduce vocabulary size. The official data sets were tokenized without this argument, so we have to recover our post-editing outputs. each experiment. "Ensemble x5" is the ensemble of 5 models from joint training, 4 of which were averaged models with highest BLEU scores on the development set, another one was the model saved for each training epoch with lowest validation perplexity.   Table 2 shows that the performance got slightly hurt (comparing "Processed MT" with "MT as PE") with pre-processing and post-processing procedures which are normally applied in training seq2seq models for reducing vocabulary size. The multi-source transformer (Base) model achieved the highest single model BLEU score without joint training with the de-noising encoder task. We think this is perhaps because there is a gap between the generated machine translation outputs with noise and the real world machine translation outputs, which biased the training.

Models
Even with the ensembled model, our APE approach does not significantly improve machine translation outputs measured in BLEU (+0.46). We think human post-editing results may contain valuable information to guide neural machine translation models in some way like Reinforcement-Learning, but unfortunately, due to the high quality of the original neural machine translation output, only a small part of the real training data in the APE task are actually corrections from post editors, and most data are generated from the neural machine translation system, which makes it like adversarial training of neural machine translation  or multipass decoding (Geng et al., 2018).
All our submissions were made by jointly trained models because the performance gap between the best and the worst model of jointly trained models is smaller, which means that jointly trained models may have smaller variance.
Results on the test set from the APE shared task organizers are shown in Table 3. Even the ensemble of 5 models did not result in significant differ-   Pal et al. (2016) applied a multi-source sequenceto-sequence neural model for APE, and Vu and Haffari (2018) jointly trained machine translation with the post editing sequence prediction task (Berard et al., 2017). Though all previous approaches get significant improvements over Statistical Machine Translation outputs, benefits with APE on top of Neural Machine Translation outputs are not very significant .
On the other hand, advanced neural machine translation approaches may also improve the APE task, such as: combining advances of the recurrent decoder , the Evolved Transformer architecture (So et al., 2019), Layer Aggregation (Dou et al., 2018) and Dynamic Convolution structures (Wu et al., 2019).

Conclusion
In this paper, we described details of our approaches for our submission to the WMT 19 APE task. We borrowed a multi-source transformer model from the context-dependent machine translation task and applied joint training with a denoising encoder task for our submission.