The AMU-UEdin Submission to the WMT 2017 Shared Task on Automatic Post-Editing

This work describes the AMU-UEdin sub-mission to the WMT 2017 shared task on Automatic Post-Editing. We explore multiple neural architectures adapted for the task of automatic post-editing of machine translation output. We focus on neural end-to-end models that combine both inputs mt and src in a single neural architecture, modeling { mt, src } → pe directly. Apart from that, we investigate the inﬂu-ence of hard-attention models which seem to be well-suited for monolingual tasks, as well as combinations of both ideas.


Introduction
During the WMT 2016 APE two systems relied on neural models, the CUNI system (Libovický et al., 2016) and the shared task winner, the system submitted by the Adam Mickiewicz University (AMU) team (Junczys-Dowmunt and Grundkiewicz, 2016).This submission explored the application of neural translation models to the APE problem and achieved good results by treating different models as components in a log-linear model, allowing for multiple inputs (the source src and the translated sentence mt) that were decoded to the same target language (post-edited translation pe).Two systems were considered, one using src as the input (src → pe) and another using mt as the input (mt → pe).A simple stringmatching penalty integrated within the log-linear model was used to control for higher faithfulness with regard to the raw MT output.The penalty fires if the APE system proposes a word in its output that has not been seen in mt.The influence of the components on the final result was tuned with Minimum Error Rate Training (Och, 2003) with regard to the task metric TER.
With neural encoder-decoder models, and multi-source models in particular, the combination of mt and src can be now achieved in more natural ways than for previously popular phrase-based statistical machine translation (PB-SMT) systems.Despite this, results for multi-source or doublesource models in APE scenarios are incomplete or unsatisfying in terms of performance.
In this work, we explore a number of singlesource and double-source neural architectures which we believe to be better fits to the APE task than vanilla encoder-decoder models with soft attention.We focus on neural end-to-end models that combine both inputs mt and src in a single neural architecture, modeling {mt, src} → pe directly.Apart from that, we investigate the influence of hard-attention models which seem to be well-suited for monolingual tasks.Finally, we create combinations of both architectures.
Following (Junczys-Dowmunt and Grundkiewicz, 2016), we also attempt to generate more artificial data for the task.Instead of relying on filtering towards specific error rates, we generate text with fitting error rates from the start which allows us to retain more data.
2 Encoder-Decoder Models with APE-specific Attention Models

Standard Attentional Encoder-Decoder
The attentional encoder-decoder model in Marian1 is a re-implementation of the NMT model in Nematus (Sennrich et al., 2017).The model differs from the standard model introduced by Bahdanau et al. (2014) by several aspects, the most important being the conditional GRU with attention.The summary provided in this section is based on the description in Sennrich et al. (2017).More details on the specific architectures in this shared task submission are given in Junczys-Dowmunt and Grundkiewicz (2017).
Given the raw MT output sequence (x 1 , . . ., x Tx ) of length T x and its manually post-edited equivalent (y 1 , . . ., y Ty ) of length T y , we construct the encoder-decoder model using the following formulations.
where F is the encoder embeddings matrix.The GRU RNN cell (Cho et al., 2014) is defined as: where x is the cell input, s is the previous recurrent state, W, U, W r , U r , W z , U z are trained model parameters 2 ; σ is the logistic sigmoid activation function.The backward encoder state is calculated analogously over a reversed input sequence with its own set of trained parameters.Let h i be the annotation of the source symbol at position i, obtained by concatenating the forward and backward encoder RNN hidden states, the set of encoder states C = {h 1 , . . ., h Tx } then forms the encoder context.

Decoder initialization
The decoder is initialized with start state s 0 , computed as the average over all encoder states: Conditional GRU with attention We follow the Nematus implementation of the conditional GRU with attention, cGRU att : where s j is the newly computed hidden state, s j−1 is the previous hidden state, C the source context and E[y j−1 ] is the embedding of the previously decoded symbol y i−1 .
The conditional GRU cell with attention, cGRU att , has a complex internal structure, consisting of three parts: two GRU layers and an intermediate attention mechanism ATT.
2 Biases have been omitted.
Layer GRU 1 generates an intermediate representation s j from the previous hidden state s j−1 and the embedding of the previous decoded symbol E[y j−1 ]: The attention mechanism, ATT, inputs the entire context set C along with intermediate hidden state s j in order to compute the context vector c j as follows: where α ij is the normalized alignment weight between source symbol at position i and target symbol at position j and v a , U a , W a are trained model parameters.
Layer GRU 2 generates s j , the hidden state of the cGRU att , from the intermediate representation s j and context vector c j : Deep output Finally, given s j , y j−1 , and c j , the output probability p(y j |s j , y j−1 , c j ) is computed by a softmax activation as follows: This rather standard encoder-decoder model with attention is our baseline and denoted as ENCDEC-ATT.
The following models reuse most parts of the architecture described above wherever possible, most differences occur in the decoder RNN cell and the attention mechanism.The encoders are identical, so are the deep output layers.

Hard Monotonic Attention
Aharoni and Goldberg (2016) introduce a simple model for monolingual morphological reinflection with hard monotonic attention.This model looks at one encoder state at a time, starting with the left-most encoder state and progressing to the right until all encoder states have been processed.
The target word vocabulary V y is extended with a special step symbol (V y = V y ∪ { STEP }) and whenever STEP is predicted as the output symbol, the hard attention is moved to the next encoder state.Formally, the hard attention mechanism is represented as a precomputed monotonic sequence (a 1 , . . ., a Ty ) which can be inferred from the target sequence (y 1 , . . ., y Ty ) (containing original target symbols and T x step symbols) as follows: For a given context C = {h 1 , . . ., h Tx }, the attended context vector at time step j is simply h a j .Following the description by Aharoni and Goldberg (2016) for their LSTM-based model, we now adapt the previously described encoder-decoder model to incorporate hard attention.The encoder as well as the output layer of the previous model remain unchanged.Given the sequence of attention indices (a 1 , . . ., a Ty ), the conditional GRU cell (Eq.2) used for hidden state updates of the decoder is replaced with a simple GRU cell (Eq. 1) (thus removing the soft-attention mechanism): where the cell input is now a concatenation of the embedding of the previous target symbol E[y j−1 ] and the currently attended encoder state h a j .This model is labeled ENCDEC-HARD.We find this architecture compelling for monolingual tasks that might require higher faithfulness with regard to the input.With hard monotonic attention, the translation algorithm can enforce certain constraints: 1.The end-of-sentence symbol can only be generated if the hard attention mechanism has reached the end of the input sequence, enforcing full coverage; 2. The STEP symbol cannot be generated once the end-of-sentence position in the source has been reached.It is however still possible to generate content tokens.
Obviously, this model requires a target sequence with correctly inserted STEP symbols.For the described APE task, using the Longest Common Subsequence algorithm (Hirschberg, 1977), we first generate a sequence of match, delete and insert operations which transform the raw MT output (x 1 , • • • x Tx ) into the corrected post-edited sequence (y 1 , • • • y Ty )3 .Next, we map these operations to the final sequence of steps and target tokens according to the following rules: • For each matched pair of tokens x, y we produce symbols: STEP y; • For each inserted target token y, we produce the same token y; • For each deleted source token x we produce STEP ; • Since at initialization of the model a 1 = 1, i.e. the first encoder state is already attended to, we discard the first symbol in the new sequence if it is a STEP symbol.

Hard and Soft Attention
While the hard attention model can be used to enforce faithfulness to the original input, we would also like the model to be able to look at information anywhere in the source sequence which is a property of the soft attention model.By re-introducing the conditional GRU cell with soft attention into the ENCDEC-HARD model while also inputting the hard-attended encoder state h a j , we can try to take advantage of both attention mechanisms.Combining Eq. 2 and Eq. 3, we get: The rest of the model is unchanged; the translation process is the same as before and we use the same target step/token sequence for training.This model is called ENCDEC-HARD-ATT.

Soft Double-Attention
Neural multi-source models (Zoph and Knight, 2016) seem to be natural fit for the APE task, as raw MT output and original source language input are available.Although application to the APE problem have been reported (Libovický and Helcl, 2017), state-of-the-art results seem to be missing.
In this section we give details about our doublesource model implementation.We rename the existing encoder C to C mt to signal that the first encoder consumes the raw MT output and introduce a structurally identical second encoder C src = {h src 1 , . . ., h src Tsrc } over the source language.To compute the decoder start state s 0 for the multiencoder model we concatenate the averaged encoder contexts before mapping them into the decoder state space: In the decoder, we replace the conditional GRU with attention, with a doubly-attentive cGRU cell (Calixto et al., 2017) over contexts C mt and C src : The procedure is similar to the original cGRU, differing only in that in order to compute the context vector c j , we first calculate contexts vectors c mt j and c src j for each context and then concatenate the results: This could be easily extended to an arbitrary number of encoders with different architectures.
During training this model is fed with a tri-parallel corpus, during translation both input sequences are processed simultaneously to produce the corrected output.This model is denoted as ENCDEC-DOUBLE-ATT.

Hard Attention with Soft Double-Attention
Analogously to the procedure described in section 2.3, we can extend the doubly-attentive cGRU to take the hard-attended encoder context as additional input: In this formulation, only the first encoder context C mt is attended to by the hard monotonic attention mechanism.The target training data consists of the step/token sequences used for all previous hard-attention models.We call this model ENCDEC-HARD+DOUBLE-ATT.

Artifical Data
We also attempt to generate more artificial data for the task.Instead of relying on filtering towards specific error rates, we generate text with fitting error rates from the start which allows us to retain more data.To obtain the monolingual source data we follow the steps described by (Junczys-Dowmunt and Grundkiewicz, 2016).Next we train a English-to-German MT system using data from the WMT2016 shared task on IT translation.This system is used to translate it's own training data into German.Although input sentence have been seen, the translations are far from perfect.
Next we create an MT system to translate from correct German to imperfect German MT output.This system can now be applied to create raw German MT output from correct German text.
In order to achieve matching TER statistics we use a simple implementation of the Nelder-Mead algorithm for parameter tuning.For unknown reasons, MERT or kb-Mira would not create output with the desired error-rates.
Using this system we create a new large set of pseudo-PE data, translating domain-selected monolingual data from German into German pseudo-MT output.The English input is created with an German-to-English phrase-based MT system.We translate about 15 million sentences in this manner, creating new artificial APE triplets.

Training, Development, and Test Data
We perform all our experiments with the official WMT16 (Bojar et al., 2016)  where src is the original English text, mt is the raw MT output generated by an English-to-German system, and pe is the human post-edited MT output.The MT system used to produce the raw MT output is unknown, so is the original training data.The task consist of automatically correcting the MT output so that it resembles human postedited data.The main task metric is TER (Snover et al., 2006) -the lower the better -with BLEU (Papineni et al., 2002) as a secondary metric.Table 1 summarizes the data sets used in this work.To produce our final training data set we oversample the original training data 20 times and add all three artificial data sets (they may overlap).This results in a total of slightly more than 21M training triplets.We keep the development set as a validation set for early stopping and report results on the WMT16 test set.The data is already tokenized, additionally we truecase all files and apply segmentation into BPE subword units.We reuse the subword units distributed with the artificial data set.For the hard-attention models, we create new target training and development files following the procedure from section 2.2.

Training parameters
All models are trained on the same training data.Models with single input encoders take only the raw MT output (mt) as input, double-encoder models use raw MT output (mt) and the original source (pe).The training procedures and model settings are the same whenever possible: • All embedding vectors consist of 512 units, the RNN states use 1024 units.We choose a vocabulary size of 40,000 for all inputs and outputs.When hard attention models are trained the maximum sentence length is 100 to accommodate the additional step symbols, otherwise 50.
• To avoid overfitting, we use pervasive dropout (Gal, 2015) over GRU steps and input embeddings, with dropout probabilities 0.2, and over source and target words with probabilities 0.2.
• We use Adam (Kingma and Ba, 2014) as our optimizer, with a mini-batch size of 64.All models are trained with Asynchronous SGD (Adam) on three to four GPUs.
• We train all models until convergence (earlystopping with a patience of 10 based on devset cross-entropy cost), saving model checkpoints every 10,000 mini-batches.
• The best eight model checkpoints w.r.t.devset cross-entropy of each training run are averaged element-wise (Junczys-Dowmunt et al., 2016) resulting in new single models with generally improved performance.
• For the multi-source models we repeat the mentioned procedure four times with different randomly initialized weights and random seeds to later form model ensembles.
Training time for one model on four NVIDIA GTX 1080 GPUs or NVIDIA TITAN X (Pascal) GPUs is between two and three days, depending on model complexity.

Submitted System
We chose an ensemble of four ENCDEC-HARD+DOUBLE-ATT systems (four distinct training runs with different random weights initializations) as our final system.In Table 2, this system is marked as CONSTRASTIVE.We also noticed that providing the system output once more as system input to the same system results in a small im- provement.This one-time looped system is our primary submission PRIMARY.  3 contains a selection of most relevant results for the WMT16 APE shared task -during the task and afterwards.WMT 2016-baseline 1 is the raw uncorrected mt output, baseline 2 is the results of a vanilla phrase-based Moses system (Koehn et al., 2007) trained only on the official 12,000 sentences.Junczys-Dowmunt and Grundkiewicz (2016)  In Table 4 we present the results for the models discussed in this work.The double-attention models outperform the best WMT16 system and the currently reported best single-model Pal et al. (2017) SYMMETRIC.The ensembles also beat the system combination Pal et al. (2017) RERANK-ING in terms of TER (not in terms of BLEU though).The simpler double-attention model with no hard-attention ENCDEC-DOUBLE-ATT reaches slightly better results on the test set than its counterpart with added hard attention ENCDEC-HARD+DOUBLE-ATT, but the situation would have been less clear if only the dev set were used to choose the best model.

Post-submission analysis
is the best system at the shared task.Pal et al. (2017) SYMMETRIC is the currently best reported result on the WMT16 APE test set for a single neural model (single source), whereas Pal et al. (2017) RERANKING -the overall best reported result on the test set -is a system combination of Pal et al. (2017) SYMMETRIC with phrase-based models via n-best list re-ranking.

Table 3 :
Results from the literature for the WMT 2016 APE development and test set

Table 4 :
Post-submission results, the main task metric is TER (the lower the better) Dowmunt and Grundkiewicz (2016), dicarding the newly created data in this work.To produce our final training data set we oversample the original training data 20 times and add the artificial data sets.This results in a total of slightly more than 5M training triplets.For the hard-attention models, we create new target training and development files following the LCS-based procedure outlined in section 2.2.Table