Neural Automatic Post-Editing Using Prior Alignment and Reranking

We present a second-stage machine translation (MT) system based on a neural machine translation (NMT) approach to automatic post-editing (APE) that improves the translation quality provided by a first-stage MT system. Our APE system (APE_Sym) is an extended version of an attention based NMT model with bilingual symmetry employing bidirectional models, mt–pe and pe–mt. APE translations produced by our system show statistically significant improvements over the first-stage MT, phrase-based APE and the best reported score on the WMT 2016 APE dataset by a previous neural APE system. Re-ranking (APE_Rerank) of the n-best translations from the phrase-based APE and APE_Sym systems provides further substantial improvements over the symmetric neural APE model. Human evaluation confirms that the APE_Rerank generated PE translations improve on the previous best neural APE system at WMT 2016.


Introduction
The ultimate goal of MT systems is to provide fully automatic publishable quality translations. However, existing MT systems often fail to deliver this. To achieve sufficient quality, translations produced by MT systems often need to be corrected by human translators. This task is referred to as post-editing (PE). PE is often understood as the process of improving a translation provided by an MT system with the minimum amount of manual effort (TAUS Report, 2010). Nonetheless, translations produced by MT systems have improved substantially and consistently over the last two decades and are now widely used in the translation and localization industry. To enhance the quality of automatic translation without changing the original MT system itself, an additional plug-in post-processing module, e.g. a second stage monolingual MT system (an APE system), can be introduced. This may lead to a more reasonable and feasible solution compared to rebuilding the first-stage MT system. APE can be defined as as an automatic method for improving raw MT output, before performing actual human post-editing (Knight and Chander, 1994). APE assumes the availability of source texts (src), corresponding MT output (mt) and the human postedited (pe) version of mt. However, APE systems can also be built without the availability of src, by using only sufficient amounts of target side "mono-lingual" parallel mt-pe data. Usually APE tasks focus on systematic errors made by first stage MT systems, acting as an effective remedy to some of the inaccuracies in raw MT output. APE approaches cover a wide methodological range such as SMT techniques (Simard et al., 2007a;Simard et al., 2007b;Chatterjee et al., 2015;Pal et al., 2015;Pal et al., 2016d) real time integration of post-editing in MT (Denkowski, 2015), rule-based approaches to APE (Mareček et al., 2011;Rosa et al., 2012), neural APE (Junczys-Dowmunt and Grundkiewicz, 2016;Pal et al., 2016b), multi-engine and multi-alignment APE (Pal et al., 2016a), etc.
In this paper we present a neural network based APE system to improve raw first-stage MT output quality. Our neural model of APE is based on the work described in Cohn et al. (2016) which implements structural alignment biases into an attention based bidirectional recurrent neural network (RNN) MT model (Bahdanau et al., 2015). Cohn et al. (2016) extends the attentional soft alignment model to traditional word alignment models (IBM models) and agreement over both translation directions (in our case mt → pe and pe → mt) to ensure better alignment consistency. We follow Cohn et al. (2016) in encouraging our alignment models to be symmetric (Och and Ney, 2003) in both translation directions with embedded prior alignments. Different from Cohn et al. (2016), we employed prior alignment computed by a hybrid multi-alignment approach. Evaluation results show consistent improvements over the raw firststage MT system output and over the previous best performing neural APE (Junczys-Dowmunt and Grundkiewicz, 2016) on the WMT 2016 APE test set. In addition we show that re-ranking n-best output from baseline and enhanced PB-SMT APE systems (Section 3) together with our neural APE output provides further statistically significant improvements over all the other systems.
The main contributions of our research are (i) an application of bilingual symmetry of the bidirectional RNN for APE, (ii) using a hybrid multialignment based approach for the prior alignments, (iii) a smart way of embedding word alignment information in neural APE, and (iv) applying reranking for the APE task.
The remainder of the paper is structured as follows: Section 2 describes the our symmetric neural APE model. Section 3 describes the experimental setup and presents the evaluation results. Section 4 summarizes our work, draws conclusions and presents avenues for for future work.

Symmetric Neural Automatic Post Editing Using Prior Alignment
Below we describe bilingual symmetry of bidirectional RNN with embedded prior word alignment for APE.

Hybrid Prior Alignment
The monolingual mt-pe parallel corpus is first word aligned using a hybrid word alignment method based on the alignment combination of three different statistical word alignment methods: (i) GIZA++ (Och, 2003) word alignment with grow-diag-final-and (GDFA) heuristic (Koehn, 2010), (ii) Berkeley word alignment (Liang et al., 2006), and (iii) SymGiza++ (Junczys-Dowmunt and Szał, 2012) word alignment, as well as two different edit distance based word aligners based on Translation Edit Rate (TER) (Snover et al., 2006) and METEOR (Lavie and Agarwal, 2007). We follow the alignment strategy described in (Pal et al., 2013;Pal et al., 2016a). The aligned word pairs are added as additional training examples to train our symmetric neural APE model. Each word in the first stage MT output is assigned a unique id (sw id ). Each mt-pe word alignment also gets a unique identification number (a id ) and a vector representation is generated for each such a id . Given a sw id , the neural APE model is trained to generate a corresponding a id based on the context sw id appears in. The APE words are generated from a id by looking up the hybrid prior alignment look-up table (LUT). Neural MT jointly learns alignment and translation. Replacing the source and target words by sw id and a id , respectively, implicitly integrates the prior alignment and lessens the burden of the attention model. Secondly, our approach bears a resemblance to the sense embedding approach (Li and Jurafsky, 2015) since an embedding is generated for each (sw id , a id ) pair.

Symmetric Neural APE
Our symmetric neural APE model (AP E Sym ) is inspired by the bilingual symmetry (Cohn et al., 2016) of the bidirectional RNN based MT (Bahdanau et al., 2015). Bilingual symmetry inferences of both directional attention models are combined. The bidirectional RNN is based on an encoderdecoder architecture, where the first-stage MT output is encoded into a distributed representation, followed by a decoding step which generates the APE translation. The encoder consists , which reads in each input string x sequentially from x 1 to x m at each time step i, and a backward RNN , which reads in the opposite direction, i.e., sequentially from x m to x 1 , f being an activation function, defined as an elementwise logistic sigmoid with an LSTM unit. Here, is the word embedding matrix of the MT output, W r ∈ R m×n and U r ∈ R n×n are weight matrices, m is the word embedding dimensionality and n represents the number of hidden units. k x and k y correspond to the vocabulary sizes of source and target languages, respectively. The hidden state of the decoder at time t is computed as Here, α ti is the weight of each h i and can be computed as in Equation 1 where e ti = a(η t−1 , h i ) is a word alignment model. Based on the input (mt) and output (pe) sequence lengths, T x and T y , the alignment model is computed T x × T y times as in Equation 2 a where W a ∈ R m×n , U a ∈ R n×2n and v a ∈ R n are the weight matrices of n hidden units. T denotes the transpose of a matrix. Each hidden unit η t can be defined in Equation 3 are weights. The joint training of the bilingual symmetry models is established using symmetric training with trace bonus, which is computed as −t r (α mt→pe α pe→mt T ). This involves optimizing L as in Equation 4. (4) where B links the two models as B = sum j i α mt→pe i,j α pe→mt j,i , where α are alignment (attention) matrices of T x × T y dimensions. The advantage of symmetrical alignment cells is that they are normalized using softmax (values in between 0 and 1), therefore, the trace term is bounded above by min(T x , T y ), representing perfect one-to-one alignments in both directions.
To train each directional attention model (mt → pe and pe → mt), we follow the work described in Cohn et al. (2016), where absolute positional bias between the MT and PE translation (as in IBM Model 2), fertility relative position bias (as in IBM Models 3, 4, 5) and HMM-based Alignment (Vogel et al., 1996) are incorporated with an attention based soft alignment model.

Experiments and Results
We carried out our experiments on the 12K English-German WMT 2016 APE task training data described in Bojar et al. (2016) and for some experiments we also use the 4.5M artificially developed APE data described in Junczys-Dowmunt and Grundkiewicz (2016). The training data consists of English-German triplets containing source English text (src) from the IT domain, corresponding German translations (mt) from a firststage MT system and the corresponding human post-edited version (pe). Development and test data contain 1,000 and 2,000 triplets respectively.
We considered two baselines: (i) the raw MT output provided by the first-stage MT system serves as Baseline1 (W M T B 1 ) and (ii) Baseline2 (W M T B 2 ) is based on Statistical APE, a secondstage phrase-based SMT system (Koehn et al., 2007) built using MOSES 1 with default settings and trained on the 12K mt-pe training data.
In addition to the two baselines, we also compared our attention based neural mt-pe symmetric model (AP E Sym ) against the best performing system (W M T Best ) in the WMT 2016 APE task and the standard log-linear mt-pe PB-SMT model with hybrid prior alignment as described in Section 2.1 (AP E B2 ). AP E B2 and AP E Sym models are trained on 4.55M (4.5M + 12K + pre-aligned word pairs) parallel mt-pe data. The pre-aligned word pairs are obtained from the hybrid prior word alignments (Section 2.1) of the 12K WMT APE training data. For building our AP E B2 system, we set a maximum phrase length of 7 for the translation model, and a 5-gram language model was trained using KenLM (Heafield, 2011). Word alignments between the mt and pe (4.5M synthetic mt-pe data + 12K WMT APE data) were established using the Berkeley Aligner (Liang et al., 2006), while word pairs from hybrid prior alignment (Section 2.1) between mt-pe (12K data) were used for the additional training data to build AP E B2 . The reordering model was trained with the hierarchical, monotone, swap, left to right bidirectional (hier-mslr-bidirectional) method (Galley and Manning, 2008) and conditioned on both the source and target language. Phrase pairs that occur only once in the training data are assigned an unduly high probability mass (1) in the PB-SMT framework. To compensate this shortcoming, we performed smoothing of the phrase table using the Good-Turing smoothing technique (Foster et al., 2006). System tuning was carried out using Minimum Error Rate Training (MERT) (Och, 2003).
For setting up our neural network, previous to training the AP E Sym model, we performed a number of preprocessing steps on the mt-pe parallel training data. First, we prepare a LUT containing mt-pe hybrid prior word alignment above (Section 2.1) a certain lexical translation probability threshold (0.3). To ensure efficient use of the hybrid prior alignment we replaced each mt word by a unique identification number (sw id ) and each pe word by a unique alignment identification number (a id ) (cf. Section 2.1). Afterwards, to effectively reduce the number of unknown words to zero, we follow a preprocessing mechanism similar to Junczys-Dowmunt and Grundkiewicz (2016). We built our AP E Sym model with a single-layer LSTM as encoder and two-layer LSTM as decoder, using 1024 embedding, 1024 hidden and 512 alignment dimensions. Our neural APE model is trained end-to-end using stochastic gradient descent (SGD), allowing up to 20 epochs. The development set was used for regularization by early stopping, which terminated training after 10 epochs. The AP E Sym model maintains bilingual symmetry, and the inferences of both directional models are combined. In a bid to further improve the translation quality, we also preformed re-ranking (cf. AP E Rerank in Table  1). For re-ranking 2 , we generated 100-best translations from each participating system (W M T B2 and AP E B2 ) along with our AP E Sym model. As with the SMT based APE output, we added log probability features from our neural models. Additionally, we used the following features: n-gram (n = 3...7) language model probability as well as perplexity normalized by sentence length, minimum Bayes risk scores, and mt-pe length ratio. We trained the re-ranking model on the development set using MERT with 100-distinct best translations of each participating system which are optimized on BLEU. Table 1 provides a comparison of the baseline W M T B 1 , W M T B 2 , W M T Best , AP E B2 , AP E Sym and the AP E Rerank system. Automatic evaluation was carried out in terms of BLEU (Papineni et al., 2002), METEOR and TER. Some general trends can be observed across all metrics. Automatic post-editing, even trained on a small amount of training data (W M T B 2 ), pro-2 Our approach is inspired by Och et al. (2004). vides improvements over raw MT output in general. Additional training data, even artificially generated, helps improve system performance (compare AP E B2 with W M T B 2 ). Neural MT performs better than PB-SMT based approaches for the post-editing task on large amounts of training data (compare W M T Best and AP E B2 with W M T B 2 ). Our AP E Sym system based on Cohn et al. (2016) with hybrid embedded prior word alignment provides the best performance among all the individual APE systems and surpasses the W M T Best system. The AP E Rerank system performs significantly better than all the individual systems. The scores marked with * in Table 1 indicate statistically significant improvements (p < 0.01) as measured by bootstrap resampling (Koehn, 2004) over the corresponding score in the previous row. We observed that AP E Sym contributed to the majority (70.65%) of the translations selected by AP E Rerank .

Human Evaluation
In order to assess the performance of the APE system, we conducted experiments with human evaluators comparing our best APE system (AP E Rerank ) against the WMT 2016 winning APE system (W M T Best ). Human evaluation was carried out using CATaLog Online 3 -an online CAT tool (Pal et al., 2016c). Our human evaluators were 18 undergraduate students enrolled in a Translation Studies programme, attending a translation technologies class, including sessions on MT and MT evaluation. All students were native speakers of German with at least a B2 level of English. During evaluation students were presented an English source sentence and two German MT outputs (AP E Rerank and W M T Best ), the ordering of the MT outputs being alternated for each presentation. They had to decide between the two MT outputs by marking the translation they consider of better quality in terms of both adequacy and fluency. Each student received a set of 30 sentences for evaluation, with 20 sentences drawn randomly and 10 sentences being common to all students, allowing us to compare the distribution of decisions across all sentences and the 10 common sentences. The outcome of the evaluation is presented in Table 2

Conclusions and Future Work
In this paper we presented a neural APE model that extends the attention based NMT model to traditional word alignment models and utilizes agreement of bidirectional models for alignment symmetry. The attentions are encouraged to symmetrization in both translation directions. To the best of our knowledge this is the first work on integrating hybrid prior alignment into NMT. Evaluation results show significant improvements over the first-stage raw MT system. Although the AP E Sym system provided only small (but significant) improvements over W M T Best system, re-ranking of the n-best outputs of the multiple APE engines yields large improvements. Human evaluation also revealed the superiority of the AP E Rerank system over the W M T Best system. As future work we plan to integrate source knowledge into the neural APE framework. We will also study further the use of standard word alignment information to influence the attention mechanism in neural APE.