Learning to Copy for Automatic Post-Editing

Automatic post-editing (APE), which aims to correct errors in the output of machine translation systems in a post-processing step, is an important task in natural language processing. While recent work has achieved considerable performance gains by using neural networks, how to model the copying mechanism for APE remains a challenge. In this work, we propose a new method for modeling copying for APE. To better identify translation errors, our method learns the representations of source sentences and system outputs in an interactive way. These representations are used to explicitly indicate which words in the system outputs should be copied. Finally, CopyNet (Gu et.al., 2016) can be combined with our method to place the copied words in correct positions in post-edited translations. Experiments on the datasets of the WMT 2016-2017 APE shared tasks show that our approach outperforms all best published results.


Introduction
Automatic post-editing (APE) is an important natural language processing (NLP) task that aims to automatically correct errors made by machine translation systems (Knight and Chander, 1994). It can be considered as an efficient way to modify translations to a specific domain or to incorporate additional information into translations rather than translating from scratch (McKeown et al., 2012;Chatterjee et al., 2015Chatterjee et al., , 2018. Approaches to APE can be roughly divided into two broad categories: statistical and neural approaches. While early efforts focused on statistical approaches relying on manual feature engineering (Simard et al., 2007;Béchara et al., 2011), neural network based approaches capable of learning representations from data have src I ate a cake yesterday mt Ich esse einen Hamburger pe Ich hatte gestern einen Kuchen gegessen Table 1: Example of automatic post-editing (APE). Given a source sentence (src) and a machine translation (mt), the goal of APE is to post-edit the erroneous translation to obtain a correct translation (pe).
Our work aims to explicitly model how to copy words from mt to pe (highlighted in bold), which is a common phenomenon in APE.
shown remarkable superiority over their statistical counterparts (Varis and Bojar, 2017;Chatterjee et al., 2017;Junczys-Dowmunt and Grundkiewicz, 2017;Unanue et al., 2018). Most of them cast APE as a multi-source sequence-to-sequence learning problem (Zoph and Knight, 2016): given a source sentence (src) and a machine translation (mt), APE outputs a post-edited translation (pe). A common phenomenon in APE is that many words in mt can be copied to pe. As shown in Table 1, two German words "Ich" and "einen" occur in both mt and pe. Note that the positions of copied words in mt and pe are not necessarily identical (e.g., "einen" in Table 1). Our analysis on the datasets of the WMT 2016 and 2017 APE shared tasks shows that over 80% of words in mt are copied to pe. As APE models not only need to decide which words in mt should be copied correctly, but also should place the copied words in appropriate positions in pe, it is challenging to model copying for APE. Our experiments show that the state-of-the-art APE method (Junczys-Dowmunt and Grundkiewicz, 2018) only achieves a copying accuracy of 64.63% (see Table 7).
We believe that existing approaches to APE (Varis and Bojar, 2017;Chatterjee et al., 2017;Junczys-Dowmunt and Grundkiewicz, 2017;Unanue et al., 2018) suffer from two major src：I ate a cake yesterday pe ：Ich hatte gestern einen Kuchen gegessen mt ：Ich esse einen Hamburger 1 0 1 0 generate copy interact Figure 1: Learning to copy for APE. Our work is based on two key ideas. As both src and mt play important roles in APE, the first idea is that src and mt should "interact" with each other during representation learning to better generate words from src and copy words from mt during inference. The second idea is to predict which target words in mt should be copied since it is easy to obtain labeled data automatically by comparing mt and pe. The words to be copied (e.g., "Ich") in mt are labeled with 1's and other words (e.g., "esse") with 0's. drawbacks when modeling the copying mechanism. First, the representations of src and mt are learned separately. APE is a two-source sequence-to-sequence learning problem in which both src and mt play important roles. On the one hand, if src is ignored, it is difficult to identify translation errors related to adequacy in mt, especially for fluent but inadequate translations (e.g., mt in Figure 1). On the other hand, mt serves as a major source for generating pe since many words (e.g., "Ich" and "einen" in Figure 1) are copied from mt to pe. Intuitively, it is likely to be easier to decide which words in mt should be copied if src and mt fully "interact" with each other during representation learning. Although CopyNet (Gu et al., 2016) can be adapted for explicitly modeling the copying mechanism in multi-source sequence-to-sequence learning, the lack of the interaction between src and mt still remains a problem. Second, there is no explicit labeling that indicates which target words in mt should be copied. Existing approaches only rely on the attention between the encoder and decoder to implicitly choose target words to be copied. Given mt and pe, it is easy to decide whether a target word in mt should be copied or not. In Figure 1, the words in mt that should be copied are labeled with 1's. Other words are labeled with with 0's, which should be re-generated from src. These labels can served as useful supervision signals to help better copy words from mt to pe, even when CopyNet is used.
In this work, we propose a new method for modeling the copying mechanism for APE. As shown in Figure 1, our work is based on two key ideas. First, our method is capable of learning the representations of input in an interactive way by enabling src and mt to attend to each other during representation learning. This might be useful for deciding when to generate words from src and when to copy words from mt during post-editing. Second, it is possible to predict which words in mt should be copied because it is easy to automatically construct labeled data by comparing mt and pe. Such predictions can be combined with Copy-Net to better model copying for APE. Experiments show that our approach outperforms the best published results on the datasets of the WMT 2016-2017 APE shared tasks.

Multi-source Sequence-to-Sequence Learning
Multi-source sequence-to-sequence learning has been widely used in APE in recent years (Junczys-Dowmunt and Grundkiewicz, 2018;Pal et al., 2018;Tebbifakhr et al., 2018;Shin and Lee, 2018). The architecture of multi-source Transformer is shown in Figure 2(a). It can be equipped with CopyNet (see Section 2.2) to serve as a baseline in our experiments. It is worth noting that src and mt are encoded separately. Let x = x 1 . . . x I be a source sentence (i.e., src) with I words,ỹ =ỹ 1 . . .ỹ K be a translation output by a machine translation system (i.e., mt) with K words, and y = y 1 . . . y J be the postedited translation (i.e., pe) with J words. The APE model is given by where y j is the j-th target word in pe, y <j = y 1 . . . y j−1 is a partial translation, θ is a set of model parameters, and P (y j |x,ỹ, y <j ; θ) is a word-level translation probability.   (Gu et al., 2016) and (b) the architecture of our approach. While the existing work learns the representations of src and mt separately, our approach allows for learning the representations of src and mt in an interactive way by concatenating them as a single input. In addition, our approach introduces a Predictor module to explicitly indicate which words in mt should be copied.
The word-level translation probability in Eq.
(1) is computed as where Encoder src (·) is the encoder for src, H src is the real-valued representation of src, Encoder mt (·) is the encoder for mt, H mt is the representation of mt, Decoder(·) is the decoder, h pe j is the representation of the j-th target word y j in pe. W g ∈ R d×Vy is a weight matrix, d is the dimension of hidden states, and V y is the target vocabulary size.
A limitation of the aforementioned model is that src and mt are encoded separately without interacting with each other, which might lead to the inability to find which src word is untranslated and which mt word is incorrect. For example, the mt sentence in Figure 1 is fluent and meaningful. Without src, the APE system is unable to identify translation errors. In addition, the multisource Transformer does not explicitly model the copying between mt and pe in neither the Encoder nor the Decoder.

CopyNet
CopyNet (Gu et al., 2016) is a widely used method for modeling copying in sequence-to-sequence learning. It has been successfully applied to single-turn dialogue (Gu et al., 2016), text sum-marization (See et al., 2017), and grammar error correction (Zhao et al., 2019).
It is possible to extend the multi-source Transformer with CopyNet to explicitly model the copying mechanism, as shown in Figure 2(a). Copy-Net defines the word-level translation probability in Eq. (1) as a linear interpolation of copying and generating probabilities: where P copy (y j ) is the copying probability for y j , P gen (y j ) is the generating probability for y j , and γ j is a gating weight. They are defined as follows: where g(·) and u(·) are non-linear functions. See (Zhao et al., 2019) for more details. Copying in APE involves two kinds of decisions: (1) choosing words in mt to be copied and (2) placing the copied words in appropriate positions in pe. CopyNet makes the two kinds of decisions simultaneously. We conjecture that if which words in mt should be copied can be explicitly indicated, it might be easier for CopyNet to copy words from mt to pe correctly. Therefore, it is necessary to design a new method for identifying words to be copied.  Figure 3: Interactive representation learning. Concatenated to serve as a single input, src and mt attend to each other during representation learning. Note that learnable weights of src and mt are shared. Interactive representation learning is used in both Predictor and Encoder. For example, copying scores are predicted based on the learned representations (see Eq. (13)).

Approach
Figure 2(b) shows the overall architecture of our approach. It differs from previous work in two aspects. First, we propose to let src and mt "interact" with each other to learn better representations (Section 3.1). Second, our approach introduces a Predictor module to predict words to be copied (Section 3.2). Section 3.3 describes how to train our APE model.

Interactive Representation Learning
We propose an interactive representation learning method by making src and mt attend to each other. Following Lample and Conneau (2019) and He et al. (2018), we concatenate src and mt in the dimension of sentence length with additional position and language embeddings: where X i is the embedding of the i-th source word x i ,Ỹ k is the embedding of the k-th target wordỹ k , and E token , E pos , and E lang are the token, position and language embedding matrices. As shown in Figure 3, the representation of src and mt can be learned jointly: where Encoder inter (·) is the interactive Encoder and [X;Ỹ] is the concatenation of X andỸ in the dimension of sentence length. As shown in Figure 2(b), the multi-source Encoders are replaced by the interactive Encoder, which enables src and mt to attend to each other. 2 We expect that enabling the interactions between them can help to strengthen the ability of the model to find which words in src is untranslated and which words in mt is correct. Note that interactive representation learning is used both in Predictor and Encoder. In the following, we will describe how to predict which words in mt should be copied based on these learned representations.

Predicting Words to be Copied
Given mt and pe, we can label each word in mt as 0 or 1. We use 1 to denote that the word is to be copied (e.g. "Ich" and "einen" in Figure 1) and 0 not to be copied (e.g. "esse" and "Hamburger" in Figure 1). It is possible to use the Longest Common Sequence (LCS) (Wagner and Fischer, 1974) algorithm to obtain common sequences between mt and pe. If the word in mt also appears in the common sequences, it will be labeled 1; otherwise, it will be labeled 0. We denote these labels as l 1 . . . l K .
We propose a Predictor module to predict words to be copied. As discrete labels are nondifferentiable during training, the Predictor module outputs copying scores instead for the target words in mt: where s ∈ R K×1 is a vector of copying scores corresponding to the K words in mt, H pred ∈ R (I+K)×d is the representation of src and mt: and W s ∈ R d×1 is a weight matrix. Only the representation of mt (i.e., [H pred I+1 ; · · · ; H pred I+K ]) is used for calculating copying scores. 3 As shown in Figure 2(b), copying scores can be incorporated into three parts of our model: the Encoder, the Decoder, and the CopyNet. Inspired by Yang et al. (2018)'s strategy to integrate localness to self-attention, we propose to incorporate copying scores into our model by modifying attention weights involved in the aforementioned three modules.
The original scaled dot-product attention (Vaswani et al., 2017) is defined as where q ∈ R 1×d is the query vector, K ∈ R (I+K)×d is the key matrix, and energy ∈ R 1×(I+K) is the "energy" vector. As shown in Figure 4, the copying scores can be used to form a scaling mask on the attention sub-layer: 4 where m = {1.0} I is a masking vector corresponding to src and s ∈ R K×1 is the vector of copying scores calculated by Eq. (13) corresponding to mt. Note that copying scores are used to only change the attention weights related to mt while the src part is unchanged.

Training
The training objective of our approach L all (θ) consists of three parts: 4 Actually, we let "energy" vector minus its minimum value to keep it non-negative.
where α and λ are hyper-parameters.
The second part is related to the CopyNet: where l k is the ground-truth label (see Section 3.2) and c k is a quantity that measures how likely the k-th word in mt to be copied by CopyNet: Note that γ × P copy (y) is the term related to copying the target word y in Eq. (6). The third part is a cross-entropy loss related to the Predictor: where s k is the copying score of the k-th wordỹ k in mt.
Finally, we use an optimizer to find the model parameters that minimize the overall loss function: 4 Experiments

Setup Datasets
We evaluated our approach on the WMT APE datasets, which often distinguish between two tasks: phrase-based statistical machine translation (i.e., PBSMT) and neural machine translation (i.e., NMT). All these APE datasets consist of English-German triplets containing source text (src), the translations (mt) from a "black-box" MT system and the corresponding human-post-edits (pe). The statistics for the WMT APE datasets are shown in Table 2. In addition to the official dataset, the organizers also recommend using additional datasets   We used the WMT official dataset for the PB-SMT task and the NMT task separately. The artificial training data (Junczys-Dowmunt and Grundkiewicz, 2016) was also used for both tasks. More precisely, we used the concatenation of the official training data and the artificial-small data to learn a truecasing model (Koehn et al., 2007) and obtain sub-word units using byte-pair encoding (BPE) (Sennrich et al., 2015) with 32k merges. Then, we applied truecasing and BPE to all datasets. We oversampled the official training data 20 times and concatenated them with both artificial-small and artificial-big datasets (Junczys-Dowmunt and Grundkiewicz, 2018). Finally, we obtained a dataset containing nearly 5M triplets for both tasks. To test our approach on a larger PBSMT dataset, we used the eSCAPE synthetic dataset (Negri et al., 2014), which contains 7.2M sentences. By including the eSCAPE dataset, the training set is enlarged to nearly 12M sentences.

Hyper-Parameter Settings
For the original Transformer model, CopyNet and our approach, the hidden size was set to 512 and the filter size was set to 2,048. The number of individual attention heads was set to 8 for multi-head attention. We set N = N e = N d = 6, N p = 3 and we tied all three src, mt, pe embeddings for saving memory. The embeddings and softmax weights were also tied. In training, we used Adam (Kingma and Ba, 2014) for optimization. Each mini-batch contains approximately 25K tokens. We used the learning rate decay policy described  by (Vaswani et al., 2017). In decoding, the beam size was set to 4. We used the length penalty (Wu et al., 2016) and set the hyper-parameter to 1.0. The other hyper-parameter settings were the same as the Transformer model (Vaswani et al., 2017). We implemented our approach on top of the opensource toolkit THUMT (Zhang et al., 2017). 6

Evaluation Metrics
We used the same evaluation metrics as the official WMT APE task (Chatterjee et al., 2018): casesensitive BLEU and TER. BLEU is computed by multi-bleu.perl (Koehn et al., 2007). TER is calculated using TERcom. 7

Baselines
We compared our approach with the following seven baselines: 1. ORIGINAL: the original mt without any postediting.
3. NPI-APE (Vu and Haffari, 2018): a neural programmer-interpreter approach. (Junczys-Dowmunt and Grundkiewicz, 2018): a multi-source Transformerbased APE system that shares the encoders of src and mt. It is the champion of the WMT 2018 APE shared task.

MS UEDIN
5. USAAR DFKI (Pal et al., 2018): a multisource Transformer-based APE system with a joint encoder that attends over a combination of two encoded sequences. It is a participant of the WMT 2018 APE shared task.  Table 4: Results on the English-German PBSMT sub-task. "TEST16+17" is the concatenation of "TEST16" and "TEST17". MS UEDIN ensemble and Ours ensemble used ensembles of four models.
6. POSTECH (Shin and Lee, 2018): a multisource Transformer-based APE system with two encoders. It is a participant of the WMT 2018 APE shared task.
7. FBK (Tebbifakhr et al., 2018): a multi-source Transformer-based APE system with two encoders. It is a participant of the WMT 2018 APE shared task.
We implemented COPYNET also on top of THUMT (Zhang et al., 2017). The results of all other baselines were taken from the corresponding original papers.

Effect of Hyper-parameters
We first investigated the effect of the hyperparameters α and λ in Eq. (18). As shown in Table  3, using α = 0.9 and λ = 1.0 achieves the best performance in terms of TER and BLEU on the WMT 2016 development set, suggesting that both the Predictor and CopyNet play important roles in our approach. Therefore, we set α = 0.9 and λ = 1.0 in the following experiments.

Main Results
Results on the PBSMT Sub-task  Under the small-data training condition (i.e., 5M), our single model (i.e., System 7) outperforms all single-model baselines on all test sets. The superiority over COPYNET (i.e., System 6) suggests that interactive representation learning and incorporating copying scores are effective in improving APE. Our approach that uses the ensemble of four models (i.e., System 8) also improves over the best published result (i.e., System 5).
Results on the NMT Sub-task Table 5 shows the results on the NMT sub-task. Besides ORIGINAL and COPYNET, we also compared our approach with POSTECH (Shin and Lee, 2018), which is the only participating system that released the results on the development set of the WMT 2018 NMT sub-task. 8 We find that our approach also outperforms all baselines. Table 6 shows the results of ablation study. It is clear that interactive representation learning plays a critical role since removing it impairs postediting performance (line 3). As shown in line 4, the Predictor is also an essential part of our approach. CopyNet and joint training are also shown to be beneficial for improving APE (lines 5 and 6) but seem to have relatively smaller contributions than interactive representation learning and predicting words to be copied.

Results on Prediction Accuracy
The Predictor is an important module in our approach as it predicts which words in mt should be copied. Given the ground-truth labels, it is easy to calculate prediction accuracy by casting the prediction as a binary classification problem. We find that the Predictor achieves a prediction accuracy of 85.09% on the development set.

Comparison of Copying Accuracies
A target word y in a machine translationỹ is called to be correctly copied to an automatic edited translationŷ if and only if the positions where y occurs 8 As POSTECH only used small data for training, the eS-CAPE datasets were not used in this experiment for a fair comparison.  inŷ and y (i.e., the ground-truth edited translation) are identical. Therefore, it is easy to define copying accuracy to measure how well the copying mechanism works. Table 7 shows the comparison of copying accuracies between MS UEDIN, COPYNET, and our approach. We find that our approach outperforms the two baselines. However, the copying accuracy of our approach is almost 20% lower than the prediction accuracy (i.e., 65.61% vs. 85.09%), indicating that it is much more challenging to place the copied words in correct positions. Figure 5 gives an example that illustrates how copying scores influence attention. It shows the heatmap of the Enc-Dec-Attention, which averages over 8 different heads. Only the attention weights beween pe and mt are included. We take the second layer of Enc-Dec-Attention for example. The x-axis represents mt and the y-axis represents pe. The darker the color, the higher the copying scores.

Visualization
We find that words "mit" and "dem" are identified by the Predictor. Accordingly, the attention weights corresponding to these words are decreased since the columns corresponding to these words have lighter color. As a result, all words in mt other than "mit" and "dem" are copied to pe.  Figure 5: Example of the heatmap of attention and copying scores. The x-axis is mt and the y-axis is pe. The Predictor successfully detects the incorrect word "mit" and "dem" and gives low copying scores to these words and then decrease the importance of them in attention. Other words in mt are correctly copied to pe.

Multi-source Sequence-to-Sequence Learning
Recently, multi-source Transformer-based APE systems (Junczys-Dowmunt and Grundkiewicz, 2018;Pal et al., 2018;Tebbifakhr et al., 2018;Shin and Lee, 2018) have achieved the state-of-the-art results on the datasets of the WMT APE shared task. Our work differs from prior studies by enabling interactions between src and mt and explicitly detecting words to be copied.

The Copying Mechanism
Zhao et al. (2019) apply CopyNet (Gu et al., 2016) to grammar error correction. Their approach generates labels similar to ours, but only uses them to perform mutli-task learning. Libovický et al. (2016) first introduce CopyNet to APE but do not provide a detailed description of their method and experimental results. We show that interactive representation learning and explicit indication of words are important for modeling copying in APE.

Interactive Representation Learning
Niehues et al. (2016) simply concatenate the output of the PBSMT system and the source sentence to serve as the input the NMT system without enabling multi-layer interactive learning. Lample and Conneau (2019) used cross-lingual setting to enable cross-lingual language model pre-training. We propose to let src and mt fully interact with each other to make it easier to decide which words in mt should be copied.

Conclusion
We have presented a new method for modeling the copying mechanism for automatic post-editing. By making the source sentence and machine translation attend to each other, representations learned in such an interactive way help to identify whether a target word should be copied or be re-generated. We also find that explicitly predicting words to be copied is beneficial for improving the performance of post-editing. Experiments show that our approach achieves new state-of-the-art results on the WMT 2016 & 2017 APE PBSMT sub-tasks.