Can Automatic Post-Editing Improve NMT?

Automatic post-editing (APE) aims to improve machine translations, thereby reducing human post-editing effort. APE has had notable success when used with statistical machine translation (SMT) systems but has not been as successful over neural machine translation (NMT) systems. This has raised questions on the relevance of APE task in the current scenario. However, the training of APE models has been heavily reliant on large-scale artificial corpora combined with only limited human post-edited data. We hypothesize that APE models have been underperforming in improving NMT translations due to the lack of adequate supervision. To ascertain our hypothesis, we compile a larger corpus of human post-edits of English to German NMT. We empirically show that a state-of-art neural APE model trained on this corpus can significantly improve a strong in-domain NMT system, challenging the current understanding in the field. We further investigate the effects of varying training data sizes, using artificial training data, and domain specificity for the APE task. We release this new corpus under CC BY-NC-SA 4.0 license at https://github.com/shamilcm/pedra.


Introduction
Automatic Post-Editing (APE) aims to reduce manual post-editing effort by automatically fixing errors in the machine-translated output. Knight and Chander (1994) first proposed APE to cope with systematic errors in selecting appropriate articles for Japanese to English translation. Earlier application of statistical phrase-based models for APE treated it as a monolingual re-writing task without considering the source sentence (Simard et al., 2007;Béchara et al., 2011). Modern APE models take the source text and machine-translated text as input and output the post-edited text in the target language (see Figure 1).

Source text (English):
Will he send the gifts to the house? Machine   APE models are usually trained and evaluated in a black-box scenario where the underlying MT model and the decoding process are inaccessible making it difficult to improve the MT system directly. APE can be effective in this case to improve the MT output or to adapt its style or domain.
Recent advancement of APE has shown remarkable success on statistical machine translation (SMT) outputs (Junczys-Dowmunt and Grundkiewicz, 2018; Correia and Martins, 2019) even when trained with limited number of post-edited training instances (generally "triplets" consisting of source, translated, and post-edited segments), with or without additional large-scale artificial data (Junczys-Dowmunt and Grundkiewicz, 2016;. Substantial improvements have been reported especially on English-German (EN-DE) WMT APE shared tasks on SMT (Bojar et al., 2017;, when models were trained with fewer than 25,000 human post-edited triplets. However, on NMT, strong APE models have failed to show any notable improvement (Chatterjee et al., , 2019Ive et al., 2020) when trained on similar-sized human post-edited data. This has led to questions regarding the usefulness of APE with current NMT systems that produce improved translations compared to SMT. Junczys-Dowmunt and Grundkiewicz (2018) concluded that the results of the WMT'18 APE (NMT) task "might constitute the end of neural automatic post-editing for strong neural in-domain systems" and that "neural-on-neural APE might not actually be useful". Contrary to this belief, we hypothesize that a competitive neural APE model still has potential to further improve strong state-of-the-art in-domain NMT systems when trained on adequate human post-edited data.
We compile a new large post-edited corpus, SubEdits, which consists of actual human postedits of translations of drama and movie subtitles produced by a strong in-domain proprietary NMT system. We use this corpus to train a state-of-theart neural APE model , with the goal of answering the following three research questions to better assess the relevance of APE going forward: • Can APE substantially improve in-domain NMT with adequate data size?
• How much does artificial APE data help?
• How significant is domain shift for APE?
Spoilers: Through automatic and human evaluation, we confirm our hypothesis that, in order to notably improve over the original NMT output ("donothing" baseline), state-of-the-art APE models need to be trained on a larger number of human post-edits, unlike the case with SMT. Training on datasets of sizes in the scale of those from the WMT APE tasks, even with large-scale in-domain artificial APE corpora, leads to underperformance. Our experimental results also highlight that APE models are highly sensitive to domain differences. To effectively exploit out-of-domain post-edited corpora such as SubEdits in other domains, it has to be carefully mixed with available in-domain data.

SubEdits Corpus
Human post-edited corpora of NMT outputs from previous WMT APE shared tasks usually consist of fewer than 25,000 instances. Large-scale artificial corpora such as eSCAPE , do not adequately cater to the primary APE objective of correcting systematic errors of the MT outputs since the pseudo "post-edits" are independent human-translated references often differing greatly from the MT output. the paucity of larger human post-edited corpora on NMT outputs, a study of APE performance under sufficient supervised training data conditions was not possible previously. To enable such a study, we introduce the SubEdits EN-DE post-editing corpus with over 161K triplets of source sentences, NMT translations, and human post-edits of NMT translations.

Corpus Collection
SubEdits corpus is collected from a database of subtitles of a popular video streaming platform, Rakuten Viki (https://www.viki.com/) Every subtitle segment had been originally manually transcribed and translated to English before translating it to German using a proprietary NMT system employed by the platform and specialized at translating subtitles. Viki community 1 members who volunteer as subtitle translators would then post-edit the machine-translated subtitles to further improve it, if necessary.

Corpus Filtering
We use an adaptation of Gale-Church filtering (Tan and Pal, 2014) used for machine translation for filtering the triplets. The global character mean ratio r c is computed as the ratio between the number of characters in the source and machine translated portions of the entire corpus. We remove triplets (src, mt, pe) from the corpus where the ratio between the number of characters of source (src) and post-edit (pe) does not lie within a threshold range of (1 − t)r c and (1 + t)r c with t = 0.2. We nor-  Among the triplets that share the same src and mt segments, we choose only the one with the longest pe. Finally, we remove triplets that are not correctly identified with the respective source and target language using a language identification tool 3 (Lui and Baldwin, 2012). We set aside 10,000 triplets as development set and 10,000 triplets as test set.
The final statistics are shown in Table 2.

BERT Encoder-Decoder APE Model
BERT Encoder-Decoder APE (Correia and Martins, 2019) is a state-of-the-art neural APE model based on a Transformer model (Vaswani et al., 2017) with the encoder and decoder initialized with pre-trained multilingual BERT (Devlin et al., 2019) weights and fine-tuned on post-editing data. A single encoder is used to encode both the source text and the machine-translated text by concatenating them with the separator token [SEP]. The encoder component of the model is identical to the original Transformer encoder initialized with pre-trained weights from the multilingual BERT. For the decoder, Correia and Martins (2019) initialized the context attention weights with the corresponding BERT self-attention weights. Also, the weights of the self-attention layers of the encoder and decoder are tied. All other weights are initialized with corresponding weights from the same multilingual BERT model as well.

Model Hyperparameters
For the BERT Encoder-Decoder model (BERT Enc-Dec), we use the implementation 4 and model hyperparameters used by Correia and Martins (2019) and initialize the encoder and decoder with cased multilingual BERT (base) from Transformers 5 library (Wolf et al., 2019). The encoder and decoder follow the architecture of BERT (base) with 12 layers and 12 attention heads, an embedding size of 768, and a feed-forward layer size of 3072. We set the effective batch size to 4096 tokens for parameter updates. We train BERT Enc-Dec on a single NVIDIA Quadro RTX6000 GPU. Training on our SubEdits corpus took approximately 5 hours to converge. We validate and save checkpoints at every 2000 steps and use early-stopping (patience of 4 checkpoints) to select the model based on best perplexity. We use a decoding beam size of 5.
As a control measure, we compare BERT Enc-Dec against two vanilla Transformer APE models using automatic metrics. The Transformer APE models use BERT vocabularies and tokenization, and employ a single encoder to encode the concatenation src and mt, but they are not initialized with pre-trained weights. The following are the descriptions of the two Transformer APE baselines: TF (base) A Transformer (base) (Vaswani et al., 2017) model with 6 hidden layers implemented in OpenNMT-py. 6 The embedding size is 512 with 2048 feed-forward units. We use default learning parameters in OpenNMT-py: Adam optimizer with a learning rate of 2 and Noam scheduler.
TF (BERT size.) A bigger Transformer with the same number of layers, attention heads, embedding dimensions, hidden, and feed-forward dimensions as BERT Enc-Dec, but without any pre-training and tying of self-attention layers. All learning hyperparameters follow that of TF (base) model.

Pre-processing and Post-processing
SubEdits corpus contains HTML tags such as line breaks (<br>) and italic tags (<i>), and symbols denoting musical notes ( , ) and segments often  begin with hyphens (-). We applied several processing steps to make the data as close as possible to natural sentences on which BERT has been pretrained on. The triplets with multi-line src, mt, and pe containing <br> tags are split into separate training instances 7 and we remove italics and other HTML tags, musical note symbols, and leading hyphens. Thereafter, the input is tokenized with the BERT tokenization and word-piece segmentation in the Transformers library. During test time, we keep track of the changes made to input such as deletion of leading hyphens, music symbols, and italics tags, and splitting at <br> tags. After decoding, the outputs are detokenized and post-processed to re-introduce the tracked changes and evaluated.

Evaluation
We evaluate the models using three different automatic metrics: BLEU (Papineni et al., 2002), ChrF (Popović, 2015), and TER (Snover et al., 2006). For our evaluation on SubEdits test set, differing from WMT APE task evaluation, we post-process and detokenize the outputs and use SacreBLEU 8 (Post, 2018) to evaluate BLEU and ChrF, and TERCOM 9 to compute TER with normalization. Significance test is done by bootstrap re-sampling on BLEU with 1000 samples (Koehn, 2004). Additionally, we conduct human evaluation to ascertain the improvement of the BERT Enc-Dec APE model and to determine the human upper-bound performance for the SubEdits benchmark (see Section 5.3).
We also compare the APE model on the canonical WMT APE dataset (Section 5.6 and Table 7). We follow their evaluation method and use the released tokenized post-edited reference to compute BLEU, ChrF, and TER on the tokenized output.

Proprietary In-domain NMT
We first assess the quality of an proprietary indomain NMT system that is used for compiling the SubEdits corpus. We use it as a black-box system and use the evaluation results from Table 3 to demonstrate that it is a strong baseline for studying APE performance on NMT outputs.
We compare the proprietary NMT system to three leading commercial EN-DE NMT systems: Google Translate, Microsoft Translator, and SYS-TRAN, on a separate in-domain EN-DE test set of 5,136 subtitle segments with independent reference translations (i.e., not post-edits of any system) fetched from the same video streaming platform as the SubEdits corpus. The results (as of May 2020) are summarized in Table 3. Unsurprisingly, the proprietary NMT system specialized at translating drama subtitles substantially outperforms other general MT systems. Table 4 reports the performance of vanilla transformer and BERT Enc-Dec APE models and compares it the do-nothing NMT baseline (the output produced by the proprietary in-domain NMT system). TF (base) APE improves over the do-nothing NMT baseline output (p < 0.05), particularly on TER scores. However, TF (BERT size) APE shows a smaller improvement on ChrF and TER scores and a drop in BLEU. Even with the SubEdits corpus, large networks such as TF (BERT size) tends to overfit. However, with pre-trained BERT initialization, BERT Enc-Dec APE shows substantial improvement across all metrics. Unlike previous studies that report marginal improvements (Chatterjee et al., , 2019, our results show that a strong APE model trained on large human postedits can significantly outperform (p < 0.001) a strong in-domain NMT system.

Human Evaluation
To validate the improvement in automatic evaluation scores and to estimate the human upper-bound performance on SubEdits, we conducted human evaluation. We hired five German native freelance translators who are also proficient in English and  had prior experience with English/German translation. Given the original English text, the annotators were asked to rate the adequacy (from 1 to 5) for three German translations: (1) the do-nothing baseline output (NMT), (2) BERT Enc-Dec APE output (APE), and (3) the human post-edited text (Human). Figure 2 shows the interface presented to the annotators for rating the translations. The three translations are presented on the same screen in random order and the annotators are unaware of their origin.
Following recent WMT APE tasks (Bojar et al., 2017;Chatterjee et al., , 2019, our human evaluation is also based solely on adequacy assessments. Previous studies reported a high correlation of fluency judgments with adequacy (Callison-Burch et al., 2007)   superfluous (Przybocki et al., 2009). Unlike the recent WMT APE tasks, we did not opt for direct assessments (Graham et al., 2013) since we wanted to evaluate the degradation or improvement in the quality of the NMT output due to APE and human post-edits on the same English source segments. We elicit judgments for all test set instances where the APE model modified the NMT output beyond simple edits on punctuation, HTML tags, spacing, or casing. 2,815 out of the 10,000 instances in our test set contains non-simple edits. A set of 50 instances out of 2,815 was evaluated by all annotators to compute inter-annotator agreement. 10 After evaluation, we filtered out the instances where the annotator was unable to decide a score for any of the three translations. The average scores by each annotator (A to E) and the overall average scores are shown in Table 5. The numerator of the "# Eval." column indicates the number of evaluations used for the average score computation after filtering out the "I can't decide" annotations. The results of our human evaluation (Table 5) show that all five annotators rate the APE output better than baseline NMT output by at least +0.5 on average, reaching an overall score of 3.9. All the five annotators rated the human post-edited output substantially better than the NMT output and the APE output, which indicates that quality of the post-edits in the SubEdits corpus is high. Human post-edits received an overall average score of 4.3.
Using the repeated set of 46 instances, 11 we com-pute inter-annotator agreement using average pairwise Cohen's Kappa κ (Cohen, 1960) to be 0.27 which is considered to be fair (Landis and Koch, 1977) and similar to that observed for adequacy judgments in WMT tasks (Callison-Burch et al., 2007). However, the ranges of scores used by the annotators differ considerably (especially, annotator 'D'). Hence, measures such as a weighted Kappa κ w (Cohen, 1968), which assigns partial credit to smaller disagreements and works better with ordinal data (such as our adequacy judgments), is more suitable. We compute the average pairwise quadratically weighted Kappa κ w to be 0.50, and consider their agreement to be moderate.

Can APE substantially improve in-domain NMT with adequate data size?
To analyze the effect of training data size with respect to APE performance, we train BERT Enc-Dec APE with varying sizes of training data from the SubEdits corpus and evaluated the models on the SubEdits development set. For each training data size, ranging from 6,250 to 125,000, we train three models on three random samples of the respective size from the SubEdits training set. Each point in Figure 3 denotes the mean score of the three models (the vertical error bars at each point denote the minimum and maximum scores). The do-nothing NMT baseline score is represented by a horizontal dotted line. As a reference, we mark the size equivalent to that of WMT'18 APE EN-DE (NMT) training set (13,441 triplets) with the vertical dotted line. The rightmost point on each graph represents the score if the full training corpus is used. Although the sizes of WMT APE dataset and the SubEdits corpus are not directly comparable, we see that size does matter for better APE performance. When the APE model was trained on a subset of SubEdits corpus that is of the same size as the WMT'18 APE training data, it performs worse than the baseline in terms of BLEU score and only marginally improves in ChrF and TER scores (see intersection points of the vertical and horizontal lines in Figure 3).
Interestingly, doubling the amount of training data from 12,500 to 25,000 provides slight BLEU gains above the do-nothing baseline and increasing the data size to 50,000 training instances improves the model further by +1 BLEU. The curves continue to show an increasing trend. After 100,000 training instances, the data size effect on score im-provement slows down. This experiment shows the possibility that previous work on APE for NMT outputs might have reached a plateau simply due to the lack of human post-edited data rather than the limited usefulness of APE models.

How much does artificial APE data help?
Previous work using strong neural APE models (Junczys-Dowmunt and Grundkiewicz, 2018;Tebbifakhr et al., 2018) relied predominantly on artificial corpora such as that released by Junczys-Dowmunt and Grundkiewicz (2016) and the eS-CAPE corpora . However, artificial post-edits are either generated from monolingual corpora or independent reference translations and they do not directly address the errors made by the MT system that is to be fixed by APE.
We compare the APE model performance when trained on large-scale in-domain and out-of-domain artificial data (in the order of millions of triplets) to training on the human post-edited SubEdits corpus (over 141K human post-edits). As out-of-domain artificial data, we use the eSCAPE EN-DE NMT corpus and filter sentences that have between 0 and 200 characters resulting in 5.3 million triplets. As in-domain artificial data, we generated an artificial APE corpus using the same approach used to create the eSCAPE corpus by decoding the source sentences from the OpenSubtitles2016 parallel corpus (Lison and Tiedemann, 2016), which is also from the subtitle domain 12 using the same proprietary NMT system we use to create the SubEdits corpus; the corresponding references translations become the artificial post-edits. We use the same filtering criteria and pre-processing methods for SubEdits (Section 2.2 and 4.2) resulting in 5.6 million artificial triplets. We set aside 10,000 triplets from each artificial corpus and use it as a development set when training solely on the corresponding corpus. We refer to this artificial corpus as SubEscape.
We compare the performance of the BERT Enc-Dec APE trained on SubEdits corpus to that when trained on the artificial corpora in Table 6. We find that training on artificial corpora alone, irrespective of their domain, cannot improve over the do-nothing baseline and in fact, degrades the performance substantially. However, when we combine SubEscape with up-sampled (10×) SubEdits cor-   pus, we get a small improvement, particularly in terms of ChrF and TER.

How significant is domain shift for APE?
While NMT performance has been known to be particularly domain-dependant (Chu and Wang, 2018), domain shift between NMT and APE training has not been investigated previously. To assess this, we evaluate BERT Enc-Dec APE on the canonical WMT'18 APE EN-DE (NMT) dataset. 13 . The baseline NMT system and datasets used for the WMT'18 task is from the Information Technology (IT) domain and is notably different from the domain of SubEdits. We experiment with different methods of combining SubEdits (out-domain) with the WMT APE training data (in-domain). For all experiments, we use 1,000 instances held out from the WMT'18 APE training data as the validation set. The results are reported in Table 7  some improvement, particularly in terms of BLEU (p < 0.05), over training with WMT APE data alone. These results show that in-domain training data is crucial to training APE models to improve in-domain NMT.

Impact of APE with varying NMT quality
To study the impact of APE with varying quality of NMT output, we conduct analysis on subsets of our development set with varying translation qualities (Figure 4). We split the SubEdits development set into 10 subsets by aggregating those triplets with the NMT output scoring > 90 TER (lowest quality), 90 − 81 TER, . . ., 20 − 11 TER, and ≤ 10 (highest quality). They are ordered from left to right in the x-axis in Figure 4 according to increasing MT quality. y-axis denotes the difference (∆) between the TER score of APE output and NMT output for each subset. The more negative ∆TER indicates a larger improvement due to APE. We find that on the lower quality subsets, APE improves over NMT substantially. This improvement margin reduces with improving NMT quality and can deteriorate the NMT output when NMT quality is at the highest. This experiment shows that APE contributes to improving overall NMT performance by predominantly fixing poorer quality NMT outputs. The APE model's error will dominate and APE can become counter-productive when NMT output is nearly perfect (i.e., when there are very few or no post-edits done on them as indicated by sentence-level TER scores of < 10). APE task remains relevant until NMT systems achieve this state, which is still not the case even for strong in-domain NMT systems as indicated by our experiments.

Qualitative Analysis
We qualitatively analyze the output produced by APE on the SubEdits development set to better understand the improvements and errors made by the APE model. Table 8 shows three example outputs produced by the APE model along with the original English text (SRC), the do-nothing baseline output (NMT), and the human post-edits (Human).
APE is able to fix incorrect named-entity translations made by the NMT system. Example 1 demonstrates an example ("Zhongyuan Palast"→"Palast Zhongcui") where the incorrect entity is corrected by the APE model to match the human post-edits.
NMT often under-translates and misses phrases and the APE models usually can patch these undertranslations, e.g. Example 2 where the prepositional phrase "to the resort"→"zum Resort" was missing in the MT outputs and the APE model was able to mend the translation. As much as sentence-level APE works well empirically, the lack of context results in erroneous  translation by the NMT system where it tries to infer a wrong pronoun and the APE model attempts to assume yet another wrong pronoun, e.g. translating a pronoun-dropped source text in Example 3. Often, the prior or future context from video, audio, or other subtitle instances is necessary to fill these contextual gaps. Sentence-level APE cannot address these issues robustly, which calls for further research on multimodal (

Related Work
Until 2018, APE models were benchmarked on SMT outputs through various WMT APE tasks (Bojar et al., 2015(Bojar et al., , 2016(Bojar et al., , 2017. The scale of postedited data provided by these tasks was in the order of 10,000 to 25,000 triplets. The largest collection of human post-edits, released by Zhechev (2012), however, was on SMT and consisted of 30,000 to 410,000 triplets across 12 language pairs. On SMT output, participating systems showed impressive gains even with small training datasets from WMT APE tasks (Junczys-Dowmunt and Junczys-Dowmunt, 2017;Tebbifakhr et al., 2018). The results of subsequent APE (NMT) tasks were not as promising with only marginal improvements on English-German and no improvement on English-Russian (Chatterjee et al., 2019).
Previously, there was no study to assess the ne-cessity of larger human post-edited training data on APE performance on NMT outputs which we address in this paper. APE models were predominantly trained on large-scale artificial data combined with a few thousand human post-edits. Junczys-Dowmunt and Grundkiewicz (2016) proposed generation of large-scale artificial APE training data via round-trip translation approach inspired from back-translation (Sennrich et al., 2016). They combined artificial training data with real data provided by WMT APE tasks to train their model. Using a similar approach of generating artificial APE data, Freitag et al. (2019) trained a monolingual re-writing APE model trained on the generated artificial training data alone. Contrary to the roundtrip translation approach, large-scale artificial APE data was generated by simply translating source sentences using NMT and SMT systems and using the reference translations as the "pseudo" post-edits to create eSCAPE corpus . Using the eSCAPE English-Italian APE corpus,  assessed the performance of an online APE model in a simulated environment where the APE model is updated at test time with new user inputs. They found that their online APE models trained on eSCAPE found it difficult to improve specialized in-domain NMT systems. Such an analysis by training on artificial corpora may not adequately assess the actual potential of APE since these corpora do not fully cater to the task and can be noisy. The "synthetic" post-edits are independent or loosely coupled with the MT outputs, and are often drastically different from the MT output. This makes analyzing APE performance over competitive NMT systems on actual post-edited data an important step in understanding the potential of APE research. Contrary to previous conclusions, our analysis shows that a competitive in-domain NMT system can be markedly improved by a strong neural APE model when trained on sufficient human post-edited training data.

Conclusion
APE has been an effective option to fix systematic MT errors and improve translations from black-box MT services. However, on NMT outputs, APE has shown hardly any improvement since training has been done on limited human post-edited data. The newly collected SubEdits corpus is the largest corpus of NMT human post-edits collected so far. We reassessed the usefulness of APE on NMT using this corpus.
We showed that with a larger human post-edited corpus, a strong neural APE model can substantially improve a strong in-domain NMT system. While artificial APE corpora help, we showed that the APE model performs better when trained on adequate human post-edited data (SubEdits) compared to large-scale artificial corpora. Finally, our experiments comparing in and out-domain APE show that domain-specificity of training affects APE performance drastically and a combination of in and out-of-domain data with certain upscaling alleviates the domain-shift problem for APE. We find that APE mostly contributes to improving NMT performance by fixing the poorer-quality outputs that still exist with strong in-domain NMT systems. We release the post-editing datasets used in this paper (SubEscape and SubEdits) along with pre/post-processing scipts at PEDRa GitHub repository (https://github.com/shamilcm/pedra)