Computer Assisted Translation with Neural Quality Estimation and Auotmatic Post-Editing

With the advent of neural machine translation, there has been a marked shift towards leveraging and consuming the machine translation results. However, the gap between machine translation systems and human translators needs to be manually closed by post-editing. In this paper, we propose an end-to-end deep learning framework of the quality estimation and automatic post-editing of the machine translation output. Our goal is to provide error correction suggestions and to further relieve the burden of human translators through an interpretable model. To imitate the behavior of human translators, we design three efficient delegation modules – quality estimation, generative post-editing, and atomic operation post-editing and construct a hierarchical model based on them. We examine this approach with the English–German dataset from WMT 2017 APE shared task and our experimental results can achieve the state-of-the-art performance. We also verify that the certified translators can significantly expedite their post-editing processing with our model in human evaluation.


Introduction
The explosive advances in the sequence to sequence model (Sutskever et al., 2014;Bahdanau et al., 2014;Vaswani et al., 2017) enable the deep learning based neural machine translation (NMT) to approximate and even achieve the human parity in some specific language pairs and scenarios. Instead of translating from scratch by human translators, a new translation paradigm has emerged: computer assisted translation (CAT) system, which includes the machine translation and human post-editing. The post-editing is the process whereby humans amend machine-generated translations to achieve ⇤ indicates equal contribution.
an acceptable final product. Practically, the estimated average translation time can be reduced by 17.4% (from 1957.4 to 1617.7 seconds per text) (Läubli et al., 2013). However, utilizing NMT poses two key challenges. First, the neural machine translation quality still continues to vary a great deal across different domains or genres, more or less in proportion to the availability of paralleled training corpora. Second, the zero tolerance policy is a common choice in the vast majority of important applications. For example, when business legal documents are translated, even a single incorrect word could bring serious financial or property losses. Therefore, the subsequent human post-editing is indispensable in situations like this. Unfortunately, while NMT systems saves time by providing the preliminary translations, the time spent on error corrections by humans (Läubli et al., 2013) remains substantial to the extent that it offsets the efficiency gained by the NMT systems. In this paper, we explore automatic post-editing (APE) in the deep learning framework. Specifically, we adopt an imitation learning approach, where our model first screens the translation candidates by quality prediction and then decides whether to post edit with the generation or the atomic operation method.
Starting with a wide range of features used in the CAT system, we carefully analyze the human post-editing results to narrow down our framework design into three key modules: quality estimation (QE), generative post-editing and atomic operation post-editing. These modules are tightly integrated into the transformer neural networks (Vaswani et al., 2017). Our main innovation is a hierarchical model with two modular post-editing algorithms which are conditionally used based on a novel fine-grained quality estimation model. For each machine translation, our model i) runs the QE model to predict the detailed token level errors, which will be further summarized as an overall quality score to decide whether the machine translation quality is high or not, and ii) conditional on the previous decision, employs the atomic operation post-editing algorithm on the high quality sentence or the generative model to rephrase the translation for the low one.
We examine our approach on the public English-German dataset from WMT 1 2017 APE shared task. Our system outperforms the top ranked methods in both BLEU and TER metrics. In addition, following a standard human evaluation process aimed at achieving impartiality with respect to the efficiency of CAT system, we ask several certified translators to edit the machine translation outputs with or without our APE assistance. Evaluation results show that our system significantly improves translators' efficiency.

Related Work
Our work relates to and builds on several intertwined threads of research in machine translation, including QE and APE. We briefly survey the traditional methods and differentiate our approach.

Quality Estimation
Quality estimation is often a desired component for developing and deploying automatic language technologies, and has been extensively researched in machine translation (Barrault et al., 2019). Its purpose is to provide some metrics measuring the overall quality. The current state-of-the-art models mostly originated from the predictor-estimator framework (Kim et al., 2017), where a sequenceto-sequence model is pre-trained to extract sophisticated sequence features to be fed into a sequence level regression or classification network. Tan et al. (2017) proposed the neural post-editing based quality estimation by streamlining together the traditional QE and APE models. Since our proposed QE module will eventually serve the APE module as well, we consider two modifications accordingly. First, we re-define the QE as a fine-grained multi-class problem, whose output indicates the number of tokens in four categories, missing / redundant / erroneous or kept tokens. A similar idea was initially proposed in (Gu et al., 2017) to predict the number of copy occurrences in non-autoregressive neural machine translation. 1 http://www.statmt.org/

Automatic Post-Editing
Automatic Post Editing aims to improve the quality of an existing MT system by learning from human edited samples, converting "translationese" output into natural text. The traditional APE is based on a round-trip translation loop to mimic errors similar to the ones produced by NMT and can achieve acceptable performance with large scale monolingual data only (Freitag et al., 2019). However, the prevalent trend in this area prefers the dualsource encoder-decoder architecture with parallel data (Chatterjee et al., 2017b;Junczys-Dowmunt and Grundkiewicz, 2018;Pal et al., 2018;Lopes et al., 2019), which obtained the best results in WMT competitions (Chatterjee et al., 2019). The dual-source encoder encodes the source text and the machine translation output separately, and the decoder decodes the post-edited results. All these approaches encode each source independently and apply an auto-regressive decoder. They differ in their parameter sharing mechanisms. While our approach still employs the multisource APE framework, but there are two fundamental differences. First, our APE module, as aforementioned above, is built on our re-designed QE model, with which the source and the machine translation are entangled by the encoder and memory-encoder QE module. Second, our decoder consists in a versatile architecture that can choose between the left to right auto-regressive generative model and the atomic-operation based paralleled model. It dynamically determines which model to engage at runtime. The parallelizable model was broadly explored in insertion-or deletion-based transformer Gu et al., 2019), while our decoder supports more functional operations.

Model and Objective
In order to achieve the automatic post-editing goal, it is essential for the model to find the exact errors appearing in the machine translation and learn how to fix them. Breaking the problem into several subtasks, our proposed pipeline includes three major models as Figure 1. By skipping the pre-training temporarily, the first step is to investigate the finegrained quality estimation model with respect to the source text and machine translated text. Its output will provide a fine-grained quality estimation of the machine translation. Based on the corresponding quality, an atomic APE or a generative APE model will be called for further processing.  Figure 1: The overall pipeline. The QE model will output fine-grained metrics to the translation quality.
Then, high quality machine translation will proceed with atomic APE model for minor fix, while the low quality machine translation will go through a generative APE model for completely rephrasing. Note that the model parameters are shared for three steps w.r.t. encoder and memory encoder. Detailed computational graph can refer to Figure 2. Label k > 1 k = 1 k = 0 k = 1 Definition insert k 1 tokens keep delete replace As described in the related work, compared to traditional translation QE task in WMT 2 , our QE module is more fine-grained and is recast as a multiclass { 1, 0, 1, ..., K} sequence labeling problem. The definition of the integer labels is shown in Table 2. If k <= 1, the label denotes one single token operation; otherwise, it means to insert k 1 extra tokens after the current one. The QE tag q for training pair (m, e) can be deterministically calculated by dynamic programming Algorithm 4 in Appendix, which is basically a string matching algorithm. We define a conditionally independent sequence tagging model for the error prediction.

Fine-Grained Quality Estimation
A transformer based neural network is employed. We present a novel encoder-memory encoder framework with memory attention as shown in the decomposition of the following equation.
where Enc(·) is the standard transformer encoder (Vaswani et al., 2017), and Enc M (·) is the memory encoder adapted from standard transformer decoder. It removed the future masking in the transformer decoder and use the last state as the output which contains contexts from both SRC and MT.
During inference, neither the ground truth of post-editing nor the golden translation reference is available. The fine-grained QE model can predict the human translation edit rate (HTER) h through the inferred QE tagsq.
On the one hand, the overall metric h can quantitate the quality of machine translation and determine which APE algorithm will be used. On the other hand, the detailed QE tags can theoretically guide the APE which atomic operation should be applied. Thus, the QE tagging and the atomic operation APE are simultaneously and iteratively trained, which will be elaborated in 3.2 and 3.5.

Atomic Operation Automatic Post-Editing
The key idea of atomic operation APE is to reduce all predefined operations (insertion, deletion, substitution) into a special substitution operation by introducing an artificial token placeholder [PLH]. First, we align the machine translation m and the post-edits e by inserting [PLH]s, resulting in a newm of the same length as e. Technically, we Second, the original APE task is transformed into another sequence tagging problem, since |m| = |e|.
Notice that i) the encoder and memory encoder share the parameters with the QE in Equation (2); ii) the softmax layer is different, because the number of outputs in APE has a different size equal to the vocabulary size. An intuitive visualization can see the Figure 3 and the holistic pipeline sees the Figure 1.

Generative Automatic Post-Editing
The larger HTER h is, the lower quality of m is, and the more atomic operations are required. In this case, the previous APE model may be not powerful enough to learn complicated editing behaviors. We propose a backup APE model via auto-regressive approach for the deteriorated translations. Concretely, we write the dual-source language model into its probabilistic formulation.
Notice that i) the encoder and memory encoder are still reused here, ii) the Dec(·; ·; ·) is a transformer decoder with hierarchical attention, since two memory blocks Enc M (m, Enc(s)) and Enc(s) are both conditional variables for the auto-regressive language model; iii) unlike sequence tagging, the inference of the generative APE is intrinsically nonparallelizable.
Algorithm 1 Imitation Learning Algorithm

Pre-training and Imitation Learning
Because of the lack of human post-editing data, training from scratch is typically difficult. We thus employ two workaround methods to improve the model performance.
Pre-training It is worth noting that the reduced atomic operation APE is actually equivalent to the mask language modeling problem, a.k.a. the famous BERT (Devlin et al., 2019). Therefore, we pre-train the encoder-memory encoder model as a conditional BERT with the data pairs (s, t) and (m,ê), aiming at learning the syntactic and alignment information of the ground truth. To make the pre-training valid on downstream tasks, we consistently use [PLH] token to randomly mask the reference / post-editing sentence.
Imitation Learning As mentioned in 3.1, during inference, the predicted QE tags will causally tie to the successive APE algorithm, becausem is derived from (m,q). Although we would want the model to learn to predict all three atomic operations together, the small size of real post-editing data severely limits the performance of joint QE tagging. Therefore, we propose a model specialization strategy where the model learns three separate tasks: deletion, insertion, and substitution. A reasonable amount of training data can be generated for each of the tasks and the model learns to specialize in each operation. The details are summarized in Algorithm 1.

Training and Inference Algorithms
In this section, we assemble all modules together into the final system. Because our model involves

4:
Run Equation 3 to obtain quality metric h.
Training usually requires to minimize the loss function (negative data log-likelihood of probabilistic models) by stochastic gradient descent (SGD) with respect to the trainable parameters. Our QE and atomic operation APE are both sequence tagging task, while the generative APE is a sequence generation task. The three loss functions are uniformly defined as sequential cross entropy between the predicted and the true sequence. Note that the QE and atomic operation APE share the encoder-memory encoder, so these two losses can be summed together for optimization. However, the generative APE model has an isolated hierarchical transformer decoder, so we need a second update by optimizing the corresponding loss alone.
Inference of our APE system is not quite the same as the training. First, the overall inference is a continuously alternating procedure between QE and APE, where the predicted APE is assigned as a new machine translation for iterative updating. However, the inner loop in training algorithm regards to the augmented data points. Second, we introduce an early stop after the first QE tagging prediction. If the predicted quality is very low (i.e. the HTER is larger than a cross-validated threshold), the generative APE will be called and the inference will immediately exit without further iterations. Lastly, the APE results are utilized by professional translators for further editing. In the next section, we validate the gain of APE over machine translation with regards to the efficiency.

Experiments on our Proposed Model
We verify the validity and efficiency of the proposed APE model by conducting a series of APE experiments and human evaluation on WMT'17 APE dataset. For convenience, we denote the generative post-editing model as GM, the atomic operation post-editing model as AOM, and the final hierarchical model as HM in this section.

Setup
Dataset. The open public WMT17 Automatic Post-Editing Shared Task (Bojar et al., 2017) data on English-German (En-De) is widely used for APE experiments. It consists of 23K real triples (source, machine translation & post-editing) for training and another 2K triples for testing from the Internet Technology (IT) domain. Besides, the shared task also provides a large-scale artificial synthetic corpus containing around 500K high quality and 4 million low quality synthetic triples. We over sample the APE real data by 20 times and merge it with the synthetic data, results in roughly 5 million of triples for both pre-training and APE training. The details of the training set are shown in Appendix Table 6. We adopt test set of the same task in WMT16 as the development set. Furthermore, we apply truecaser (Koehn et al., 2007) to all files and encode every sentence into subword units (Kudo, 2018) with a 32K shared vocabulary.
Evaluation Metrics. We mainly evaluate our systems with metrics bilingual evaluation understudy (BLEU) (Papineni et al., 2002) and translation edit rate (TER) (Snover et al., 2006), since they are standard and widely employed in the APE shared task. The metric BLEU indicates how similar the candidate texts are to the reference texts, with values closer to 100 representing higher sim- ilarity. TER measures how many edits required from the predicted sentence to the ground truth sentence, and is calculated by Equation (3) as well and multiplied by 100. Training Details. All experiments are trained on 8 NVIDIA P100 GPUs for maximum 100,000 steps for about two days until convergence, with a total batch-size of around 17,000 tokens per step and the Adam optimizer (Kingma and Ba, 2014). Only the source and post-edited sentence pairs are used for pre-training. During pre-training, 20% tokens in post-editing sentence are masked as [PLH]. Parameters are being tuned with 12,000 steps of learning rates warm-up (Vaswani et al., 2017) for both of the GM and AOM model. However, 5 automatic post editing iterations (i.e. S = 5 in Algorithm alg:infer) are applied during the inference for the AOM model due to its characteristic of fine-grained editing behaviors. Except these modifications, we follow the default transformerbased configuration (Vaswani et al., 2017) for other hyper-parameters in our models.

APE Systems Comparison
The main results of automatic post-editing systems are presented in Table 3 and competitively compared with results of recent years' winners of WMT APE shared task and several other top results. It is observed that our hierarchical single model achieves the state-of-the-art performance on both BLEU and TER metrics, outperforming not only all other single models but also the ensemble models of top ranked systems in WMT APE tasks.
Note that our hierarchical system is not a twomodel ensemble. The standard ensemble method requires inference and combination of results from more than one models. In contrast, our hierarchical model contains multiple parameter-sharing modules to accomplish multi-tasks, and only need to infer once on the selected model.

Results of Generative APE Model
As mentioned in section 3.3, the decoder of our generative model receives encoder-memory encoder outputs, refering to SRC memory and SRC-MT joint memory. A transformer attention layer encodes the SRC into the SRC memory, and the joint memory is produced by another one, which encodes the original MT conditionally on the SRC memory. These two encoders are pre-trained with sources and post-edits from the full training data. We designed a set of systematic experiments to verify that our model benefits from such a design in Figure 4: (1) To verify that the memory encoder has the ability to learn cross-lingual knowledge, we replace the memory encoder with an ordinary multi-head self-attention encoder, which does not accept the source memory as input, marked by w/o Joint.
(2) To prove that the shortcut from the SRC memory to the decoder input is necessary, the shortcut is removed in the w/o Shortcut experiment. (3) To verify that our model can leverage representations from pre-training, we conduct an experiment without pre-training, denoted as w/o Pre-training.
The ablation results significantly demonstrate that our model does benefit from meory encoder, SRC memory shortcut and pre-training. Removing any of them will result in performance loss.

Results of Atomic Operation APE Model
In each iteration, based on the QE model's output, our AOM refines the MT in parallel regarding to all placeholders. Unlike the GM, the time cost of the AOM only depends on the steps of iterations, regardless of the length of the sentence. To evaluate the decoding efficiency, we collect the AOM's performances at different iteration steps, as shown in Figure 5. The Role of Pseudo Data. As noted in section 3.4, model specialization algorithm is applied to train the model to learn different kinds of atomic operations. We compare our AOM on the test set with and without pseudo data in Table 4. The results demostrate that our model specialization algorithm plays a key role by providing a powerful guidance for training and making up for the deficiency from the lack of large amount of real APE data.

Results of QE Model
The QE model is the prerequisite of the final hierarchical model as well as the basis of our atomic operation model. Therefore, it is necessary to guarantee the performance of QE results as accurate as possible. Unlike the traditional OK/BAD word- Table 5: Results of Fine-Grained QE Model (Pearson = 0.664). Quality tag prediction is evaluated in terms of multi-classification accuracy via F1-scores. The overall MT quality estimation is measured by the Pearson correlation coefficient, indicating the correlation between the predicted and the real MT quality w.r.t. TER. level QE task in WMT (Bojar et al., 2017), our model pursues to predict fine-grained quality tags. So, we cannot make a completely fair comparison with previous works.
The fine-grained quality tag of each word predicted by the model can be classified into one of the four labels: K for Kept, E for Erroneous , R for Redundant and M for Missing. Furthermore, we convert the predicted fine-grained QE tags to OK/BAD tags directly by treating tag K and tag M as OK, and the other two tags as BAD according to the rules of tagging in WMT17 QE Shared Task.
We provide our fine-grained QE results on the test dataset of WMT17 APE Task in Table 5, where the ground-truth tags are produced by Algorithm 4 in Appendix A.1. Note that the TER score can be easily computed from the predicted quality tags. The predicted TER score is regarded as an indicator of MT quality in our hierarchical model: MTs with quality higher than ⌧ in Algorithm 3 are fed to the GM, otherwise they are sent to the AOM. The hyper-parameter ⌧ = 0.3 is determined by cross validation on WMT16 development dataset. Afterwards, we apply it on the WMT17 test dataset to select a potentially preferable model from GM and AOM to generate the final APE result for each SRC and MT pair.
There are more than 75% of tokens in the training set are tagged with Keep. In terms of the huge challenge posed by the unbalanced dataset, our fine-grained quality estimation is quite remarkable. The performance of our final hierarchical model in Table 3 proves the effectiveness of it.

Results of Human Evaluation
We conduct real post-editing experiments with professional translators involved. There are 6 independent participating translators, randomly divided into 2 groups. They are all native speakers of German and have 10+ years of experience in transla- tion of En-De in IT related domains. We follow two different flows in our experiments. For fair comparison, both of the two groups see the same 100 source sentences picked from the WMT17 test dataset. The MTs are provided for the first group for post-editing, while our model generated APEs for the second group. However, the information on the category of the translation is not revealed to translators. The translators are asked to record the elapsed time of their labor in total.
The statistics of averaged post-editing time for different translators are summarized in Figure 6. Besides the total time, we also analyze the duration for low and high quality translations separately (determined by QE model). In either case, postediting from the APE costs less time. We also did case study about high-quality vs low-quality APE in Appendix A.3. From different perspectives of experimental validation, we can conclude that the APE generated by our model can ease the burden of translators and substantially improve the postediting efficiency.

Conclusion
In this paper, we propose a hierarchical model that utilizes the fine-grained word-level QE prediction to select one of the two APE models we propose to generate better translations automatically, which shows a state-of-the-art performance. In particular, we design a dynamic deep learning model using imitation learning, which intuitively mimics the editing behaviors of human translators. Our hierarchical model is not a standard ensemble model in the conventional sense. We merely share the parameters of different modules to accomplish different objectives, including QE, AOM and GM. Our experimental findings show that if the characteristics of errors in the machine translation can be accurately simulated, it is highly likely that MT output can be automatically refined by the APE model. Towards this end, we conduct a rigorous comparison of the machine translation and automatic post-editing based manual post-editing tasks, and it is observed that the latter can significantly increase the efficiency of post-editing.

A.3 Case Study and Runtime Efficiency
As mentioned in the paper, the AOM is more suitable for translations that only require a few edit operations while GM is more preferable for low quality translations. To demonstrate this conclusion and prove the effectiveness of our QE-based automatic selector, some cases of translations with different qualities are shown in Table 7.
In case 1 and case 2, the translation is quite close to pe. Therefore, the AOM only need to predict tokens for a small number of [PLH]s. When there are relatively complete contexts provided, the AOM  1 1 1 1 1 1 1 1 1 1 1 1 1 -1  can achieve a higher performance than the GM. Moreover, after reading the source and the final output, the human translators did not even take any additional action to improve the translation quality.
In the opposite way, as shown in case 3 and case 4, there is a huge gap between mt and pe, and the input for AOM contains a considerable number of placeholders, which lacks enough contextual information. In these cases, our GM can auto-regressively regenerate the translation based on the given mt to guarantee the higher quality of the final output. Based on the QE selector, the translators only need to make very few efforts to correct the errors in the final generated APE of our model.
A practical point of the computer assisted translation via APE is its expense and computational cost. Compared with the traditional computer assisted translation crowdsourcing, machine translation + human post-editing, our additional automatic post-editing does increase the computational cost, which is roughly equivalent to another machine translation model. In general, the crowdsourcing is charged by hours. The numbers in our findings suggest a promising budget cut associated with CAT crowdsourcing. However, this extra APE module may lead to a latency increase by400ms, which is still far below the average time cost by human post-editing. Even for an online crowdsourcing system, a well-designed concurrent mechanism should make the translators not feel any delay. From the perspective of architecture scale, the APE model can be deployed in the identical processing unit for the machine translation model and be called successively in a pipeline. The only concern is that the memory storage capacity should be large enough to store more parameters.