The Transference Architecture for Automatic Post-Editing

In automatic post-editing (APE) it makes sense to condition post-editing (pe) decisions on both the source (src) and the machine translated text (mt) as input. This has led to multi-encoder based neural APE approaches. A research challenge now is the search for architectures that best support the capture, preparation and provision of src and mt information and its integration with pe decisions. In this paper we present an efficient multi-encoder based APE model, called transference. Unlike previous approaches, it (i) uses a transformer encoder block for src, (ii) followed by a decoder block, but without masking for self-attention on mt, which effectively acts as second encoder combining src –> mt, and (iii) feeds this representation into a final decoder block generating pe. Our model outperforms the best performing systems by 1 BLEU point on the WMT 2016, 2017, and 2018 English–German APE shared tasks (PBSMT and NMT). Furthermore, the results of our model on the WMT 2019 APE task using NMT data shows a comparable performance to the state-of-the-art system. The inference time of our model is similar to the vanilla transformer-based NMT system although our model deals with two separate encoders. We further investigate the importance of our newly introduced second encoder and find that a too small amount of layers does hurt the performance, while reducing the number of layers of the decoder does not matter much.


Introduction
Although machine translation (MT) systems are improving rapidly, the resulting translations may still require manual post-editing (PE) to achieve human-acceptable translation. Automatic post-editing (APE) is a method that aims to automatically correct errors in machine translated text before performing actual human post-editing (PE) (Knight and Chander, 1994), thereby reducing the post-editors' workload and increasing productivity (Pal et al., 2016a). APE systems learned from human PE data serve as downstream MT post-processing modules to improve the overall performance. APE can therefore be viewed as a 2 nd -stage MT system, translating predictable error patterns in MT output to their corresponding corrections. APE training data minimally involves MT output (mt) and the human post-edited (pe) version of mt, but additionally using the source (src) has been shown to provide further benefits (Bojar et al., 2015;. To provide awareness of errors in mt originating from src, attention mechanisms (Bahdanau et al., 2015) allow modeling of non-local dependencies in the input or output sequences, and importantly also global dependencies between them (in our case src, mt and pe). The transformer architecture (Vaswani et al., 2017) is built solely upon such attention mechanisms completely replacing recurrence and convolutions. The transformer uses positional encoding to encode the input and output sequences, and computes both self-and cross-attention through so-called multi-head attentions, which can be easily parallelized. Multi-head attention allows to jointly attend to information at different positions from different representation subspaces, e.g. utilizing and combining information from src, mt, and pe.
In this paper, we present a multi-encoder based neural APE architecture called transference. Our model contains a source encoder which encodes src information, a second encoder (enc src→mt ) which takes the encoded representation from the source encoder (enc src ), combines this with the self-attentionbased encoding of mt (enc mt ), and prepares a representation for the decoder (dec pe ) via cross-attention. Our second encoder (enc src→mt ) can also be viewed as a standard transformer decoding block, however, without masking, which acts as an encoder. We thus recombine the different blocks of the transformer architecture and repurpose them for the APE task in a simple yet effective way. The suggested architecture is inspired by the two-step approach professional translators tend to use during post-editing: first, the source segment is compared to the corresponding translation suggestion (similar to what our enc src→mt is doing), then corrections to the MT output are applied based on the encountered errors (in the same way that our dec pe uses the encoded representation of enc src→mt to produce the final translation).
The paper makes the following contributions: (i) we propose a multi-encoder model for APE that consists only of standard transformer encoding and decoding blocks, (ii) by using a mix of self-and cross-attention we provide a representation of both src and mt for the decoder, allowing it to better capture errors in mt originating from src; this advances Junczys-Dowmunt and Grundkiewicz (2018) -the WMT 2018 best system (wmt18 smt best ) in terms of BLEU and TER, (iii), we analyze the effect of varying the number of encoder and decoder layers (Domhan, 2018), indicating that the encoders contribute more than decoders in neural APE, and (iv) we present and evaluate an APE architecture inspired by a two-step approach professional translators often use during post-editing.
In comparison to the shared task system description paper (Pal et al., 2019), this paper (i) provides more detailed explanations and reformation of different components of the transference architecture, (ii) compares it to a single encoder based transformer architecture where only mt or src concatenated with mt are used as an input, (iii) analyzes results when swapping mt and src in the multi-encoder setup, and (iv) investigates the importance of encoder and decoder by varying the amount of layers.
The rest of the paper is organized as follows. In §2, we survey existing literature on APE. In §3, we describe the multi-encoder architecture. §4 describes our experimental setup. §5 reports the results of our approach against a number of baselines. Finally, §6 concludes the paper with future directions.

Related Research
Recent advances in APE research are directed towards neural APE, which was first proposed by Pal et al. (2016b) and Junczys-Dowmunt and Grundkiewicz (2016) for the single-source APE scenario which does not consider src, i.e. mt → pe. Junczys-Dowmunt and Grundkiewicz (2016) also generated a large synthetic training dataset, which we also use as additional training data.
Exploiting source information as an additional input can help neural APE to disambiguate corrections applied at each time step; this naturally leads to multi-source APE ({src, mt} → pe). A multi-source neural APE system can be configured either by using a single encoder that encodes the concatenation of src and mt (Niehues et al., 2016) or by using two separate encoders for src and mt and passing the concatenation of both encoders' final states to the decoder (Libovický et al., 2016). A few approaches to multi-source neural APE were proposed in the WMT 2017 APE shared task. Junczys-Dowmunt and Grundkiewicz (2017) combine both mt and src in a single neural architecture, exploring different combinations of attention mechanisms including soft attention and hard monotonic attention.  built upon the two-encoder architecture of multi-source models (Libovický et al., 2016) by means of concatenating both weighted contexts of encoded src and mt. Varis and Bojar (2017) compared two multi-source models, one using a single encoder with the concatenation of src and mt sentences, and a second one using two character-level encoders for mt and src along with a character-level decoder.
In the WMT 2018 APE shared task, several adaptations of the transformer architecture were presented for multi-source APE. Pal et al. (2018) introduced a joint encoder that attends over a combination of the two encoded sequences from mt and src. Tebbifakhr et al. (2018), the NMT-subtask winner of WMT 2018 (wmt18 nmt best ), employed sequence-level loss functions in order to avoid exposure bias during training and to be consistent with the automatic evaluation metrics. Shin and Lee (2018) proposed a multi-source transformer where on the decoder side, they added two additional multi-head attention layers for src → mt and src → pe. Thereafter another multi-head attention between the output of those attention layers helps the decoder to capture common words in mt which should remain in pe. The APE PBSMT-subtask winner of WMT 2018 (wmt18 smt best ) (Junczys-Dowmunt and Grundkiewicz, 2018) also presented another transformer-based multi-source APE which uses two encoders and stacks an additional cross-attention component for src → pe above the previous cross-attention for mt → pe.
In contrast to other multi-encoder based approaches and Libovický et al. (2018)'s approach, where the authors focused on cross-attention of two encoders with respect to the decoder within the transformer architecture, we propose a novel architecture where the second encoder block is similar to the transformer decoder block but without masking.
In the latest edition of WMT (2019), the submissions are mostly multi-source models extending the transformer implementation (Pal et al., 2019;Lee et al., 2019; and adapting BERT (Devlin et al., 2018) to the transformer-based framework (Lopes et al., 2019). The winner system (Lopes et al., 2019) (wmt19 nmt best ) uses a single pre-trained BERT encoder that receives both the source src and mt strings and applies a BERT-based encoder-decoder model. Additionally, they add a conservativeness penalty factor during beam decoding to avoid over-corrections in APE.
Our method outperforms the WMT 2016, 2017, and 2018 winners by 1 BLEU point, and yields comparable performance to the WMT 2019 winner, however, without using a BERT-based architecture.

The Transference Multi-Encoder Transformer for APE
We propose a multi-source transformer model called transference ({src, mt} tr → pe, Figure 1), which takes advantage of both the encodings of src and mt and attends over a combination of both sequences while generating the post-edited sentence. The second encoder, enc src→mt , makes use of the first encoder enc src and a sub-encoder enc mt for considering src and mt. Here, the enc src encoder and the dec pe decoder are equivalent to the original transformer for neural MT (Vaswani et al., 2017). Our enc src→mt follows an architecture similar to the transformer's decoder, the difference being that multihead selfattention is not masked to process mt.
The self-attended encoder for src, s = (s 1 , s 2 , . . . , s k ), returns a sequence of continuous representations, enc src , and the second self-attended sub-encoder for mt, m = (m 1 , m 2 , . . . , m l ), returns another sequence of continuous representations, enc mt . Self-attention at this point provides the advantage of aggregating information from all of the words, including src and mt, and successively generates a new representation per word informed by the entire src and mt context. To do this the internal enc mt representation performs cross-attention over enc src and prepares a final representation (enc src→mt ) for the decoder (dec pe ). The decoder then generates the pe output in sequence, p = (p 1 , p 2 , . . . , p n ), one word at a time from left to right by attending to previously generated words as well as the final representations (enc src→mt ) generated by the encoder.
To summarize, our multi-source APE implementation extends Vaswani et al. (2017) by introducing an additional encoding block by which src and mt communicate with the decoder.
Our proposed approach differs from the WMT 2018 PBSMT winner system (wmt18 smt best ) in several ways: (i) we use the original transformer's decoder without modifications; (ii) one of our encoder blocks (enc src→mt ) is identical to the transformer's decoder block but uses no masking in the self-attention layer, thus having one self-attention layer and an additional cross-attention for src → mt; and (iii) in the decoder layer, the cross-attention is performed between the encoded representation from enc src→mt and pe. Moreover, placing a cross-attention network within the enc src→mt sub-layer rather than the dec pe sub-layer as in wmt18 smt best , during inference, enc src→mt is forward propagated only once instead of multiple times i.e., once per decoding step.
Our approach also differs from the WMT 2018 NMT winner system: (i) wmt18 nmt best concatenates the encoded representation of two encoders and passes it as the key to the attention layer of the decoder, and (ii), the system additionally employs sequence-level loss functions based on maximum likelihood estimation and minimum risk training in order to avoid exposure bias during training.
Comparing with wmt19 nmt best , the winner system of WMT 2019 uses a pre-trained deep bidirectional transformer (multilingual BERT) (Devlin et al., 2018), while our model does not.
The main intuition is that our enc src→mt attends over the src and mt and informs the pe to better capture, process, and share information between src-mt-pe, which efficiently models error patterns and the corresponding corrections. Our model performs better than past transformer-based approaches and similar to the BERT-based approach (wmt19 nmt best ) without adding the overhead of the pre-trained model, as the experiment section will show.

Experiments
We explore our approach on both APE sub-tasks of WMT 2018, where the black box MT (we refer as 1 st -stage MT) system to which APE is applied is either a phrase-based statistical machine translation (PBSMT) or a neural machine translation (NMT) model.
For the PBSMT task, we compare against four baselines: the raw SMT output provided by the 1 ststage PBSMT, the best-performing systems from WMT APE 2018 (wmt18 smt best ), which are a single model and an ensemble model by Junczys-Dowmunt and Grundkiewicz (2018), as well as a transformer directly translating from src to pe (Transformer (src → pe)), thus performing translation instead of APE. We evaluate the systems using BLEU (Papineni et al., 2002) and TER (Snover et al., 2006).
For the NMT task, we consider three baselines: the raw NMT output provided by the 1 st -stage NMT system, the best-performing system from the WMT 2018 (wmt18 nmt best ) (Tebbifakhr et al., 2018) and WMT 2019 (wmt19 nmt best ) (Lopes et al., 2019) NMT APE task. Apart from the multi-encoder transference architecture described above ({src, mt} tr → pe) and ensembling of this architecture, two simpler versions are also analyzed: first, a 'mono-lingual' (mt → pe) APE model using only parallel mt-pe data and therefore only a single encoder, and second, an identical single-encoder architecture, however, using the concatenated src and mt text as input ({src + mt} → pe) (Niehues et al., 2016).

Data
For our experiments, we use the English-German WMT 2016 , 2017 , 2018  and 2019 (Chatterjee et al., 2019) APE task data. All these released APE datasets consist of English-German triplets containing source English text (src) from the IT domain, the corresponding German translations (mt) from a 1 st -stage MT system, and the corresponding human-post-edited version (pe). The sizes of the datasets (train; dev; test), in terms of number of sentences, are (12,000; 1,000; 2,000), (11,000; 0; 2,000), and (13,442; 1,000; 1,023), for the 2016 PBSMT, the 2017 PBSMT, and the 2018 NMT data, respectively. The 2019 version of the APE dataset released in WMT is the same as the WMT 2018 NMT data. It is to be noted that for WMT 2018, we carried out experiments only for the NMT sub-task and ignored the data for the PBSMT task.
Since the WMT APE datasets are small in size, we use 'artificial training data' (Junczys-Dowmunt and Grundkiewicz, 2016) containing 4.5M sentences as additional resources, 4M of which are weakly similar to the WMT 2016 training data, while 500K are very similar according to TER statistics.
For experimenting on the NMT data, we additionally use the synthetic eScape APE corpus , consisting of ∼7M triples. For cleaning this noisy eScape dataset containing many unrelated language words (e.g. Chinese), (i) we use the cleaning process described in Tebbifakhr et al. (2018), and (ii) we use the Moses (Koehn et al., 2007) corpus cleaning scripts with minimum and maximum number of tokens set to 1 and 100, respectively. After cleaning, we perform punctuation normalization, and then use the Moses tokenizer (Koehn et al., 2007) to tokenize the eScape corpus with 'no-escape' option. Finally, we apply true-casing. The cleaned version of the eScape corpus contains ∼6.5M triplets.

Experiment Setup
To build models for the PBSMT tasks from 2016 and 2017, we first train a generic APE model using all the training data (4M + 500K + 12K + 11K) described in Section 4.1. Afterwards, we fine-tune the trained model using the 500K artificial and 23K (12K + 11K) real PE training data. We use the WMT 2016 development data (dev2016) containing 1,000 triplets to validate the models during training. To test our system performance, we use the WMT 2016 and 2017 test data (test2016, test2017) as two subexperiments, each containing 2,000 triplets (src, mt and pe). We compare the performance of our system with the four different baseline systems described above: raw MT, wmt18 smt best single and ensemble, as well as transformer (src → pe).
Additionally, we check the performance of our model on the WMT 2018 NMT APE task (where unlike in previous tasks, the 1 st -stage MT system is provided by NMT): for this, we explore two experimental setups: (i) we use the PBSMT task's APE model as a generic model which is then fine-tuned to a subset (12k) of the NMT data ({src, mt} nmt tr → pe generic,smt ). One should note that it has been argued that the inclusion of SMT-specific data could be harmful when training NMT APE models (Junczys-Dowmunt and Grundkiewicz, 2018). (ii), we train a completely new generic model on the cleaned eScape data (∼6.5M) along with a subset (12K) of the original training data released for the NMT task ({src, mt} nmt tr → pe generic,nmt ). The aforementioned 12K NMT data are the first 12K of the overall 13.4K NMT data. The remaining 1.4K are used as validation data. The released development set (dev2018) is used as test data for our experiment, alongside the test2018, for which we could only obtain results for a few models by the WMT 2019 task organizers. We also explore an additional fine-tuning step of {src, mt} nmt tr → pe generic,nmt towards the 12K NMT data (called {src, mt} nmt tr → pe f t ), and a model averaging the 8 best checkpoints of {src, mt} nmt tr → pe f t , which we call {src, mt} nmt tr → pe f t avg . During post-editing, professional translators have to understand the source, and analyze if the MT correctly represents the source, which corresponds to our enc src and enc src→mt . To investigate whether following this realistic understanding of the post-editing process is beneficial, we compare the model to a version with swapped inputs (mt, src), called {mt, src} smt tr → pe generic . We carried out an experiment with the PBSMT task's APE dataset. Moreover, we fine-tune the {mt, src} smt tr → pe generic model with 500K artificial and 23K real PE training data and compare the fine-tuned model ({mt, src} smt tr → pe f t ) with {src, mt} smt tr → pe f t . Last, we analyze the importance of our second encoder (enc src→mt ), compared to the source encoder (enc src ) and the decoder (dec pe ), by reducing and expanding the amount of layers in the encoders and the decoder. Our standard setup, which we use for fine-tuning, ensembling etc., is fixed to 6-6-6 for N src -N mt -N pe (cf. Figure 1). We investigate what happens in terms of APE performance if we change this setting to 6-6-4 and 6-4-6.
To handle out-of-vocabulary words and reduce the vocabulary size, instead of considering words, we consider subword units (Sennrich et al., 2016) by using byte-pair encoding (BPE). In the preprocessing step, instead of learning an explicit mapping between BPEs in the src, mt and pe, we define BPE tokens by jointly processing all triplets. Thus, src, mt and pe derive a single BPE vocabulary. Since mt and pe belong to the same language (German) and src is a close language (English), they naturally share a good fraction of BPE tokens, which reduces the vocabulary size to 28k. We implemented our approach based on the Neutron implementation of the Transformer (Xu and Liu, 2019) 1 .

Hyper-parameter Setup
We follow a similar hyper-parameter setup for all reported systems. All encoders (for {src, mt} tr → pe), and the decoder, are composed of a stack of N src = N mt = N pe = 6 identical layers (except for the layer experiment) followed by layer normalization. The learning rate is varied throughout the training process, and increasing for the first training steps warmup steps = 8000 and afterwards decreasing as described in Vaswani et al. (2017). All remaining hyper-parameters are set analogously to those of the transformer's base model. At training time, the batch size is set to 25K tokens, with a maximum sentence length of 256 subwords. After each epoch, the training data is shuffled. During decoding, we perform beam search with a beam size of 4. We use shared embeddings between mt and pe in all our experiments.

Results
The results of our four models, single-source (mt → pe), multi-source single encoder ({src + pe} → pe), transference model ({src, mt} smt tr → pe, {mt, src} smt tr → pe), and ensemble, in comparison to the four baselines, raw SMT, wmt18 smt best (Junczys-Dowmunt and Grundkiewicz, 2018) single and ensemble, as well as Transformer (src → pe), are presented in Table 1 for test2016 and test2017. Table 2 reports the results obtained by our transference model ({src, mt} nmt tr → pe) on the WMT 2018, 2019 NMT data for dev2018 (which we use as a test set) and test2018/2019 (when testset was available), compared to the baselines raw NMT, wmt18 nmt best , and wmt19 nmt best .

Baselines
The raw SMT output in Table 1 is a strong black-box PBSMT system (i.e., 1st-stage MT). We report its performance observed with respect to the ground truth (pe), i.e., the post-edited version of mt. The original PBSMT system scores over 62 BLEU points and below 25 TER on test2016 and test2017. Using a Transformer (src → pe), we test if APE is really useful, or if potential gains are only achieved due to the good performance of the transformer architecture. While we cannot do a full training of the transformer on the data that the raw MT engine was trained on due to the unavailability of the data, we use our PE datasets in an equivalent experimental setup as for all other models. The results of this system (Exp. 1.2 in Table 1) show that the performance is actually lower across both test sets, -5.52/-9.43 absolute points in BLEU and +5.21/+7.72 absolute in TER, compared to the raw SMT baseline.
We report four results from wmt18 smt best , (i) wmt18 smt best (single), which is the core multi-encoder implementation without ensembling but with checkpoint averaging, (ii) wmt18 smt best (x4) which is an ensemble of four identical 'single' models trained with different random initializations. The results of wmt18 smt best (single) and wmt18 smt best (x4) (Exp. 1.3 and 1.4) reported in Table 1 are from Junczys-Dowmunt and Grundkiewicz (2018). Since their training procedure slightly differs from ours, we also trained the wmt18 smt best system using exactly our experimental setup in order to make a fair comparison. This yields the baselines (iii) wmt18 smt,generic best (single) (Exp. 1.5), which is similar to wmt18 smt best (single), however, the training parameters and data are kept in line with our transference general model (Exp. 2.3) and (iv) wmt18 smt,f t best (single) (Exp. 1.6), which is also trained maintaining the equivalent experimental setup compared to the fine tuned version of the transference general model (Exp. 3.3). Compared to both raw SMT and transformer (src → pe) we see strong improvements for this state-ofthe-art model, with BLEU scores of at least 68.14 and TER scores of at most 20.98 across the PBSMT testsets. wmt18 smt best , however, performs better in its original setup (Exp. 1.3 and 1.4) compared to our experimental setup (Exp. 1.5 and 1.6).
The results on the WMT 2018 and 2019 NMT datasets (dev2018 and test2018) are presented in Table 2. The raw NMT system serves as one baseline against which we compare the performance of the different models. We evaluate the system hypotheses with respect to the ground truth (pe), i.e., the postedited version of mt. The baseline original NMT system scores 76.76 BLEU points and 15.08 TER on dev2018, and 74.73 BLEU points and 16.80 TER on test2018.

Single-Encoder Transformer for APE
The two transformer architectures mt → pe and {src + mt} → pe use only a single encoder. Table  1 shows that mt → pe (Exp. 2.1) provides better performance (+4.42 absolute BLEU on test2017) compared to the original SMT, while {src + mt} → pe (Exp. 2.2) provides further improvements by additionally using the src information. {src + mt} → pe improves over mt → pe by +1.62/+1.35 absolute BLEU points on test2016/test2017. After fine-tuning, both single encoder transformers (Exp. 3.1 and 3.2 in Table 1) show further improvements, +0.87 and +0.31 absolute BLEU points, respectively, for test2017 and a similar improvement for test2016.

Transference Transformer for APE
In contrast to the two models above, our transference architecture uses multiple encoders. The finetuned version of the {src, mt} smt tr → pe model (Exp. 3.3 in Table 1) outperforms wmt18 smt best (single) (Exp. 1.3) in BLEU on both test sets, however, the TER score for test2016 increases. When ensembling the 4 best checkpoints of our {src, mt} smt tr → pe model (Exp. 4.1), the result beats the wmt18 smt best (x4) system, which is an ensemble of four different randomly initialized wmt18 smt best (single) systems. Our ensemble smt (x3) combines two {src, mt} smt tr → pe (Exp. 2.3) models initialized with different random weights with the ensemble of the fine-tuned transference model Exp3.3 smt ens4ckpt (Exp. 4.1). This ensemble provides the best results for all datasets, providing roughly +1 BLEU point and -0.5 TER when comparing against wmt18 smt best (x4). In terms of the number of parameters, wmt18 smt best and our {src, mt} smt tr → pe model are the same. Moreover, our {src, mt} smt tr → pe model uses a single multi-head cross-attention in the decoder sub-layer, compared to two multi-head cross-attention mechanisms in wmt18 smt best , therefore our model requires less inference time. Furthermore, using more non-autoregressive encoder layers with fewer autoregressive decoder layers can significantly accelerate the inference (Xu et al., 2020), instead of aggregating src and mt with the autoregressive pe decoder (Junczys-Dowmunt and Grundkiewicz, 2018), our approach that aggregates src and mt with the nonautoregressive mt encoder is significantly faster than the wmt18 smt best in inference, which is of practical value.
Additionally we compare our {src, mt} smt tr → pe model with {mt, src} smt tr → pe, where we reverse the input order i.e., enc 1 and enc 2 take mt and src, respectively, as input.  → pe generic,smt model, which is the model from Exp. 3.3 fine-tuned towards NMT data as described in Section 4.2. Table 2 shows that our PBSMT APE model fine-tuned towards NMT (Exp. 7) can even slightly improve over the already very strong NMT system by about +0.3 BLEU and -0.1 TER, although these improvements are not statistically significant.
The overall results improve when we train our model on eScape and NMT data instead of using the PBSMT model as a basis. Our proposed generic transference model (Exp. 8, {src, mt} nmt tr → pe generic,nmt ) shows statistically significant improvements in terms of BLEU and TER compared to the baseline even before fine-tuning, and further improvements after fine-tuning (Exp. 9, {src, mt} nmt tr → pe f t ). Finally, after averaging the 8 best checkpoints, our {src, mt} nmt tr → pe f t avg model (Exp. 10) also shows consistent improvements in comparison to the baseline and other experimental setups. Overall our fine-tuned model averaging the 8 best checkpoints achieves +1.02 absolute BLEU points and -0.69 absolute TER improvements over the baseline on test2018. Table 2 also shows the performance of our model compared to the winner system of WMT 2018 (wmt18 nmt best ) for the NMT task (Tebbifakhr et al., 2018). wmt18 nmt best scores 14.78 in TER and 77.74 in BLEU on the dev2018 and 16.46 in TER and 75.53 in BLEU on the test2018. In comparison to wmt18 nmt best , our model (Exp. 10) achieves better scores in TER on both the dev2018 and test2018, however, in terms of BLEU our model scores slightly lower for dev2018, while some improvements are achieved on test2018. Compared to wmt19 nmt best (Exp. 6.3), our model scores slightly lower, however, the performance loss is not statistically significant. It is to be noted that the training strategy in wmt19 nmt best is different -(i) they used their own synthetic corpus prepared using the parallel data provided by the Quality Estimation shared task 2 , (ii) they oversampled the APE training data 20 times, and (iii) they applied multilingual BERT.
The number of layers (N src -N mt -N pe ) in all encoders and the decoder for these results is fixed to 6-6-6. In Exp. 5.1, and 5.2 in Table 1, we see the results of changing this setting to 6-6-4 and 6-4-6. This can be compared to the results of Exp. 2.3, since no fine-tuning or ensembling was performed for these three experiments. Exp. 5.1 shows that decreasing the number of layers on the decoder side does not hurt the performance. In fact, in the case of test2016, we got some improvement, while for test2017, the scores got slightly worse. In contrast, reducing the enc src→mt encoder block's depth (Exp. 5.2) does indeed reduce the performance for all four scores, showing the importance of this second encoder.

Discussion
The proposed multi-encoder based transformer architecture ({src, mt} smt tr → pe, Exp. 2.3) shows slightly worse results than wmt18 smt best (single) (Exp. 1.3) before fine-tuning, and roughly similar results after fine-tuning (Exp. 3.3). After ensembling, however, our transference model (Exp. 4.2) shows consistent improvements when comparing against the best baseline ensemble wmt18 smt best (x4) (Exp. 1.4). Due to the unavailability of the sentence-level scores of wmt18 smt best (x4), we could not test if the improvements (roughly +1 BLEU, -0.5 TER) are statistically significant. Interestingly, our approach of taking the model optimized for PBSMT and fine-tuning it to the NMT task (Exp. 7) does not hurt the performance as was reported in the previous literature (Junczys-Dowmunt and Grundkiewicz, 2018). In contrast, some small, albeit statistically insignificant improvements over the raw NMT baseline were achieved. When we train the transference architecture directly for the NMT task (Exp. 8), we get slightly better and statistically significant improvements compared to raw NMT. Fine-tuning this NMT model further towards the actual NMT data (Exp. 9), as well as performing checkpoint averaging using the 8 best checkpoints improves the results even further. Compared to wmt18 smt best and wmt19 nmt best , our architecture is simpler, faster during inference, it follows the two-step approach of professional post-editors, and has no additional overhead like BERT.
The reasons for the effectiveness of our approach can be summarized as follows.
(1) Our enc src→mt contains two attention mechanisms: one is self-attention and another is cross-attention. The self-attention layer is not masked here; therefore, the cross-attention layer in enc src→mt is informed by both previous and future time-steps from the self-attended representation of mt (enc mt ) and additionally from enc src . As a result, each state representation of enc src→mt is learned from the context of src and mt. This might produce better representations for dec pe which can access the combined context. In contrast, in wmt18 smt best , the dec pe accesses representations from src and mt independently, first using the representation from mt and then using that of src.
(2) The position-wise feed-forward layer in our enc src→mt of our model requires processing information from two attention modules, while in the case of wmt18 smt best , the position-wise feed-forward layer in dec tgt needs to process information from three attention modules, which may increase the learning difficulty of the feed-forward layer. (3) Since pe is a post-edited version of mt, sharing the same language, mt and pe are quite similar compared to src. Therefore, attending over a fine-tuned representation from mt along with src, which is what we have done in this work, might be a reason for the better results than those achieved by attending over src directly.
Evaluating the influence of the depth of our encoders and decoder show that while the decoder depth appears to have limited importance, reducing the encoder depth indeed hurts performance which is in line with Domhan (2018).

Conclusions
In this paper, we presented a multi-encoder transformer-based APE model that repurposes the standard transformer blocks in a simple and effective way for the APE task: first, our transference architecture uses a transformer encoder block for src, followed by a decoder block without masking on mt that effectively acts as a second encoder combining src → mt, and feeds this representation into a final decoder block generating pe. The proposed model outperforms the best-performing system of WMT 2018 on the test2016, test2017, dev2018, and test2018 data. Moreover, our model is on par with but simpler than WMT 2019 best system since our model does not apply BERT or any conservative factor during inference. Taking a departure from traditional transformer-based encoders, which perform self-attention only, our second encoder also performs cross-attention to produce representations for the decoder based on both src and mt. We also show that the encoder plays a more pivotal role than the decoder in transformerbased APE, which could also be the case for transformer-based generation tasks in general. Our architecture is generic and can be used for any multi-source task, e.g., (i) Multi-source Translation (ii) document translation to model the associated context during translation, (iii) Question Generation to generate question from given passage and a short answer text, (iv) Question Answering task from given passage and question text, (v) Summarization, etc.