Transformer-based Automatic Post-Editing Model with Joint Encoder and Multi-source Attention of Decoder

This paper describes POSTECH’s submission to the WMT 2019 shared task on Automatic Post-Editing (APE). In this paper, we propose a new multi-source APE model by extending Transformer. The main contributions of our study are that we 1) reconstruct the encoder to generate a joint representation of translation (mt) and its src context, in addition to the conventional src encoding and 2) suggest two types of multi-source attention layers to compute attention between two outputs of the encoder and the decoder state in the decoder. Furthermore, we train our model by applying various teacher-forcing ratios to alleviate exposure bias. Finally, we adopt the ensemble technique across variations of our model. Experiments on the WMT19 English-German APE data set show improvements in terms of both TER and BLEU scores over the baseline. Our primary submission achieves -0.73 in TER and +1.49 in BLEU compare to the baseline.


Introduction
Automatic Post-Editing (APE) is the task of automatically correcting errors in a given the machine translation (MT) output to generate a better translation (Chatterjee et al., 2018). Because APE can be regarded as a sequence-to-sequence problem, MT techniques have been previously applied to this task. Subsequently, it is only natural that neural APE has been proposed following the appearance of neural machine translation (NMT).
Among the initial approaches to neural APE, a log-linear combination model (Junczys-Dowmunt and Grundkiewicz, 2016) that combines bilingual * Both authors equally contributed to this work and monolingual translations yielded the best results. Since then, In order to leverage information from both MT outputs (mt) and its corresponding source sentences (src), a multi-encoder model (Libovický et al., 2016) based on multi-source translation (Zoph and Knight, 2016) has become the prevalent approach (Bojar et al., 2017). Recently, with the advent of Transformer (Vaswani et al., 2017), most of the participants in the WMT18 APE shared task proposed Transformerbased multi-encoder APE models (Chatterjee et al., 2018).
Previous multi-encoder APE models employ separate encoders for each input (src, mt), and combine their outputs in various ways: by 1) sequentially applying attention between the hidden state of the decoder and the two outputs (Junczys-Dowmunt and Grundkiewicz, 2018;Shin and Lee, 2018) or 2) simply concatenating them (Pal et al., 2018;Tebbifakhr et al., 2018). However, these approaches seem to overlook one of the key differences between general multi-source translation and APE. Because the errors mt may contain are dependent on the MT system, the encoding process for mt should reflect its relationship with the source sentence. Furthermore, we believe that it would be helpful to incorporate information from the source sentence, which should ideally be error-free, in addition to the jointly encoded mt in generating post-edited sentence.
From these points of view, we propose a multisource APE model by extending Transformer to contain a joint multi-source encoder and a decoder that involves a multi-source attention layer to combine the outputs of the encoder. Apart from that, we apply various teacher-forcing ratios at training time to alleviate exposure bias. Finally, we ensemble model variants for our submission. The remainder of the paper is organized as follows: Section 2 describes our model architecture.

Transformer-based Automatic Post-Editing Model with Joint Encoder and Multi-source Attention of Decoder
Section 3 summarizes the experimental results, and Section 4 gives the conclusion.

Model Description
We adopt Transformer to the APE problem, which takes multiple inputs (src, mt) to generate a postedited sentence (pe). In the following subsections, we describe our modified encoder and decoder.

Encoder
The proposed encoder structure for multisource inputs, as shown in Figure 1, is an extension of what is introduced in Vaswani et al. (2017) developed considering single-source input. Similar to recent APE studies, our encoder receives two sources: src x = ( 1 , … , ) and mt y = ( 1 , … , ) , where and denote their sequence lengths respectively, but produce the joint representation E = ( 1 , … , ), in addition to encoded src E = ( 1 , … , ).

Joint representation.
Unlike previous studies, which independently encode two input sources using separate encoding modules, we incorporate src context information into each hidden state of mt through the single encoding module, resulting in a joint representation of two sources. As shown with the dashed square in Figure 1, jointly represented hidden states are obtained from the residu-al connection and multi-head attention that takes ∈ ℝ × as keys and values and ∈ ℝ × as queries. Therefore, the joint representation of each level of the stack ( = 1, … , ) can be expressed with MultiHead(Q, K, V) and Lay-erNorm described in Vaswani et al. (2017) as follows: Stack-level attention. When applying attention across source and target, the original Transformer only considers source hidden states retrieved from the final stack, whereas our encoder feeds into each attention layer the src embeddings from the same level, as can be seen in (1).

Masking option.
The self-attention layer that is the first attention layer of the mt encoding module optionally includes a future mask, which mimics the general decoding process of MT systems that depends only on previously generated words. We conduct experiments ( §3.2) for two cases: with and without this option.

Decoder
Our decoder is an extension of Transformer decoder, in which the second multi-head attention layer that originally only refers to single-source

Encoder
Add & Norm encoder states is replaced with a multi-source attention layer. Figure 2 shows our decoder architecture including the multi-source attention layer that attends to both outputs of the encoder. Furthermore, we construct two types of the multisource attention layer by utilizing different strategies in combining attention over two encoder output states.
Multi-source parallel attention. Figure 3a illustrates the structure of parallel attention. The decoder's hidden state simultaneously attends to each output of the multi-source encoder, followed by residual connection, and the results are linearly combined by summing them at the end: Note that ∈ ℝ × denotes the hidden states for decoder input pe z = ( 1 , … , ).
Multi-source sequential attention. As shown in Figure 3b, two outputs of the encoder are sequentially combined with the decoder's hidden state: and the decoder's hidden state are first assigned to multi-head attention and residual con-nection layers, then the same operation is performed between the result and .
This approach is structurally equivalent to Junczys-Dowmunt and Grundkiewicz (2018), except that the encoder states being passed on are different.

Dataset
We used the WMT19 official English-German APE dataset (Chatterjee et al., 2018) which consists of a training and development set. In addition, we adopted the eSCAPE NMT dataset  as additional training data. We extracted sentence triplets from the eSCAPE-NMT dataset according to the following criteria, to which the official training dataset mostly adheres. Selected triplets have no more than 70 words in each sentence, a TER less than or equal to 75, and a reciprocal length ratio within the monolingual pair (mt, pe) less than 1.4. Table 1 summarizes the statistic of the datasets.

Training Details
Settings. We modified the OpenNMT-py (Klein et al., 2017) implementation of Transformer to build our models. Most hyperparameters such as the dimensionality of hidden states, optimizer settings, dropout ratio, etc. were copied from the "base model" described in Vaswani et al. (2017). We adjusted the warm-up learning steps and batch size per triplets to 18k and ~25k, respectively. For data preprocessing, we employed subword encoding (Kudo, 2018) with 32k shared vocabulary.  Two-step training. We separated the training process into two steps: the first phase for training a generic model, and the second phase to finetune the model. For the first phase, we trained the model with a union dataset that is the concatenation of eSCAPE-NMT-filtered, and the upsampled official training set by copying 20 times. After reaching the convergence point in the first phase, we fine-tuned the model by running the second phase using only the official training set.

Model variations.
In our experiment, we constructed four types of models in terms of the existence of the encoder future mask and the type of the multi-source attention layer in the decoder as follows:  Parallel w/ masking where the model involves the multi-source parallel attention layer with the encoder mask.
 Parallel w/o masking in which the encoder mask is excluded from Parallel w/ masking.
 Sequential w/ masking where the model involves the multi-source sequential attention layer with the encoder mask.
 Sequential w/o masking in which the encoder mask is excluded from Seq. w/ masking.
Teacher-forcing ratio. During training, because the decoder takes as input the target shifted to the right, the ground-truth words are passed to the decoder. However, at inference time, the decoder consumes only previously produced output words, causing exposure bias. To overcome this problem, we have empirically adjusted the teacher-forcing ratio in the second phase of training, so that teacher-forcing is applied stochastically in such a way that given a ratio , the greedy decoding output of the previous step is fed into the next input with a probability of 1 − .

Ensemble.
To leverage all variants in different architectures and teacher-forcing ratios, we combined them using an ensemble approach according to the following three criteria:  Ens_set_1: top-N candidates among all variants in terms of TER.
 Ens_set_2: top-N candidates for variants in each architecture, in terms of TER.
 Ens_set_3: two candidates for variants in each architecture, achieving the best TER and BLEU scores, respectively.

Results
We trained a generic model for each of the four model variations mentioned in §3.2. Then, we fine-tuned those models using various teacherforcing ratios. For evaluation, we used TER (Snover et al., 2006) and BLEU (Papineni et al., 2002) scores on the WMT official development dataset. Table 2 shows the scores of the generic and fine-tuned models according to their architectures and teacher-forcing ratios. The result shows that adjusting teacher-forcing ratio helps improve the post-editing performance of the models. Table 3 gives the results of the ensemble models. The ensemble models had slightly worse TER scores (+0.02 ~ +0.13) than the best TER score in the fine-tuned variants, but better BLEU scores (+0.09 ~ +0.27) than the best BLEU score. We selected the three best ensemble models for submission, expecting to reap the benefits from leveraging different architectures in the decoding process. The names and types for submission are noted in Table 3.
Submission results. The results of primary and contrastive submission on the official test set are reported in Table 4. Our primary submission achieves improvements of -0.73 in TER and +1.49 in BLEU compared to the baseline, and shows better results than the state-of-the-art of the last round with -0.35 in TER and +0.69 in BLEU. While our primary system ranks second out of 18 systems submitted this year, it shows the highest BLEU score.

Conclusion
In this paper, we present POSTECH's submissions to the WMT19 APE shared task. We propose a new Transformer-based APE model comprising a joint multi-source encoder and a decoder with two types of multi-source attention layers. The proposed encoder generates a joint representation for MT output with optional masking, in addition to the encoded source sentence. The proposed de-coder employs two types of multi-source attention layers according to the post-editing strategy. We refine the eSCAPE-NMT dataset and apply twostep training with various teacher-forcing ratios. Finally, our ensemble models showed improvements in terms of both TER and BLEU, and outperform not only the baseline but also the best model from the previous round of the task.   Table 3: Results of ensemble models -"Submission Name" indicates the names (types) for the submission.
The bold values indicate the best result in each metric.