A Simple and Effective Unified Encoder for Document-Level Machine Translation

Most of the existing models for document-level machine translation adopt dual-encoder structures. The representation of the source sentences and the document-level contexts are modeled with two separate encoders. Although these models can make use of the document-level contexts, they do not fully model the interaction between the contexts and the source sentences, and can not directly adapt to the recent pre-training models (e.g., BERT) which encodes multiple sentences with a single encoder. In this work, we propose a simple and effective unified encoder that can outperform the baseline models of dual-encoder models in terms of BLEU and METEOR scores. Moreover, the pre-training models can further boost the performance of our proposed model.


Introduction
Thanks to the development of the deep learning methods, the machine translation systems have achieved good performance that is even comparable with human translation in the news domain (Hassan et al., 2018). However, there are still some problems with machine translation in the documentlevel context (Läubli et al., 2018). Therefore, more recent work (Jean et al., 2017;Wang et al., 2017;Tiedemann and Scherrer, 2017;Maruf and Haffari, 2018;Bawden et al., 2018;Voita et al., 2019a;Junczys-Dowmunt, 2019) is focusing on the document-level machine translation.
Most of the existing models Maruf et al., 2019;Werlen et al., 2018) for document-level machine translation use two encoders to model the source sentences and the document-level contexts. Figure 1a illustrates the structure of these models. They extend the standard 1 In this work, document-level contexts denote the surrounding sentences of the current source sentence. Transformer model with a new context encoder, and the encoder for source sentences is conditioned on this context encoder. However, they do not fully model the interaction between the contexts and the source sentences because the self-attention layers are performed inside each encoder separately. Moreover, it cannot be directly adapted to the recent pre-training models (Devlin et al., 2019;Peters et al., 2018;Radford et al., 2019;Dong et al., 2019;Song et al., 2019;Lample and Conneau, 2019), which encodes multiple sentences with a single encoder. Different from the dual-encoder structure, the uni-encoder structure takes the concatenation of contexts and source sentences as the input (as shown in Figure 1b). Therefore, when modeling the contexts, it can make full use of the interaction between the source sentences and the contexts, while the dual-encoder model fails to exploit this information. Moreover, the uni-encoder structure is identical to the recent pre-training models (e.g., BERT). However, the previous uni structure suffers from two problems for document-level machine translation. First, the attention is distracted due to longer sequences. Second, the source sentences and the contexts are modeled equally, which is contrary to the fact that the translation is more related to the current source sentences.
To address these problems, we propose a novel flat structure with a unified encoder called Flat-Transformer. It separates the encoder of standard Transformers into two parts so that the attention can concentrate at both the global level and the local level. At the bottom of the encoder blocks, the self-attention is applied to the whole sequence. At the top of the blocks, it is only implemented at the position of the source sentences. We evaluate this model on three document-level machine translation datasets. Experiments show that it can achieve better performance than the baseline models of dual-encoder structures in terms of BLEU and METEOR scores. Moreover, the pre-training models can further boost the performance of the proposed structure.

Flat-Transformer
In this section, we introduce our proposed flat structured model, which we denote as Flat-Transformer.

Document-Level Translation
Formally, we denote X = {x 1 , x 2 , · · · , x N } as the source document with N sentences, and Y = {y 1 , y 2 , · · · , y M } as the target document with M sentences. We assume that N = M because the sentence mismatches can be fixed by merging sentences with sentence alignment algorithms (Sennrich and Volk, 2011). Therefore, we can assume that (x i , y i ) is a parallel sentence pair.
Following , y <i can be omitted because x <i and y <i conveys the same information. As a result, the probability can be approximated as: where x i is the source sentence aligned to y i , and (x <i , x >i ) is the document-level context used to translate y i .  Figure 2: The architecture of the proposed Flat-Transformer model.

Segment Embedding
The flat structure adopts a unified encoder that does not distinguish the context sentences and the source sentences. Therefore, we introduce the segment embedding to identify these two types of inputs. Formally, given the source input of the surrounding context c and the current sentence x, we project them into word embedding and segment embedding. Then, we perform a concatenation operation to unify them into a single input: where [; ] denotes the concatenation operation, E is the word embedding matrix, and S is the segment embedding matrix. Finally, we add e and s as the input of the encoder.

Unified Flat Encoder
Given the document context, the input sequences of Flat-Transformer are much longer than the standard Transformer, which brings additional challenges. First, the attention is distracted, and its weights become much smaller after the normalization operation. Second, the memory consumption and the computation cost increase, so it is difficult to enlarge the model size, which hinders the adaptation to the pre-training model. To address this problem, we introduce a unified flat encoder. As shown in Figure 2, at the bottom of the encoder blocks, we apply self-attention and the feed-forward layer to the concatenated sequence of the contexts and the current sentence: where θ is the parameter of the Transformer blocks. At the top of encoder blocks, each self-attention and feed-forward layer is only implemented on the position of the current sentences: where s and t are the starting and ending positions of the source sentences in the concatenation sequence. In this way, the attention can focus more on the current sentences, while the contexts are served as the supplemental semantics for the current sentences. It is noted that the total number of the bottom blocks and the top blocks is equal to the number of standard Transformer's blocks, so there is no more parameter than that of the standard Transformer.

Training and Decoding
The training of Flat-Transformer is consistent with that of standard Transformer, using the cross entropy loss: At the decoding step, it translates the document sentence-by-sentence. When translating each sentences, it predicts the target sequence with the highest probability given the current sentence x i and the surrounding contexts x <i , x >i :

Comparison with Existing Models
Here, we summarize some significant differences compared with the existing models for documentlevel machine translation: 1. Compared with the dual-encoder models, our model uses a unified encoder. To combine the representation of two encoders for the decoder, these dual-encoder models should add a layer inside the encoders. Flat-Transformer does not put any layer on top of the standard Transformer, so it is consistent with the recent pre-training models.
2. Compared with the previous uni-encoder models, our model limits the top transformer layers to only model the source sentences. In this way, our model has an inductive bias of modeling on more current sentences than the contexts, because the translation is more related to the current sentences. 3. There are also some alternative approaches to limit the use of context vectors. For example, we can limit only the top attention layers to attend to the source sentence while keeping the feed-forward layers the same. Compared with this approach, our model does not feed the output vectors of the context encoder to the decoder, so that the decoder attention is not distracted by the contexts. The context vectors in our model is only to help encode a better representation for current source sentences.

Experiments
We evaluate the proposed model and several stateof-the-art models on three document-level machine translation benchmarks. We denote the proposed model as Flat-Transformer.

Datasets
Following the previous work (Maruf et al., 2019), we use three English-German datasets as the benchmark datasets, which are TED, News, and Europarl. The statistic of these datasets can be found in Table 1. We obtain the processed datasets from Maruf et al. (2019) 2 , so that our results can be compared with theirs reported in Maruf et al. (2019). We use the scripts of Moses toolkit 3 to tokenize the sentences. We also split the words into subword units (Sennrich et al., 2016) with 30K mergeoperations. The evaluation metrics are BLEU (Papineni et al., 2002) and Meteor (Banerjee and Lavie, 2005).

Implementation Details
The batch size is limited to 4, 000 tokens for all models. We set the hidden units of the multi-head component and the feed-forward layer as 512 and 1024. The embedding size is 512, the number of heads is 4, and the dropout rate (Srivastava et al., 2014)    for the top encoder is 5, while that for the bottom encoder is 1. When fine-tuning on the pre-training BERT, we adopt the base setting, and the hidden size, the feed-forward dimension, and the number of heads are 768, 3072, 12. To balance the accuracy and the computation cost, we use one previous sentence and one next sentence as the surrounding contexts. We use the Adam (Kingma and Ba, 2014) optimizer to train the models. For the hyper-parameters of Adam optimizer, we set two momentum parameters β 1 = 0.9 and β 2 = 0.98, and = 1 × 10 −8 . The learning rate linearly increases from 0 to 5 × 10 −4 for the first 4, 000 warming-up steps and then decreases proportional to the inverse square root of the update numbers. We also apply label smoothing to the cross-entropy loss, and the smoothing rate is 0.1. We implement the early stopping mechanism with patience that the loss on the validation set does not fall in 10 epochs.

Baselines
We compare our models with two categories of baseline models: the dual-encoder models and the uni-encoder models.
Uni-encoder: RNNSearch (Bahdanau et al., 2015) is an RNN-based sequence-to-sequence model with the attention mechanism. Transformer (Vaswani et al., 2017) is a popular model for machine translation, based solely on attention mechanisms. For a fair comparison, we use the same hyper-parameters as our model's, which is described in Section 3.2.
Dual-encoder:  extends the Transformer model with a new context encoder to represent the contexts. HAN (Werlen et al., 2018) is the first to use a hierarchical attention model to capture the context in a structured and dynamic manner. SAN (Maruf et al., 2019) proposes a new selective attention model that uses sparse attention to focus on relevant sentences in the document context. QCN  proposes a query-guided capsule networks to cluster context information into different perspectives.

Results
We compare our Flat-Transformer model with the above baselines. Table 2 summarizes the results of these models. It shows that our Flat-Transformer can obtain scores of 24.87/23.55/30.09 on three datasets in terms of BLEU,and 47.05/43.97/48.56 in terms of METEOR, which significantly outperforms the previous flat models (RNNSearch and Transformer).
By fine-tuning on BERT, Flat-Transformer can achieve improvements of +1.74/+0.97/+1.90 BLEU scores as well as +1.48/+1.43/+1.20 ME-TEOR scores. It proves that Flat-Transformer can be compatible with the pre-training BERT model. Except for the BLEU score on the News dataset, the Flat-Transformer can significantly outperform the dual-encoder models, achieving state-of-the-art performance in terms of both BLEU and ME-TEOR scores. On the contrary, the dual-encoder Transformer is not compatible with BERT. It gets slightly worse performance on two datasets, mainly because the model size becomes larger to adapt the setting of BERT. Still, BERT does not provide a good prior initialization for modeling the uni-directional relationship from contexts to source sentences.

Ablation Study
To analyze the effect of each component of Flat-Transformer, we conduct an ablation study by removing them from our models on the TED dataset. Table 3 summarizes the results of the ablation study. We remove the segment embedding but reserve the unified structure. It concludes that the segment embedding contributes to an improvement of 0.51 BLEU score and 0.85 METEOR score, showing the importance of explicitly identifying the contexts and the source sentences. After further removing the unified structure of Flat-Transformer, the model becomes a standard Transformer. It shows that the unified structures contribute a gain of 1.08 in terms of BLEU and 2.03 in terms of METEOR. The reason is that the unified structures encourage the model to focus more on the source sentences, while the contexts can be regarded as the semantic supplements.

Related Work
Here we summarize the recent advances in document-level neural machine translation. Some work focuses on improving the architectures of the document machine translation models. Tiedemann and Scherrer (2017) and Wang et al. (2017) explore possible solutions to exploit the crosssentence contexts for neural machine translation.  extends the Transformer model with a new context encoder to represent documentlevel context. Werlen et al. (2018) and (Maruf et al., 2019) propose two different hierarchical attention models to model the contexts.  introduces a capsule network to improve these hierarchical structures. There are also some works analyzing the contextual errors (Voita et al., , 2019bBawden et al., 2018) and providing the test suites (Müller et al., 2018). More recently, Voita et al. (2019a) explores the approaches to incorporate the mono-lingual data to augment the document-level bi-lingual dataset. Different from these works, this paper mainly discusses the comparison between dual-encoder models and uniencoder models and proposes a novel method to improve the uni-encoder structure.

Conclusions
In this work, we explore the solutions to improve the uni-encoder structures for document-level machine translation. We propose a Flat-Transformer model with a unified encoder, which is simple and can model the bi-directional relationship between the contexts and the source sentences. Besides, our Flat-Transformer is compatible with the pretraining model, yielding a better performance than both the existing uni-encoder models and the dualencoder models on two datasets.