Improving the Transformer Translation Model with Document-Level Context

Although the Transformer translation model (Vaswani et al., 2017) has achieved state-of-the-art performance in a variety of translation tasks, how to use document-level context to deal with discourse phenomena problematic for Transformer still remains a challenge. In this work, we extend the Transformer model with a new context encoder to represent document-level context, which is then incorporated into the original encoder and decoder. As large-scale document-level parallel corpora are usually not available, we introduce a two-step training method to take full advantage of abundant sentence-level parallel corpora and limited document-level parallel corpora. Experiments on the NIST Chinese-English datasets and the IWSLT French-English datasets show that our approach improves over Transformer significantly.


Introduction
The past several years have witnessed the rapid development of neural machine translation (NMT) (Sutskever et al., 2014;Bahdanau et al., 2015), which investigates the use of neural networks to model the translation process. Showing remarkable superiority over conventional statistical machine translation (SMT), NMT has been recognized as the new de facto method and is widely used in commercial MT systems . A variety of NMT models have been proposed to map between natural languages such as RNNencdec (Sutskever et al., 2014), RNNsearch (Bahdanau et al., 2015), ConvS2S (Gehring et al., 2017), and Transformer (Vaswani et al., 2017). Among them, the Transformer model has achieved state-of-the-art translation performance. The ca-pability to minimize the path length between longdistance dependencies in neural networks contributes to its exceptional performance.
However, the Transformer model still suffers from a major drawback: it performs translation only at the sentence level and ignores documentlevel context. Document-level context has proven to be beneficial for improving translation performance, not only for conventional SMT , but also for NMT .  indicate that it is important to exploit document-level context to deal with contextdependent phenomena which are problematic for machine translation such as coreference, lexical cohesion, and lexical disambiguation.
While document-level NMT has attracted increasing attention from the community in the past two years Tiedemann and Scherrer, 2017;Maruf and Haffari, 2018;, to the best of our knowledge, only one existing work has endeavored to model document-level context for the Transformer model . Previous approaches to document-level NMT have concentrated on the RNNsearch model (Bahdanau et al., 2015). It is challenging to adapt these approaches to Transformer because they are designed specifically for RNNsearch.
In this work, we propose to extend the Transformer model to take advantage of documentlevel context. The basic idea is to use multihead self-attention (Vaswani et al., 2017) to compute the representation of document-level context, which is then incorporated into the encoder and decoder using multi-head attention. Since largescale document-level parallel corpora are usually hard to acquire, we propose to train sentencelevel model parameters on sentence-level paral- lel corpora first and then estimate document-level model parameters on document-level parallel corpora while keeping the learned original sentencelevel Transformer model parameters fixed. Our approach has the following advantages: 1. Increased capability to capture context: the use of multi-head attention, which significantly reduces the path length between longrange dependencies, helps to improve the capability to capture document-level context; 2. Small computational overhead: as all newly introduced modules are based on highly parallelizable multi-head attention, there is no significant slowdown in both training and decoding; 3. Better use of limited labeled data: our approach is capable of maintaining the superiority over the sentence-level counterpart even when only small-scale document-level parallel corpora are available.
Experiments show that our approach achieves an improvement of 1.96 and 0.89 BLEU points over Transformer on Chinese-English and French-English translation respectively by exploiting document-level context. It also outperforms a state-of-the-art cache-based method  adapted for Transformer.

Problem Statement
Our goal is to enable the Transformer translation model (Vaswani et al., 2017) as shown in Figure  1(a) to exploit document-level context.
Formally, let X = x (1) , . . . , x (k) , . . . , x (K) be a source-language document composed of K source sentences. We use to denote the k-th source sentence containing I words. x (k) i denotes the i-th word in the k-th source sentence.
Likewise, the corresponding target-language document is denoted by Y = y (1) , . . . , y (k) , . . . , y (K) and y (k) = y (k) 1 , . . . , y (k) j , . . . , y (k) J represents the k-th target sentence containing J words. y (k) j denotes the j-th word in the k-th target sentence. We assume that X, Y constitutes a parallel document and each x (k) , y (k) forms a parallel sentence. Therefore, the document-level translation probability is given by where Y <k = y (1) , . . . , y (k−1) is a partial translation. For generating y (k) , the source document X can be divided into three parts: (1) the k-th source sentence X =k = x (k) , (2) the source-side document-level context on the left X <k = x (1) , . . . , x (k−1) , and (3) the source-side document-level context on the right X >k = x (k+1) , . . . , x (K) . As the languages used in our experiments (i.e., Chinese and English) are written left to right, we omit X >k for simplicity.
We also omit the target-side document-level context Y <k due to the translation error propagation problem : errors made in translating one sentence will be propagated to the translation process of subsequent sentences. Interestingly, we find that using source-side documentlevel context X <k , which conveys the same information with Y <k , helps to compute better representations on the target side (see Table 8).
As a result, the document-level translation probability can be approximated as where y j−1 is a partial translation.
In this way, the document-level translation model can still be defined at the sentence level without sacrificing efficiency except that the source-side document-level context X <k (or context for short) is taken into account.
In the following, we will introduce how to represent the context (Section 2.2), how to integrate the context (Section 2.3), and how to train the model especially when only limited training data is available (Section 2.4).

Document-level Context Representation
As document-level context often includes several sentences, it is important to capture long-range dependencies and identify relevant information. We use multi-head self-attention (Vaswani et al., 2017) to compute the representation of documentlevel context because it is capable of reducing the maximum path length between long-range dependencies to O(1) (Vaswani et al., 2017) and determining the relative importance of different locations in the context (Bahdanau et al., 2015). Because of this property, multi-head self-attention has proven to be effective in other NLP tasks such as constituency parsing (Kitaev and Klein, 2018).
As shown in Figure 1(b), we use a self-attentive encoder to compute the representation of X <k . The input to the self-attentive encoder is a sequence of context word embeddings, represented as a matrix. Suppose X <k is composed of M source words: X <k = x 1 , . . . , x m , . . . , x M . We use x m ∈ R D×1 to denote the vector representation of x m that is the sum of word embedding and positional encoding (Vaswani et al., 2017). Therefore, the matrix representation of X <k is given by where X c ∈ R D×M is the concatenation of all vector representations of all source contextual words.
The self-attentive encoder is composed of a stack of N c identical layers. Each layer has two sub-layers. The first sub-layer is a multi-head selfattention: where A (1) ∈ R D×M is the hidden state calculated by the multi-head self-attention at the first layer, MultiHead(Q, K, V) is a multi-head selfattention function that takes a query matrix Q, a key matrix K, and a value matrix V as inputs. In this case, Q = K = V = X c . This is why it is called self-attention. Please refer to (Vaswani et al., 2017) for more details. Note that we follow Vaswani et al. (2017) to use residual connection and layer normalization in each sub-layer, which are omitted in the presentation for simplicity. For example, the actual output of the first sub-layer is: The second sub-layer is a simple, position-wise fully connected feed-forward network: where C (1) ∈ R D×M is the annotation of X <k after the first layer, A ·,m ∈ R D×1 is the column vector for the m-th contextual word, and FNN(·) is a position-wise fully connected feed-forward network (Vaswani et al., 2017).
This process iterates N c times as follows: where A (n) and C (n) (n = 1, . . . , N c ) are the hidden state and annotation at the n-th layer, respectively. Note that C (0) = X c .

Document-level Context Integration
We use multi-head attention to integrate C (Nc) , which is the representation of X <k , into both the encoder and the decoder.

Integration into the Encoder
Given the k-th source sentence i , which is a sum of word embedding and positional encoding. Therefore, the initial matrix representation of x (k) is where X ∈ R D×I is the concatenation of all vector representations of source words. As shown in Figure 1(b), we follow (Vaswani et al., 2017) to use a stack of N s identical layers to encode x (k) . Each layer consists of three sub-layers. The first sub-layer is a multi-head selfattention: where S (0) = X. The second sub-layer is context attention that integrates document-level context into the encoder: The third sub-layer is a position-wise fully connected feed-forward neural network: where S (n) ∈ R D×I is the representation of the source sentence x (k) at the n-th layer (n = 1, . . . , N s ).

Integration into the Decoder
When generating the j-th target word y (k) j , the partial translation is denoted by y Vaswani et al. (2017) to offset the target word embeddings by one position, resulting in the following matrix representation of y (k) <j : where y (k) 0 ∈ R D×1 is the vector representation of a begin-of-sentence token and Y ∈ R D×j is the concatenation of all vectors.
As shown in Figure 1(b), we follow (Vaswani et al., 2017) to use a stack of N t identical layers to compute target-side representations. Each layer is composed of four sub-layers. The first sub-layer is a multi-head self-attention: where T (0) = Y . The second sub-layer is context attention that integrates document-level context into the decoder: Nc) . (16) The third sub-layer is encoder-decoder attention that integrates the representation of the corresponding source sentence: Ns) . (17) The fourth sub-layer is a position-wise fully connected feed-forward neural network: where T (n) ∈ R D×j is the representation at the n-th layer (n = 1, . . . , N t ). Note that T (0) = Y .
Finally, the probability distribution of generating the next target word y (k) j is defined using a softmax layer: where W o ∈ R |Vy|×D is a model parameter, V y is the target vocabulary, and T (Nt) ·,j ∈ R D×1 is a column vector for predicting the j-th target word.

Context Gating
In our model, we follow Vaswani et al. (2017) to use residual connections  around each sub-layer to shortcut its input to its output: where H is the input of the sub-layer. While residual connections prove to be effective for building deep architectures, there is one potential problem for our model: the residual connections after the context attention sub-layer might increase the influence of document-level context X <k in an uncontrolled way. This is undesirable because the source sentence x (k) usually plays a more important role in target word generation.
To address this problem, we replace the residual connections after the context attention sub-layer with a position-wise context gating sub-layer: The gating weight is given by where σ(·) is a sigmoid function, W i and W s are model parameters.

Training
Given a document-level parallel corpus D d , the standard training objective is to maximize the loglikelihood of the training data: Unfortunately, large-scale document-level parallel corpora are usually unavailable, even for resource-rich languages such as English and Chinese.
Under small-data training conditions, document-level NMT is prone to underperform sentence-level NMT because of poor estimates of low-frequency events.
To address this problem, we adopt the idea of freezing some parameters while tuning the remaining part of the model Zoph et al., 2016). We propose a two-step training strategy that uses an additional sentence-level parallel corpus D s , which can be larger than D d . We divide model parameters into two subsets: θ = θ s ∪ θ d , where θ s is a set of original sentencelevel model parameters (highlighted in blue in Figure 1(b)) and θ d is a set of newly-introduced document-level model parameters (highlighted in red in Figure 1(b)).
In the first step, sentence-level parameters θ s are estimated on the combined sentence-level parallel corpus D s ∪ D d : 2 θ s = argmax θs x,y ∈Ds∪D d log P (y|x; θ s ). (24) Note that the newly introduced modules (highlighted in red in Figure 1(b)) are inactivated in this step. P (y|x; θ s ) is identical to the original Transformer model, which is a special case of our model. In the second step, document-level parameters θ d are estimated on the document-level parallel corpus D d only: Our approach is also similar to pre-training which has been widely used in NMT (Shen et al., 2016;. The major difference is that our approach keepsθ s fixed when estimating θ d to prevent the model from overfitting on the relatively smaller document-level parallel corpora.

Setup
We evaluate our approach on Chinese-English and French-English translation tasks. In Chinese-English translation task, the training set contains 2M Chinese-English sentence pairs with 54.8M Chinese words and 60.8M English words. 3 The document-level parallel corpus is a subset of the full training set, including 41K documents with 940K sentence pairs. On average, each document in the training set contains 22.9 sentences. We use the NIST 2006 dataset as the development set and the NIST 2002NIST , 2003NIST , 2004NIST , 2005NIST , 2008 datasets as test sets. The development and test sets contain 588 documents with 5,833 sentences. On average, each document contains 9.9 sentences.
In French-English translation task, we use the IWSLT bilingual training data (Mauro et al., 2012) which contains 1,824 documents with 220K sentence pairs as training set. For development and testing, we use the IWSLT 2010 development and test sets, which contains 8 documents with 887 sentence pairs and 11 documents with 1,664 sentence pairs respectively. The evaluation metric for both tasks is case-insensitive BLEU score as calculated by the multi-bleu.perl script.
In preprocessing, we use byte pair encoding (Sennrich et al., 2016) with 32K merges to segment words into sub-word units for all languages. For the original Transformer model and our extended model, the hidden size is set to 512 and the # sent.  filter size is set to 2,048. The multi-head attention has 8 individual attention heads. We set N = N s = N t = 6. In training, we use Adam (Kingma and Ba, 2015) for optimization. Each mini-batch contains approximately 24K words. We use the learning rate decay policy described by Vaswani et al. (2017). In decoding, the beam size is set to 4. We use the length penalty  and set the hyper-parameter α to 0.6. We use four Tesla P40 GPUs for training and one Tesla P40 GPU for decoding. We implement our approach on top of the open-source toolkit THUMT (Zhang et al., 2017). 4

Effect of Context Length
We first investigate the effect of context length (i.e., the number of preceding sentences) on our approach. As shown in Table 1, using two preceding source sentences as document-level context achieves the best translation performance on the development set. Using more preceding sentences does not bring any improvement and increases computational cost. This confirms the finding of  that long-distance context only has limited influence. Therefore, we set the number of preceding sentences to 2 in the following experiments. 5 Table 2 shows the effect of self-attention layer number for computing representations of document-level context (see Section 2.2) on translation quality. Surprisingly, using only one selfattention layer suffices to achieve good performance. Increasing the number of self-attention layers does not lead to any improvements. Therefore, we set N c to 1 for efficiency.

Comparison with Previous Work
In Chinese-English translation task, we compare our approach with the following previous methods: 1. : using a hierarchical RNN to integrate document-level context into the RNNsearch model. They use a documentlevel parallel corpus containing 1M sentence pairs. Table 3 gives the BLEU scores reported in their paper.
2. : using a cache which stores previous translated words and topical words to incorporate document-level context into the RNNsearch model. They use a document-level parallel corpus containing 2.8M sentence pairs. Table 3 gives the BLEU scores reported in their paper.
3. (Vaswani et al., 2017): the state-of-the-art NMT model that does not exploit documentlevel context. We use the open-source toolkit THUMT (Zhang et al., 2017) to train and evaluate the model. The training dataset is our sentence-level parallel corpus containing 2M sentence pairs. 4. *: adapting the cachebased method to the Transformer model. We implement it on top of the open-source toolkit THUMT. We also use the same training data (i.e., 2M sentence pairs) and the same twostep training strategy to estimate sentenceand document-level parameters separately.
As shown in Table 3, using the same data, our approach achieves significant improvements over the original Transformer model (Vaswani et al., 2017) (p < 0.01). The gain on the concatenated test set (i.e., "All") is 1.96 BLEU points. It also outperforms the cache-based method  adapted for Transformer significantly (p < 0.01), which also uses the two-step training strategy. Table 4 shows that our model also outperforms Transformer by 0.89 BLEU points on French-English translation task.  Table 3: Comparison with previous works on Chinese-English translation task. The evaluation metric is caseinsensitive BLEU score.  use a hierarchical RNN to incorporate document-level context into RNNsearch.  use a cache to exploit document-level context for RNNsearch. * is an adapted version of the cache-based method for Transformer. Note that "MT06" is not included in "All".   Table 5: Subjective evaluation of the comparison between the original Transformer model and our model. ">" means that Transformer is better than our model, "=" means equal, and "<" means worse.

Subjective Evaluation
We also conducted a subjective evaluation to validate the benefit of exploiting document-level context. All three human evaluators were asked to compare the outputs of the original Transformer model and our model of 20 documents containing 198 sentences, which were randomly sampled from the test sets. Table 5 shows the results of subjective evaluation. Three human evaluators generally made consistent judgements. On average, around 19% of Transformer's translations are better than that of our model, 51% are equal, and 31% are worse. This evaluation confirms that exploiting document-level context helps to improve translation quality.

Evaluation of Efficiency
We evaluated the efficiency of our approach. It takes the original Transformer model about 6.7

Method
Training Decoding Transformer 41K 872 this work 31K 364 hours to converge during training and the training speed is 41K words/second. The decoding speed is 872 words/second. In contrast, it takes our model about 7.8 hours to converge in the second step of training. The training speed is 31K words/second. The decoding speed is 364 words/second. Therefore, the training speed is only reduced by 25% thanks to the high parallelism of multi-head attention used to incorporate document-level context. The gap is larger in decoding because target words are generated in an autoregressive way in Transformer. Table 7 shows the effect of the proposed twostep training strategy. The first two rows only use sentence-level parallel corpus to train the original Transformer model (see Eq. 24) and achieve BLEU scores of 39.53 and 45.97. The third row only uses the document-level parallel corpus to directly train our model (see Eq. 23) and achieves a BLEU score of 36.52. The fourth and fifth rows use the two-step strategy to take advantage of both sentence-and document-level parallel corpora and achieve BLEU scores of 40.22 and 47.93, respectively. We find that document-level NMT achieves much worse results than sentence-level NMT (i.e., 36.52 vs. 39.53) when only small-scale documentlevel parallel corpora are available. Our two-step training method is capable of addressing this problem by exploiting sentence-level corpora, which   Table 8: Effect of context integration. "none" means that no document-level context is integrated, "encoder" means that the document-level context is integrated only into the encoder, "decoder" means that the documentlevel context is integrated only into the decoder, and "both" means that the context is integrated into both the encoder and the decoder.  leads to significant improvements across all test sets. Table 8 shows the effect of integrating documentlevel context to the encoder and decoder (see Section 2.3). It is clear that integrating document-level context into the encoder (Eq. 12) brings significant improvements (i.e., 45.97 vs. 47.51). Similarly, it is also beneficial to integrate document-level context into the decoder (Eq. 16). Combining both leads to further improvements. This observation suggests that documentlevel context does help to improve Transformer.

Effect of Context Gating
As shown in Table 9, we also validated the effectiveness of context gating (see Section 2.3.3). We find that replacing residual connections with context gating leads to an overall improvement of 0.38 BLEU point.

Analysis
We use an example to illustrate how documentlevel context helps translation (Table 10). In order to translate the source sentence, NMT has to disambiguate the multi-sense word "yundong", which is actually impossible without the document-level context. The exact meaning of "rezhong" is also highly context dependent. Fortunately, the sense of "yundong" can be inferred from the word "saiche" (car racing) in the document-level context and "rezhong" is the antonym of "yanjuan" (tired of). This example shows that our model learns to resolve word sense ambiguity and lexical cohesion problems by integrating document-level context.

Related Work
Context · · ·ziji ye yinwei queshao jingzheng duishou er dui saiche youxie yanjuan shi· · · Source wo rengran feichang rezhong yu zhexiang yundong. Reference I'm still very fond of the sport. Transformer I am still very enthusiastic about this movement. Our work I am still very keen on this sport. Table 10: An example of Chinese-English translation. In the source sentence, "yundong" (sport or political movement) is a multi-sense word and "rezhong" (fond of) is an emotional word whose meaning is dependent on its context. Our model takes advantage of the words "saiche" (car racing) and "yanjuan" (tired of) in the documentlevel context to translate the source words correctly. 2015). These approaches can be roughly divided into two broad categories: computing the representation of the full document-level context Tiedemann and Scherrer, 2017;Maruf and Haffari, 2018; and using a cache to memorize most relevant information in the document-level context . Our approach falls into the first category. We use multi-head attention to represent and integrate document-level context.  also extended Transformer to model document-level context, but our work is different in modeling and training strategies. The experimental part is also different. While  focus on anaphora resolution, our model is able to improve the overall translation quality by integrating document-level context.

Conclusion
We have presented a method for exploiting document-level context inside the state-of-the-art neural translation model Transformer. Experiments on Chinese-English and French-English translation tasks show that our method is able to improve over Transformer significantly. In the future, we plan to further validate the effectiveness of our approach on more language pairs.