Context-Interactive Pre-Training for Document Machine Translation

Document machine translation aims to translate the source sentence into the target language in the presence of additional contextual information. However, it typically suffers from a lack of doc-level bilingual data. To remedy this, here we propose a simple yet effective context-interactive pre-training approach, which targets benefiting from external large-scale corpora. The proposed model performs inter sentence generation to capture the cross-sentence dependency within the target document, and cross sentence translation to make better use of valuable contextual information. Comprehensive experiments illustrate that our approach can achieve state-of-the-art performance on three benchmark datasets, which significantly outperforms a variety of baselines.


Introduction
Document machine translation (Doc-MT) aims at utilizing the surrounding contexts of the source sentence to tackle some linguistic consistency problems (e.g., deixis, ellipsis, and lexical cohesion) in translation (Tiedemann and Scherrer, 2017). However, due to the introduction of extra contexts, it also presents several intractable challenges: (1) Data scarcity of document-level bilingual corpora. Since most bilingual corpora are preserved by sentence, well-aligned document-level data is relatively scarce , especially for low-resource languages or domains. Such a data sparsity not only impairs the effective training of neural machine translation (NMT) models, but also tends to result in potential overfitting.
(2) Effective utilization of valuable information contained in extra contexts. Although some efforts (Wang et al., 2017;Tu et al., 2018) have strived to incorporate contextual information via various architectures, they only observe minor performance gains compared with traditional sentence machine translation (Sent-MT). Recent work  also reveals that contextual information cannot be fully leveraged by some existing approaches, where the source contexts tend to act as the data noise enriching the training signals.
(3) Modeling of cross-sentence dependency within the target document. Since the input of Doc-MT focuses on documents consisting of multiple sentences, the decoder should be able to deal with some discourse phenomena like coreference resolution, lexical cohesion, and lexical disambiguation. (Voita et al., 2019b). This goal requires the modeling of cross-sentence dependency within the target document.
To tackle the above three challenges, here we propose a simple yet effective context-interactive pre-training approach for Doc-MT. The proposal consists of three pre-training tasks, whose sketch is presented in Figure 1. Specifically, the cross sentence translation task (CST in Figure 1 (A)) strives to generate the target sentence in the absence of the source sentence and only based on the source contexts. With such a goal, the model is encouraged to maximize the utilization of extra contexts. To capture interactions between multiple sentences in the target document so that the discourse phenomena can be modeled, we conduct inter sentence generation (ISG in Figure 1 (B)) that aims to predict the inter sentence based on the target surrounding contexts. This task can be regarded as discourse language modeling that injects the cross-sentence dependency within the target document into the decoder of the translation model. We also introduce parallel sentence translation (PST in Figure 1 (C)) to alleviate the lack of doc-level bilingual corpora and achieve knowledge transfer from abundant sentlevel parallel data to limited doc-level parallel data. In order to avoid the catastrophic forgetting of pretrained model in downstream fine-tuning, elastic weight consolidation (EWC) regularization is introduced to further enhance the model performance.
We perform the evaluation on three benchmark (D) Downstream Fine-Tuning (C) Parallel Sentence Translation (PST)  datasets and results illustrate that our approach can achieve state-of-the-art performance, which is able to outperform a variety of baselines.

Related Work
Document machine translation (Doc-MT) aims to translate the source sentence into another different language in the presence of additional contextual information. The mainstream advances of this research field can be divided into three lines: uniencoder structure, dual-encoder structure, and pretrained models.
Uni-encoder structure. This line of research aims at performing Doc-MT based on a universal Transformer, which takes the concatenation of the additional contexts and the source sentence as the input. Tiedemann and Scherrer (2017) explores multiple different concatenation strategies and proves that the translation with extended source achieves the best performance. Bawden et al. (2018) presents several new discourse testsets, which aims to evaluate the ability of the models to exploit previous source and target sentences.  utilizes dynamic or topic cache to model coherence for Doc-MT by capturing contextual information either from recently translated sentences or the entire document. Going a step further, they  presents an inter-sentence gate model to encode two adjacent sentences and controls the amount of information flowing from the preceding sentence to the translation of the current sentence with an inter-sentence gate. Tu et al. (2018) augments translation model with a cache-like memory network that stores recent hidden representations as translation history. Yang et al. (2019) introduce a query-guided capsule networks into document-level translation to capture high-level capsules related to the current source sentence.  proposes a unified encoder to process the concatenated source information that only attends to the source sentence at the top of encoder blocks.
Dual-encoder structure. This line of work tends to adopt two encoders or another components to model the source sentences and the document-level contexts. Wang et al. (2017) summarize the source history in a hierarchical way and then integrate the historical representation into translation model with multiple strategies. Maruf and Haffari (2018) takes both source and target document context into account using memory networks, which modeling Doc-MT as a structured prediction problem with inter-dependencies among the observed and hidden variables.  introduces a light context encoder to represent source context and performs information fusion with the unidirectional multi-head attention. Werlen et al. (2018) uses a hierarchical attention network (HAN) with two levels of abstraction: word level abstraction allows attention to words in previous sentences, and sentence level abstraction allows access to relevant previous sentences. Source and target context both can be exploited. Voita et al. (2019b) introduces a two-pass framework that first translates each sentence with a context-agnostic model, and then refines it using context of several previous sentences. Furthermore, Voita et al. (2019a) presents a monolingual Doc-Repair model that performs automatic post-editing on a sequence of sentencelevel translations to correct inconsistencies among them.  investigates multi-encoder approaches in Doc-MT and find that the context encoder does not only encode the surrounding sentences but also behaves as a noise generator. Maruf et al. (2019) presents a hierarchical context-aware translation model, which selectively focus on relevant sentences in the document context and then attends to key words in those sentences.

Methodology
Following prior work , we translate the i-th source sentence x i into the i-th target sentence y i in the presence of extra source contexts c = (x i−1 , x i+1 ), where x i−1 and x i+1 refer to the predecessor and successor of x i respectively. We adopt Transformer as the model architecture of pre-training and machine translation. The model is trained by minimizing the negative log-likelihood of target sequence y conditioned on the source sequence x, i.e., L = −logp(y|x). Readers can refer to Vaswani et al. (2017) for more details. We introduce our approach based on EN→DE Doc-MT. Figure 1 shows the sketch of our context-interactive pre-training approach, elaborated on as follows.

Pre-Training Tasks
Cross Sentence Translation (CST) When translating the i-th sentence x i and the source context c = (x i−1 , x i+1 ) into the i-th target sentence y i , prior approaches tend to pay most attention on x i (Li et al., 2019), resulting in the neglect of c. To maximize the use of the source context c, we propose cross sentence translation (CST) to encourage the model to more effectively utilize the valuable information contained in c. We mask the whole source sentence x i in the model input, and enforce the model to generate the target sentence y i only based on c = (x i−1 , x i+1 ). To be specific, we pack both the source context c and the mask token [mask] as a continue span, and employ a special token </s> to indict the end of each sentence. To distinguish texts from different languages, we add language identifier (e.g., <en> for English and <de> for German) to the ends of both the source input and target output. Figure 1(A) presents the illustration of this task on EN-DE translation, where the input of Transformer is the concatenation of (x i−1 , <mask>, x i+1 ) and the target output is y i .
Inter Sentence Generation (ISG) Voita et al. (2019b) has demonstrated that the cross-sentence dependency within the target document can effec-tively improve the translation quality. Transformer decoder should be able to model the corresponding historical information to improve coherence or lexical cohesion and other aspects during translation. Motivated by this, here we propose inter sentence generation (ISG) to capture the cross-sentence dependency among the target output. The ISG task aims to predict the inter sentence y i based on its surrounding predecessor y i−1 and successor y i+1 . In this way, the model is trained to capture the interactions between the sentences in the target document. Besides, the training of ISG only requires the monolingual document corpora of the target language, which effectively alleviates the lack of doclevel parallel data in Doc-MT. Figure 1(B) presents the detailed illustration, where the model input is the concatenation of (y i−1 , <mask>, y i+1 ) and the target output is y i . Both source and target language identifiers are <de>.
Parallel Sentence Translation (PST) In practice, the available sent-level parallel corpora usually present larger scale than doc-level parallel corpora. Thus, here we introduce parallel sentence translation (PST) performing context-agnostic sentence translation, which only requires sent-level parallel data. This further alleviates the lack of the doclevel parallel data in Doc-MT. The illustration of PST is presented in Figure 1(C), where the input is the concatenation of (<none>, x i , <none>) and the target output is y i . 1 The source and target language identifiers are <en> and <de>, respectively.
EWC-Based Fine-Tuning. After finishing the pre-training, the pre-trained transformer is used as the model initialization for subsequent finetuning on downstream datasets. As shown in Figure 1, the input of Transformer in this scenario is (x i−1 , x i , x i+1 ), i.e., the concatenation of the i-th source sentence x i and its surrounding context c = (x i−1 , x i+1 ). The desired output is the i-th target sentence y i . The source and target language identifiers are same as PST. However, obvious catastrophic forgetting has been observed during fine-tuning. As fine-tuning continues, the model performance exhibits degradation. Due to large-scale model capacity and limited downstream datasets, pre-trained models usually suffer from overfitting. To remedy this, here we introduce Elastic Weight Consolidation (EWC) regularization (Kirkpatrick et al., 2016). EWC regularizes  The detailed illustration of different tasks. "SLI" and "TLI"denotes the source and target language identifier, respectively. "Use Mono-Doc", "Use Bi-Doc" and "Use Bi-Sent" means that the corresponding task can use monolingual doc-level, bilingual doc-level, bilingual sent-level corpora, respectively.
the weights individually based on their importance to that task, which forces the model to remember the original language modeling tasks. Formally, the EWC regularization is computed as: where λ is a hyperparameter weighting the importance of old LM tasks compared to new MT task, and i labels each parameter. The final loss J for fine-tuning is the sum of negative log-likelihood in all pre-training tasks and newly introduced R, i.e., J = L CST + L ISG + L PST + R.
We summarize the key information of our approach in Table 1, which also shows the available data of different tasks.

Settings
We train Transformer consisting of 12 encoder and 12 decoder layers with 1024 hidden size on 16 heads. We adopt the public mBART.CC25 released by Liu et al. (2020) as the initialization. For CST task, the pre-training data consists of: TED, Europarl, News Commentary and Rapid corpus. The monolingual target documents used in ISG task are extracted from Wikipedia. For PST task, we sample bilingual sentences in NewsCrawl utill 2018. We use sentence piece model (Kudo and Richardson, 2018) to tokenize all data. Gradient accumulation is used to simulate the batch size of 128K tokens. We use Adam optimizer with linear learning rate decay. The learning rate and dropout is set to 3e−5 and 0.1, respectively. We set λ in Eq. 1 to 0.01. We evaluate on three EN-DE Doc-MT datasets provided by Maruf et al. (2019): TED, News, and Europarl and perform limited grid-search of hyperparameter.

Baselines
Unpretrained models. Transformer (Vaswani et al., 2017) performs context-agnostic sent-level translation and HAN (Werlen et al., 2018) employs hierarchical attention to capture extra contexts. SAN (Maruf et al., 2019) utilizes top-down attention to selectively focus on relevant sentences and QCN (Yang et al., 2019) uses query-guided capsule networks to capture the related capsulese.
Pretrained models. Flat-Transformer  apply BERT as the initialization of encoder. We also implement the parallel sentence translationbased pre-training with mBART (Liu et al., 2020) initialization as the most comparable baseline.
To have a fair comparison, we adopt multi-BLEU as the evaluation metric. We first conduct SPM-based detoken on the generated texts and then use Moses to re-tokenize all texts like the baselines. Table 2 shows the performance of different systems. Results first confirm that large-scale pre-training can effectively accomplish model transferring and advance the performance of Doc-MT. Besides, we can observe significant performance gain for our approach compared to the baselines. For instance, it surpasses the mBART initialized model with PST by 0.72 BLEU. With the proposed pre-training tasks, our approach succeeds in acquiring more effective knowledge from external large-scale corpora, leading to better translation quality.

Incremental Analysis
Here we perform further incremental analysis. We treat Transformer with mBART initialization as the base model and cumulatively add each pre-training task until the full approach is rebuilt. The results are shown in Table 3. We can observe that the removal of the parallel sentence translation (PST) task results in the largest performance degradation. First, the scale of parallel sentences used for PST far exceeds that for the other two tasks, bringing the significant performance gains; In addition, PST closely resembles the downstream Doc-MT task,   Table 3: The results of incremental analysis on News dataset. " " represents that the corresponding pretraining task is adopted.
encouraging more effective knowledge transfer. Besides, Table 3 also reveals that other CST and ISG tasks play an active role in improving translation quality. By masking the whole source sentence in the input via CST, the model is encouraged to more effectively extract and utilize valuable information from extra contexts. With the target doc-level language modeling, the cross-sentence dependency within the document is better captured. Both contributes to improving the quality of Doc-MT.

Effectiveness of EWC Regularization
To avoid the catastrophic forgetting of pre-trained models in downstream fine-tuning, we introduce EWC regularization to force the model to remember the original language modeling task. Table 4 presents the comparison of our approach with or without EWC regularization, demonstrating its effectiveness in improving model performance. Results show that EWC regularization can achieve consistent improvements on various datasets, increasing the average BLEU score from 29.24 to 29.54. By weighing the original LM task and newly introduced NMT task based on the importance of parameters, the overfitting of the pre-trained model on the limited downstream data is effectively alleviated, bringing consistent performance gains.  Table 4: The comparison of our approach with or without elastic weight consolidation (EWC) regularization.

Conclusion
This work presents context-interactive pre-training to benefit document machine translation from external large-scale mono or bi-lingual corpora. The proposed approach strives to capture the crosssentence dependency within the target document via inter sentence generation, and utilize valuable information contained in the source context via cross sentence translation. Extensive experiments illustrate that our approach can consistently outperform extensive baselines, achieving state-of-the-art performance on various benchmark datasets.