Multi-Hop Transformer for Document-Level Machine Translation

Document-level neural machine translation (NMT) has proven to be of profound value for its effectiveness on capturing contextual information. Nevertheless, existing approaches 1) simply introduce the representations of context sentences without explicitly characterizing the inter-sentence reasoning process; and 2) feed ground-truth target contexts as extra inputs at the training time, thus facing the problem of exposure bias. We approach these problems with an inspiration from human behavior – human translators ordinarily emerge a translation draft in their mind and progressively revise it according to the reasoning in discourse. To this end, we propose a novel Multi-Hop Transformer (MHT) which offers NMT abilities to explicitly model the human-like draft-editing and reasoning process. Specifically, our model serves the sentence-level translation as a draft and properly refines its representations by attending to multiple antecedent sentences iteratively. Experiments on four widely used document translation tasks demonstrate that our method can significantly improve document-level translation performance and can tackle discourse phenomena, such as coreference error and the problem of polysemy.


Introduction
Neural machine translation (NMT) employs an endto-end framework (Sutskever et al., 2014) and has advanced promising results on various sentencelevel translation tasks (Bahdanau et al., 2015;Gehring et al., 2017;Vaswani et al., 2017;Wan et al., 2020). However, most of NMT models handle sentences independently, regardless of the linguistic context that may appear outside the current sentence (Tiedemann and Scherrer, 2017a). This makes NMT insufficient to fully resolve the typical context-dependent phenomena problematic, † These authors contributed equally to this work. * Corresponding author. e.g. coreference (Guillou, 2016), lexical cohesion (Carpuat, 2009), as well as lexical disambiguation (Gonzales et al., 2017). Recent studies (Tu et al., 2018;Maruf et al., 2019;Tan et al., 2019;Kim et al., 2019;Zheng et al., 2020;Sun et al., 2020;Ma et al., 2020) have proven to be effective on tackling discourse phenomena via feeding NMT with contextual information, e.g. sourceside (Wang et al., 2017;Voita et al., 2018; or target-side context sentences (Bawden et al., 2018;Miculicich et al., 2018). Despite their successes, these methods simply merge the representations of context sentences together, lacking a mechanism to explicitly characterize the inter-sentence reasoning upon the context. Another shortage in existing document-level NMT is the problem of exposure bias. Most of methods utilized the ground-truth target context for training but the generated translations for inference, leading to inconsistent inputs at training and testing time (Ranzato et al., 2015;Koehn and Knowles, 2017).
Intuitively, human translators tend to acquire useful context information from the reasoning process among sentences, thus figuring out the correct meaning when they encounter ambiguity during translation. Sukhbaatar et al. (2015) and  empirically verified that modeling multi-hop reasoning among sentences benefits to the language understanding task, e.g text comprehension. Voita et al. (2019) showed that documentlevel NMT model can profit from relative positions with respect to context sentences, which to some extent confirms the importance of the relationship among sentences. Meanwhile, Xia et al. (2017) demonstrated that sentence-level NMT could be improved by a two-pass draft-editing process, of which the second-pass decoder refines the target sentence generated by a first-pass standard decoder.
Accordingly, we propose to improve document-level NMT using a novel framework -Multi-Hop Transformer, which imitates draft-editing and reasoning process of human translators. Specifically, we implement an explicit reasoning process by exploiting source and target antecedent sentences with concurrently stacked attention layers, thus performing the progressive refinement on the representations of the current sentence and its translation. Besides, we leverage the draft to present context information on the target side during both training and testing, alleviating the problem of exposure bias.
We conduct experiments on four widely used document translation tasks: English-German and Chinese-English TED, English-Russian Opensubtitles, as well as English-German Europarl-7 datasets. Experimental results demonstrate that our method significantly outperforms both context-agnostic and context-aware methods. The qualitative analysis confirms the effectiveness of the proposed multihop reasoning mechanism on resolving many linguistic phenomena, such as word sense disambiguation and coreference resolution. Our contributions are mainly in: • We propose the Multi-Hop Transformer. To the best of our knowledge, this is the first pioneer investigation that introduces multi-hop reasoning into document-level NMT.
• The proposed model takes target context drafts into account at the training time, which devotes to avoid the training-generation discrepancy.
• Our approach significantly improves document-level translation performance on four document-level translation tasks in terms of BLEU scores and solves some context-dependent phenomena, such as coreference error and polysemy.

Preliminary
Transformer NMT is an end-to-end framework to build translation models. Vaswani et al. (2017) propose a new architecture called Transformer which adopts self-attention network for both encoding and decoding. Both its encoder and decoder consist of multiple layers, each of which includes a multi-head self-attention and a feed-forward sublayer. Additionally, each layer of the decoder applys a multi-head cross attention to capture information from the encoder. Transformer has shown superiority in a variety of NLP tasks. Therefore, we construct our models upon this advanced architecture.
Document-level NMT In order to correctly translate the sentence with discourse phenomena, NMT models need to look beyond the current sentence and integrate contextual sentences as auxiliary inputs. Formally, let X = (x 1 , x 2 , ..., x I ) be a source-language document composed of I sentences, where x i = (x i 1 , x i 2 , ..., x i N ) denotes the i th sentence containing N words. Correspondingly, the target-language document also consists of I sentences, Y = (y 1 , y 2 , ..., y I ), where y i = (y i 1 , y i 2 , ..., y i M ) denotes the i th sentence involving M words. Document-level NMT incorporates contextual information from both source side and target side to autoregressively generate the best translation result that has highest probability: (1) where y i <m is the sequence of proceeding tokens before position m. X −i and Y −i denote the context sentences of the i th sentence.
Related Work Several studies have explored multi-input models to leverage the contextual information from source-side (Jean et al., 2017; or target-side sentences Miculicich et al., 2018). For the former,  propose a new encoder to represent document-level context from previous source-side sentences . Tiedemann and Scherrer (2017b) and Junczys-Dowmunt (2019) utilize the concatenation of previous source-side sentences as input, while Voita et al. (2018) make use of gate mechanism to balance the weight between current source sentence and its context. For the latter, Miculicich et al. (2018) propose a hierarchical attention (HAN) framework to capture the target contextual information in the decoder. Bawden et al. (2018),  and Maruf et al. (2019) take both source-side and target-side context into account.
Motivation As seen, both of the existing methods simply introduce the context sentences without explicitly characterizing the inter-sentence reasoning. Intuitively, when humans have difficulty in translation like encountering ambiguity phenomenon, they could acquire more information

Softmax
indicate the j th previous sentence in the source side and target side respectively. d denotes the draft of current source sentence. All drafts are generated by a pre-trained sentence-level NMT model. The modules inside dashed box are the proposed multi-hop attention layers, which gradually refine the representation of current sentence. Finally, the context gate α is used to control the contextual information.
from the contexts sentence by sentence and then perform reasoning to figure out the exact meaning. We attribute that such reasoning process is also beneficial to machine translation task. Recent successes in text comprehension communities have to some extent supported our hypothesis (Hill et al., 2015;Kumar et al., 2016). For example, Sukhbaatar et al. (2015) propose a multi-hop end-to-end memory network, which can renew the query representation with multiple computational steps (which they term "hops"). Dhingra et al. (2016) extend an attention-sum reader to multi-turn reasoning with a gating mechanism. In addition,  introduce multi-hop attention, which used multiple turns to effectively exploit and reason over the relation among queries and documents.
In this paper, we propose to bring the idea of multi-hop into document translation and aim at mimicking the multi-step comprehension and revising process of human translators. Contrast with those models for text comprehension which scan the query and document for multiple passes, our model iteratively focuses on different context sentences, which captures the inter-sentence reasoning semantics of contextual sentences to incrementally refine the representation of current sentence.

Multi-Hop Transformer
With this mind, we propose a novel method called Multi-Hop Transformer, which models the reasoning process among multiple contextual sentences in both source side and target side. The source-side contexts are directly acquired from the document. The target-side contexts, called target-side drafts in this paper, are generated by a sentence-level NMT model. These contexts are fed into the Multi-Hop Transformer with pre-trained encoders. The overall architecture of our proposed model is illustrated in Figure 1, which consists of three components: • Sentence Encoder: This component contains two pre-trained encoders, one of which is called source-side sentence encoder and the other is called target-side sentence encoder. These encoders generate representations for source-side contexts and target-side drafts respectively.
• Multi-Hop Encoder: We extend the original Transformer encoder with a novel multi-hop encoder to efficiently perform sentence-bysentence reasoning on source-side contexts and generate the representation for the current sentence.
• Multi-Hop Decoder: Similarly, a multi-hop decoder is proposed to acquire information from the target-side drafts and models the translation probability distribution.

Sentence Encoder
We use multi-layer and multi-head self-attention architecture (Vaswani et al., 2017) to obtain the representations for source-side contexts and target-side drafts. Similar to the encoder of Transformer, sentence encoder contains a stack of six identical layers, each of which consists of two sub-layers. The first sub-layer is a multi-head attention(Q,K,V ), which takes a query Q, a key K and a value V as inputs. The second sub-layer is a fully connected feed-forward network (FFN).
Source-Side Sentence Encoder. This encoder is utilized to generate the representations for sourceside contexts, as shown in Figure 1. For the current sentence s = x i to be translated, we use the previous sentences X −i = (x i−k , x i−k+1 , ..., x i−1 ) in the same document as the source-side context, specially denoted as c i−k s , c i−k+1 s , ..., c i−1 s for clarity. k is the context window size. For the j th context, we obtain the A (n) c i−j s which denotes the n th hidden layer representation of c i−j s as follows: where n = 1, 2, ..., 6. MHA represents the standard Multi-Head Attention function (Vaswani et al., 2017). j denotes the distance between the context sentence and current sentence.
Target-Side Sentence Encoder. Most existing works use ground-truth target-side contexts as the input of decoder during training (Voita et al., 2019). However, the target contexts at training and testing are drawn from different distributions, leading to the inconsistency between training and testing.
To alleviate this problem, we instead make use of target-side context drafts generated from a pretrained sentence-level translation model. Similar to source-side sentence encoder, this target-side context draft encoder is used to obtain the context representation A (n) c i−j t of the j th target-side draft c i−j t . Besides, we obtain a draft translation d of the current sentence from the pre-trained sentencelevel translation model and use a target-side draft encoder to obtain the representation A (n) d .

Multi-Hop Encoder
The multi-hop encoder contains a stack of 6 identical layers, each of which contains the following sub-layers: Self-Attention Layer. The first sub-layer makes use of multi-head self-attention to encode the information of current source sentence s and obtains the representation A The other hops are implemented: where j = k−1, k−2, ..., 1. j denotes the distance between the context sentence and current sentence.
Context Gating. The information of current source sentence is crucial in translation while the contextual information is auxiliary. In order to avoid excessive utilization of contextual information, a context gating mechanism Yang et al., , 2019 is introduced to dynamically control the weight between context sentences and current sentence: where σ is the logistic sigmoid function and α is the context gate. W a and W b denote the weight matrices of A Finally, we obtain the representation Enc s = H (6) s as the final output of the multi-hop encoder.

Multi-Hop Decoder
Similarly, the multi-hop decoder involves a stack of 6 identical layers. Each of them contains five sub-layers.
Self-Attention Layer. The first sub-layer utilizes multi-head self-attention to encode the information of current target sentence t and obtains the representation A (n) t .
Draft-Attention Layer. Inspired by Xia et al. (2017), we introduce the complete draft d translated from current source sentence by a sentencelevel NMT. Then this draft representation A (n) d is encoded by the target-side draft encoder in Section 3.1. The draft attention is achieved by multi-head attention: Multi-Hop Attention Layer. Similar to the encoder, a multi-hop reasoning process is performed on the target-side contexts. The target-side drafts are generated from corresponding source sentences by a pre-trained sentence-level NMT model. The first hop takes representation F The other hops are achieved: where j = k−1, k−2, ..., 1. j denotes the distance between the context draft and current target draft.
Context Gating. Same as the multi-hop encoder, the final output of multi-hop decoder is computed as: where α is used to regulate the weight of target-side contextual information.
Encoder-Decoder Attention Layer. Finally, we use an encoder-decoder attention layer to integrate the output of multi-hop encoder Enc s with the current target representation G where H (n) t represents the final representation of decoder.

Datasets
To evaluate the effectiveness of the proposed MHT, we conduct experiments on four widely used document translation tasks, including the TED Talk (Cettolo et al., 2012) with two language pairs, Opensubtitles  and Europarl7 . All datasets are tokenized and truecased with the Moses toolkit (Koehn et al., 2007), and splited into sub-word units with a joint BPE model (Sennrich et al., 2016) with 30K merge operations. The datasets are described as follows: • TED Talk (English-German): We use the dataset of IWSLT 2017 MT English-German track for training, which contains transcripts of TED talks aligned at sentence level. dev2010 is used for development and tst2016-2017 for evaluation. Statistically, there are 0.21M sentences in the training set, 9K sentences in the development set, and 2.3K sentences in the test set.
• TED Talk (Chinese-English): We use the corpus consisting of 0.2M sentence pairs extracted from IWSLT 2014 and 2015 Chinese-English track for training. dev2010 involves 0.8K sentences for development and tst2010-2013 contains 5.5K sentences for test.
• Opensubtitles (English-Russian): We make use of the parallel corpus from . The training set includes 0.3M sentence pairs. There are 6K sentence pairs in development set, and 9K in test set.
• Europarl7 (English-German): The raw Europarl v7 corpus (Koehn, 2005) contains SPEAKER and LANGUAGE tags where the latter indicates the language the speaker was actually using. We process the raw data and extract the parallel corpus as same as Maruf

Baselines
We compare our model against four NMT systems as follows: • Transformer: The state-of-the-art contextagnostic NMT model (Vaswani et al., 2017).
• CA-Transformer: A context-aware transformer model (CA-Transformer) with an additional context encoder to incorporate document contextual information into model .
• CA-HAN: A context-aware hierarchical attention networks (CA-HAN) which integrate document contextual information from both source side and target side (Miculicich et al., 2018).
• CADec: A two-pass machine translation model (Context-Aware Decoder, CADec) which first produces a draft translation of the current sentence, then corrects it using context (Voita et al., 2019).

Implementation Details
Our model is implemented on the open-source toolkit Thumt (Zhang et al., 2017). Adam optimizer (Kingma and Ba, 2014) is applied with an initial learning rate 0.1. The size of hidden dimension and feed-forward layer are set to 512 and 2048 respectively. Encoder and decoder have 6 layers with 8 heads multi-head attention. Dropout is 0.1 and batch size is set to 4096. Beam size is 4 for inference. Translation quality is evaluated by the traditional metric BLEU (Papineni et al., 2002) on tokenized text. Context window size is set to 3, consistent with the experiments in Section 5.2.
To initialize the source-side sentence encoder in Section 3.1, a sentence-level NMT model is trained from source language to target language using the corresponding datasets without additional corpus. The encoder of this trained model is used to initialize the source-side context encoder. Also, we utilize the trained model to translate the source-side sentences and obtain the target-side drafts. Similarly, we train a sentence-level model from target language to source language to initialize the targetside encoders in Section 3.1. In order to reduce the computational overhead, we share the parameters among the sentence encoders on the same side. The settings of these two sentence-level NMT models are consistent with our baseline Transformer model.

Results
Table 1 summarizes the BLEU scores of different systems on four tasks. As seen, our baseline and re-implemented existing methods outperform the reported results on the same data, which we believe makes the evaluation convincing. Clearly, our model MHT significantly improves translation quality in terms of BLEU on these tasks, and obtains the best average results that gain 0.38, 0.69 and 1.57 BLEU points over CADec, CA-Transformer and CA-HAN respectively. These results demonstrate the universality and effectiveness of the proposed approach. Moreover, without in- troducing large-scale pre-trained language models, our translation systems achieve new state-of-the-art translation qualities across three examined translation tasks, which are TED (En-De), Opensubtitles (En-Ru) and Europarl7 (En-De). Overall, our experiments indicate the following two points: 1) explicitly modeling underlying reasoning semantics by a multi-hop mechanism indeed benefits neural machine translation, and 2) the improvements of our model are not from enlarging the network.

Analysis
In this section, to gain further insight, we explore the effectiveness of several factors of our model, including 1) multi-hop attention; 2) context window size; 3) reasoning direction; 4) sides for introducing context; and 5) target contexts. Moreover, we show qualitative analysis on discourse phenomena to better understand the advantage of our model.

Multi-Hop Attention
To further investigate the effect of multi-hop reasoning, we compare our multi-hop attention with two baseline context modeling methods, including "Concat" and "Hierarchical Attention". Table 2 shows the results of three different context modeling modules on TED, which use same inputs containing original training data and drafts. "Concat" denotes the MHT model simply using the concatenation of the three context sentences representations to get the final context representation. "Hierarchical Attention" denotes the MHT model with a hierarchical attention to model context, which consists of a sentence-level attention and a token-level attention to capture information from the appropriate context sentences and tokens, as in Miculicich et al. (2018). As depicted in Ta-ble 2, we replace multi-hop attention with these two baseline modules for experiments. "Hierarchical Attention" slightly outperforms "Concat", while multi-hop attention leads both of them by a much larger margin. The results demonstrate that multi-hop attention is capable of providing a more fine-grained representation of reasoning state over context and consequently capturing context semantic information more accurately.

Context Window Size
As shown in Figure 2, we conduct experiments with different context window sizes to explore its effect. When the window size is less than 4, the model obtains more information from contexts and achieves better performance as the window size gets larger. However, when window size is increased to 4, we find that the performance doesn't improve further, but decreases slightly. This phenomenon shows that contexts far from the target sentence may be less relevant and cause noise (Kim et al., 2019). Therefore, we choose the window size 3 for our model MHT.

Reasoning Direction
In  R2L indicates the MHT model encoding context sentences with an opposite direction. We observe that integrating reasoning processes by multi-hop attention with both direction can improve the effect of Transformer due to the incorporation of extra context information. Besides, MHT model reasoning with natural sentence order outperforms the MHT model with an opposite reasoning direction. This is within our expectation since the L2R reasoning is consistent with the reading and reasoning direction of human being.

Different Sides for Introducing Context
As shown in Table 4, we conduct an ablation study to explore how MTH model benefits from contexts on source side and target side of MTH model. "None" indicates the MTH model without multihop attention module on any side of MHT model, but only the draft of the current sentence. "Source", "Target" and "Source & Target" indicate the MHT models with multi-hop attention module to introducing context on only source side, only target side and both sides respectively. We find that integrating source-side context or target-side context into the model brings improvements over "None" that ignores context on both side. Besides, MHT with context on both sides achieves the best performance, indicating that the beneficial context information captured by multi-hop attention on the source side and the target side are divergent and complementary.

Draft vs. Reference
In training, the context draft sentences can be the drafts from a pre-trained MT system or the context references, while only the generated drafts are accessible during inference. Table 5 shows the BLEU scores of the MHT models using generated drafts and context references during training. We can see that the MHT model using drafts as contexts outperforms the MHT model directly using target-side context references, possibly because using context references faces the problem of exposure bias and the drafts generated from pre-trained translation system can bridge the gap between training and testing data.

Qualitative Analysis
We present the translated results from baselines and our model in Table 6 to explore how multihop reasoning mitigate the impact of common discourse phenomena in translation process. According to Case 1 in Table 6, the noun "hum" in source sentence is translated to "der Summen" by Transformer and CA-Transformer, which fail to understand the correct coreference. In German, "der" is a masculine article. The correct article is neutral article "das" because the "hum" is from a machine. MHT can perform a reasoning process to leverage the context information effectively and figure out the "hum" is from an engine according to Context 2. Case 2 indicates that MHT can understand the exact meaning of a polysemous word, benefiting from the reasoning process among the contexts. In this case, Transformer, CA-Transformer and CA-HAN all translates the noun "show" into "zeigt", which means "display". The translation is clearly wrong in this context. The correct meaning of "show" is TV shows like "Breaking Bad" according to the Context 1. In contrast, our model can take previous contexts in consideration and reason out the exact meaning of the polysemous word.

Conclusion
In this paper, we propose a novel document-level translation model called Multi-Hop Transformer Case 1 Case 2 Context3 They were 12, 3, and 1 when the hum stopped. So if your show gets a rating of nine points ...

Context2
The hum of the engine died.  with an inspiration from human reasoning behavior to explicitly model the human-like draft-editing and reasoning process. Experimental results on four widely used tasks show that our model can achieve better performance than both context-agnostic and context-aware strong baseline. Furthermore, the qualitative analysis shows that the multi-hop reasoning mechanism is capable of solving some discourse phenomena by capturing context semantics more accurately.