Modeling Discourse Structure for Document-level Neural Machine Translation

Recently, document-level neural machine translation (NMT) has become a hot topic in the community of machine translation. Despite its success, most of existing studies ignored the discourse structure information of the input document to be translated, which has shown effective in other tasks. In this paper, we propose to improve document-level NMT with the aid of discourse structure information. Our encoder is based on a hierarchical attention network (HAN) (Miculicich et al., 2018). Specifically, we first parse the input document to obtain its discourse structure. Then, we introduce a Transformer-based path encoder to embed the discourse structure information of each word. Finally, we combine the discourse structure information with the word embedding before it is fed into the encoder. Experimental results on the English-to-German dataset show that our model can significantly outperform both Transformer and Transformer+HAN.


Introduction
Neural machine translation (NMT) has made great progress in the past decade.In practical applications, the need for NMT systems has expanded from individual sentences to complete documents.Therefore, document-level NMT has gradually drawn much more attention.Contextual information is particularly important for obtaining highquality document translation.To get better contextual information, researchers have proposed many methods (e.g., memory network and hierarchical attention network) for document-level translation (Sim Smith, 2017;Tiedemann and Scherrer, 2017;Wang et al., 2017a;Tu et al., 2017;Wang et al., 2017a;Voita et al., 2018;Zhang et al., 2018;Miculicich et al., 2018;Maruf and Haffari, 2018;Maruf et al., 2019;Yang et al., 2019).Discourse structure, as well as raw contextual sentences, is a major component of the document.And it has been proved to be effective in many other tasks, such as automatic document summarization (Yoshida et al., 2014;Isonuma et al., 2019) and sentiment classification (Schouten and Frasincar, 2016;Nejat et al., 2017).However, to the best of our knowledge, discourse structure has not been explicitly used in document-level NMT.
To address the above problem, we propose to improve document-level NMT with the aid of discourse structure information.First, we represent each input document with a Rhetorical Structure Theory-based discourse tree (Mann and Thompson, 1988).Then, we use a Transformer-based path encoder to embed the discourse structure path of each word and combine it with the corresponding word embedding before feeding it into the sentence encoder.In this way, discourse structure information can be fully exploited to enrich word representations and guide the context encoder to capture the relevant context of the current sentence.Finally, we adopt HAN (Miculicich et al., 2018) as our context encoder to model context information in a hierarchical manner.
Our contributions are as follows: (i) We propose a novel and efficient approach to explicitly exploit discourse structure information for documentlevel NMT.Particularly, our approach is applicable for any other context encoder of document-level NMT; (ii) We carry out experiments on English-to-German translation task and experimental results show that our model outperforms competitive baselines.In the era of statistical machine translation, document-level machine translation has become one of the research focuses in the community of machine translation.(Xiao et al., 2011;Su et al., 2012;Xiao et al., 2012;Su et al., 2015).Recently, with the rapid development of NMT, documentlevel NMT has also gradually attracted people's attention (Voita et al., 2018;Maruf and Haffari, 2018;Miculicich et al., 2018;Maruf et al., 2019;Yang et al., 2019).Typically, existing studies aim to improve document-level translation quality with the help of document context, which is usually extracted from neighboring sentences of the current sentence.For example,some researchers applied cache-based models to selectively remember the most relevant context information of the document (Voita et al., 2018;Maruf and Haffari, 2018;Kuang et al., 2018), while some researchers employed hierarchical context networks to catch document context information for Transformer (Miculicich et al., 2018;Maruf et al., 2019;Yang et al., 2019).Specifically, Miculicich et al. (2018) proposed a hierarchical attention network to model contextual information, Maruf et al. (2019) applied a selective attention method to select contextual information that is more relevant to the current sentence, and Yang et al. (2019) employed capsule network to model multi-angle context information.

arXiv:2006.04721v1 [cs.CL] 8 Jun 2020
Although these methods have made some progress in document-level NMT, they all ignored the discourse structure information, which can be used to not only enrich word embedding but also guide the selection of relevant context for the current sentence.

Our Encoder
We propose a novel document-level NMT model based on HAN (Miculicich et al., 2018).The difference between ours and HAN lies in that we introduce the RST-based discourse structure to represent the document-level context, which is incorporated into HAN to refine translation.
Figure 1 gives the architecture of our proposed encoder.In addition to the standard encoder for the current sentence, it contains HAN (Miculicich et al., 2018) as context encoder, and a novel path encoder for the discourse structure.We first use the Transformer-based path encoder to model discourse structure information.Then, we combine the embedding of each input word with its corresponding path embedding vector and feed the combined vector into the sentence encoder.Finally, we use the hierarchical attention network to capture the global contextual embedding and update the hidden states of current sentence as the final output of the whole encoder.
In our model, the translation of a document is made by translating each of its sentences sequentially.We introduce discourse structure for both the current sentence and contextual sentences.Given a source document X, the translation probability of Figure 2: An example discourse tree of with six EDUs.N and S denote the relative importance label NUCLEUS and SATELLITE, respectively.Sentence 3 is the current sentence to be translated, with two previous context sentences 1 and 2. On the tree, the path marked with dotted lines from the root node to the leaf node e 5 is used to represent the discourse structure of e 5 .
the target document Y can be defined as: where X j and Y j denote the j-th source sentence and its target translation respectively, D j denotes the contextual sentences, S represents the discourse structure of the document to be translated, and θ is the parameter set of the model.

RST-based Discourse Structure
For each document to be translated, we parse it to obtain its discourse structure based on Rhetorical Structure Theory (RST) (Mann and Thompson, 1988).RST is one of the most influential theories of document coherence.According to RST, we represent each document with a hierarchical tree.As shown in Figure 2, the discourse tree contains some leaf nodes, each of which indicates an Elementary Discourse Unit (EDU).The adjacent leaf nodes are recursively connected into units by certain coherence relations (e.g., ELABORATION, BACKGROUND) until the entire tree is built.Besides, NUCLEUS or SATELLITE is used to mark the relative importance of child node units in the relationship.
In this work, we represent the discourse structure information of each word using its discourse path from root node to its corresponding leaf node.Each path is a mixed label sequence composed of the discourse relationship and the importance label (e.g., NUCLEUS ELABORATION, SATEL-LITE BACKGROUND).Please note that all tokens in the same EDU share the same discourse structure.For example, the discourse structure of EDU e 5 is "SATELLITE ELABORATION SATEL-LITE ELABORATION SATELLITE CONTRAST".

Discourse Structure Path Embedding
To integrate the structural information of words into the our HAN-based document-level NMT model, we first additionaly introduce a Transformer-based path encoder to encode discourse structure paths of words.Specifically, for each word w i , we directly consider its discourse structure path p i as a sequence and then employ the path encoder to learn its contextual hidden states, which can be finally averaged to produce the overall discourse embedding vector d i .Then, we enrich each input word embedding with its corresponding discourse embedding vector before it is fed into the context encoder or the translation encoder.Concretely, for the word w i , we define its enriched vector as the sum of its word embedding and discourse embedding:

HAN-based Context Modeling
Following (2018), we apply hierarchical attention network (HAN) as our context encoder.Due to the advantage of accurately capturing different levels of contexts, HAN has been widely used in many tasks, such as document classification (Yang et al., 2016), stance detection (Sun et al., 2018), sentencelevel NMT (Su et al., 2018b).Using this encoder, we mainly focus on two levels of context modeling: Sentence-level Context Modeling For the i-th word of the current sentence, we employ muti-head attention (Vaswani et al., 2017) to summarize the context from the k-th context sentence: where f s is a linear transformation function, h i denotes the hidden state representation of the i-th token of current sentence.By doing so, our context encoder can exploit different types of relation between words to better capture sentence-level context.And H k is the hidden state representation of the k-th context sentence and is used as value and key for attention.

Document-level Context Modeling
Unlike the above modeling, here we mainly on capturing the context information from previous K sentences for the i-th word of the current sentence.
where f d is a linear transformation, and Integrating Document-level Context into the Translation Encoder Finally, we integrate the above-mentioned document-level context into the translation encoder via a gating operation: where W h and W cd denote parameter matrices for h i and cd i , and h i is the final output of the encoder.

Settings
Datasets We conduct our experiments on English-to-German translation task.The sentencealigned document-delimited News Comment v11 corpus 1 , the WMT16 newstest2015 and the new-stest2016 are used as the training set, development and test set, respectively.We download all the above corpus from (Maruf et al., 2019), of which statistics are provided in Table 1.Settings We use Transformer (Vaswani et al., 2017) as our context-agnostic baseline system and Transformer+HAN (Miculicich et al., 2018) as our context-aware baseline system.We conduct experiments using the same configuration as HAN.Specifically, both sentence encoder and decoder are composed of 6 hidden layers, while path encoder is composed of 2 hidden layers.We use three previous sentences as contextual sentences for current sentence.The hidden size and pointwise FFN size are 512 and 2048 respectively.The dropout rates for all hidden states are set to 0.1.The source and target vocabulary sizes are both 30K.At the training phase, we use the Adam optimizer (Kingma and Ba, 2015) and the batch sizes of context-agnostic model and context-aware model are 4096 and 1024, respectively.Finally, we use case-sensitive BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) to measure the translation quality.
Data Preprocessing All datasets are tokenized and truecased using the scripts of Moses Toolkit (Koehn et al., 2007).We split them into subword units using a joint bye pair encoding model with 30K merge operations.To get discourse structure of the input documents, we first apply the opensource tool NeuralEDUSeg (Wang et al., 2018) obtaining non-overlapping EDUs.Then, we employ StageDP (Wang et al., 2017b) to obtain discourse structure trees of segmented documents.Afterwards, we extract the path from root node to leaf node as the discourse structure information for the corresponding EDU, where all words share the same discourse structure path.level Transformer integrated with discourse structure achieves an improvement of 0.83 on BLEU and a decline of 0.8 on TER.By contrast, our model integrated with contextual information and discourse structure information further gains a better performance, 2.06 higher than Transformer and 0.39 higher than Transformer+HAN on BLEU, 2.9 lower than Transformer and 0.5 lower than Trans-former+HAN on TER.

Results and Analysis
Our experimental results show that discourse structure features indeed provide helpful information to enhance the translation quality of both context-agnostic and context-aware documentlevel NMT models.Please note that our approach is also applicable to other document-level NMT models.

Conclusion
This paper has presented a novel discourse structure-based encoder for document-level NMT.The main idea of our encoder lies in enriching the input word embeddings with their path embeddings based on discourse structure.Experimental results on English-to-German translation verify the effectiveness of our proposed encoder.
In the future, we plan to extend our encoder to other NLP tasks, such as simultaneous translation.Simultaneous translation, as well as documentlevel NMT, has difficulty in modeling the long context so that it may be effective to improve the retranslation with the structure information modeled by our encoder.Finally, we will focus on refining document-level NMT with variational neural networks, which has shown effecitive in previous studies of sentence-level NMT (Zhang et al., 2016;Su et al., 2018a).

Figure 1 :
Figure 1: The architecture of our proposed encoder

Table 1 :
1 http://www.casmacat.eu/corpus/news-commentary.htmlThe statistical of our datasets.#Sentence indicates the number of sentences, and Document length means the average number of sentences in document.

Table 2 :
BLEU and TER scores for different models.The best scores are marked in bold.HAN denotes Hierarchical Attention Network which is used to get context information while DS denotes Discourse Structure information.