Document-Level Neural Machine Translation with Hierarchical Attention Networks

Neural Machine Translation (NMT) can be improved by including document-level contextual information. For this purpose, we propose a hierarchical attention model to capture the context in a structured and dynamic manner. The model is integrated in the original NMT architecture as another level of abstraction, conditioning on the NMT model’s own previous hidden states. Experiments show that hierarchical attention significantly improves the BLEU score over a strong NMT baseline with the state-of-the-art in context-aware methods, and that both the encoder and decoder benefit from context in complementary ways.


Introduction
Neural machine translation (NMT) (Bahdanau et al., 2015;Wu et al., 2016;Vaswani et al., 2017) trains an encoder-decoder network on sentence pairs to maximize the likelihood of predicting a target-language sentence given the corresponding source-language sentence, without considering the document context. By ignoring discourse connections between sentences and other valuable contextual information, this simplification potentially degrades the coherence and cohesion of a translated document (Hardmeier, 2012;Meyer and Webber, 2013;Sim Smith, 2017). Recent studies (Tiedemann and Scherrer, 2017;Wang et al., 2017; have demonstrated that adding contextual information to the NMT model improves the general translation performance, and more importantly, improves the coherence and cohesion of the translated text (Bawden et al., 2018;Lapshinova-Koltunski and Hardmeier, 2017). Most of these methods use an additional encoder Wang et al., 2017) to extract contextual information from previous source-side sentences. However, this requires additional parameters and it does not ex-ploit the representations already learned by the NMT encoder. More recently,  have shown that a cache-based memory network performs better than the above encoder-based methods. The cache-based memory keeps past context as a set of words, where each cell corresponds to one unique word keeping the hidden representations learned by the NMT while translating it. However, in this method, the word representations are stored irrespective of the sentences where they occur, and those vector representations are disconnected from the original NMT network.
We propose to use a hierarchical attention network (HAN) (Yang et al., 2016) to model the contextual information in a structured manner using word-level and sentence-level abstractions. In contrast to the hierarchical recurrent neural network (HRNN) used by (Wang et al., 2017), here the attention allows dynamic access to the context by selectively focusing on different sentences and words for each predicted word. In addition, we integrate two HANs in the NMT model to account for target and source context. The HAN encoder helps in the disambiguation of source-word representations, while the HAN decoder improves the target-side lexical cohesion and coherence. The integration is done by (i) re-using the hidden representations from both the encoder and decoder of previous sentence translations and (ii) providing input to both the encoder and decoder for the current translation. This integration method enables it to jointly optimize for multiple-sentences. Furthermore, we extend the original HAN with a multi-head attention (Vaswani et al., 2017) to capture different types of discourse phenomena.
Our main contributions are the following: (i) We propose a HAN framework for translation to capture context and inter-sentence connections in a structured and dynamic manner. (ii) We integrate the HAN in a very competitive NMT ar-chitecture (Vaswani et al., 2017) and show significant improvement over two strong baselines on multiple data sets. (iii) We perform an ablation study of the contribution of each HAN configuration, showing that contextual information obtained from source and target sides are complementary.

The Proposed Approach
The goal of NMT is to maximize the likelihood of a set of sentences in a target language represented as sequences of words y = (y 1 , ..., y t ) given a set of input sentences in a source language x = (x 1 , ..., x m ) as: so, the translation of a document D is made by translating each of its sentences independently. In this study, we introduce dependencies on the previous sentences from the source and target sides: where D x n = (x n−k , ..., x n−1 ) and D y n = (y n−k , ..., y n−1 ) denote the previous k sentences from source and target sides respectively. The contexts D x n and D y n are modeled with HANs.

Hierarchical Attention Network
The proposed HAN has two levels of abstraction. The word-level abstraction summarizes information from each previous sentence j into a vector s j as: where h denotes a hidden state of the NMT network. In particular, h t is the last hidden state of the word to be encoded, or decoded at time step t, and h j i is the last hidden state of the i-th word of the j-th sentence of the context. The function f w is a linear transformation to obtain the query q w . We used the MultiHead attention function proposed by (Vaswani et al., 2017) to capture different types of relations among words. It matches the query against each of the hidden representations h j i (used as value and key for the attention). The sentence-level abstraction summarizes the contextual information required at time t in d t as: Figure 1: Integration of HAN during encoding at time step t,h t is the context-aware hidden state of the word x t . Similar architecture is used during decoding.
where f s is a linear transformation, q s is the query for the attention function, FFN is a position-wise feed-forward layer (Vaswani et al., 2017). Each layer is followed by a normalization layer (Lei Ba et al., 2016).

Context Gating
We use a gate  to regulate the information at sentence-level h t and the contextual information at document-level d t . The intuition is that different words require different amount of context for translation: where W h , W p are parameter matrices, and h t is the final hidden representation for a word x t or y t .

Integrated Model
The context can be used during encoding or decoding a word, and it can be taken from previously encoded source sentences, previously decoded target sentences, or from previous alignment vectors (i.e. context vectors (Bahdanau et al., 2015)). The different configurations will define the input query and values of the attention function. In this work we experiment with five of them: one at encoding time, three at decoding time, and one combining both. At encoding time the query is a function of the hidden state h xt of the current word to be encoded x t , and the values are the encoded states of previous sentences h j x i (HAN encoder). At decoding time, the query is a function of the hidden state h yt of the current word to be decoded y t , and the values can be (a) the encoded states of previous sentences h j x i (HAN decoder source), 3 Experimental Setup

Datasets and Evaluation Metrics
We carry out experiments with Chinese-to-English (Zh-En) and Spanish-to-English (Es-En) sets on three different domains: talks, subtitles, and news. TED Talks is part of the IWSLT 2014 and 2015 (Cettolo et al., 2012(Cettolo et al., , 2015 evaluation campaigns 1 . We use dev2010 for development; and tst2010-2012 (Es-En), tst2010-2013 (Zh-En) for testing. The Zh-En subtitles corpus is a compilation of TV subtitles designed for research on context (Wang et al., 2018). In contrast to the other sets, it has three references to compare. The Es-En corpus is a subset of OpenSubtitles2018 (Lison and Tiedemann, 2016) 2 . We randomly select two episodes for development and testing each. Finally, we use the Es-En News-Commentaries11 3 corpus which has document-level delimitation. We evaluate on WMT sets (Bojar et al., 2013): newstest2008 for development, and newstest2009-2013 for testing. A similar corpus for Zh-En is too small to be comparable. Table 2 shows the corpus statistics.
For evaluation, we use BLEU score (Papineni et al., 2002) (multi-blue) on tokenized text, and we measure significance with the paired bootstrap resampling method proposed by Koehn (2004) (implementations by Koehn et al. (2007)).

Model Configuration and Training
As baselines, we use a NMT transformer, and a context-aware NMT transformer with cache memory which we implemented for comparison following the best model described by , with memory size of 25 words. We used the OpenNMT (Klein et al., 2017) implementation of the transformer network. The configuration is the same as the model called "base model" in the original paper (Vaswani et al., 2017). The encoder and decoder are composed of 6 hidden layers each. All hidden states have dimension of 512, dropout of 0.1, and 8 heads for the multi-head attention. The target and source vocabulary size is 30K. The optimization and regularization methods were the same as proposed by Vaswani et al. (2017). Inspired by  we trained the models in two stages. First we optimize the parameters for the NMT without the HAN, then we proceed to optimize the parameters of the whole network. We use k = 3 previous sentences, which gave the best performance on the development set. Table 1 shows the BLEU scores for different models. The baseline NMT transformer already has better performance than previously published results on these datasets, and we replicate previous previous improvements from the cache method over the this stronger baseline. All of our proposed HAN models perform at least as well as the cache method. The best scores are obtained by the combined encoder and decoder HAN model, which is significantly better than the cache method on all datasets without compromising training speed (2.3K vs 2.6K tok/sec). An important portion of the improvement comes from the HAN encoder, which can be attributed to the fact that the sourceside always contains correct information, while the target-side may contain erroneous predictions at testing time. But combining HAN decoder with HAN encoder further improves translation performance, showing that they contribute complementary information. The three ways of incorporating information into the decoder all perform similarly. Table 3 shows the performance of our best HAN model with a varying number k of previous sentences in the test-set. We can see that the best performance for TED talks and news is archived with 3, while for subtitles it is similar between 3 and 7.

Accuracy of Pronoun/Noun Translations
We evaluate coreference and anaphora using the reference-based metric: accuracy of pronoun translation (Miculicich Werlen and Popescu-Belis, 2017b), which can be extended for nouns. The list of evaluated pronouns is predefined in the metric, while the list of nouns was extracted using NLTK POS tagging (Bird, 2006). The upper part   Table 1: BLEU score for the different configurations of the HAN model, and two baselines. The highest score per dataset is marked in bold. ∆ denotes the difference in BLEU score with respect of the NMT transformer.
The significance values with respect to the NMT and the cache method are denoted by * , and † respectively. The repetitions correspond to the p-values: * † < .05, * * † † < .01, * * * † † † < .001.  of Table 4 shows the results. For nouns, the joint HAN achieves the best accuracy with a significant improvement compared to other models, showing that target and source contextual information are complementary. Similarity for pronouns, the joint model has the best result for TED talks and news. However, HAN encoder alone is better in the case of subtitles. Here HAN decoder produces mistakes by repeating past translated personal pronouns. Subtitles is a challenging corpus for personal pronoun disambiguation because it usually involves dialogue between multiple speakers.

Cohesion and Coherence Evaluation
We use the metric proposed by Wong and Kit (2012) to evaluate lexical cohesion. It is defined as the ratio between the number of repeated and lexically similar content words over the total number of content words in a target document. The lexical similarity is obtained using WordNet. Table 4 (bottom-left) displays the average ratio per tested document. In some cases, HAN decoder achieves the best score because it produces a larger quantity of repetitions than other models. However, as previously demonstrated in 4.2, repetitions do not always make the translation better. Although HAN boosts lexical cohesion, the scores are still far from the human reference, so there is room for improvement in this aspect. For coherence, we use a metric based on Latent Semantic Analysis (LSA) (Foltz et al., 1998). LSA is used to obtain sentence representations, then cosine similarity is calculated from one sentence to  the next, and the results are averaged to get a document score. We employed the pre-trained LSA model Wiki-6 from (Stefanescu et al., 2014). Table 4 (bottom-right) shows the average coherence score of documents. The joint HAN model consistently obtains the best coherence score, but close to other HAN models. Most of the improvement comes from the HAN decoder. Table 5 shows an example where HAN helped to generate the correct translation. The first box shows the current sentence with the analyzed word in bold; and the second, the past context at source and target. For the context visualization we use the toolkit provided by Pappas and Popescu-Belis (2017). Red corresponds to sentences, and blue to words. The intensity of color is proportional to the weight. We see that HAN correctly translates the ambiguous Spanish pronoun "su" into the English "his". The HAN decoder highlighted a previous mention of "his", and the HAN encoder highlighted the antecedent "Nathaniel". This shows that HAN can capture interpretable inter-sentence connections. More samples with different attention heads are shown in the Appendix ??.

Conclusion
We proposed a hierarchical multi-head HAN NMT model 5 to capture inter-sentence connections. We integrated context from source and target sides by directly connecting representations from previous sentence translations into the current sentence translation. The model significantly outperforms two competitive baselines, and the ablation study shows that target and source context is complementary. It also improves lexical cohesion and coherence, and the translation of nouns and pronouns. The qualitative analysis shows that the model is able to identify important previous sentences and words for the correct prediction. In future work, we plan to explicitly model discourse connections with the help of annotated data, which may further improve translation quality.