Exploiting Cross-Sentence Context for Neural Machine Translation

In translation, considering the document as a whole can help to resolve ambiguities and inconsistencies. In this paper, we propose a cross-sentence context-aware approach and investigate the influence of historical contextual information on the performance of neural machine translation (NMT). First, this history is summarized in a hierarchical way. We then integrate the historical representation into NMT in two strategies: 1) a warm-start of encoder and decoder states, and 2) an auxiliary context source for updating decoder states. Experimental results on a large Chinese-English translation task show that our approach significantly improves upon a strong attention-based NMT system by up to +2.1 BLEU points.


Introduction
Neural machine translation (NMT) has been rapidly developed in recent years (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Bahdanau et al., 2015;Tu et al., 2016). The encoderdecoder architecture is widely employed, in which the encoder summarizes the source sentence into a vector representation, and the decoder generates the target sentence word by word from the vector representation. Using the encoder-decoder framework as well as gating and attention techniques, it has been shown that the performance of NMT has surpassed the performance of traditional statistical machine translation (SMT) on various language pairs (Luong et al., 2015).
The continuous vector representation of a symbol encodes multiple dimensions of similarity, equivalent to encoding more than one meaning of * Corresponding Author: Zhaopeng Tu a word. Consequently, NMT needs to spend a substantial amount of its capacity in disambiguating source and target words based on the context defined by a source sentence (Choi et al., 2016). Consistency is another critical issue in documentlevel translation, where a repeated term should keep the same translation throughout the whole document (Xiao et al., 2011;Carpuat and Simard, 2012). Nevertheless, current NMT models still process a documents by translating each sentence alone, suffering from inconsistency and ambiguity arising from a single source sentence. These problems are difficult to alleviate using only limited intra-sentence context.
The cross-sentence context, or global context, has proven helpful to better capture the meaning or intention in sequential tasks such as query suggestion (Sordoni et al., 2015) and dialogue modeling (Vinyals and Le, 2015;Serban et al., 2016). The leverage of global context for NMT, however, has received relatively little attention from the research community. 1 In this paper, we propose a cross-sentence context-aware NMT model, which considers the influence of previous source sentences in the same document. 2 Specifically, we employ a hierarchy of Recurrent Neural Networks (RNNs) to summarize the cross-sentence context from source-side previous sentences, which deploys an additional documentlevel RNN on top of the sentence-level RNN encoder (Sordoni et al., 2015). After obtaining the global context, we design several strategies to integrate it into NMT to translate the current sentence: • Initialization, that uses the history represen-1 To the best of our knowledge, our work and Jean et al. (2017) are two independently early attempts to model crosssentence context for NMT.
2 In our preliminary experiments, considering target-side history inversely harms translation performance, since it suffers from serious error propagation problems. tation as the initial state of the encoder, decoder, or both; • Auxiliary Context, that uses the history representation as static cross-sentence context, which works together with the dynamic intrasentence context produced by an attention model, to good effect.
• Gating Auxiliary Context, that adds a gate to Auxiliary Context, which decides the amount of global context used in generating the next target word at each step of decoding.
Experimental results show that the proposed initialization and auxiliary context (w/ or w/o gating) mechanisms significantly improve translation performance individually, and combining them achieves further improvement.

Approach
Given a source sentence x m to be translated, we consider its K previous sentences in the same document as cross-sentence context C = {x m−K , ..., x m−1 }. In this section, we first model C, which is then integrated into NMT.

Summarizing Global Context
As shown in Figure 1, we summarize the representation of C in a hierarchical way: Sentence RNN For a sentence x k in C, the sentence RNN reads the corresponding words {x 1,k , ..., x n,k , . . . , x N,k } sequentially and updates its hidden state: where f (·) is an activation function, and h n,k is the hidden state at time n. The last state h N,k stores order-sensitive information about all the words in x k , which is used to represent the summary of the whole sentence, i.e. S k ≡ h N,k . After processing each sentence in C, we can obtain all sentencelevel representations, which will be fed into document RNN.
Document RNN It takes as input the sequence of the above sentence-level representations {S 1 , ..., S k , ..., S K } and computes the hidden state as: where h k is the recurrent state at time k, which summarizes the previous sentences that have been processed to the position k. Similarly, we use the last hidden state to represent the summary of the global context, i.e. D ≡ h K .

Integrating Global Context into NMT
We propose three strategies to integrate the history representation D into NMT: Initialization We use D to initialize either NMT encoder, NMT decoder or both. For encoder, we use D as the initialization state rather than all-zero states as in the standard NMT (Bahdanau et al., 2015). For decoder, we rewrite the calculation of the initial hidden state where h N is the last hidden state in encoder and {W s , W D } are the corresponding weight metrices.
Auxiliary Context In standard NMT, as shown in Figure 2 (a), the decoder hidden state for time i is computed by where y i−1 is the most recently generated target word, and c i is the intra-sentence context summarized by NMT encoder for time i. As shown in Figure 2 (b), Auxiliary Context method adds the representation of cross-sentence context D to jointly update the decoding state s i : In this strategy, D serves as an auxiliary information source to better capture the meaning of the source sentence. Now the gated NMT decoder has four inputs rather than the original three ones. The concatenation [c i , D], which embeds both intraand cross-sentence contexts, can be fed to the decoder as a single representation. We only need to modify the size of the corresponding parameter matrix for least modification effort.  Gating Auxiliary Context The starting point for this strategy is an observation: the need for information from the global context differs from step to step during generation of the target words. For example, global context is more in demand when generating target words for ambiguous source words, while less by others. To this end, we extend auxiliary context strategy by introducing a context gate (Tu et al., 2017a) to dynamically control the amount of information flowing from the auxiliary global context at each decoding step, as shown in Figure 2 (c).
Intuitively, at each decoding step i, the context gate looks at decoding environment (i.e., s i , y i−1 , and c i ), and outputs a number between 0 and 1 for each element in D, where 1 denotes "completely transferring this" while 0 denotes "completely ignoring this". The global context vector D is then processed with an element-wise multiplication before being fed to the decoder activation layer.
Formally, the context gate consists of a sigmoid neural network layer and an element-wise multiplication operation. It assigns an element-wise weight to D, computed by Here σ(·) is a logistic sigmoid function, and {W z , U z , C z } are the weight matrices, which are trained to learn when to exploit global context to maximize the overall translation performance. Note that z i has the same dimensionality as D, and thus each element in the global context vector has its own weight. Accordingly, the decoder hidden state is updated by 3 Experiments

Setup
We carried out experiments on Chinese-English translation task. As the document information is necessary when selecting the previous sentences, we collect all LDC corpora that contain document boundary. The training corpus consists of 1M sentence pairs extracted from LDC corpora 3 with 25.4M Chinese words and 32.4M English words. We chose the NIST05 (MT05) as our development set, and NIST06 (MT06) and NIST08 (MT08) as test sets. We used case-insensitive BLEU score (Papineni et al., 2002) as our evaluation metric, and sign-test (Collins et al., 2005) for calculating statistical significance. We implemented our approach on top of an open source attention-based NMT model, Nematus 4 (Sennrich and Haddow, 2016;Sennrich et al., 2017). We limited the source and target vocabularies to the most frequent 35K words in Chinese and English, covering approximately 97.1% and 99.4% of the data in the two languages respectively. We trained each model on sentences of length up to 80 words in the training data with early stopping. The word embedding dimension was 600, the hidden layer size was 1000, and the batch size was 80. All our models considered the previous three sentences (i.e., K = 3) as crosssentence context.  Table 1: Evaluation of translation quality. "Init" denotes Initialization of encoder ("enc"), decoder ("dec"), or both ("enc+dec"), and "Auxi" denotes Auxiliary Context. " †" indicates statistically significant difference (P < 0.01) from the baseline NEMATUS. Table 1 shows the translation performance in terms of BLEU score. Clearly, the proposed approaches significantly outperforms baseline in all cases.

Results
Baseline (Rows 1-2) NEMATUS significantly outperforms Moses -a commonly used phrasebased SMT system (Koehn et al., 2007), by 2.3 BLEU points on average, indicating that it is a strong NMT baseline system. It is consistent with the results in (Tu et al., 2017b) (i.e., 26.93 vs. 29.41) on training corpora of similar scale.
Initialization Strategy (Rows 3-5) Init enc and Init dec improve translation performance by around +1.0 and +1.3 BLEU points individually, proving the effectiveness of warm-start with crosssentence context. Combining them achieves a further improvement.
Auxiliary Context Strategies (Rows 6-7) The gating auxiliary context strategy achieves a significant improvement of around +1.0 BLEU point over its non-gating counterpart. This shows that, by acting as a critic, the introduced context gate learns to distinguish the different needs of the global context for generating target words.
Combining (Row 8) Finally, we combine the best variants from the initialization and auxiliary context strategies, and achieve the best performance, improving upon NEMATUS by +2.1 BLEU points. This indicates the two types of strategies are complementary to each other.

Analysis
We first investigate to what extent the mistranslated errors are fixed by the proposed system. We randomly select 15 documents (about 60 sentences) from the test sets. As shown in Table 2, we count how many related errors: i) are made by NMT (Total), and ii) fixed by our method (Fixed); as well as iii) newly generated (New). About Ambiguity, while we found that 38 words/phrases were translated into incorrect equivalents, 76% of them are corrected by our model. Similarly, we solved 75% of the Inconsistency errors including lexical, tense and definiteness (definite or indefinite articles) cases. However, we also observe that our system brings relative 21% new errors.  Ref.

Errors Ambiguity Inconsistency All
Can it inhibit and deter corrupt officials?
NMT Can we contain and deter the enemy?

Our
Can it contain and deter the corrupt officials? Table 3: Example translations. We italicize some mis-translated errors and highlight the correct ones in bold. Table 3 shows an example. The word "腐官" (corrupt officials) is mis-translated as "enemy" by the baseline system. With the help of the similar word "贪官" in the previous sentence, our approach successfully correct this mistake. This demonstrates that cross-sentence context indeed helps resolve certain ambiguities.

Related Work
While our approach is built on top of hierarchical recurrent encoder-decoder (HRED) (Sordoni et al., 2015), there are several key differences which reflect how we have generalized from the original model. Sordoni et al. (2015) use HRED to summarize a single representation from both the current and previous sentences, which limits itself to (1) it is only applicable to encoder-decoder framework without attention model, (2) the representation can only be used to initialize decoder. In contrast, we use HRED to summarize the previous sentences alone, which provides additional cross-sentence context for NMT. Our approach is more flexible at (1) it is applicable to any encoderdecoder frameworks (e.g., with attention), (2) the cross-sentence context can be used to initialize either encoder, decoder or both. While both our approach and Serban et al. (2016) use Auxiliary Context mechanism for incorporating cross-sentence context, there are two main differences: 1) we have separate parameters to better control the effects of the cross-and intrasentence contexts, while they only have one parameter matrix to manage the single representation that encodes both contexts; 2) based on the intuition that not every target word generation requires equivalent cross-sentence context, we introduce a context gate (Tu et al., 2017a) to control the amount of information from it, while they don't.
At the same time, some researchers propose to use an additional set of an encoder and attention to model more information. For example, Jean et al. (2017) use it to encode and select part of the previous source sentence for generating each target word. Calixto et al. (2017) utilize global image features extracted using a pre-trained convolutional neural network and incorporate them in NMT. As additional attention leads to more computational cost, they can only incorporate limited information such as single preceding sentence in Jean et al. (2017). However, our architecture is free to this limitation, thus we use multiple preceding sentences (e.g. K = 3) in our experiments.
Our work is also related to multi-source (Zoph and Knight, 2016) and multi-target NMT (Dong et al., 2015), which incorporate additional source or target languages. They investigate one-tomany or many-to-one languages translation tasks by integrating additional encoders or decoders into encoder-decoder framework, and their experiments show promising results.

Conclusion and Future Work
We proposed two complementary approaches to integrating cross-sentence context: 1) a warmstart of encoder and decoder with global context representation, and 2) cross-sentence context serves as an auxiliary information source for updating decoder states, in which an introduced context gate plays an important role. We quantitatively and qualitatively demonstrated that the presented model significantly outperforms a strong attention-based NMT baseline system. We release the code for these experiments at https:// www.github.com/tuzhaopeng/LC-NMT.
Our models benefit from larger contexts, and would be possibly further enhanced by other document level information, such as discourse relations. We propose to study such models for full length documents with more linguistic features in future work.