Exploiting Sentential Context for Neural Machine Translation

In this work, we present novel approaches to exploit sentential context for neural machine translation (NMT). Specifically, we show that a shallow sentential context extracted from the top encoder layer only, can improve translation performance via contextualizing the encoding representations of individual words. Next, we introduce a deep sentential context, which aggregates the sentential context representations from all of the internal layers of the encoder to form a more comprehensive context representation. Experimental results on the WMT14 English-German and English-French benchmarks show that our model consistently improves performance over the strong Transformer model, demonstrating the necessity and effectiveness of exploiting sentential context for NMT.


Introduction
Sentential context, which involves deep syntactic and semantic structure of the source and target languages (Nida, 1969), is crucial for machine translation.In statistical machine translation (SMT), the sentential context has proven beneficial for predicting local translations (Meng et al., 2015;Zhang et al., 2015).The exploitation of sentential context in neural machine translation (NMT, Bahdanau et al., 2015), however, is not well studied.Recently, Lin et al. (2018) showed that the translation at each time step should be conditioned on the whole target-side context.They introduced a deconvolution-based decoder to provide the global information from the target-side context for guidance of decoding.
In this work, we propose simple yet effective approaches to exploiting source-side global sentence-level context for NMT models.We use encoder representations to represent the source-side context, which are summarized into a sentential context vector.The source-side context vector is fed to the decoder, so that translation at each step is conditioned on the whole source-side context.Specifically, we propose two types of sentential context: 1) the shallow one that only exploits the top encoder layer, and 2) the deep one that aggregates the sentence representations of all the encoder layers.The deep sentential context can be viewed as a more comprehensive global sentence representation, since different types of syntax and semantic information are encoded in different encoder layers (Shi et al., 2016;Peters et al., 2018;Raganato and Tiedemann, 2018).
We validate our approaches on top of the stateof-the-art TRANSFORMER model (Vaswani et al., 2017).Experimental results on the benchmarks WMT14 English⇒German and English⇒French translation tasks show that exploiting sentential context consistently improves translation performance across language pairs.Among the model variations, the deep strategies consistently outperform their shallow counterparts, which confirms our claim.Linguistic analyses (Conneau et al., 2018) on the learned representations reveal that the proposed approach indeed provides richer linguistic information.
The contributions of this paper are: • Our study demonstrates the necessity and effectiveness of exploiting source-side sentential context for NMT, which benefits from fusing useful contextual information across encoder layers.
• We propose several strategies to better capture useful sentential context for neural machine translation.Experimental results empirically show that the proposed approaches achieve improvement over the strong baseline model TRANSFORMER.

Approach
Like a human translator, the encoding process is analogous to reading a sentence in the source language and summarizing its meaning (i.e.sentential context) for generating the equivalents in the target language.When humans translate a source sentence, they generally scan the sentence to create a whole understanding, with which in mind they incrementally generate the target sentence by selecting parts of the source sentence to translate at each decoding step.In current NMT models, the attention model plays the role of selecting parts of the source sentence, but lacking a mechanism to guarantee that the decoder is aware of the whole meaning of the sentence.In response to this problem, we propose to augment NMT models with sentential context, which represents the whole meaning of the source sentence.

Framework
Figure 1 illustrates the framework of the proposed approach.Let g = g(X) be the sentential context vector, and g(•) denotes the function to summarize the source sentence X, which we will discuss in the next sections.There are many possible ways to integrate the sentential context into the decoder.The target of this paper is not to explore this whole space but simply to show that one fairly straightforward implementation works well and that sentential context helps.In this work, we incorporate the sentential context into decoder as where d l i is the l-th layer decoder state at decoding step i, c l i is a dynamic vector that selects certain parts of the encoder output, FFN l (•) is a distinct feed-forward network associated with the l-th layer of the decoder, which reads the l − 1-th layer output D l−1 and the sentential context g.In this way, at each decoding step i, the decoder is aware of the sentential context g embedded in D l−1 .
In the following sections, we discuss the choice of g(•), namely shallow sentential context (Figure 1b) and deep sentential context (Figure 1c), which differ at the encoder layers to be exploited.It should be pointed out that the new parameters introduced in the proposed approach are jointly updated with NMT model parameters in an endto-end manner.

Shallow Sentential Context
Shallow sentential context is a function of the top encoder layer output H L : where GLOBAL(•) is the composition function.
Choices of GLOBAL(•) Two intuitive choices are mean pooling (Iyyer et al., 2015) and max pooling (Kalchbrenner et al., 2014): GLOBAL MAX = MAX(H L ). (5) Recently, Lin et al. (2017) proposed a selfattention mechanism to form sentence representation, which is appealing for its flexibility on extracting implicit global features.Inspired by this, we propose an attentive mechanism to learn sentence representation: where H 0 is the word embedding layer, and its max pooling vector g 0 serves as the query to extract features to form the final sentential context representation.

Deep Sentential Context
Deep sentential context is a function of all encoder layers outputs {H 1 , . . ., H L }: where g l is the sentence representation of the l-th layer H l , which is calculated by Equation 3. The motivation for this mechanism is that recent studies reveal that different encoder layers capture linguistic properties of the input sentence at different levels (Peters et al., 2018), and aggregating layers to better fuse semantic information has proven to be of profound value (Shen et al., 2018;Dou et al., 2018;Wang et al., 2018;Dou et al., 2019).In this work, we propose to fuse the global information across layers.

Choices of DEEP(•)
In this work, we investigate two representative functions to aggregate information across layers, which differ at whether the decoding information is taken into account.
RNN Intuitively, we can treat G = {g 1 , . . ., g L } as a sequence of representations, and recurring all the representations with an RNN: We use the last RNN state as the sentence representation: g = r L .As seen, the RNN-based aggregation repeatedly revises the sentence representations of the sequence with each recurrent step.As a side effect coming together with the proposed approach, the added recurrent inductive bias of RNNs has proven beneficial for many sequence-to-sequence learning tasks such as machine translation (Dehghani et al., 2018).
TAM Recently, Bapna et al. (2018) proposed a novel transparent attention model (TAM) to train very deep NMT models.In this work, we apply TAM to aggregate sentence representations: where ATT g (•) is an attention model with its own parameters, that specifics which context representations is relevant for each decoding output.Again, d l i−1 is the decoder state in the l-th layer.Comparing with its RNN counterpart, the TAM mechanism has three appealing strengths.First, TAM dynamically generates the weights β i based on the decoding information at every decoding step d l i−1 , while RNN is unaware of the decoder states and the associated parameters are fixed after training.Second, TAM allows the model to adjust the gradient flow to different layers in the encoder depending on its training phase.

Experiment
We conducted experiments on WMT14 En⇒De and En⇒Fr benchmarks, which contain 4.5M and 35.5M sentence pairs respectively.We reported experimental results with case-sensitive 4gram BLEU score.We used byte-pair encoding (BPE) (Sennrich et al., 2016) with 32K merge operations to alleviate the out-of-vocabulary problem.We implemented the proposed approaches on top of TRANSFORMER model (Vaswani et al., 2017).We followed Vaswani et al. (2017) to set the model configurations, and reproduced their reported results.We tested both Base and Big models, which differ at the layer size (512 vs. 1024) and the number of attention heads (8 vs. 16).

Ablation Study
We first investigated the effect of components in the proposed approaches, as listed in  1) which has a similar model size as the proposed deep sentential context model.We change the filter size from 1024 to 3072 in the decoder's feed-forward network (Eq.2).As seen, the proposed deep sentential context models also outperform the MEDIUM model over 0.5 BLEU point.

Main Result
Experimental results on both WMT14 En⇒De and En⇒Fr translation tasks are listed in vs. 264.1M,not shown in the table).Furthermore, DEEP (TAM) consistently outperforms DEEP (RNN) in the TRANSFORMER-BIG configuration.One possible reason is that the big models benefit more from the improved gradient flow with the transparent attention (Bapna et al., 2018).

Linguistic Analysis
To gain linguistic insights into the global and deep sentence representation, we conducted probing tasks1 (Conneau et al., 2018) to evaluate linguistics knowledge embedded in the encoder output and the sentence representation in the variations of the Base model that are trained on En⇒De translation task.The probing tasks are classification problems that focus on simple linguistic properties of sentences.The 10 probing tasks are categories into three groups: (1) Surface information.
(3) Semantic information.For each task, we trained the classifier on the train set, and validated the classifier on the validation set.We followed Hao et al. (2019) and Li  (Raganato and Tiedemann, 2018).
• Integrating the shallow sentence representation ("+ SSR") obtains improvement over the baseline on semantic tasks (75.33 vs. 74.61),while fails to improve on the surface (77.32 vs. 77.60)and syntactic tasks (64.88 vs. 65.00).This may indicate that the shallow representations that exploits only the top encoder layer ("L6 in BASE") encodes more semantic information.
• Introducing deep sentence representation ("+ DSR") brings more improvements.The reason is that our deep sentence representation is induced from the sentence representations of all the encoder layers, and lower layers that contain abound surface and syntactic information are exploited.
Along with the above translation experiments, we believe that the sentential context is necessary for NMT by enriching the source sentence representation.The deep sentential context which is induced from all encoder layers can improve translation performance by offering different types of syntax and semantic information.

Related Work
Sentential context has been successfully applied in SMT (Meng et al., 2015;Zhang et al., 2015).In these works, sentential context representation which is generated by the CNNs is exploited to guided the target sentence generation.In broad terms, sentential context can be viewed as a sentence abstraction from a specific aspect.From this point of view, domain information (Foster and Kuhn, 2007;Hasler et al., 2014;Wang et al., 2017b) and topic information (Xiao et al., 2012;Xiong et al., 2015;Zhang et al., 2016) can also be treated as the sentential context, the exploitation of which we leave for future work.
In the context of NMT, several researchers leverage document-level context for NMT (Wang et al., 2017a;Choi et al., 2017;Tu et al., 2018), while we opt for sentential context.In addition, contextual information are used to improve the encoder representations (Yang et al., 2018(Yang et al., , 2019;;Lin et al., 2018).Our approach is complementary to theirs by better exploiting the encoder representations for the subsequent decoder.Concerning guiding the NMT generation with source-side context, Zheng et al. (2018) split the source content into translated and untranslated parts, while we focus on exploiting global sentence-level context.

Conclusion
In this work, we propose to exploit sentential context for neural machine translation.Specifically, the shallow and the deep strategies exploit the top encoder layer and all the encoder layers, respectively.Experimental results on WMT14 benchmarks show that exploiting sentential context improves performances over the state-of-theart TRANSFORMER model.Linguistic analyses reveal that the proposed approach indeed captures more linguistic information as expected.

Figure 1 :
Figure 1: Illustration of the proposed approache.As on a 3-layer encoder: (a) vanilla model without sentential context, (b) shallow sentential context representation (i.e.blue square) by exploiting the top encoder layer only; and (c) deep sentential context representation (i.e.brown square) by exploiting all encoder layers.The circles denote hidden states of individual tokens in the input sentence, and the squares denote the sentential context representations.The red up arrows denote that the representations are fed to the subsequent decoder.This figure is best viewed in color.

Figure 2 :
Figure 2: Illustration of the deep functions."TAM" model dynamically aggregates sentence representations at each decoding step with state d i−1 .

Table 3 :
Performance on the linguistic probing tasks of evaluating linguistics embedded in the encoder outputs."BASE" denotes the representations from TRANFORMER-BASED encoder."SSR" denotes shallow sentence representation."DSR" denotes deep sentence representation."AVG" denotes the average accuracy of each category.

Table 3 .
From the tale, we can see that