Does Multi-Encoder Help? A Case Study on Context-Aware Neural Machine Translation

In encoder-decoder neural models, multiple encoders are in general used to represent the contextual information in addition to the individual sentence. In this paper, we investigate multi-encoder approaches in document-level neural machine translation (NMT). Surprisingly, we find that the context encoder does not only encode the surrounding sentences but also behaves as a noise generator. This makes us rethink the real benefits of multi-encoder in context-aware translation - some of the improvements come from robust training. We compare several methods that introduce noise and/or well-tuned dropout setup into the training of these encoders. Experimental results show that noisy training plays an important role in multi-encoder-based NMT, especially when the training data is small. Also, we establish a new state-of-the-art on IWSLT Fr-En task by careful use of noise generation and dropout methods.


Introduction
Sentence-level neural machine translation (NMT) systems ignore the discourse phenomena and encode the individual source sentences with no use of contexts. In recent years, the context-aware models which learn contextual information from surrounding sentences have shown promising results in generating consistent and coherent translations Voita et al., 2018;Kim et al., 2019;Voita et al., 2019;Bawden et al., 2018;Miculicich et al., 2018;Maruf and Haffari, 2018;Maruf et al., 2019).
There are two common approaches to incorporating contexts into NMT: the simple way is to concatenate the context and the current sentence * Corresponding author.
to form a context-aware input sequence (Agrawal et al., 2018;Tiedemann and Scherrer, 2017), whereas a more widely-used approach utilizes additional neural networks to encode context sentences (Jean et al., 2017;Voita et al., 2018;.
Here we name the former as the single-encoder approach and name the latter as the multi-encoder approach. However, large-scale document corpora are not easily available. Most context-aware NMT systems are evaluated on small datasets and significant BLEU improvements are reported (Wang et al., 2017;. In our experiments, we find that the improvement persists if we feed pseudo sentences into the context encoder, especially when we train the system on small-scale data. A natural question here is: How much does the improvement come from the leverage of contextual information in multi-encoder?
In this work, we aim to investigate what kinds of information that the context-aware model captures. We re-implement several widely used context-aware architectures based on the multiencoder paradigm, and do an in-depth analysis to study whether the context encoder captures the contextual information. By conducting extensive experiments on several document-level translation benchmarks, we observe that: • The BLEU gaps between sentence-level and context-aware models decrease when the sentence baselines are carefully tuned, e.g., proper use of dropout.
• The multi-encoder systems are insensitive to the context input. Even randomly sampled sentences can bring substantial improvements.
• The model trained with the correct context can achieve better performance during inference without the context input. Our contribution is two folds: (i) We find that the benefit of the multi-encoder context-aware approach is not from the leverage of contextual information. Instead, the context encoder acts more like a noise generator to provide richer training signals. (ii) The finding here inspires us to develop a simple yet effective training strategy: we add a Gaussian-noise to the encoder output, which can effectively alleviate the overfitting, especially on small datasets.

Approaches to Incorporating Contexts into NMT
Here we describe two ways of introducing contextual information into NMT systems.

The Single-Encoder Approach
The input of the single-encoder system is the concatenation of the context sentences and the current sentence, with a special symbol inserted to distinguish them (Tiedemann and Scherrer, 2017;Agrawal et al., 2018). Then the extended sentence is fed into the standard Transformer. These systems may face the challenge of encoding extremely long inputs, resulting in inefficient computation.

The Multi-Encoder Approach
The multi-encoder models take the surrounding sentences as the context and employ an additional neural network to encode the context, that is, we have a source-sentence encoder and a context encoder. Figure 1 shows two methods of integrating the context into NMT in the multi-encoder paradigm. Next we show that most of the multi-encoder approaches (Voita et al., 2018; are instances of the models described below. • Outside integration. As shown in Figure  1(a), the representations of the context and the current sentence are firstly transformed into a new representation by an attention network. Then the attention output and the source sentence representation are fused by a gated sum.
• Inside integration. Alternatively, the decoder can attend to two encoders respectively ( Figure 1(b)). Then, the gating mechanism inside the decoder is employed to obtain the fusion vector. segment words into sub-word units. The Chinese sentences were word segmented by the tool provided within NiuTrans (Xiao et al., 2012). For Fr-En and Zh-En tasks, we lowercased all sentences to obtain comparable results with previous work. We also conducted experiments on a larger English-Russian (En-Ru) dataset provided by Voita et al. (2018), consisting of 2M sentence pairs selected from publicly available OpenSubtitles2018 corpus. The data statistics of each language pair can be seen in Table 1. We chose the Transformer-base model as the sentence-level baseline. The context encoder also used the same setting as the sentence-level baseline.
We used Adam (Kingma and Ba, 2014) for optimization, and trained the systems on a single TiTan V GPU 4 . The learning rate strategy was the same as that used in Vaswani et al. (2017). Our implementation was based on Fairseq (Ott et al., 2019). More details can be found in our repository 5 .

Results and Discussion
To study whether the context-encoder network captures contextual information in training, we present three types of context as the input of the contextencoder: • Context: the previous sentence of the current sentence.
• Random: a sentence consisting of words randomly sampled from the source vocabulary.
• Fixed: a fixed sentence input for contextencoder.

Baseline Selection
Weight sharing (Voita et al., 2018) and two-stage training  strategies have been proven essential to build strong context-aware systems. The former shared the first N-1 blocks of  context encoder with the source encoder, and the latter first trained a standard sentence-level Transformer and finetuned the document-level Transformer with an extra context-encoder. We first evaluated the importance of two training strategies for multi-encoder systems. We selected the multiencoder with Outside integration (see Section 2) as the context-aware model and trained systems with two training strategies on the En-De task respectively. As shown in Table 2, we find that both two strategies outperform the sentence-level baseline by a large margin. The model with two-stage training performs slightly better than the weightsharing system in terms of BLEU. To our surprise, the context-encoder with a single-layer can compete with a six-layers model. We suspect that this is because the training data is limited and we do not need a sophisticated model to fit it. Therefore, we choose the two-stage training and single-layer context-encoder for all experiments in the remainder of this paper. Table 3 shows the results of several context-aware models on different datasets. We see, first of all, that all multi-encoder models, including both Inside and Outside approaches outperform the sentencelevel baselines by a large margin on the Zh-En and En-De datasets with a small p value of dropout. Also, there are modest BLEU improvements on the Fr-En and En-Ru tasks. When the models are regularized by a larger dropout, all systems obtain substantial improvements -but the gaps between sentence-level and multi-encoder systems decrease significantly. We deduce that if the context-aware systems rely on the contextual information from the preceding sentence, the performance of Random and Fixed should dramatically decrease due to the incorrect context. Surprisingly, both Random and Fixed systems achieve comparable performance or even   higher BLEU scores than Context in most cases (See Table 3). A possible explanation is that the context encoder does not only model the context. Instead, it acts more like a noise generator to provide additional supervised signals to train the sentence-level model.

Robust Training
To verify the assumption of robust training, we followed the work (Srivastava et al., 2014;Berger et al., 1996). We turned off the context-encoder during the inference process, and made the inference system perform as the sentence-level baseline. Table 4 shows that both Context and Random inference without context-encoder obtain modest BLEU improvements. This confirms that the information extracted by context-encoder just plays a role like introducing randomness into training (e.g., dropout), which is a popular method used in robust statistics. We argue that three types of context provide noise signals to disturb the distribution of the sentence-level encoder output. The BLEU improvements of both Outside and Inside are mainly due to the richer noise signals which can effectively alleviate the overfitting. Inspired by Outside integration manner, we de-  signed a simple yet effective method to regularize the training process: A Gaussian noise is added to the encoder output instead of the embedding (Cheng et al., 2018). We sample a vector ∼ N 0, σ 2 I from a Gaussian distribution with variance σ 2 , where σ is a hyper-parameter. As seen in Table 5, the systems with Gaussian-noise significantly outperform the sentence-level baselines, and are slightly better than the Outside-context counterpart. Moreover, a natural question is whether further improvement can be achieved by combining the Context with the Gaussian-noise method. From the last line in Table 5, we observe no more improvement at all. The observation here convinced the assumption again that the context-encoder plays a similar role with the noise generator.

Large Scale Training
Most previous results are reported on small training datasets. Here we examine the effects of the noise-based method on different sized datasets. We trained the Inside-Random model and the Gaussiannoise model on different datasets consisting of 500K to 5M sentence pairs. Seen from Figure  2, the baseline model achieves better translation performance when we increase the data size. More interestingly, it is observed that Inside-Random and Gaussian-noise perform slightly better than the baseline, and the gaps gradually decrease with the volume increasing. This is reasonable that models trained on large-scale data may suffer less from the overfitting problem.

Related Work
Context-aware NMT systems incorporating the contextual information generate more consistent and coherent translations than sentence-level N-MT systems. Most of the current context-aware NMT models can be classified into two main categories, single-encoder systems (Tiedemann and Scherrer, 2017) and multi-encoder systems (Jean et al., 2017;Voita et al., 2018;. Voita et al. (2018) and  integrated an additional encoder to leverage the contextual information into Transformer-based NMT systems. Miculicich et al. (2018) employed a hierarchical attention network to model the contextual information. Maruf and Haffari (2018) built a context-aware NMT system using a memory network, and Maruf et al. (2019) encoded the whole document with selective attention network. However, most of the work mentioned above utilized more complex modules to capture the contextual information, which can be approximately regarded as multi-encoder systems. For a fair evaluation of context-aware NMT methods, we argue that one should build a strong enough sentence-level baseline with carefully regularized methods, especially on small datasets (Kim et al., 2019;Sennrich and Zhang, 2019). Beyond this, Bawden et al. (2018) and Voita et al. (2019) acknowledged that BLEU score is insufficient to evaluate context-aware models, and they emphasized that multi-encoder architectures alone had a limited capacity to exploit discourse-level context. In this work, we take a further step to explore the main cause, showing that the context-encoder acts more like a noise generator, and the BLEU improve-ments mainly come from the robust training instead of the leverage of contextual information. Additionally, Cheng et al. (2018) added the Gaussian noise to word embedding to simulate lexical-level perturbations for more robust training. Differently, we added the Gaussian noise to the encoder output which plays a similar role with context-encoder, which provides additional training signals.

Conclusions
We have shown that, in multi-encoder contextaware NMT, the BLEU improvement is not attributed to the leverage of contextual information. Even though we feed the incorrect context into training, the NMT system can still obtain substantial BLEU improvements on several small datasets. Another observation is that the NMT models can even achieve better translation quality without the context encoder. This gives us an interesting finding that the context-encoder acts more like a noise generator, which provides rich supervised training signals for robust training. Motivated by this, we significantly improve the sentence-level systems with a Gaussian noise imposed on the encoder output. Experiments on large-scale training data demonstrate the effectiveness of this method.