Depth Growing for Neural Machine Translation

While very deep neural networks have shown effectiveness for computer vision and text classification applications, how to increase the network depth of the neural machine translation (NMT) models for better translation quality remains a challenging problem. Directly stacking more blocks to the NMT model results in no improvement and even drop in performance. In this work, we propose an effective two-stage approach with three specially designed components to construct deeper NMT models, which result in significant improvements over the strong Transformer baselines on WMT14 English\toGerman and English\toFrench translation tasks.

Training deep networks has always been a challenging problem, mainly due to the difficulties in optimization for deep architecture.Breakthroughs have been made in computer vision to enable deeper model construction via advanced initialization schemes (He et al., 2015), multi-stage training strategy (Simonyan and Zisserman, 2015), and Figure 1: Performances of Transformer models with different number of encoder/decoder blocks (recorded on x-axis) on WMT14 En→De translation task.† denotes the result reported in (Vaswani et al., 2017).novel model architectures (Srivastava et al., 2015;He et al., 2016b).While constructing very deep neural networks with tens and even more than a hundred blocks have shown effectiveness in image recognition (He et al., 2016b), question answering and text classification (Devlin et al., 2018;Radford et al., 2019), scaling up model capacity with very deep network remains challenging for NMT.The NMT models are generally constructed with up to 6 encoder and decoder blocks in both state-of-the-art research work and champion systems of machine translation competition.For example, the LSTM-based models are usually stacked for 4 (Stahlberg et al., 2018) or 6 (Chen et al., 2018) blocks, and the state-of-the-art Transformer models are equipped with a 6-block encoder and decoder (Vaswani et al., 2017;Junczys-Dowmunt, 2018;Edunov et al., 2018).Increasing the NMT model depth by directly stacking more blocks results in no improvement or performance drop (Figure 1), and even leads to optimization failure (Bapna et al., 2018).
There have been a few attempts in previous works on constructing deeper NMT models.Zhou et al. (2016) and Wang et al. (2017) propose increasing the depth of LSTM-based models by introducing linear units between internal hidden states to eliminate the problem of gradient vanishing.However, their methods are specially designed for the recurrent architecture which has been significantly outperformed by the state-ofthe-art transformer model.Bapna et al. (2018) propose an enhancement to the attention mechanism to ease the optimization of models with deeper encoders.While gains have been reported over different model architectures including LSTM and Transformer, their improvements are not made over the best performed baseline model configuration.How to construct and train deep NMT models to push forward the state-ofthe-art translation performance with larger model capacity remains a challenging and open problem.
In this work, we explore the potential of leveraging deep neural networks for NMT and propose a new approach to construct and train deeper NMT models.As aforementioned, constructing deeper models is not as straightforward as directly stacking more blocks, but requires new mechanisms to boost the training and utilize the larger capacity with minimal increase in complexity.Our solution is a new two-stage training strategy, which "grows" a well-trained NMT model into a deeper network with three components specially designed to overcome the optimization difficulty and best leverage the capability of both shallow and deep architecture.Our approach can effectively construct a deeper model with significantly better performance, and is generally applicable to any model architecture.
We evaluate our approach on two large-scale benchmark datasets, WMT14 English→German and English→French translations.Empirical studies show that our approach can significantly improve in translation quality with an increased model depth.Specifically, we achieve 1.0 and 0.6 BLEU score improvement over the strong Transformer baseline in English→German and English→French translations.

Approach
We introduce the details of our proposed approach in this section.The overall framework is illustrated in Figure 2. Our model consists of a bottom module with N blocks of encoder and decoder (the grey components in Figure 2), and a top module with M blocks (the blue and green components).We denote the encoder and decoder of the bottom module as enc 1 and dec 1 , and the corresponding two parts of the top module as enc 2 and dec 2 .An encoder-decoder attention mechanism is used in the decoder blocks of the NMT models, and here we use attn 1 and attn 2 to represent such attention in the bottom and top modules respectively.
The model is constructed via a two-stage training strategy: in Stage 1, the bottom module (i.e., enc 1 and dec 1 ) is trained and subsequently holds constant; in Stage 2, only the top module (i.e., enc 2 and dec 2 ) is optimized.
Let x and y denote the embedding of source and target sequence.Let l y denote the number of words in y, and y <t denote the elements before time step t.Our proposed model works in the following way: which contains three key components specially designed for deeper model construction, including: (1) Cross-module residual connections: As shown in Eqn.(1), the encoder enc 1 of the bottom module encodes the input x to a hidden representation h 1 , then a cross-module residual connection is introduced to the top module and the representation h 2 is eventually produced.

Experiments
We evaluate our proposed approach on two largescale benchmark datasets.We compare our approach with multiple baseline models, and analyze the effectiveness of our deep training strategy.

Experiment Design
Datasets We conduct experiments to evaluate the effectiveness of our proposed method on two widely adopted benchmark datasets: the WMT142 English→German translation (En→De) and the WMT14 English→French translation (En→Fr).We use 4.5M parallel sentence pairs for En→De and 36M pairs for En→Fr as our training data3 .We use the concatenation of Newstest2012 and Newstest2013 as the validation set, and New-stest2014 as the test set.All words are segmented into sub-word units using byte pair encoding (BPE)4 (Sennrich et al., 2016b), forming a vocabulary shared by the source and target languages with 32k and 45k tokens for En→De and En→Fr respectively.
Architecture The basic encoder-decoder framework we use is the strong Transformer model.We adopt the big transformer configuration following Vaswani et al. (2017), with the dimension of word embeddings, hidden states and non-linear layer set as 1024, 1024 and 4096 respectively.The dropout rate is 0.3 for En→De and 0.1 for En→Fr.
We set the number of encoder/decoder blocks for the bottom module as N = 6 following the common practice, and set the number of additionally stacked blocks of the top module as M = 2. Our models are implemented based on the PyTorch implementation of Transformer5 and the code can be found in the supplementary materials.
Training We use Adam (Kingma and Ba, 2015) optimizer following the optimization settings and default learning rate schedule in Vaswani et al. (2017) for model training.All models are trained on 8 M40 GPUs.Evaluation We evaluate the model performances with tokenized case-sensitive BLEU6 score (Papineni et al., 2002) for the two translation tasks.We use beam search with a beam size of 5 and with no length penalty.

Overall Results
We compare our method (Ours) with the Transformer baselines of 6 blocks (6B) and 8 blocks (8B), and a 16-block Transformer with transparent attention (Transparent Attn (16B))7 (Bapna et al., 2018).We also reproduce a 6-block Transformer baseline, which has better performance than what is reported in (Vaswani et al., 2017) and we use it to initialize the bottom module in our model.
From the results in Table 1, we see that our proposed approach enables effective training for deeper network and achieves significantly better performances compared to baselines.With our method, the performance of a well-optimized 6block model can be further boosted by adding two additional blocks, while simply using Transformer (8B) will lead to a performance drop.Specifically, we achieve a 29.92 BLEU score on En→De translation with 1.0 BLEU improvement over the strong baselines, and achieve a 0.6 BLEU improvement for En→Fr.The improvements are statistically significant with p < 0.01 in paired bootstrap sampling (Koehn, 2004).
We further make an attempt to train a deeper model with additional M = 4 blocks, which has 10 blocks in total for En→De translation.tom module is also initialized from our reproduced 6-block transformer baseline.This model achieves a 30.07BLEU score on En→De translation and it surpasses the performance of our 8-block model, which further demonstrates that our approach is effective for training deeper NMT models.

Analysis
To further study the effectiveness of our proposed framework, we present additional comparisons in En→De translation with two groups of baseline approaches in Figure 3: (1) Direct stacking (DS): we extend the 6-block baseline to 8-block by directly stacking 2 additional blocks.We can see that both training from scratch (DS scratch) and "growing" from a welltrained 6-block model (DS grow) fails to improve performance in spite of larger model capacity.The comparison with this group of models shows that directly stacking more blocks is not a good strategy for increasing network depth, and demonstrates the effectiveness and necessity of our proposed mechanisms for training deep networks.
(2) Ensemble learning (Ensemble): we present the two-model ensemble results for fair comparison with our approach that involves a two-pass deepshallow decoding.Specifically, we present the ensemble performances of two independently trained 6-block models (Ensemble 6B/6B), and ensemble of one 6-block and one 8-block model independently trained from scratch (Ensemble 6B/8B).As expected, the ensemble method improves translation quality over the single model baselines by a large margin (over 0.8 BLEU improvement).Regarding training complexity, it takes 40 GPU days (5 days on 8 GPU) to train a single 6-block model from scratch, 48 GPU days for a 8-block model , and 8 GPU days to "grow" a 6-block model into 8-block with our approach.Therefore, our model is better than the two-model ensemble in terms of both translation quality (more than 0.3 BLEU improvement over the ensemble baseline) and training complexity.

Conclusion
In this paper, we proposed a new training strategy with three specially designed components, including cross-module residual connection, hierarchical encoder-decoder attention and deep-shallow decoding, to construct and train deep NMT models.We showed that our approach can effectively construct deeper model with significantly better performance over the state-of-the-art transformer baseline.Although only empirical studies on the transformer are presented in this paper, our proposed strategy is a general approach that can be universally applicable to other model architectures, including LSTM and CNN.In future work, we will further explore efficient strategies that can jointly train all modules of the deep model with minimal increase in training complexity.

Figure 2 :
Figure 2: The overall framework of our proposed deep model architecture.N and M are the numbers of blocks in the bottom module (i.e., grey parts) and top module (i.e., blue and green parts).Parameters of the bottom module are fixed during the top module training.The dashed parts denote the original training/decoding of the bottom module.The weights of the two linear operators before softmax are shared.

Figure 3 :
Figure 3: The test performances of WMT14 En→De translation task.
Wu et al. (2018)7)in a similar way as shown in Eqn.(2) and (3).This enables the top module to have direct access to both the low-level input signals from the word embedding and high-level information generated by the bottom module.Similar principles can be found inWang et al. (2017);Wu et al. (2018).(2)Hierarchicalencoder-decoderattention: We introduce a hierarchical encoder-decoder attention calculated with different contextual representations as shown in Eqn.(2) and (3), where h 1 is used as key and value for attn 1 in the bottom module, and h 2 for attn 2 in the top module.Hidden states from the corresponding previous decoder block are used as queries for both attn 1 and attn 2 (omitted for readability).In this way, the strong capability of the well trained bottom module can be best preserved regardless of the influence from top module, while the newly stacked top module can leverage the higher-level contextual representations.More details can be found from source code in the supplementary materials.
(3) Deep-shallow decoding: At the decoding phase, enc 1 and dec 1 work together according to Eqn.(1) and Eqn.(2) as a shallow network net S , integrate both bottom and top module works as a deep network net D according to Eqn.(1)∼Eqn.(3).net S and net D generate the final translation results through reranking.

Table 1 :
The test set performances of WMT14 En→De and En→Fr translation tasks.' †' denotes the performance figures reported in the previous works.
The bot-