Layer-Wise Multi-View Learning for Neural Machine Translation

Traditional neural machine translation is limited to the topmost encoder layer’s context representation and cannot directly perceive the lower encoder layers. Existing solutions usually rely on the adjustment of network architecture, making the calculation more complicated or introducing additional structural restrictions. In this work, we propose layer-wise multi-view learning to solve this problem, circumventing the necessity to change the model structure. We regard each encoder layer’s off-the-shelf output, a by-product in layer-by-layer encoding, as the redundant view for the input sentence. In this way, in addition to the topmost encoder layer (referred to as the primary view), we also incorporate an intermediate encoder layer as the auxiliary view. We feed the two views to a partially shared decoder to maintain independent prediction. Consistency regularization based on KL divergence is used to encourage the two views to learn from each other. Extensive experimental results on five translation tasks show that our approach yields stable improvements over multiple strong baselines. As another bonus, our method is agnostic to network architectures and can maintain the same inference speed as the original model.


Introduction
Neural Machine Translation (NMT) adopts the encoder-decoder paradigm to model the entire translation process (Bahdanau et al., 2015). Specifically, the encoder finds a multi-layer representation of the source sentence, and the decoder queries the topmost encoding representation to produce the target sentence through a cross-attention mechanism (Wu et al., 2016;Vaswani et al., 2017). However, such overreliance on the topmost encoding layer is problematic in two aspects: (1) Prone to over-fitting, especially when the encoder is under-trained, such as in low-resource tasks ; (2) It cannot make full use of representations extracted from lower encoder layers, which are syntactically and semantically complementary to higher layers (Peters et al., 2018;Raganato and Tiedemann, 2018).
Researchers have proposed many methods to make the model aware of various encoder layers besides the topmost to mitigate this issue. Almost all of them resort to the adjustment of network structure, which can be further divided into two categories. The first is to merge the feature representations extracted by distinct encoder layers before being fed to the decoder Dou et al., 2018;Wang et al., 2019b). The differences between them lie in the design of the merge function: through self-attention , recurrent neural network (Wang et al., 2019b), or tree-like hierarchical merge (Dou et al., 2018). Moreover, the second makes each decoder layer explicitly align to a parallel encoder layer (He et al., 2018) or all encoder layers (Bapna et al., 2018). However, the above methods either complicate the original model Dou et al., 2018;Wang et al., 2019b;Bapna et al., 2018) or limit the model's flexibility, such as requiring the number of the encoder layers to be the same as the decoder layers (He et al., 2018).
Instead, in this work, we propose layer-wise multi-view learning to address this problem from the perspective of model training, without changing the model structure. Our method's highlight is that only the training process is concerned, while the inference speed is guaranteed to be the same as that of the standard model. The core idea is that we regard the off-the-shelf output of each encoding layer as a view for the input sentence. Therefore, it is straightforward and cheap to construct multiple views during a standard layer-by-layer encoding process. Further, in addition to the output of the topmost encoder layer used in standard models (refer to the primary view), we also incorporate an intermediate encoder layer as the auxiliary view. We feed the two views to a partially shared decoder for independent predictions. An additional regularization loss based on prediction consistency between views is used to encourage the auxiliary view to mimic the primary view. Thanks to the co-training on the two views, the gradients during back-propagation can simultaneously flow into the two views, which implicitly realizes the knowledge transfer.
Extensive experimental results on five translation tasks (Ko→En,IWSLT'14 De→En,WMT'17 Tr→En,WMT'16 Ro→En,and WMT'16 En→De) show that our method can stably outperform multiple baseline models (Vaswani et al., 2017;Dou et al., 2018;Bapna et al., 2018). In particular, we have achieved new state-of-the-art results of 10.8 BLEU on Ko→En and 36.23 BLEU on IWSLT'14 De→En. Further analysis shows that our method's success lies in the robustness to encoding representations and dark knowledge (Hinton et al., 2015) provided by consistency regularization.

Approach
In this section, we will take the Transformer model (Vaswani et al., 2017) as an example to show how to train a model by our multi-view learning. We first briefly introduce Transformer in § 2.1, then describe the proposed multi-view Transformer model (called MV-Transformer) and its training and inference in detail in § 2.2. Finally, we discuss why our method works in § 2.3. See Figure 1 for an overview of the proposed approach.

Transformer
The Transformer systems follow the encoder-decoder paradigm. On the encoder side, there are M identical stacked layers. Each of them comprises a self-attention-network (SAN) sub-layer and a feed-forwardnetwork (FFN) sub-layer. To easy optimization, layer normalization (LN) (Ba et al., 2016) and residual connections (He et al., 2016) are used between these sub-layers. There are two ways to incorporate them, namely PreNorm Transformer and PostNorm Transformer (Wang et al., 2019a). Without loss generalization, here we only describe the implementation of PreNorm Transformer, but we test our method in both cases 1 . The l-th encoder layer of PreNorm Transformer is: whereḢ denotes the intermediate encoding state after the first sublayer. Besides, there is an extra layer normalization behind the topmost layer to prevent the excessive accumulation of the unnormalized output in each layer, i.e. H * = LN(H (M ) ), where H * denotes the final encoding result. Likewise, the decoder has another stack of N identical layers, but an additional cross-attention-network (CAN) sub-layer is inserted between SAN and FFN compared to the encoder layer: CAN(·) is similar to SAN(·) except that its key and value are composed of the encoding output H * instead of query itself. Resemble H * , the last extracted feature vector by decoder is Z * = LN(Z (N ) ). Thus, given a sentence pair of x, y , where x = (x 1 , . . . , x m ) and y = (y 1 , . . . , y n ), we can train the model parameters θ by minimizing the negative log-likelihood: where p θ (y j |x, y <j ) = Softmax(W o Z * j + b o ), W o and b o are the parameters in the output layer.

Multi-View Transformer
Multi-view. Multi-view learning has achieved great success in conventional machine learning by exploiting the redundant views of the same input data (Xu et al., 2013), where one of the keys is view construction. In our scenario, a view is the hidden representation of the input sentence (an array of hidden vectors for each token, e.g., H * ). In this work, we further propose to take the off-the-shelf output of each encoder layer (i.e., H (l) 2 ) to construct the redundant views. In NLP, previous implementations of view construction generally require the model to recalculate on the reconstructed input, such as using different orders of n-grams in the bag-of-word model (Matsubara et al., 2005), randomly masking the input tokens . As opposed to them, our method is very cheap as the by-product of the standard layer-by-layer encoding process. According to the definition of a view, we can regard the vanilla Transformer as a single-view model since only the topmost encoder layer (also called primary view) is fed to the decoder. In contrast, MV-Transformer additionally contains an intermediate layer M a (1 ≤ M a < M ) as the auxiliary view 3 . The choice of M a can be arbitrary, and we discuss its effect in § 4.2. Our goal is to learn a better single model with the help of the auxiliary view.
Partially shared parameters. In the encoder, except for the last layer normalization, all other parameters are shared in the two views to obtain the corresponding view representations by encoding only once. However, the situation is different for the decoder. We empirically find that a fully shared decoder has no sufficient capacity to be compatible with two different views simultaneously, especially in medium or large translation tasks (see § 4.4). On the other hand, it can be seen that the difference between the two views only directly affects the CANs in the decoder and has nothing to do with other sublayers (i.e., SANs, FFNs). Therefore, using a separate decoder for each view will cause an enormous waste of decoder model parameters. To trade-off, we extend the decoder network by using independent CANs for each view but share all SANs and FFNs (see Figure 1) 4 .
Two-stream decoder. Given the two views, we use a two-stream decoder during training. Like Eq. 2, the auxiliary view is queried as:Z where the subscript a indicates used for auxiliary view. In each decoding step, one stream queriesŻ (l) from the primary view H * like a standard Transformer, while the other stream queriesŻ (l) a from the auxiliary view H * a through separate CAN sublayers. In this way, each stream yields distinct predictions based on different context semantics in the views. Here we use p pri (·) and p aux (·) to denote the prediction distribution by the primary view and the auxiliary view, respectively.
Training. To jointly train the two views and transfer the knowledge between them, the training objective of MV-Transformer consists of two items. The first itemL nll is similar to the negative log-likelihood in Eq. 3, but additionally considers the log-likelihood of the auxiliary view prediction: where L pri nll , L aux nll are based on the distribution of p pri (·) and p aux (·) respectively, and 1/2 is used to numerically scaleL nll to L nll . The second itemL cr is the consistency regularization loss between views, where we use Kullback-Leibler (KL) divergence to let the student (played by the auxiliary view) imitate the prediction of the teacher (played by the primary view): where p (j) (v) is the probability of generating token v at step j 5 . We note that our consistency regularization is different from traditional knowledge distillation, where a typical implementation is to detach the teacher's prediction p (j) pri as a constant (Hinton et al., 2015). On the contrary, our method takes p (j) pri as a variable that requires gradients during back-propagation. To this end, the entire model parameters are optimized to give good predictions in two views instead of considering only one, which implicitly makes the model learn from different encoder layers. Some people may say that it is enough for the student to learn from the teacher, but the reverse is unreasonable. However, we believe that the information in different views is complementary, so the potential for mutual learning of views may be greater than one-way learning. And our empirical comparison in § 3.3 also confirms this assumption. Finally, we can interpolate these two losses with the hyper-parameter α to obtain the overall loss function for multi-view learning:L Intuitively, when α is low, the loss degrades into Eq. 5, which only focuses on the ground-truth labels. On the contrary, a high α overemphasizes the consistency of the entire vocabulary between the two views, resulting in neglecting to learn from the provided ground-truth. We discuss α's effect in § 4.2.
Inference. Instead of maintaining both views like training, we can shift to any single view at inference time. Considering the primary view as an example: We can straightforwardly discard all the modules attached to the auxiliary view, including CAN a and LN a in the decoder as well as the newly added layer normalization in the encoder. It makes the decoding speed to be precisely the same as that of the standard model. Likely, we can also switch to the auxiliary view composed of fewer encoder layers for slightly faster speed, but with the risk of performance degradation.

Discussion
In this section, we discuss why our method works from two aspects: robustness to encoding representation and dark knowledge. See § 4.1 for more experimental analysis.
Robustness to encoding representation. Over-reliance on the top encoding layer (primary view) makes the model easier to over-fit . Our method attempts to reduce the sensitivity to the primary view by feeding an auxiliary view. Figure 2 shows that the vector similarity between the i-th encoder layer and the topmost layer grows as the increase of i. Therefore, we can regard the middle layer's auxiliary view as a noisy version of the primary view. Training with noises has been widely proven to effectively improve the model's generalization ability, such as dropout (Srivastava et al., 2014), adversarial training (Miyato et al., 2017;Cheng et al., 2019) etc. We also experimentally confirm that our model is more robust than the single view model when injecting random noises into the encoding representation.
Dark knowledge. Typically, the prediction target in L nll is a one-hot distribution: Only the gold label is 1, while the others are 0. A better alternative is label smoothing (Szegedy et al., 2016), which reduces the probability of gold label by and redistributes to all non-gold labels on average. However, label smoothing ignores the relationship between non-gold labels. For example, if the current ground-truth is "improve", then "promote" should have high probability than "eat". In contrast, in our method, the target in the auxiliary view is the primary view's prediction, which contains more information about non-gold labels, also known as dark knowledge (Hinton et al., 2015).
Models and hyperparameters. We tested all models in shallow networks based on PostNorm and deep networks based on PreNorm, respectively. Concretely, for shallow models, we use M =N =3 for the small-scale Ko-En task 9 , while M =N =6 for other tasks. We use Small configuration (embed=512, ffn=1024, head=4) for {Ko, De, Tr}-En, and Base configuration (embed=512, ffn=2048, head=8) for Ro-En and En-De. As for deep models, we double the encoder depth as the corresponding PostNorm counterparts. E.g., suppose we use a 6-layer encoder in vanilla Transformer. In that case, we turn it to 12-layer in deep Transformer. For MV-Transformer, we use 1/3/6-th encoder layer as the auxiliary view when the encoder depth is 3/6/12, respectively. Following Vaswani et al. (2017), we use the inverse sqrt learning rate schedule with warm-up and label smoothing of 0.1. Some training hyperparameters are distinct across tasks due to the different data sizes. Detailed hyperparameters are listed in Appendix A.
Decoding and evaluation. To compare with previous works, we use the beam size of 4 and average last 5 checkpoints on De→En, while for other tasks, we use the beam size of 5 and the best checkpoint according to the best BLEU score on the development set. For evaluation, except that Ko→En Aux./Pri. denotes the used view at inference time. ∆ denotes the improved BLEU score over the Transformer baseline when using multi-view learning at the same encoder depth. † denotes our implementation. Boldface and * represent local and global best results, respectively. All the MV-Transformer results are significantly better (p<0.01) than the Transformer counterparts, measured by paired bootstrap resampling (Koehn, 2004).
uses sacrebleu 10 , all other datasets are evaluated by multi-bleu.perl. Only De→En is reported by case insensitive BLEU.

Main results
In addition to Transformer, we also re-implemented three previously proposed models that incorporate multiple encoder layers: multi-layer representation fusion (MLRF) , hierarchical aggregation (HieraAgg) (Dou et al., 2018), and transparent attention (TA) (Bapna et al., 2018). Table 1 shows the results of the five translation tasks on PostNorm and PreNorm. First, our MV-Transformer outperforms all baselines across the board. Specifically, for PostNorm models, with the helper of multiview learning, both views can improve the Transformer baselines by about 0.4-1.5 BLEU points. Consistent improvements of 0.5-1.7 BLEU points are also obtained even in the stronger PreNorm baselines that benefit from the encoder's increased depth. And we achieve the new state-of-the-art of 10.8 and 36.23 on Ko→En and De→En, respectively 11 . Note that these five tasks include both low-resource scenarios (Ko→En) and rich-resource scenarios (En→De), which indicates that our method has a good generalization for the scale of data size.

Compare to knowledge distillation and model ensemble
MV-Transformer can be thought of as consisting of two models: A large model as the primary view, and a small model (with shallower encoder) as the auxiliary view. Here we compare with the other three methods of integrating multiple models: • Oneway-KD. Similar to Eq. 7 but detach the teacher's prediction, i.e., gradients of the teacher's prediction is not tracked, posing a one-way transfer from primary view to auxiliary view.
• Seq-KD. Train the large model first and then translate the original training set by beam search to construct the distilled training set for the small model (Kim and Rush, 2016).
• Ensemble. Independently train the two models and combine their predictions at inference time, e.g., by algorithmic average.
Experiments are done on IWSLT'14 De→En, where the small model has a 3-layer encoder. As shown in Table 2, we can see that: (1) Oneway-KD suffers from severe degradation than MV when detaching the primary view, which indicates that making mutual learning between the primary view and auxiliary view is critical; (2) Seq-KD is almost useless or even badly hurts the performance (vs. Baseline (3L)), which is against the previous belief that Seq-KD helps the small model a lot by learning from the teacher. We suspect that the reason is that our student's performance has already been closed to the teacher; (3) Ensemble can achieve significant performance improvement than a single model but at the cost of almost twice the slower decoding speed. However, our approach can achieve comparable or even better results than the model ensemble but maintain the decoding speed as a single model.

Why multi-view learning works?
Robustness to encoding noises. In § 2.3, we suppose that MV-Transformer is less sensitive to noise encoding representations due to the introduction of the auxiliary view. To verify it, we add random Gaussian noises sampled from N (0, ) into the normalized input in the last layer normalization of the encoder 12 . As shown in Figure 3, we can see that while both models degrade performance along with stronger noises, MV-Transformer is less sensitive, e.g., the maximum gap is 15.72/2.09 when =0.8/1.0 for the 6/12-layer encoder respectively. It indicates that our model has a better generalization even if the test distribution is largely different from the training distribution. We also observed that the PreNorm-style Transformer  is less sensitive than the PostNorm counterpart, e.g., when =1.0, the PostNorm Transformer decreases 33.55 BLEU points, while the PreNorm one only decreases 6.86.
Dark knowledge. As shown in Table 3, we study the effect of dark knowledge in our multi-view learning. First, we test the case where we only use gold knowledge (ground-truth label, #2) and dark knowledge (non-ground truth labels, #3). Obviously, without the help of dark knowledge, multi-view learning fails to boost the performance. Going further along this line, we study which part of dark knowledge is the most important (#4-5). Specifically, we split all non-gold labels into two parts from high to low according to their probability: [1,100] and [100,]. We can see that our approach's success lies in those non-gold labels at the top, not the long tail part.

Hyperparameter sensitivity
Position of auxiliary view. In Figure 4, we plot the BLEU score curves against the auxiliary view position L a on IWSLT'14 De→En. In general, we can obtain better performance when L a is closer to L e , but this is not always true. It is intuitive: When L a L e , the auxiliary view is numerically far different from the primary view, which is difficult for a partially shared decoder to learn. On the contrary, if L a is too close to the top, the slight difference may not bring too many learnable signals. In this work, we use the middle layer as the auxiliary view, while it should be noted that we could obtain better results if we tune L a more carefully (e.g., L a =10 vs. L a =6 in the 12-layer encoder).
Interpolation coefficient. In Figure 5, we show the curve of BLEU score against the hyperparameter α on IWSLT'14 De→En test set. First of all, we can see that the BLEU score is improved along with the increase of α, while it starts to decrease when α is too large (e.g., α > 0.6). In particular, we failed to train the model when α=1.0. On the other hand, a large α can reduce the gap between the two views as expected. We note that even though it is difficult to know the optimal α in advance, we empirically found that α ∈ [0.3, 0.5] is robust across these distinct tasks.

Transparency to network architecture
In addition to the Transformer, we also test our method on the recently proposed DynamicConv (Wu et al., 2019). Original DynamicConv is composed of a 7-layer encoder and a 6-layer decoder. Our method takes the topmost layer as the primary view as before and uses the 4-th layer as the auxiliary view. The results on IWSLT'14 De→En task are listed in Table 4. It can be seen that DynamicConv with multiview is stably higher than that of a single-view model by about 0.5 BLEU score, which indicates that our method is transparent to network architecture and has the potential to be widely used.

Ablation study
We did ablation studies to understand the effects of (a) separate CANs and (b) using a lower encoder layer as auxiliary view. Experimental results are listed in Table 5. We can see that: (1) Under almost the same parameter size, sharing CANs (#3) obtains +0.7 BLEU in both tasks compared to the baseline (#1), which indicates the improvement comes from our multi-view training instead of the increased parameters; (2) Using separate CANs is more helpful than sharing CANs when the size of training data is large enough (#3 vs. #2); (3) Thanks to separate CANs, the decoder can obtain distinguishable context representations even if the auxiliary view is the same as the primary view (#4 vs. #1); (4) The auxiliary view with a high layer (#4, M a =6) performs worse than that of a low layer (#2, M a =3), which strongly indicates the diversity between views is more important than the quality of the auxiliary view.

Related Work
Multi-view learning. In multi-view learning, one of the most fundamental problems is view construction. Most previous works study random sampling in the feature spaces (Ho, 1998), feature vector transformation by reshaping (Wang et al., 2011). For natural language processing, Matsubara et al. (2005) obtain the multiple views of one document by taking different grams as terms in the bag-of-word model. Perhaps the most related work in this topic is , which randomly mask input tokens to generate different sequences. Different from , we take the off-the-shelf outputs of the encoder layers as views which is more general for multi-layer networks without any construction cost.
Consistency regularization. Knowledge distillation (KD) is a typical application of consistency regularization, which achieves knowledge transfer by letting the student model imitate the teacher model (Hinton et al., 2015). There are many ways to construct the student model. For example, the student is the peer model as the teacher in , and Lan et al. (2018) take one branch as the student in their multi-branch network architecture. As for us, our student model consists of a shallow teacher network in a partially shared manner. Another important application scenario of consistency regularization is semi-supervised learning, such as Temporal Ensembling (Laine and Aila, 2017), Mean Teacher (Tarvainen and Valpola, 2017), Virtual Adversarial Training (Miyato et al., 2017) etc. However, our method works in supervised learning without the requirement of unlabeled data.

Conclusion
We studied to incorporate different encoder layers through multi-view learning in neural machine translation. In addition to the primary view from the topmost layer, the proposed model introduces an auxiliary view from an intermediate encoder layer and encourages the transfer of knowledge between the two views. Our method is agnostic to network architecture and can maintain the same inference speed as the original model. We tested our method on five translation tasks with multiple strong baselines: Transformer, deep Transformer, and DynamicConv. Experimental results show that our multi-view learning method can stably outperform the baseline models. Our models have achieved new state-of-the-art results in Ko→En and IWSLT'14 De→En tasks.