Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers

With the growth of computing power neural machine translation (NMT) models also grow accordingly and become better. However, they also become harder to deploy on edge devices due to memory constraints. To cope with this problem, a common practice is to distill knowledge from a large and accurately-trained teacher network (T) into a compact student network (S). Although knowledge distillation (KD) is useful in most cases, our study shows that existing KD techniques might not be suitable enough for deep NMT engines, so we propose a novel alternative. In our model, besides matching T and S predictions we have a combinatorial mechanism to inject layer-level supervision from T to S. In this paper, we target low-resource settings and evaluate our translation engines for Portuguese--English, Turkish--English, and English--German directions. Students trained using our technique have 50% fewer parameters and can still deliver comparable results to those of 12-layer teachers.


Introduction
Almost in all deep learning tasks, including neural machine translation (NMT), an ensemble of models outperforms a single model. In fact, ensemble modelling (training multiple models and ensemble decoding) is supported by most publicly available NMT frameworks (Klein et al., 2017;Junczys-Dowmunt et al., 2018;Vaswani et al., 2018;Ott et al., 2019). However, we know that dealing with multiple models could be challenging, especially in deep learning scenarios. To tackle the issue, one effective solution is to compress the knowledge in an ensemble into a single model through distillation (Buciluǎ et al., 2006;Hinton et al., 2015).
The core part of any knowledge distillation (KD) pipeline is a component that matches different mod- * These authors contributed equally. els' predictions, which is usually implemented via multiple cost functions (see Section 2). Furthermore, we also need to take care of the architecture mismatch that may exist between student (S) and teacher (T ) models. In KD, these two models can have different architectures (Jiao et al., 2019;Sun et al., 2019) and the motivation is to be able to compress a large teacher into a smaller student.
This research focuses on the aforementioned issue. If we distill from intermediate layers of a teacher that has more layers than its student, we have to select a subset of T layers and skip others as there are no peers for all of them on the S side. Clearly, we do not benefit from the skipped layers in this scenario. This type of KD introduces a problem of finding an optimal subset of T layers (to distill from). Although this might, to some extent, be mitigated via a search mechanism, our experimental results show that the problem is severe in NMT and each layer plays a unique role. Therefore, we prefer to keep all layers rather than skip them.
KD has recently become popular in NMT but, to the best of our knowledge, all NMT models (Kim and Rush, 2016;Tan et al., 2019) are still trained using the original idea of KD (Hinton et al., 2015), which is referred to as Regular KD (RKD) throughout this paper. RKD only matches S and T outputs, regardless of their internal architecture. However, there exist techniques such as Patient KD (PKD) (Sun et al., 2019) proposed for other tasks that not only match final predictions but also focus on internal components and distill their information too (Sun et al., 2020). In this research, we borrowed those ideas and adapted them to NMT. This is the first contribution of the paper.
PKD and other similar models suffer from the skip problem, which happens when T has more layers than S and some T layers have to be skipped in order to carry out layer-to-layer distillation. In this paper, we propose a model to distill from all teacher layers so we do not have to skip any of them. This is our second contribution by which we are able to outperform PKD. Moreover, for the first time we report experimental results for Transformer-based (Vaswani et al., 2017) models trained with a layerlevel KD technique in the context of NMT. This set of results is our third and last contribution in this paper.
The remainder of the paper is organized as follows: In Section 2 we explain the fundamentals of KD. Section 3 discusses the methodology. We describe the advantages of our model and accompany our claims with experimental results in Section 4. Finally, in Section 5, we conclude the paper with our future plan.

Background
Usually, in multi-class classification scenarios the training criterion is to minimize the negative loglikelihood of samples, as shown in Equation 1: where 1(.) is an indicator function, (x, y) is an input-output training tuple, and θ and |V | are the parameter set of the model and the number of classes, respectively. There is no feedback returned from the network for misclassified examples as 1(y = v) = 0. This issue is resolved in KD with extending L with an additive term (Kim and Rush, 2016;Tan et al., 2019), as shown in Equation 2: where there is a student model with the parameter set θ S whose predictions are penalized with its own loss as well as T predictions given by q(y = v|x; θ T ). In KD, the first component of the loss (q) is usually referred to as the soft loss and the S model's loss is known as the hard loss. This form of training provides richer feedback compared to the previous one and leads to high(er)-quality results. KD for NMT also follows the same principle where V is a target-language vocabulary set and L KD is computed for each word during decoding.
With the matching strategy proposed in KD, S learns to mimic its T . A teacher could be a deep model trained on a large dataset but we do not necessarily need to have the same complex architecture for S. We can distill teacher's knowledge into a smaller model and replicate its results with fewer resources. Kim and Rush (2016) studied this problem and proposed a sequence-level extension of Equation 2 for NMT models. They evaluated their idea on recurrent, LSTM-based models (Hochreiter and Schmidhuber, 1997) and could run the final model on a cellphone. Freitag et al. (2017) extended the original two-class idea (one S with one T ) to distill from multiple teachers. They trained an attentionbased recurrent model (Bahdanau et al., 2015) for their experiments. Tan et al. (2019) proposed a setting to train a multilingual Transformer for different language directions. In order to have a high-quality multilingual model they distill knowledge from separately trained bilingual models. Their work is one of the few papers that reports KD results for NMT on Transformers. However, their results are not directly comparable to ours as they benefit from rich, multilingual corpora.
Wei et al. (2019) introduced a pipeline where a student model learns from different checkpoints. At each validation step, if the current checkpoint is a better model than the best existing checkpoint, S learns from it, otherwise the best stored checkpoint is considered as the teacher.
In all models discussed so far, i) S usually has the same architecture as its teacher(s) but we know that recent NMT models, particularly Transformers, are deep models which makes them challenging to run on edge devices. Moreover, ii) the training criterion in the aforementioned models is to combine final predictions. Transformers have new components (e.g. self-attention) and multiple (sub-)layers that consist of valuable information (Clark et al., 2019) and we need more than an output-level combination to efficiently distill for/from these models. Therefore, a new technique that is capable of addressing i and ii is required.
Authors of PKD spotted the problem and focused on internal layers (Sun et al., 2019). They studied the limitations of RKD for BERT (Devlin et al., 2019) models and introduced a layer-to-layer cost function. They select a subset of layers from T whose values are compared to S layers. They also showed that different internal components are important and play critical roles in KD.
The layer-level supervision idea was successful for monolingual models but so far, no one has tried it in the context of NMT. In this paper, we investigate if the same idea holds for bilingual models or if NMT requires a different type of KD. Moreover, we address the skip problem in PKD (shown in Figure 1). It seems in deep teacher models we do not need to skip layers and we can distill from all layers.

Methodology
In RKD, distillation only happens at the output level whereas PKD introduces layer-wise supervision. This idea is illustrated in Figure 1. In PKD, finding a skippable layer is the main challenge. Accordingly, we propose a combinatorial idea, CKD, by which we are able to fuse layers and benefit from all information stored in all layers. Our idea can be formulated as follows: where L s and L t indicate the set of all hidden layers of S and T , respectively. M SE() is the mean-square error function and l i s is the i-th hidden layer of S. In PKD, f i t is the teacher's i-th layer whereas in our case f i t is the result of a fusion applied through the function F () to a particular subset of T layers. This subset is defined via a mapper function M () which takes an index (pointing to a layer on the student side) and returns a set of indices from the teacher model. Based on these indices, teacher layers are combined and passed to the distillation process, e.g. if M (2) = {1, 3} that means F is fed by the first (l 1 t ) and third (l 3 t ) layers of T and the distillation happens between l 2 s and f 2 t (result of fusion).
For F (), a simple concatenation followed by a linear projection provided the best results in our experiments, so in the previous example: where • indicates concatenation, and W ∈ R d×2d and b ∈ R d are learnable parameters of KD. All l 1 t , l 3 t , l 2 s , and f 2 t are d-dimensional vectors. The mapper function M () defines our combination strategy for which we have 4 different variations of regular combination (RC), overlap combination (OC), skip combination (SC), and cross combination (CC). Figure 2 visualizes these variations. As the figure shows, PKD is a particular case of our model, but CKD gives us more flexibility in terms of distilling from different teacher configurations.

Experimental Study
Although our proposed model is a general KD technique and can be applied in different settings, we narrow down the scope of this paper to lowresource, NMT settings. The incentive idea behind our project was to train NMT models for small datasets, so we report experimental results accordingly.
To evaluate CKD, we trained multiple models to translate from English (En) into German (De), and from Portuguese (Pt) and Turkish (Tr) into English (En). For the Pt|Tr→En directions we use the IWSLT-2014 dataset, and the En→De experiment uses the WMT-2014 dataset.
In Pt→En, we use the original split of datasets from IWSLT 1 with 167K, 7590, and 5388 sentences for training, development, and test sets, respectively. For Tr→En, the split is 142K, 1958,  (Vaswani et al., 2017), namely the training set includes 4.5M sentences, newstest2013 is used as the validation set and newstest2014 is our test set with 3000 and 3003 sentences, respectively. We selected this dataset to be comparable to a well-known baseline and make sure our training pipeline yields highquality engines.

Models
We preprocess datasets with Sentence-Piece (Kudo and Richardson, 2018). For Pt→En, we extracted a shared vocabulary set for both source and target sides with 32K subwords. Both S and T are trained using the same training set. Tr→En follows the same setting. For En→De, we conduct two experiments. Since our focus in this paper is to work with low-resource settings, in En→De 1 , S and T are trained on a dataset of 200K sentences randomly sampled from the main dataset (4.5M). 3 For this experiment the vocabulary set size is 15K. In En→De 2 , we slightly changed the setting where we use the entire set of 4.5M sentences to train T but S still uses the same 200K dataset. In this scenario, we assumed that there already exists a high-quality teacher trained on a large dataset but we only have a small in-house dataset to train the student. For this experiment the vocabulary size is 37K.  pineni et al., 2002) scores computed using sacre-BLEU (Post, 2018). As the table shows, our students outperform all other students trained with different KD techniques. Moreover, students in Pt|Tr→En and En→De 1 settings are even comparable to accurately-trained, deep teachers. All teachers are 12-layer Transformers (6 for encoding and 6 for decoding), whereas students only have 4 layers (2 encoder layers and 2 decoder layers). All settings in our experiments are identical to those of Vaswani et al. (2017), which means hyperparameters whose values are not clearly declared in this paper use the same values as the original Transformer model.
CKD makes it possible to reduce the number of parameters in our students by 50% and yet deliver the same high-quality translations. Accordingly, this enables us to run these translation engines on edge devices. Table 2 shows the exact number of parameters for each model. For results reported in Table 1, cross-model layer mappings between teacher and student layers are as follow: We tried a simple (and somewhat arbitrary) configuration for layer connections and there is no systematic strategy behind it. However, better results can be achieved with better heuristics or through a search process. Moreover, as the mappings show there is no connection between student and teacher models' decoder layers. In our experiments, we noticed that any KD technique applied to the decoder considerably decreases performance, so we only use KD on the encoder side. More specifically, each student model has two decoder layers which only receive inputs from the same model's encoder layers and they are not connected to the teacher side.
To train students with different KD techniques we use different loss functions. In T and No-KD we only have a single loss function (L) as described in the original Transformer model (Vaswani et al., 2017). For models trained with RKD, an additional loss is involved to match teacher and student predictions (L KD ). The final loss in this case is an interpolation of the aforementioned losses: (β × L) + (η × L KD ) . In our experiments, β = (1 − η) where η = 0.1 is obtained through a search process over the set {0.1, 0.3, 0.5, 0.7, 0.9}.
For students trained using PKD and CKD, a third loss is also used in addition to L and L KD . Similar to other losses, the third one is also multiplied by a weight value (λ) to incorporate its impact into the training process. In this new setting, β = (1 − η − λ), η = 0.1, and λ = 0.7. The high value of λ compared to other weights shows the importance of intermediate KD for deep models. All these values are learned through an empirical study in order to minimize the final loss of translation engines.

How Powerful is CKD?
In order to study the behaviour of CKD, we designed multiple, small experiments in addition to those reported in Table 1. PKD proposes a solution to define a loss between internal components of teacher and student models. The original model implemented this idea for intermediate layers. In one of our experiments we extended PKD by adding an extra loss function for self-attention components. Therefore, this new extension compares final outputs of student and teacher models as well as their intermediate layers and self-attention parameters. In this experiment, BLEU for Pt→En increased from 42.27 to 43.28, but our model is still superior with the BLEU score 43.78. For this setting, CKD outperforms even a very complicated variation of PKD that could be an indication of our model's capacity in training high-quality students. For Tr→En and En→De 1 we also observed slight improvements by matching teacher and student selfattention components but results were not statistically significant and CKD was still better.
We also studied how CKD behaves in large experimental settings, for which we used En→De and En→French (Fr) datastes with 4.5M and 36M training samples, respectively, and trained 12-layer teachers and 4-layer students. For this experiment, we used the same settings, and test and development sets suggested in Vaswani et al. (2017).  As the table shows, CKD is better than PKD in large experimental settings too. However, in order to have a better understanding of the large-dataset scenario we need to explore more configurations. We emphasize that for this paper our focus was to work with small students and datasets.

Conclusion and Future Work
In this paper, we proposed a novel model to distill from intermediate layers as well as final predictions. Moreover, we addressed the skip problem of PKD. We applied our technique in NMT and showed its potential in training high-quality and compact models. In our future work, i) we are interested in distilling from deep NMT models into extremely small students with CKD, in the hope of achieving the same results of large models with much smaller counterparts. ii) We also try to improve the combination module and find a better alternative than concatenation. iii) Finally, we plan to evaluate CKD in other tasks such as language modeling.