Sequence-Level Knowledge Distillation

Neural machine translation (NMT) offers a novel alternative formulation of translation that is potentially simpler than statistical approaches. However to reach competitive performance, NMT models need to be exceedingly large. In this paper we consider applying knowledge distillation approaches (Bucila et al, 2006; Hinton et al., 2015) that have proven successful for reducing the size of neural models in other domains to the problem of NMT. We demonstrate that standard knowledge distillation applied to word-level prediction can be effective for NMT, and also introduce two novel sequence-level versions of knowledge distillation that further improve performance, and somewhat surprisingly, seem to eliminate the need for beam search (even when applied on the original teacher model). Our best student model runs 10 times faster than its state-of-the-art teacher with only a decrease of 0.2 BLEU. It is also significantly better than a baseline model trained without knowledge distillation: by 4.2/1.7 BLEU with greedy decoding/beam search.


Introduction
Neural machine translation (NMT) (Kalchbrenner and Blunsom, 2013;Cho et al., 2014;Sutskever et al., 2014;Bahdanau et al., 2015) is a deep learningbased method for translation that has recently shown promising results as an alternative to statistical ap-proaches. NMT systems directly model the probability of the next word in the target sentence simply by conditioning a recurrent neural network on the source sentence and previously generated target words.
While both simple and surprisingly accurate, NMT systems typically need to have very high capacity in order to perform well: Sutskever et al. This issue of excessively large networks has been observed in several other domains, with much focus on fully-connected and convolutional networks for multi-class classification. Researchers have particularly noted that large networks seem to be necessary for training, but learn redundant representations in the process (Denil et al., 2013). Therefore compressing deep models into smaller networks has been an active area of research. As deep learning systems obtain better results on NLP tasks, compression also becomes an important practical issue with applications such as running deep learning models for speech and translation locally on cell phones.
Existing compression methods generally fall into two categories: (1) pruning and (2) knowledge distillation. Pruning methods (LeCun et al., 1990;He et al., 2014;Han et al., 2016), zero-out weights or entire neurons based on an importance criterion: Le-Cun et al. (1990) use (a diagonal approximation to) the Hessian to identify weights whose removal minimally impacts the objective function, while Han et al. (2016) remove weights based on thresholding their absolute values. Knowledge distillation approaches (Bucila et al., 2006;Ba and Caruana, 2014;Hinton et al., 2015) learn a smaller student network to mimic the original teacher network by minimizing the loss (typically L 2 or cross-entropy) between the student and teacher output.
In this work, we investigate knowledge distillation in the context of neural machine translation. We note that NMT differs from previous work which has mainly explored non-recurrent models in the multiclass prediction setting. For NMT, while the model is trained on multi-class prediction at the word-level, it is tasked with predicting complete sequence outputs conditioned on previous decisions. With this difference in mind, we experiment with standard knowledge distillation for NMT and also propose two new versions of the approach that attempt to approximately match the sequence-level (as opposed to word-level) distribution of the teacher network. This sequence-level approximation leads to a simple training procedure wherein the student network is trained on a newly generated dataset that is the result of running beam search with the teacher network.
We run experiments to compress a large state-ofthe-art 4 × 1000 LSTM model, and find that with sequence-level knowledge distillation we are able to learn a 2 × 500 LSTM that roughly matches the performance of the full system. We see similar results compressing a 2 × 500 model down to 2 × 100 on a smaller data set. Furthermore, we observe that our proposed approach has other benefits, such as not requiring any beam search at test-time. As a result we are able to perform greedy decoding on the 2 × 500 model 10 times faster than beam search on the 4 × 1000 model with comparable performance. Our student models can even be run efficiently on a standard smartphone. 1 Finally, we apply weight pruning on top of the student network to obtain a model that has 13× fewer parameters than the original teacher model. We have released all the code for the models described in this paper. 2 1 https://github.com/harvardnlp/nmt-android 2 https://github.com/harvardnlp/seq2seq-attn 2 Background 2.1 Sequence-to-Sequence with Attention Let s = [s 1 , . . . , s I ] and t = [t 1 , . . . , t J ] be (random variable sequences representing) the source/target sentence, with I and J respectively being the source/target lengths. Machine translation involves finding the most probable target sentence given the source: where T is the set of all possible sequences. NMT models parameterize p(t | s) with an encoder neural network which reads the source sentence and a decoder neural network which produces a distribution over the target sentence (one word at a time) given the source. We employ the attentional architecture from Luong et al. (2015), which achieved state-ofthe-art results on English → German translation. 3

Knowledge Distillation
Knowledge distillation describes a class of methods for training a smaller student network to perform better by learning from a larger teacher network (in addition to learning from the training data set). We generally assume that the teacher has previously been trained, and that we are estimating parameters for the student. Knowledge distillation suggests training by matching the student's predictions to the teacher's predictions. For classification this usually means matching the probabilities either via L 2 on the log scale (Ba and Caruana, 2014) or by crossentropy (Li et al., 2014;Hinton et al., 2015).
Concretely, assume we are learning a multi-class classifier over a data set of examples of the form (x, y) with possible classes V. The usual training criteria is to minimize NLL for each example from the training data, where 1{·} is the indicator function and p the distribution from our model (parameterized by θ). is minimized between the student/teacher distributions (yellow) for each word in the actual target sequence (ECD), as well as between the student distribution and the degenerate data distribution, which has all of its probabilitiy mass on one word (black). In sequence-level knowledge distillation (center) the student network is trained on the output from beam search of the teacher network that had the highest score (ACF). In sequence-level interpolation (right) the student is trained on the output from beam search of the teacher network that had the highest sim with the target sequence (ECE).
This objective can be seen as minimizing the crossentropy between the degenerate data distribution (which has all of its probability mass on one class) and the model distribution p(y | x; θ).
In knowledge distillation, we assume access to a learned teacher distribution q(y | x; θ T ), possibly trained over the same data set. Instead of minimizing cross-entropy with the observed data, we instead minimize the cross-entropy with the teacher's probability distribution, where θ T parameterizes the teacher distribution and remains fixed. Note the cross-entropy setup is identical, but the target distribution is no longer a sparse distribution. 4 Training on q(y | x; θ T ) is attractive since it gives more information about other classes for a given data point (e.g. similarity between classes) and has less variance in gradients (Hinton et al., 2015). 4 In some cases the entropy of the teacher/student distribution is increased by annealing it with a temperature term τ > 1 After testing τ ∈ {1, 1.5, 2} we found that τ = 1 worked best.
Since this new objective has no direct term for the training data, it is common practice to interpolate between the two losses, where α is mixture parameter combining the one-hot distribution and the teacher distribution.

Knowledge Distillation for NMT
The large sizes of neural machine translation systems make them an ideal candidate for knowledge distillation approaches. In this section we explore three different ways this technique can be applied to NMT.

Word-Level Knowledge Distillation
NMT systems are trained directly to minimize word NLL, L WORD-NLL , at each position. Therefore if we have a teacher model, standard knowledge distillation for multi-class cross-entropy can be applied. We define this distillation for a sentence as, where V is the target vocabulary set. The student can further be trained to optimize the mixture of L WORD-KD and L WORD-NLL . In the context of NMT, we refer to this approach as word-level knowledge distillation and illustrate this in Figure 1 (left).

Sequence-Level Knowledge Distillation
Word-level knowledge distillation allows transfer of these local word distributions. Ideally however, we would like the student model to mimic the teacher's actions at the sequence-level. The sequence distribution is particularly important for NMT, because wrong predictions can propagate forward at testtime.
First, consider the sequence-level distribution specified by the model over all possible sequences t ∈ T , for any length J. The sequence-level negative loglikelihood for NMT then involves matching the onehot distribution over all complete sequences, Of course, this just shows that from a negative log likelihood perspective, minimizing word-level NLL and sequence-level NLL are equivalent in this model.
But now consider the case of sequence-level knowledge distillation. As before, we can simply replace the distribution from the data with a probability distribution derived from our teacher model. However, instead of using a single word prediction, we use q(t | s) to represent the teacher's sequence distribution over the sample space of all possible sequences, Note that L SEQ-KD is inherently different from L WORD-KD , as the sum is over an exponential number of terms. Despite its intractability, we posit that this sequence-level objective is worthwhile. It gives the teacher the chance to assign probabilities to complete sequences and therefore transfer a broader range of knowledge. We thus consider an approximation of this objective.
Our simplest approximation is to replace the teacher distribution q with its mode, Observing that finding the mode is itself intractable, we use beam search to find an approximation. The loss is then whereŷ is now the output from running beam search with the teacher model.
Using the mode seems like a poor approximation for the teacher distribution q(t | s), as we are approximating an exponentially-sized distribution with a single sample. However, previous results showing the effectiveness of beam search decoding for NMT lead us to belief that a large portion of q's mass lies in a single output sequence. In fact, in experiments we find that with beam of size 1, q(ŷ | s) (on average) accounts for 1.3% of the distribution for German → English, and 2.3% for Thai → English (Table 1: p(t =ŷ)). 5 To summarize, sequence-level knowledge distillation suggests to: (1) train a teacher model, (2) run beam search over the training set with this model, (3) train the student network with cross-entropy on this new dataset.
Step (3) is identical to the word-level NLL process except now on the newly-generated data set. This is shown in Figure 1 (center).

Sequence-Level Interpolation
Next we consider integrating the training data back into the process, such that we train the student model as a mixture of our sequence-level teachergenerated data (L SEQ-KD ) with the original training data (L SEQ-NLL ), where y is the gold target sequence.
Since the second term is intractable, we could again apply the mode approximation from the previous section, and train on both observed (y) and teachergenerated (ŷ) data. However, this process is nonideal for two reasons: (1) unlike for standard knowledge distribution, it doubles the size of the training data, and (2) it requires training on both the teachergenerated sequence and the true sequence, conditioned on the same source input. The latter concern is particularly problematic since we observe that y andŷ are often quite different.
As an alternative, we propose a single-sequence approximation that is more attractive in this setting. This approach is inspired by local updating (Liang et al., 2006), a method for discriminative training in statistical machine translation (although to our knowledge not for knowledge distillation). Local updating suggests selecting a training sequence which is close to y and has high probability under the teacher model, where sim is a function measuring closeness (e.g. Jaccard similarity or BLEU (Papineni et al., 2002)). Following local updating, we can approximate this sequence by running beam search and choosing where T K is the K-best list from beam search. We take sim to be smoothed sentence-level BLEU (Chen and Cherry, 2014).
We justify training onỹ from a knowledge distillation perspective with the following generative process: suppose that there is a true target sequence (which we do not observe) that is first generated from the underlying data distribution D. And further suppose that the target sequence that we observe (y) is a noisy version of the unobserved true sequence: i.e. (i) t ∼ D, (ii) y ∼ (t), where (t) is, for example, a noise function that independently replaces each element in t with a random element in V with some small probability. 6 In such a case, ideally the student's distribution should match the mixture distribution, In this setting, due to the noise assumption, D now has significant probability mass around a neighborhood of y (not just at y), and therefore the argmax of the mixture distribution is likely something other than y (the observed sequence) orŷ (the output from beam search). We can see thatỹ is a natural approximation to the argmax of this mixture distribution between D and q(t | s) for some α. We illustrate this framework in Figure 1 (right) and visualize the distribution over a real example in Figure 2.

Experimental Setup
To test out these approaches, we conduct two sets of NMT experiments: high resource (English → German) and low resource (Thai → English).
The English-German data comes from WMT 2014. 7 The training set has 4m sentences and we take newstest2012/newstest2013 as the dev set and newstest2014 as the test set. We keep the top 50k most frequent words, and replace the rest with UNK. The teacher model is a 4 × 1000 LSTM (as in Luong et al. (2015)) and we train two student models: 2 × 300 and 2 × 500. The Thai-English data comes from IWSLT 2015. 8 There are 90k sentences in the 6 While we employ a simple (unrealistic) noise function for illustrative purposes, the generative story is quite plausible if we consider a more elaborate noise function which includes additional sources of noise such as phrase reordering, replacement of words with synonyms, etc. One could view translation having two sources of variance that should be modeled separately: variance due to the source sentence (t ∼ D), and variance due to the individual translator (y ∼ (t)). 7 http://statmt.org/wmt14 8 https://sites.google.com/site/iwsltevaluation2015/mt-track training set and we take 2010/2011/2012 data as the dev set and 2012/2013 as the test set, with a vocabulary size is 25k. Size of the teacher model is 2 × 500 (which performed better than 4×1000, 2×750 models), and the student model is 2×100. Other training details mirror Luong et al. (2015). We evaluate on tokenized BLEU with multi-bleu.perl, and experiment with the following variations:

Word-Level Knowledge Distillation (Word-KD)
Student is trained on the original data and additionally trained to minimize the cross-entropy of the teacher distribution at the word-level. We tested α ∈ {0.5, 0.9} and found α = 0.5 to work better.

Sequence-Level Knowledge Distillation (Seq-KD)
Student is trained on the teacher-generated data, which is the result of running beam search and taking the highest-scoring sequence with the teacher model. We use beam size K = 5 (we did not see improvements with a larger beam).

Sequence-Level Interpolation (Seq-Inter)
Student is trained on the sequence on the teacher's beam that had the highest BLEU (beam size K = 35). We adopt a fine-tuning approach where we begin training from a pretrained model (either on original data or Seq-KD data) and train with a smaller learning rate (0.1). For English-German we generate Seq-Inter data on a smaller portion of the training set (∼ 50%) for efficiency.
The above methods are complementary and can be combined with each other. For example, we can train on teacher-generated data but still include a word-level cross-entropy term between the teacher/student (Seq-KD + Word-KD in Table 1), or fine-tune towards Seq-Inter data starting from the baseline model trained on original data (Baseline + Seq-Inter in Table 1). 9

Results and Discussion
Results of our experiments are shown in Table  1. We find that while word-level knowledge distillation (Word-KD) does improve upon the baseline, sequence-level knowledge distillation (Seq-KD) does better on English → German and performs similarly on Thai → English. Combining them (Seq-KD + Word-KD) results in further gains for the 2 × 300 and 2 × 100 models (although not for the 2 × 500 model), indicating that these methods provide orthogonal means of transferring knowledge from the teacher to the student: Word-KD is transferring knowledge at the the local (i.e. word) level while Seq-KD is transferring knowledge at the global (i.e. sequence) level.
Sequence-level interpolation (Seq-Inter), in addition to improving models trained via Word-KD and Seq-KD, also improves upon the original teacher model that was trained on the actual data but finetuned towards Seq-Inter data (Baseline + Seq-Inter). In fact, greedy decoding with this fine-tuned model has similar performance (19.6) as beam search with the original model (19.5), allowing for faster decoding even with an identically-sized model. We hypothesize that sequence-level knowledge distillation is effective because it allows the student network to only model relevant parts of the teacher distribution (i.e. around the teacher's mode) instead of 'wasting' parameters on trying to model the entire English → German WMT 2014   space of translations. Our results suggest that this is indeed the case: the probability mass that Seq-KD models assign to the approximate mode is much higher than is the case for baseline models trained on original data (Table 1: p(t =ŷ)). For example, on English → German the (approximate) argmax for the 2 × 500 Seq-KD model (on average) accounts for 16.9% of the total probability mass, while the corresponding number is 0.9% for the baseline. This also explains the success of greedy decoding for Seq-KD models-since we are only modeling around the teacher's mode, the student's distribution is more peaked and therefore the argmax is much easier to find. Seq-Inter offers a compromise between the two, with the greedily-decoded sequence accounting for 7.6% of the distribution.
Finally, although past work has shown that models with lower perplexity generally tend to have

Decoding Speed
Run-time complexity for beam search grows linearly with beam size. Therefore, the fact that sequencelevel knowledge distillation allows for greedy decoding is significant, with practical implications for running NMT systems across various devices. To test the speed gains, we run the teacher/student models on GPU, CPU, and smartphone, and check the average number of source words translated per second (Table 2). We use a GeForce GTX Titan X for GPU and a Samsung Galaxy 6 smartphone. We find that we can run the student model 10 times faster with greedy decoding than the teacher model with beam search on GPU (1051.3 vs 101.9 words/sec), with similar performance.

Weight Pruning
Although knowledge distillation enables training faster models, the number of parameters for the student models is still somewhat large (  2 × 500 English → German model the word embeddings account for approximately 63% (50m out of 84m) of the parameters. The size of word embeddings have little impact on run-time as the word embedding layer is a simple lookup table that only affects the first layer of the model. We therefore focus next on reducing the memory footprint of the student models further through weight pruning. Weight pruning for NMT was recently investigated by See et al. (2016), who found that up to 80 − 90% of the parameters in a large NMT model can be pruned with little loss in performance. We take our best English → German student model (2 × 500 Seq-KD + Seq-Inter) and prune x% of the parameters by removing the weights with the lowest absolute values. We then retrain the pruned model on Seq-KD data with a learning rate of 0.2 and fine-tune towards Seq-Inter data with a learning rate of 0.1. As observed by See et al. (2016), retraining proved to be crucial. The results are shown in Table 3.
Our findings suggest that compression benefits achieved through weight pruning and knowledge distillation are orthogonal. 11 Pruning 80% of the weight in the 2 × 500 student model results in a model with 13× fewer parameters than the original teacher model with only a decrease of 0.4 BLEU. While pruning 90% of the weights results in a more appreciable decrease of 1.0 BLEU, the model is drastically smaller with 8m parameters, which is 26× fewer than the original teacher model.

Further Observations
• For models trained with word-level knowledge distillation, we also tried regressing the student network's top-most hidden layer at each time step to the teacher network's top-most hidden layer as a pretraining step, noting that Romero et al. (2015) obtained improvements with a similar technique on feed-forward models. We found this to give comparable results to standard knowledge distillation and hence did not pursue this further.
• There have been promising recent results on eliminating word embeddings completely and obtaining word representations directly from characters with character composition models, which have many fewer parameters than word embedding lookup tables (Ling et al., 2015a;Kim et al., 2016;Ling et al., 2015b;Jozefowicz et al., 2016;Costa-Jussa and Fonollosa, 2016). Combining such methods with knowledge distillation/pruning to further reduce the memory footprint of NMT systems remains an avenue for future work.

Related Work
Compressing deep learning models is an active area of current research. Pruning methods involve pruning weights or entire neurons/nodes based on some criterion. LeCun et al. (1990) prune weights based on an approximation of the Hessian, while Han et al. (2016) show that a simple magnitude-based pruning works well. Prior work on removing neurons/nodes include Srinivas and Babu (2015) and Mariet and Sra (2016). See et al. (2016) were the first to apply pruning to Neural Machine Translation, observing that that different parts of the architecture (input word embeddings, LSTM matrices, etc.) admit different levels of pruning. Knowledge distillation approaches train a smaller student model to mimic a larger teacher model, by minimizing the loss between the teacher/student predictions (Bucila et al., 2006;Ba and Caruana, 2014;Li et al., 2014;Hinton et al., 2015). Romero et al. (2015) additionally regress on the intermediate hidden layers of the student/teacher network as a pretraining step, while Mou et al. (2015) obtain smaller word embeddings from a teacher model via regression. There has also been work on transferring knowledge across different network architectures: Chan et al. (2015b) show that a deep non-recurrent neural network can learn from an RNN; Geras et al. (2016) train a CNN to mimic an LSTM for speech recognition. Kuncoro et al. (2016) recently investigated knowledge distillation for structured prediction by having a single parser learn from an ensemble of parsers.

Conclusion
In this work we have investigated existing knowledge distillation methods for NMT (which work at the word-level) and introduced two sequence-level variants of knowledge distillation, which provide improvements over standard word-level knowledge distillation.
We have chosen to focus on translation as this domain has generally required the largest capacity deep learning models, but the sequence-to-sequence framework has been successfully applied to a wide range of tasks including parsing (Vinyals et al., 2015a), summarization (Rush et al., 2015), dialogue (Vinyals and Le, 2015;Serban et al., 2016;Li et al., 2016), NER/POS-tagging (Gillick et al., 2016), image captioning (Vinyals et al., 2015b;, video generation (Srivastava et al., 2015), and speech recognition (Chan et al., 2015a). We anticipate that methods described in this paper can be used to similarly train smaller models in other domains.