XtremeDistil: Multi-stage Distillation for Massive Multilingual Models

Deep and large pre-trained language models are the state-of-the-art for various natural language processing tasks. However, the huge size of these models could be a deterrent to use them in practice. Some recent and concurrent works use knowledge distillation to compress these huge models into shallow ones. In this work we study knowledge distillation with a focus on multi-lingual Named Entity Recognition (NER). In particular, we study several distillation strategies and propose a stage-wise optimization scheme leveraging teacher internal representations that is agnostic of teacher architecture and show that it outperforms strategies employed in prior works. Additionally, we investigate the role of several factors like the amount of unlabeled data, annotation resources, model architecture and inference latency to name a few. We show that our approach leads to massive compression of MBERT-like teacher models by upto 35x in terms of parameters and 51x in terms of latency for batch inference while retaining 95% of its F1-score for NER over 41 languages.


Introduction
Motivation: Pre-trained language models have shown state-of-the-art performance for various natural language processing applications like text classification, named entity recognition and questionanswering. A significant challenge facing practitioners is how to deploy these huge models in practice. For instance, models like BERT Large (Devlin et al., 2019), GPT 2 (Radford et al., 2019), Megatron (Shoeybi et al., 2019) and T5 (Raffel et al., 2019) have 340M , 1.5B, 8.3B and 11B parameters respectively. Although these models are trained offline, during prediction we need to traverse the deep neural network architecture stack involving a large number of parameters. This significantly increases latency and memory requirements.
Knowledge distillation (Hinton et al., 2015;Ba and Caruana, 2014) earlier used in computer vision provides one of the techniques to compress huge neural networks into smaller ones. In this, shallow models (called students) are trained to mimic the output of huge models (called teachers) based on a transfer set. Similar approaches have been recently adopted for language model distillation.
Limitations of existing work: Recent works Zhu et al., 2019;Tang et al., 2019;Turc et al., 2019) leverage soft logits from teachers as optimization targets for distilling students, with some notable exceptions from concurrent work. Sun et al. (2019); Sanh (2019); Aguilar et al. (2019); Zhao et al. (2019) additionally use internal representations from the teacher as additional signals. However, these methods are constrained by architectural considerations like embedding dimension in BERT and transformer architectures. This makes it difficult to massively compress these models (without being able to reduce network width) or adopt alternate architectures. For instance, we observe BiLSTMS as students to be more accurate than Transformers for low latency configurations. Some of the concurrent works (Turc et al., 2019); (Zhao et al., 2019) adopt pre-training or dual training to distil students of arbitrary architecture. However, pre-training is expensive both in terms of time and computational resources.
Additionally, most of the above works are geared for distilling language models for GLUE tasks (Wang et al., 2018). There has been some limited exploration of such techniques for sequence tagging tasks like NER (Izsak et al., 2019;Shi et al., 2019) or multilingual tasks (Tsai et al., 2019). However, these works also suffer from similar drawbacks as mentioned before. Overview of XtremeDistil: In this work, we com-pare distillation strategies used in all the above works and propose a new scheme outperforming prior ones. In this, we leverage teacher internal representations to transfer knowledge to the student. However, in contrast to prior work, we are not restricted by the choice of student architecture. This allows representation transfer from Transformerbased teacher model to BiLSTM-based student model with different embedding dimensions and disparate output spaces. We also propose a stagewise optimization scheme to sequentially transfer most general to task-specific information from teacher to student for better distillation. Overview of our task: Unlike prior works mostly focusing on GLUE tasks in a single language, we employ our techniques to study distillation for massive multilingual Named Entity Recognition (NER) over 41 languages. Prior work on multilingual transfer on the same (Rahimi et al., 2019) (MM-NER) requires knowledge of source and target language whereby they judiciously select pairs for effective transfer resulting in a customized model for each language. In our work, we adopt Multilingual Bidirectional Encoder Representations from Transformer (mBERT) as our teacher and show that it is possible to perform language-agnostic joint NER for all languages with a single model that has a similar performance but massively compressed in contrast to mBERT and MMNER.
Perhaps, the closest work to this work is that of (Tsai et al., 2019) where mBERT is leveraged for multilingual NER. We discuss this in details and use their strategy as one of our baselines. We show that our distillation strategy is better leading to a much higher compression and faster inference. We also investigate several unexplored dimensions of distillation like the impact of unlabeled transfer data and annotation resources, choice of multilingual word embeddings, architectural variations and inference latency to name a few.
Our techniques obtain massive compression of teacher models like mBERT by upto 35x in terms of parameters and 51x in terms of latency for batch inference while retaining 95% of its performance for massive multilingual NER, and matching or outperforming it for classification tasks. Overall, our work makes the following contributions: • Method: We propose a distillation method leveraging internal representations and parameter projection that is agnostic of teacher architecture. • Inference: To learn model parameters, we pro-pose stage wise optimization schedule with gradual unfreezing outperforming prior schemes.
• Experiments: We perform distillation for multilingual NER on 41 languages with massive compression and comparable performance to huge models 1 . We also perform classification experiments on four datasets where our compressed models perform at par with significantly larger teachers. • Study: We study the influence of several factors on distillation like the availability of annotation resources for different languages, model architecture, quality of multilingual word embeddings, memory footprint and inference latency. Problem Statement: Consider a sequence x = x k with K tokens and y = y k as the corresponding labels. Consider D l = { x k,l , y k,l } to be a set of n labeled instances with X = { x k,l } denoting the instances and Y = { y k,l } the corresponding labels. Consider D u = { x k,u } to be a transfer set of N unlabeled instances from the same domain where n N . Given a teacher T (θ t ), we want to train a student S(θ s ) with θ being trainable parameters such that |θ s | |θ t | and the student is comparable in performance to the teacher based on some evaluation metric. In the following section, the superscript 't' always represents the teacher and 's' denotes the student.

Related Work
Model compression and knowledge distillation: Prior works in the vision community dealing with huge architectures like AlexNet and ResNet have addressed this challenge in two ways. Works in model compression use quantization (Gong et al., 2014), low-precision training and pruning the network, as well as their combination (Han et al., 2016) to reduce the memory footprint. On the other hand, works in knowledge distillation leverage student teacher models. These approaches include using soft logits as targets (Ba and Caruana, 2014), increasing the temperature of the softmax to match that of the teacher (Hinton et al., 2015) as well as using teacher representations (Romero et al., 2015) (refer to (Cheng et al., 2017) for a survey).  Aguilar et al. (2019) train student models leveraging architectural knowledge of the teacher models which adds architectural constraints (e.g., embedding dimension) on the student. In order to address this shortcoming, more recent works combine task-specific distillation with pre-training the student model with arbitrary embedding dimension but still relying on transformer architectures (Turc et al., 2019); (Jiao et al., 2019); (Zhao et al., 2019). Izsak et al. (2019); Shi et al. (2019) extend these for sequence tagging for Part-of-Speech (POS) tagging and Named Entity Recognition (NER) in English. The one closest to our work Tsai et al. (2019) extends the above for multilingual NER.
Most of these works rely on general corpora for pre-training and task-specific labeled data for distillation. To harness additional knowledge, (Turc et al., 2019) leverage task-specific unlabeled data. (Tang et al., 2019;Jiao et al., 2019) use rule-and embedding-based data augmentation in absence of such unlabeled data.

Models
The Student: The input to the model are Edimensional word embeddings for each token. In order to capture sequential information in the sentence, we use a single layer Bidirectional Long Short Term Memory Network (BiLSTM). Given a sequence of K tokens, a BiLSTM computes a set of K vectors h( as the concatenation of the states generated by a forward ( − −− → h(x k )) and backward LSTM ( ← −− − h(x k )). Assuming the number of hidden units in the LSTM to be H, each hidden state h(x k ) is of dimension 2H. Probability distribution for the token label at timestep k is given by: where W s ∈ R 2H.C and C is number of labels. Consider one-hot encoding of the token labels, such that y k,l,c = 1 for y k,l = c, and y k,l,c = 0 otherwise for c ∈ C. The overall cross-entropy loss computed over each token obtaining a specific label in each sequence is given by: We train the student model end-to-end minimizing the above cross-entropy loss over labeled data. The Teacher: Pre-trained language models like ELMO (Peters et al., 2018), BERT (Devlin et al., 2019) and GPT (Radford et al., 2018(Radford et al., , 2019 have shown state-of-the-art performance for several tasks. We adopt BERT as the teacher -specifically, the multilingual version of BERT (mBERT) with 179M M parameters trained over 104 languages with the largest Wikipedias. mBERT does not use any markers to distinguish languages during pre-training and learns a single language-agnostic model trained via masked language modeling over Wikipedia articles from all languages. Tokenization: Similar to mBERT, we use Word-Piece tokenization with 110K shared WordPiece vocabulary. We preserve casing, remove accents, split on punctuations and whitespace. Fine-tuning the Teacher: The pre-trained language models are trained for general language modeling objectives. In order to adapt them for the given task, the teacher is fine-tuned end-to-end with task-specific labeled data D l to learn parametersθ t using cross-entropy loss as in Equation 2.

Distillation Features
Fine-tuning the teacher gives us access to its taskspecific representations for distilling the student model. To this end, we use different kinds of information from the teacher.

Teacher Logits
Logits as logarithms of predicted probabilities provide a better view of the teacher by emphasizing on the different relationships learned by it across different instances. Consider p t (x k ) to be the classification probability of token x k as generated by the fine-tuned teacher with logit(p t (x k )) representing the corresponding logits. Our objective is to train a student model with these logits as targets. Given the hidden state representation h(x k ) for token x k , we can obtain the corresponding classification score (since targets are logits) as: where W r ∈ R C·2H and b r ∈ R C are trainable parameters and C is the number of classes. We want to train the student neural network end-toend by minimizing the element-wise mean-squared error between the classification scores given by the student and the target logits from the teacher as:

Internal Teacher Representations
Hidden representations: Recent works (Sun et al., 2019;Romero et al., 2015) have shown the hidden state information from the teacher to be helpful as a hint-based guidance for the student. Given a large collection of task-specific unlabeled data, we can transfer the teacher's knowledge to the student via its hidden representations. However, this poses a challenge in our setting as the teacher and student models have different architectures with disparate output spaces. Consider h s (x k ) and z t l (x k ;θ t ) to be the representations generated by the student and the l th deep layer of the fine-tuned teacher respectively for a token x k . Consider x u ∈ D u to be the set of unlabeled instances. We will later discuss the choice of the teacher layer l and its impact on distillation. Projection: To make all output spaces compatible, we perform a non-linear projection of the parameters in student representation h s to have same shape as teacher representation z t l for each token x k : l | is the bias, and Gelu (Gaussian Error Linear Unit) (Hendrycks and Gimpel, 2016) is the non-linear projection function. |z t l | represents the embedding dimension of the teacher. This transformation aligns the output spaces of the student and teacher and allows us to accommodate arbitrary student architecture. Also note that the projections (and therefore the parameters) are shared across tokens at different timepoints.
The projection parameters are learned by minimizing the KL-divergence (KLD) between the student and the l th layer teacher representations: (6) Multilingual word embeddings: A large number of parameters reside in the word embeddings. For mBERT a shared multilingual WordPiece vocabulary of V = 110K tokens and embedding dimension of D = 768 leads to 92M M parameters. To have massive compression, we cannot directly incorporate mBERT embeddings in our model. Since we use the same WordPiece vocabulary, we are likely to benefit more from these embeddings than from Glove (Pennington et al., 2014) or FastText (Bojanowski et al., 2016).
We use a dimensionality reduction algorithm like Singular Value Decomposition (SVD) to project the mBERT word embeddings to a lower dimensional space. Given mBERT word embedding matrix of dimension V ×D, SVD finds the best Edimensional representation that minimizes sum of squares of the projections (of rows) to the subspace.

Training
We want to optimize the loss functions for representation L RL , logits L LL and cross-entropy L CE . These optimizations can be scheduled differently to obtain different training regimens as follows.

Joint Optimization
In this, we optimize the following losses jointly: where α, β and γ weigh the contribution of different losses. A high value of α makes the student focus more on easy targets; whereas a high value of γ leads focus to the difficult ones. The above loss is computed over two different task-specific data segments. The first part involves cross-entropy loss over labeled data, whereas the second part involves representation and logit loss over unlabeled data.

Stage-wise Training
Instead of optimizing all loss functions jointly, we propose a stage-wise scheme to gradually transfer most general to task-specific representations from teacher to student. In this, we first train the student to mimic teacher representations from its l th layer by optimizing R RL on unlabeled data. The student learns the parameters for word embeddings (θ w ), In the second stage, we optimize for the crossentropy R CE and logit loss R LL jointly on both labeled and unlabeled data respectively to learn the corresponding parameters W s and W r , b r .
The above can be further broken down in two stages, where we sequentially optimize logit loss R LL on unlabeled data and then optimize crossentropy loss R CE on labeled data. Every stage learns parameters conditioned on those learned in previous stage followed by end-to-end fine-tuning.
Fine-tune teacher on D l and updateθ t ; for stage in {1,2,3} do Freeze all student layers l ∈ {1 · · · L}; if stage=1 then output =z s (xu) ; target = teacher representations on Du from the l th layer as z t l (xu;θ t ) ; loss = RRL ; end if stage=2 then output = r s (xu) ; target = teacher logits on Du as logit(p t (xu;θ t )) ; Unfreeze l ; Update parameters θ s l , θ s l +1 · · · θ s L by minimizing the optimization loss between student output and teacher target end end

Gradual Unfreezing
One potential drawback of end-to-end fine-tuning for stage-wise optimization is 'catastrophic forgetting' (Howard and Ruder, 2018) where the model forgets information learned in earlier stages. To address this, we adopt gradual unfreezing -where we tune the model one layer at a time starting from the configuration at the end of previous stage.
We start from the top layer that contains the most task-specific information and allow the model to configure the task-specific layer first while others remain frozen. The latter layers are gradually unfrozen one by one and the model trained till convergence. Once a layer is unfrozen, it maintains the state. When the last layer (word embeddings) is unfrozen, the entire network is trained end-toend. The order of this unfreezing scheme (top-tobottom) is reverse of that in (Howard and Ruder, 2018) and we find this to work better in our setting with the following intuition. At the end of the first stage on optimizing R RL , the student learns to generate representations similar to that of the l th layer of the teacher. Now, we need to add only a few task-specific parameters ( W r , b r ) to optimize for logit loss R LL with all others frozen. Next, we gradually give the student more flexibility to optimize for task-specific loss by tuning the layers below where the number of parameters increases Sanh ( . We tune each layer for n epochs and restore model to the best configuration based on validation loss on a held-out set. Therefore, the model retains best possible performance from any iteration. Algorithm 1 shows overall processing scheme.

Experiments
Dataset Description: We evaluate our model XtremeDistil for multilingual NER on 41 languages and the same setting as in (Rahimi et al., 2019). This data has been derived from the WikiAnn NER corpus  and partitioned into training, development and test sets. All the NER results are reported in this test set for a fair comparison between existing works. We report both the average F 1 -score (µ) and standard deviation σ between scores across 41 languages for phraselevel evaluation. Refer to Figure 2 for languages codes and distribution of training labels across languages. We also perform experiments with data from four other domains (refer to Table 1): IMDB (Maas et al., 2011), SST-2 (Socher et al., 2013) and Elec (McAuley and Leskovec, 2013) for sentiment analysis for movie and electronics product reviews, DbPedia (Zhang et al., 2015) and Ag News (Zhang et al., 2015)   strategy with entities like LOC, ORG and PER. Following mBERT, we do not use language markers and share these tags across all languages. We use additional syntactic markers like {CLS, SEP, PAD} and 'X' for marking segmented wordpieces contributing a total of 11 tags (with shared 'O').

Evaluating Distillation Strategies
Baselines: A trivial baseline (D0) is to learn models one per language using only corresponding labels for learning. This can be improved by merging all instances and sharing information across all languages (D0-S). Most of the concurrent and recent works (refer to Table 2 for an overview) leverage logits as optimization targets for distillation (D1). A few exceptions also use teacher internal representations along with soft logits (D2). For our model we consider multi-stage distillation, where we first optimize representation loss followed by jointly optimizing logit and cross-entropy loss (D3.1) and further improving it by gradual unfreezing of neural network layers (D3.2). Finally, we optimize the loss functions sequentially in three stages (D4.1) and improve it further by unfreezing mechanism (D4.2). We further compare all strategies while varying the amount of unlabeled transfer data for distillation (hyper-parameter settings in Appendix). Results: From Table 3, we observe all strategies that share information across languages to work better (D0-S vs. D0) with the soft logits adding more value than hard targets (D1 vs. D0-S). Interestingly, we observe simply combining representation loss with logits (D3.1 vs. D2) hurts the model. We observe this strategy to be vulnerable to the hyperparameters (α, β, γ in Eqn. 7) used to combine multiple loss functions. We vary hyper-parameters in multiples of 10 and report best numbers. Stage-wise optimizations remove these hyperparameters and improve performance. We also observe the gradual unfreezing scheme to improve   both stage-wise distillation strategies significantly. Focusing on the data dimension, we observe all models to improve as more and more unlabeled data is used for transferring teacher knowledge to student. However, we also observe the improvement to slow down after a point where additional unlabeled data does not yield significant benefits. Table 4 shows the gradual performance improvement in XtremeDistil after every stage and unfreezing various neural network layers.

Performance, Compression and Speedup
Performance: We observe XtremeDistil in Table 5 to perform competitively with other models. mBERT-single models are fine-tuned per language with corresponding labels, whereas mBERT is finetuned with data across all languages. MMNER results are reported from Rahimi et al. (2019). Figure 2 shows the variation in F 1 -score across different languages with variable amount of training data for different models. We observe all the models to follow the general trend with some aberrations for languages with less training labels.  Parameter compression: XtremeDistil performs at par with MMNER in terms of F 1 -score while obtaining at least 41x compression. Given L languages, MMNER learns (L − 1) ensembled and distilled models, one for each target language. Each of the MMNER language-specific models is comparable in size to our single multilingual model. We learn a single model for all languages, thereby, obtaining a compression factor of at least L = 41. Figure 1a shows the variation in F 1 -scores of XtremeDistil and compression against mBERT with different configurations corresponding to the embedding dimension (E) and number of BiLSTM hidden states (2×H). We observe that reducing the embedding dimension leads to great compression with minimal performance loss. Whereas, reducing the BiLSTM hidden states impacts the performance more and contributes less to the compression.
Inference speedup: We compare the runtime inference efficiency of mBERT and our model in a single P100 GPU for batch inference (batch size = 32) on 1000 queries of sequence length 32. We average the time taken for predicting labels for all the queries for each model aggregated over 100 runs. Compared to batch inference, the speedups are less for online inference (batch size = 1) at 17x on Intel(R) Xeon(R) CPU (E5-2690 v4 @2.60GHz) (refer to Appendix for details).  Table 6: F 1 -score comparison for low-resource setting with 100 labeled samples per language and transfer set of different sizes for XtremeDistil . Figure 1b shows the variation in F 1 -scores of XtremeDistil and inference speedup against mBERT with different (linked) parameter configurations as before. As expected, the performance degrades with gradual speedup. We observe that parameter compression does not necessarily lead to an inference speedup. Reduction in the word embedding dimension leads to massive model compression, however, it does not have a similar effect on the latency. The BiLSTM hidden states, on the other hand, constitute the real latency bottleneck. One of the best configurations leads to 35x compression, 51x speedup over mBERT retaining nearly 95% of its performance.

Low-resource NER and Distillation
Models in all prior experiments are trained on 705K labeled instances across all languages. In this setting, we consider only 100 labeled samples for each language with a total of 4.1K instances. From Table 6, we observe mBERT to outperform MMNER by more than 17 percentage points with XtremeDistil closely following suit.
Furthermore, we observe our model's performance to improve with the transfer set size depicting the importance of unlabeled transfer data for knowledge distillation. As before, a lot of additional data has marginal contribution.

Architectural Considerations
Which teacher layer to distil from? The topmost teacher layer captures more task-specific knowledge. However, it may be difficult for a shallow student to capture this knowledge given its limited capacity. On the other hand, the less-deep representations at the middle of teacher model are easier to mimic by shallow student. From Table 8 we observe the student to benefit most from distilling the 6 th or 7 th layer of the teacher.  Which student architecture to use for distillation? Recent works in distillation leverage both BiLSTM and Transformer as students. In this experiment, we vary the embedding dimension and hidden states for BiLSTM-, and embedding dimension and depth for Transformer-based students to obtain configurations with similar inference latency. Each of 13 configurations in Figure 3 depict F 1 -  scores obtained by students of different architecture but similar latency (refer to Table 15 in Appendix for statistics) -for strategy D0-S in Table 3. We observe that for low-latency configurations BiL-STMs with hidden states {2×100, 2×200} work better than 2-layer Transformers. Whereas, the latter starts performing better with more than 3-layers although with a higher latency compared to the aforementioned BiLSTM configurations.

Distillation for Text Classification
We switch gear and focus on classification tasks. In contrast to sequence tagging, we use the last hidden state of the BiLSTM as the final sentence representation for projection, regression and softmax. Table 9 shows the distillation performance of XtremeDistil with different teachers on four benchmark text classification datasets. We observe the student to almost match the teacher performance for all of the datasets. The performance also improves with a better teacher, although the improvement is marginal as the student capacity saturates.    Table 10 shows the distillation performance with only 500 labeled samples per class. The distilled student improves over the non-distilled version by 19.4 percent and matches the teacher performance for all of the tasks demonstrating the impact of distillation for low-resource settings. Comparison with other distillation techniques: SST-2 (Socher et al., 2013) from GLUE (Wang et al., 2018) has been used as a test bed for other distillation techniques for single instance classification tasks (as in this work). Table 11 shows the accuracy comparison of such methods reported in SST-2 development set with the same teacher.
We extract 11.7M M sentences from all IMDB movie reviews in Table 1 to form the unlabeled transfer set for distillation. We obtain the best performance on distilling with BERT Large (uncased, whole word masking model) than BERT Basedemonstrating a better student performance with a better teacher and outperforming other methods.

Summary
Teacher hidden representation and distillation schedule: Internal teacher representations help in distillation, although a naive combination hurts the student model. We show that a distillation schedule with stagewise optimization, gradual unfreezing with a cosine learning rate scheduler (D4.1 + D4.2 in Table 3) obtains the best performance. We also show that the middle layers of the teacher are easier to distil by shallow students and result in the best performance (Table 8). Additionally, the student performance improves with bigger and better teachers (Tables 9 and 11). Student architecture: We compare different student architectures like BiLSTM and Transformer in terms of configuration and performance (Figure 3, Table 15 in Appendix), and observe BiLSTM to perform better at low-latency configurations, whereas the Transformer outperforms the former with more depth and higher latency budget. Unlabeled transfer data: We explored the data dimension in Tables 3 and 6 and observed unlabeled data to be the key for knowledge transfer from deep pre-trained teachers to shallow students and bridge the performance gap.
We observed a moderate amount of unlabeled transfer samples (0.7 -1.5 million) lead to the best student, whereas larger amounts of transfer data does not result in significant gains. This is particularly helpful for low-resource NER (with only 100 labeled samples per language as in Table 6). Performance trade-off: Parameter compression does not necessarily reduce inference latency, and vice versa. We explore model performance in terms of parameter compression, inference latency and F 1 to show the trade-off for distillation in Figure 1 and Table 16 in Appendix. Multilingual word embeddings: Random initialization of word embeddings work well. A better initialization, which is also parameter-efficient, is given by Singular Value Decomposition (SVD) over fine-tuned mBERT word embeddings with the best performance for downstream task (Table 7). Generalization: The outlined distillation techniques and strategies are model-, architecture-, and language-agnostic and can be easily extended to arbitrary tasks and languages, although we only focus on NER and classification in this work. Massive compression: Our techniques demonstrate massive compression (35x for parameters) and inference speedup (51x for latency) while retaining 95% of the teacher performance allowing deep pre-trained models to be deployed in practice.

Conclusions
We develop XtremeDistil for massive multi-lingual NER and classification that performs close to huge pre-trained models like MBERT but with massive compression and inference speedup. Our distillation strategy leveraging teacher representations agnostic of its architecture and stage-wise optimization schedule outperforms existing ones. We perform extensive study of several distillation dimensions like the impact of unlabeled transfer set, embeddings and student architectures, and make interesting observations outlined in summary.

A.1 Implementation
The model uses Tensorflow backend. Code and resources available at: https://aka.ms/ XtremeDistil.

A.2 Parameter Configurations
All the analyses in the paper -except compression and speedup experiments that vary embedding dimension E and BiLSTM hidden states H -are done with the following model configuration in Table 12 with the best F 1 -score. Optimizer Adam is used with cosine learning rate scheduler (lr high = 0.001, lr low = 1e − 8).