Structure-Level Knowledge Distillation For Multilingual Sequence Labeling

Multilingual sequence labeling is a task of predicting label sequences using a single unified model for multiple languages. Compared with relying on multiple monolingual models, using a multilingual model has the benefit of a smaller model size, easier in online serving, and generalizability to low-resource languages. However, current multilingual models still underperform individual monolingual models significantly due to model capacity limitations. In this paper, we propose to reduce the gap between monolingual models and the unified multilingual model by distilling the structural knowledge of several monolingual models (teachers) to the unified multilingual model (student). We propose two novel KD methods based on structure-level information: (1) approximately minimizes the distance between the student’s and the teachers’ structure-level probability distributions, (2) aggregates the structure-level knowledge to local distributions and minimizes the distance between two local probability distributions. Our experiments on 4 multilingual tasks with 25 datasets show that our approaches outperform several strong baselines and have stronger zero-shot generalizability than both the baseline model and teacher models.


Introduction
Sequence labeling is an important task in natural language processing. Many tasks such as named entity recognition (NER) and part-of-speech (POS) tagging can be formulated as sequence labeling problems and these tasks can provide extra information to many downstream tasks and products such as searching engine, chat-bot and syntax parsing (Jurafsky and Martin, 2009). Most of the previ-ous work on sequence labeling focused on monolingual models, and the work on multilingual sequence labeling mainly focused on cross-lingual transfer learning to improve the performance of low-resource or zero-resource languages (Johnson et al., 2019;Huang et al., 2019a;Rahimi et al., 2019;Huang et al., 2019b;Keung et al., 2019), but their work still trains monolingual models. However, it would be very resource consuming considering if we train monolingual models for all the 7,000+ languages in the world. Besides, there are languages with limited labeled data that are required for training. Therefore it is beneficial to have a single unified multilingual sequence labeling model to handle multiple languages, while less attention is paid to the unified multilingual models due to the significant difference between different languages. Recently, Multilingual BERT (M-BERT) (Devlin et al., 2019) is surprisingly good at zero-shot cross-lingual model transfer on tasks such as NER and POS tagging (Pires et al., 2019). M-BERT bridges multiple languages and makes training a multilingual sequence labeling model with high performance possible (Wu and Dredze, 2019). However, accuracy of the multilingual model is still inferior to monolingual models that utilize different kinds of strong pretrained word representations such as contextual string embeddings (Flair) proposed by Akbik et al. (2018).
To diminish the performance gap between monolingual and multilingual models, we propose to utilize knowledge distillation to transfer the knowledge from several monolingual models with strong word representations into a single multilingual model. Knowledge distillation (Buciluǎ et al., 2006;Hinton et al., 2015) is a technique that first trains a strong teacher model and then trains a weak student model through mimicking the output probabilities (Hinton et al., 2015;Lan et al., 2018;Mirzadeh et al., 2019) or hidden states (Romero et al., 2014;Seunghyun Lee, 2019) of the teacher model. The student model can achieve an accuracy comparable to that of the teacher model and usually has a smaller model size through KD. Inspired by KD applied in neural machine translation (NMT) (Kim and Rush, 2016) and multilingual NMT (Tan et al., 2019), our approach contains a set of monolingual teacher models, one for each language, and a single multilingual student model. Both groups of models are based on BiLSTM-CRF (Lample et al., 2016;Ma and Hovy, 2016), one of the state-of-the-art models in sequence labeling. In BiLSTM-CRF, the CRF layer models the relation between neighbouring labels which leads to better results than simply predicting each label separately based on the BiLSTM outputs. However, the CRF structure models the label sequence globally with the correlations between neighboring labels, which increases the difficulty in distilling the knowledge from the teacher models. In this paper, we propose two novel KD approaches that take structure-level knowledge into consideration for multilingual sequence labeling. To share the structure-level knowledge, we either minimize the difference between the student's and the teachers' distribution of global sequence structure directly through an approximation approach or aggregate the global sequence structure into local posterior distributions and minimize the difference of aggregated local knowledge. Experimental results show that our proposed approach boosts the performance of the multilingual model in 4 tasks with 25 datasets. Furthermore, our approach has better performance in zero-shot transfer compared with the baseline multilingual model and several monolingual teacher models.

Sequence Labeling
BiLSTM-CRF (Lample et al., 2016;Ma and Hovy, 2016) is one of the most popular approaches to sequence labeling. Given a sequence of n word tokens x = {x 1 , · · · , x n } and the corresponding sequence of gold labels y * = {y * 1 , · · · , y * n }, we first feed the token representations of x into a BiL-STM to get the contextual token representations r = {r 1 , · · · , r n }. The conditional probability p(y|x) is defined by: where Y(x) denotes the set of all possible label sequences for x, ψ is the potential function, W y and b y ,y are parameters and y 0 is defined to be a special start symbol. W T y r i and b y ,y are usually called emission and transition scores respectively. During training, the negative log-likelihood loss for an input sequence is defined by: BiLSTM-Softmax approach to sequence labeling reduces the task to a set of label classification problem by disregarding label transitions and simply feeding the emission scores W T r i into a softmax layer to get the probability distribution of each variable y i .
The loss function then becomes: In spite of its simplicity, this approach ignores correlations between neighboring labels and hence does not adequately model the sequence structure. Consequently, it empirically underperforms the first approach in many applications.

Knowledge Distillation
A typical approach to KD is training a student network by imitating a teacher's predictions (Hinton et al., 2015). The simplest approach to KD on BiLSTM-Softmax sequence labeling follows Eq. 3 and performs token-level distillation through minimizing the cross-entropy loss between the individual label distributions predicted by the teacher model and the student model: where p t (y i = j|x) and p s (y i = j|x) are the label distributions predicted by the teacher model and the student model respectively and |V| is the number of possible labels. The final loss of the student model combines the KD loss and the negative loglikelihood loss: where λ is a hyperparameter. As pointed out in Section 2.1, however, sequence labeling based on Eq. 3 has the problem of ignoring structure-level knowledge. In the BiLSTM-CRF approach, we can also apply an Emission distillation through feeding emission scores in Eq. 3 and get emission probabilitiesp(y i |x), then the loss function becomes:

Approach
In this section, we propose two approaches to learning a single multilingual sequence labeling model (student) by distilling structure-level knowledge from multiple mono-lingual models. The first approach approximately minimizes the difference between structure-level probability distributions predicted by the student and teachers. The second aggregates structure-level knowledge into local posterior distributions and then minimizes the difference between local distributions produced by the student and teachers. Our approaches are illustrated in Figure 1.
Both the student and the teachers are BiLSTM-CRF models (Lample et al., 2016;Ma and Hovy, 2016), one of the state-of-the-art models in sequence labeling. A BiLSTM-CRF predicts the distribution of the whole label sequence structure, so token-level distillation is no longer possible and structure-level distillation is required.

Top-K Distillation
Inspired by Kim and Rush (2016), we propose to encourage the student to mimic the teachers' global structural probability distribution over all possible label sequences: However, |Y(x)| is exponentially large as it represents all possible label sequences. We propose two methods to alleviates this issue through efficient approximations of p t (y|x) using the k-best label sequences.
Top-K Eq. 6 can be seen as computing the expected student log probability with respect to the teacher's structural distribution: The expectation can be approximated by sampling from the teacher's distribution p t (y|x). However, unbiased sampling from the distribution is difficult. We instead apply a biased approach that regards the k-best label sequences predicted by the Table 1: Example of computing the structural knowledge for a sequence of 3 tokens with a label set of {T, F }. ψ(y k−1 , y k , r k ) represents the potential formulated in Eq. 1. Each Label Seq. Probs. is defined in Eq. 2 for the corresponding label sequence. Top-2 represents the two label sequences with the highest scores and Weights are their corresponding weights for KD (Eq. 8,9). α(y k ), β(y k ) and the posterior distribution q(y k |x) are computed based on Eq. 11, 12 and 10 respectively. We assume that ψ(y 0 , y 1 , r 1 ) = 1 regardless of whether y 1 is T or F . teacher model as our samples. We use a modified Viterbi algorithm to predict the k-best label sequences T = {ŷ 1 , . . . ,ŷ k }. Eq. 7 is then approximated as: This can also be seen as data augmentation through generating k pseudo target label sequences for each input sentence by the teacher.
Weighted Top-K The Top-K method is highly biased in that the approximation becomes worse with a larger k . A better method is to associate weights to the k samples to better approximate p t (y|x).
Eq. 7 is then approximated as: This can be seen as the student learning weighted pseudo target label sequences produced by the teacher for each input sentence. The Top-K approach is related to the previous work on model compression in neural machine translation (Kim and Rush, 2016) and multilingual neural machine translation (Tan et al., 2019). In neural machine translation, producing k-best label sequences is intractable in general and in practice, beam search decoding has been used to approximate the k-best label sequences. However, for linear-chain CRF model, k-best label sequences can be produced exactly with the modified Viterbi algorithm.

Posterior Distillation
The Top-K is approximate with respect to the teacher's structural distribution and still is slow on large k. Our second approach tries to distill structure-level knowledge based on tractable local (token-wise) distributions q(y k |x), which can be exactly computed.
where Z is the denominator of Eq. 2 that is usually called the partition function and α(y k ) and β(y k ) are calculated in forward and backward pass utilizing the forward-backward algorithm. We assume that β(y n ) = 1. Given the local probability distribution for each token, we define the KD loss function in a similar manner with the token-level distillation in Eq. 5.
The difference between token-level distillation and posterior distillation is that posterior distillation is based on BiLSTM-CRF and conveys global Algorithm 1 KD for Multilingual Sequence Labeling 1: Input: Training corpora D = {D 1 , . . . , D l } with l languages, monolingual models T = {T 1 , . . . , T l } pretrained on the corresponding training corpus, learning rate η, multilingual student model M with parameters θ, total training epochs S, loss interpolation coefficient λ, interpolation annealing rate τ . 2: Initialize: Randomly initialize multilingual model parameters θ. Set the current training epoch S = 0, current loss interpolation λ = 1. Create an new empty training dataset D. 3: Teacher model Ti reads the input x i j and predicts probability distributionsp i j required for KD. 7: Append (x i j , y i j ,p i j ) into the new training dataset D. 8: end for 9: end for 10: 11: while S < S do 12: S = S + 1.
Posterior distillation has not been used in the related research of knowledge distillation in neural machine translation because of intractable computation of local distributions. In sequence labeling, however, local distributions in a BiLSTM-CRF can be computed exactly using the forward-backward algorithm.
An example of computing the structural knowledge discussed in this and last subsections is shown in Table 1.

Multilingual Knowledge Distillation
Let D = {D 1 , . . . , D l } denotes a set of training data with l languages. D i denotes the corpus of the i-th language that contains multiple sentence and label sequence pairs To train a single multilingual student model from multiple monolingual pretrained teachers, for each input sentence, we first use the teacher model of the corresponding language to predict the pseudo targets (k-best label sequences or posterior distribution for posterior distillation). Then the student jointly learns from the gold targets and pseudo targets in training by optimizing the following loss function: where λ decreases from 1 to 0 throughout training following Clark et al. (2019), L KD is one of the Eq. 5,8,9,13 or an averaging of Eq. 9, 13. The overall distillation process is summarized in Algorithm 1.

Setup
Dataset We use datasets from 4 sequence labeling tasks in our experiment.
• CoNLL NER: We collect the corpora of 4 languages from the CoNLL 2002 and 2003 shared task (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003) • WikiAnn NER (Pan et al., 2017): The dataset contains silver standard NER tags that are annotated automatically on 282 languages that exist in Wikipedia. We select the data of 8 languages from different language families or from different language subgroups of Indo-European languages. We randomly choose 5000 sentences from the dataset for each language except English, and choose 10000 sentences for English to reflect the abundance of English corpora in practice. We split the dataset by 8:1:1 for training/development/test.
• Universal Dependencies (UD) (Nivre et al., 2016): We use universal POS tagging annotations in the UD datasets. We choose 8 languages from different language families or language subgroups and one dataset for each language.
• Aspect Extraction: The dataset is from an aspect-based sentiment analysis task in SemEval-2016 Task 5 (Pontiki et al., 2016). We choose subtask 1 of the restaurants domain which has the most languages in all domains 1 , and split 10% of the training data as the development data.    Model Configurations In our experiment, all the word embeddings are fixed and M-BERT token embeddings are obtained by average pooling. We feed the token embeddings into the BiLSTM-CRF for decoding. The hidden size of the BiLSTM layer is 256 for the monolingual teacher models and 600 or 800 for the multilingual student model depending on the dataset as larger hidden size for the multilingual model results in better performance in our experiment. The settings of teacher and student models are as follows: •

Results
We report results of the following approaches.  • Baseline represents training the multilingual model with the datasets of all the languages combined and without knowledge distillation.
• Emission is the KD method based on Eq. 5.
• Pos.+Top-WK is a mixture of posterior and weighted Top-K distillation.
We also report the results of monolingual models as Teachers and multilingual BiLSTM-Softmax model with token-level KD based on Eq. 4 as Softmax and Token for reference. Table 2, 3, and 4 show the effectiveness of our approach on 4 tasks over 25 datasets. In all the tables, we report scores averaged over 5 runs. Observation #0. BiLSTM-Softmax models perform inferior to BiLSTM-CRF models in most cases in the multilingual setting: The results show that the BiLSTM-CRF approach is stronger than the BiLSTM-Softmax approach on three of the four tasks, which are consistent with previous work on sequence labeling (Ma and Hovy, 2016;Reimers and Gurevych, 2017;Yang et al., 2018). The token-level KD approach performs almost the same as the BiLSTM-Softmax baseline in most of the tasks except the Aspect Extraction task. Observation #1. Monolingual teacher models outperform multilingual student models: This is probably because the monolingual teacher models are based on both multilingual embeddings M-BERT and strong monolingual embeddings (Flair/fastText). The monolingual embedding may provide additional information that is not available to the multilingual student models. Furthermore, note that the learning problem faced by a multilingual student model is much more difficult than that of a teacher model because a student model has to handle all the languages using roughly the same model size as a teacher model.  Table 6: Averaged results of zero-shot transfer on another 28 languages of the NER task and 24 languages of the POS tagging task.
only on 12 out of 25 datasets. This shows that simply following the standard approach of knowledge distillation from emission scores is not sufficient for the BiLSTM-CRF models. Observation #3. Top-K and Top-WK outperform the baseline: Top-K outperforms the baseline on 15 datasets. It outperforms Emission on average on Wikiann NER and Aspect Extraction and is competitive with Emission in the other two tasks. Top-WK outperforms the baseline on 18 datasets and it outperforms Top-K in all the tasks. Observation #4. Posterior achieves the best performance on most of the tasks: The Posterior approach outperforms the baseline on 21 datasets and only underperforms the baseline by 0.12 on 2 languages in WikiAnn and by 0.01 on one language in UD POS tagging. It outperforms the other methods on average in all the tasks except that is slightly underperforms Pos.+Top-WK in the CoNLL NER task. Observation #5. Top-WK+Posterior stays in between: Pos.+Top-WK outperforms both Top-WK and Posterior only in the CoNLL NER task. In the other three tasks, its performance is above that of Top-WK but below that of Posterior.

Zero-shot Transfer
We use the monolingual teacher models, multilingual baseline models and our Posterior and Pos.+Top-WK models trained on the CoNLL NER datasets to predict NER tags on the test sets of 7 languages in WikiAnn that used in Section 4.2. Table 5 shows the results. For the teacher models, we report the maximum score over all the teachers for  We also conduct experiments on zero-shot transferring over other 28 languages on WikiAnn NER datasets and 24 languages on UD POS tagging datasets. The averaged results are shown in Table  6. The NER experiment shows that our approaches outperforms Baseline on 24 out of 28 languages and the Posterior is stronger than Pos.+Top-WK by 0.29 F1 score on average. The POS tagging experiment shows that our approach outperforms Baseline on 20 out of 24 languages. For more details, please refer to the Appendices A.

KD with Weaker Teachers
To show the effectiveness of our approach, we train weaker monolingual teachers using only M-BERT embeddings on four datasets of the CoNLL NER task. We run Posterior distillation and keep the setting of the student model unchanged. In this setting, Posterior not only outperforms the baseline, but also outperforms the teacher model on average. This shows that our approaches still work when the teachers have the same token embeddings as the student. By comparing Table 7 and 2, we can also see that stronger teachers lead to better students.

k Value in Top-K
To show how the k value affects the performance of Top-K and Top-WK distillation methods, we compare the models with two distillation methods and different k values on the CoNLL NER task. Figure  2 shows that Top-K drops dramatically when k gets larger while Top-WK performs stably. Therefore   Top-WK is less sensitive to the hyper-parameter k and might be practical in real applications.

Training Time and Memory Consumption
We compare the training time of different approaches on the CoNLL NER task and report the results in Table 8. Our Top-WK and Posterior approaches take 1.45 and 1.63 times the training time of the Baseline approach. For the memory consumption in training, the GPU memory cost does not vary significantly for all the approaches, while the CPU memory cost for all the KD approaches is about 2 times that of the baseline model, because training models with KD requires storing predictions of the teachers in the CPU memory.

Related Work
Multilingual Sequence Labeling Many important tasks such as NER and POS tagging can be reduced to a sequence labeling problem. Most of the recent work on multilingual NER (Täckström, 2012;Fang et al., 2017;Enghoff et al., 2018;Rahimi et al., 2019;Johnson et al., 2019) and POS tagging (Snyder et al., 2009;Plank and Agić, 2018) focuses on transferring the knowledge of a specific language to another (low-resource) language.
For example, Johnson et al. (2019) proposed crosslingual transfer learning for NER focusing on bootstrapping Japanese from English, which has a different character set than Japanese. Previous work has discussed and empirically investigated two ways of adapting monolingual pretrained embedding models to monolingual downstream tasks (Peters et al., 2019): either fixing the models and using them for feature extraction, or fine-tuning them in downstream tasks. They found that both settings have comparable performance in most cases. Wu and Dredze (2019) found that fine-tuning M-BERT with the bottom layers fixed provides further performance gains in multilingual setting. In this paper, we mainly focus on the first approach and utilize the pretrained embedding as fixed feature extractor because Flair/M-BERT finetuning is too slow for our large-scale experimental design of multilingual KD. Designing a cheap and fast fine-tuning approach for pretrained embedding models might be an interesting direction for future work.

Conclusion
In this paper our major contributions are the two structure-level methods to distill the knowledge of monolingual models to a single multilingual model in sequence labeling: Top-K knowledge distillation and posterior distillation. The experimental results show that our approach improves the performance of multilingual models over 4 tasks on 25 datasets. The analysis also shows that our model has stronger zero-shot transfer ability on unseen languages on the NER and POS tagging task. Our code is publicly available at https://github.

A Appendices
In this appendices, we use ISO 639-1 codes 3 to represent each language for simplification.
A.1 Zero-shot Transfer