Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language

To better tackle the named entity recognition (NER) problem on languages with little/no labeled data, cross-lingual NER must effectively leverage knowledge learned from source languages with rich labeled data. Previous works on cross-lingual NER are mostly based on label projection with pairwise texts or direct model transfer. However, such methods either are not applicable if the labeled data in the source languages is unavailable, or do not leverage information contained in unlabeled data in the target language. In this paper, we propose a teacher-student learning method to address such limitations, where NER models in the source languages are used as teachers to train a student model on unlabeled data in the target language. The proposed method works for both single-source and multi-source cross-lingual NER. For the latter, we further propose a similarity measuring method to better weight the supervision from different teacher models. Extensive experiments for 3 target languages on benchmark datasets well demonstrate that our method outperforms existing state-of-the-art methods for both single-source and multi-source cross-lingual NER.


Introduction
Named entity recognition (NER) is the task of identifying text spans that belong to pre-defined categories, like locations, person names, etc.It's a fundamental component in many downstream tasks, and has been greatly advanced by deep neural networks (Lample et al., 2016;Chiu and Nichols, 2016;Peters et al., 2017).However, these approaches generally require massive manually labeled data, which prohibits their adaptation to lowresource languages due to high annotation costs.
One solution to tackle that is to transfer knowledge from a source language with rich labeled data to a target language with little or even no labeled data, which is referred to as cross-lingual NER (Wu and Dredze, 2019;Wu et al., 2020).In this paper, following Wu and Dredze (2019) and Wu et al. (2020), we focus on the extreme scenario of crosslingual NER where no labeled data is available in the target language, which is challenging in itself and has attracted considerable attention from the research community in recent years.Previous works on cross-lingual NER are mostly based on label projection with pairwise texts or direct model transfer.Label-projection based methods focus on using labeled data in a source language to generate pseudo-labelled data in the target language for training an NER model.For example, Ni et al. (2017) creates automatically labeled NER data for the target language via label projection on comparable corpora and develops a heuristic scheme to select good-quality projection-labeled data.Mayhew et al. (2017) and Xie et al. (2018) translate the source language labeled data at the phrase/word level to generate pairwise labeled data for the target language.Differently, model-transfer based methods (Wu and Dredze, 2019;Wu et al., 2020) focus on training a shared NER model on the labeled data in the source language with languageindependent features, such as cross-lingual word representations (Devlin et al., 2019), and then directly testing the model on the target language.
However, there are limitations in both labelprojection based methods and model-transfer based methods.The former relies on labeled data in the source language for label projection, and thus is not applicable in cases where the required labeled data is inaccessible (e.g., due to privacy/sensitivity issues).Meanwhile, the later does not leverage unlabeled data in the target language, which can be much cheaper to obtain and probably contains very useful language information.
In this paper, we propose a teacher-student learning method for cross-lingual NER to address the mentioned limitations.Specifically, we leverage multilingual BERT (Devlin et al., 2019) as the base model to produce language-independent features.A previously trained NER model for the source language is then used as a teacher model to predict the probability distribution of entity labels (i.e., soft labels) for each token in the non-pairwise unlabeled data in the target language.Finally, we train a student NER model for the target language using the pseudo-labeled data with such soft labels.The proposed method does not rely on labelled data in the source language, and it also leverages the available information from unlabeled data in the target language, thus avoiding the mentioned limitations of previous works.Note that we use the teacher model to predict soft labels rather than hard labels (i.e., one-hot labelling vector), as soft labels can provide much more information (Hinton et al., 2015) for the student model.Figure 1 shows the differences between the proposed teacher-student learning method and the typical label-projection or model-transfer based methods.
We further extend our teacher-student learning method to multi-source cross-lingual NER, considering that there are usually multiple source languages available in practice and we would prefer transferring knowledge from all source languages rather than a single one.In this case, our method still enjoys the same advantages in terms of data availability and inference efficiency, compared with existing works (Täckström, 2012;Chen et al., 2019;Enghoff et al., 2018;Rahimi et al., 2019).Moreover, we propose a method to measure the similarity between each source language and the target language, and use this similarity to better weight the supervision from the corresponding teacher model.
We evaluate our proposed method for 3 target languages on benchmark datasets, using different source language settings.Experimental results show that our method outperforms existing state-of-the-art methods for both single-source and multi-source cross-lingual NER.We also conduct case studies and statistical analyses to discuss why teacher-student learning reaches better results.
The main contributions of this work are: • We propose a teacher-student learning method for single-source cross-lingual NER, which addresses limitations of previous works w.r.t data availability and usage of unlabeled data.
• We extend the proposed method to multisource cross-lingual NER, using a measure of the similarities between source/target languages to better weight teacher models.
• We conduct extensive experiments validating the effectiveness and reasonableness of the proposed methods, and further analyse why they attain superior performance.

Related Work
Single-Source Cross-Lingual NER: Such approaches consider one single source language for knowledge transfer.Previous works can be divided into two categories: label-projection and modeltransfer based methods.Label-projection based methods aim to build pseudo-labeled data for the target language to train an NER model.Some early works proposed to use bilingual parallel corpora and project model expectations (Wang and Manning, 2014) or labels (Ni et al., 2017) from the source language to the target language with external word alignment information.But obtaining parallel corpora is expensive or even infeasible.To tackle that, recent methods proposed to firstly translate source-language labeled data at the phrase level (Mayhew et al., 2017) or word level (Xie et al., 2018), and then directly copy labels across languages.But translation introduces extra noise due to sense ambiguity and word order differences between languages, thus hurting the trained model.
Model-transfer based methods generally rely on language-independent features (e.g., crosslingual word embeddings (Ni et al., 2017;Huang et al., 2019;Wu and Dredze, 2019;Moon et al., 2019), word clusters (Täckström et al., 2012), gazetteers (Zirikly and Hagiwara, 2015), and wikifier features (Tsai et al., 2016)), so that a model trained with such features can be directly applied to the target language.For further improvement, Wu et al. (2020) proposed constructing a pseudotraining set for each test case and fine-tuning the model before inference.However, these methods do not leverage any unlabeled data in the target language, though such data can be easy to obtain and benefit the language/domain adaptation.
Multi-Source Cross-Lingual NER: Multisource cross-lingual NER considers multiple source languages for knowledge transfer.Täckström (2012) and Moon et al. (2019) concatenated the labeled data of all source languages to train a unified model, and performed cross-lingual NER in a direct model transfer manner.Chen et al. (2019) leveraged adversarial networks to learn language-independent features, and learns a mixture-of-experts model (Shazeer et al., 2017) to weight source models at the token level.However, both methods straightly rely on the availability of labeled data in the source languages.
Differently, Enghoff et al. ( 2018) implemented multi-source label projection and studied how source data quality influence performance.Rahimi et al. (2019) applied truth inference to model the transfer annotation bias from multiple sourcelanguage models.However, both methods make predictions via an ensemble of source-language models, which is cumbersome and computationally expensive, especially when a source-language model has massive parameter space.
Teacher-Student Learning: Early applications of teacher-student learning targeted model compression (Bucilu et al., 2006), where a small student model is trained to mimic a pre-trained, larger teacher model or ensemble of models.It was soon applied to various tasks like image classification (Hinton et al., 2015;You et al., 2017), dialogue generation (Peng et al., 2019), and neural machine translation (Tan et al., 2019), which demonstrated the usefulness of the knowledge transfer approach.In this paper, we investigate teacher-student learning for the task of cross-lingual NER, in both single-source and multi-source scenarios.Different from previous works, our proposed method does not rely on the availability of labelled data in source languages or any pairwise texts, while it can also leverage extra information in unlabeled data in the target language to enhance the cross-lingual transfer.Moreover, compared with using an ensemble of source-language models, our method uses a single student model for inference, which can enjoy higher efficiency.

Methodology
Named entity recognition can be formulated as a sequence labeling problem, i.e., given a sentence x = {x i } L i=1 with L tokens, an NER model is supposed to infer the entity label y i for each token x i and output a label sequence y = {y i } L i=1 .Under the paradigm of cross-lingual NER, we assume there are K source-language models previously trained with language-independent features.Our proposed teacher-student learning method then uses those K source-language models as teachers to train an effective student NER model for the target language on its unlabeled data D tgt .

Single-Source Cross-Lingual NER
Here we firstly consider the case of only one source language (K = 1) for cross-lingual NER.The overall framework of the proposed teacher-student learning method for single-source cross-lingual NER is illustrated in Figure 2.

NER Model Structure
As shown in Figure 2, for simplicity, we employ the same neural network structure for both teacher (source-language) and student (target-language) NER models.Note that the student model is flexible and its structure can be determined according to the trade-off between performance and training/inference efficiency.
Here the adopted NER model consists of an encoder layer and a linear classification layer.Specifically, given an input sequence x = {x i } L i=1 with L tokens, the encoder layer f θ maps it into a sequence of hidden vectors h = {h i } L i=1 : Here f θ (•) can be any encoder model that produces cross-lingual token representations, and h i is the hidden vector corresponding to the i-th token x i .
With each h i derived, the linear classification layer computes the probability distribution of entity labels for the corresponding token x i , using a softmax function: where p(x i , Θ) ∈ R |C| with C being the entity label set, and Θ = {f θ , W, b} denotes the to-belearned model parameters.

Teacher-Student Learning
Training: We train the student model to mimic the output probability distribution of entity labels by the teacher model, on the unlabeled data in the target language D tgt .Knowledge from the teacher model is expected to transfer to the student model, while the student model can also leverage helpful language-specific information available in the unlabeled target-language data.Given an unlabeled sentence x ∈ D tgt in the target language, the teacher-student learning loss w.r.t x is formulated as the mean squared error (MSE) between the output probability distributions of entity labels by the student model and those by the teacher model, averaged over tokens.Note that here we follow Yang et al. (2019) and use the MSE loss, because it is symmetric and mimics all probabilities equally.Suppose that for the i-token in x , i.e., x i , the probability distribution of entity labels output by the student model is denoted as p(x i , Θ S ), and that output by the teacher model as p(x i , Θ T ).Here Θ S and Θ T , respectively, denote the parameters of the student and the teacher models.The teacher-student learning loss w.r.t x is then defined as: (3) And the whole training loss is the summation of losses w.r.t all sentences in D tgt , as defined below.
Minimizing L(Θ S ) will derive the student model.
Inference: For inference in the target language, we only utilize the learned student model to predict the probability distribution of entity labels for each token x i in a test sentence x.Then we take the entity label c ∈ C with the highest probability as the predicted label y i for x i : where p(x i , Θ S ) c denotes the predicted probability corresponding to the entity label c in p(x i , Θ S ).

Multi-Source Cross-Lingual NER
The framework of the proposed teacher-student learning method for multi-source (K > 1) crosslingual NER is illustrated in Figure 3.

Extension to Multiple Teacher Models
As illustrated in Figure 3, we extend the singleteacher framework in Figure 2 into a multi-teacher one, while keeping the student model unchanged.
Note that, for simplicity, all teacher models and the student model use the same model structure as 3.1.1.Take the k-th teacher model for example, and denote its parameters as Θ (k) T .Given a sentence x = {x i } L i=1 with L tokens from the unlabeled data D tgt in the target language, the output probability distribution of entity labels w.r.t the i-th token x i can be derived as Eq. 1 and 2, which is denoted as p(x i , Θ T ).To combine all teacher models, we add up their output probability distributions with a group of weights {α k } K k=1 as follows.
where p(x i , Θ T ) is the combined probability distribution of entity labels, T } K k=1 is the set of parameters of all teacher models, and α k is the weight corresponding to the k-th teacher model, with K k=1 α k = 1 and α k ≥ 0, ∀k ∈ {1, . . ., K}.

Weighting Teacher Models
Here we elaborate on how to derive the weights {α k } K k=1 in cases w/ or w/o unlabeled data in the source languages.Source languages more similar to the target language should generally be assigned higher weights to transfer more knowledge.
Without Any Source-Language Data: It is straightforward to average over all teacher models: With Unlabeled Source-Language Data: As no labeled data is available, existing supervised language/domain similarity learning methods for a target task (i.e., NER) (McClosky et al., 2010) are not applicable here.Inspired by Pinheiro (2018), we propose to introduce a language identification auxiliary task for calculating similarities between source and target languages, and then weight teacher models based on this metric.
In the language identification task, for the kth source language, each unlabeled sentence u (k)  in it is associated with the language index k to build its training dataset, denoted as D (k) src = {(u (k) , k)}.We also assume that in the mdimensional language-independent feature space, sentences from each source language should be clustered around the corresponding language embedding vector.We thus introduce a learnable language embedding vector µ (k) ∈ R m for the k-th source language, and then utilize a bilinear operator to measure similarity between a given sentence u and the k-th source language: where g(•) can be any language-independent model that outputs sentence embeddings, and M ∈ R m×m denotes the parameters of the bilinear operator.
By building a language embedding matrix P ∈ R m×K with each µ (k) column by column, and applying a softmax function over the bilinear operator, we can derive language-specific probability distributions w.r.t u as below.
q(u, M, P ) = softmax g T (u)M P (9) Then the parameters M and P are trained to identify the language of each sentence in {D (k) src } K k=1 , via minimizing the cross-entropy (CE) loss: F denotes the squared Frobenius norm, and I is an identity matrix.The regularizer in L(P, M ) is to encourage different dimensions of the language embedding vectors to focus on different aspects, with γ ≥ 0 being its weighting factor.
With learned M and P = [µ (1) , µ (2) , . . ., µ (K) ], we compute the weights {α k } K i=1 using the unlabeled data in the target language D tgt : (11) where τ is a temperature factor to smooth the output probability distribution.In our experiments, we set it as the variance of all values in {s(x , µ (k) )}, ∀x ∈ D tgt , ∀k ∈ {1, ..., K}, so that α k would not be too biased to either 0 or 1.

Teacher-Student Learning
Training: With the combined probability distribution of entity labels from multiple teacher models, i.e., p(x i , Θ T ) in Eq. 6, the training loss for the student model is identical to Eq. 3 and 4.
Inference: For inference on the target language, we only use the learned student model and make predictions as in the single-source scenario (Eq.5).

Experiments
We conduct extensive experiments for 3 target languages (i.e., Spanish, Dutch, and German) on standard benchmark datasets, to validate the effectiveness and reasonableness of our proposed method for single-and multi-source cross lingual NER.

Settings
Datasets We use two NER benchmark datasets: CoNLL-2002 (Spanish and Dutch) (Tjong Kim Sang, 2002); CoNLL-2003 (English and German) (Tjong Kim Sang and De Meulder, 2003).Both are annotated with 4 entity types: PER, LOC, ORG, and MISC.Each language-specific dataset is split into training, development, and test sets.Table 1 reports the dataset statistics.All sentences are tokenized into sequences of subwords with WordPiece (Wu et al., 2016).Following Wu and Dredze (2019), we also use the BIO entity labelling scheme.
In our experiments, for each source language, an NER model is trained previously with its corresponding labeled training set.As for the target language, we discard the entity labels from its training set, and use it as unlabeled target-language data D tgt .Similarly, unlabeled source-language data for learning language similarities (Eq.10) is simulated via discarding the entity labels of each training set.

Network Configurations
We leverage the cased multilingual BERT BASE (Wu and Dredze, 2019) for both f (•) in Eq. 1 and g(•) in Eq. 8, with 12 Transformer blocks, 768 hidden units, 12 self-attention head, GELU activations (Hendrycks and Gimpel, 2016), and learned positional embeddings.We use the final hidden vector of the first [CLS] token as the sentence embedding for g(•), and use the mean value of sentence embeddings w.r.t the k-th source language to initialize µ (k)  Network Training We implement our proposed method based on huggingface Transformers1 .Following Wolf et al. (2019), we use a batch size of 32, and 3 training epochs to ensure convergence of optimization.Following Wu and Dredze (2019), we freeze the parameters of the embedding layer and the bottom three layers of BERT BASE .For the optimizers, we use AdamW (Loshchilov and Hutter, 2017) with learning rate of 5e − 5 for teacher models (Wolf et al., 2019), and 1e − 4 for the student model (Yang et al., 2019) to converge faster.As for language similarity measuring (i.e., Eq. 10), we set γ = 0.01 following Pinheiro (2018).Besides, we use a low-rank approximation for the bilinear operator M , i.e., M = U T V where U, V ∈ R d×m with d m, and we empirically set d = 64.
Performance Metric We use phrase level F1score as the evaluation metric, following Tjong Kim Sang (2002).For each experiment, we conduct 5 runs and report the average F1-score.

Performance Comparison
Single-Source Cross-Lingual NER Table 2 reports the results of different single-source crosslingual NER methods.All results are obtained with English as the source language and others as target languages.
It can be seen that our proposed method outperforms the previous state-of-the-art methods.Particularly, compared with the remarkable Wu and Dredze (2019) and Moon et al. (2019), which use nearly the same NER model as our method but is based on direct model transfer, our method obtains significant and consistent improvements in es nl de Täckström (2012) 61.90 59.90 36.40Rahimi et al. (2019)  Table 3: Performance comparisons of multi-source cross-lingual NER.Ours-avg: averaging teacher models (Eq.7) .Ours-sim: weighting teacher models with learned language similarities (Eq.11).† denotes the reported results w.r.t.freezing the bottom three layers of BERT BASE .
F1-scores, ranging from 0.51 for Dutch to 1.80 for German.That well demonstrates the benefits of teacher-student learning over unlabeled targetlanguage data, compared to direct model transfer.Moreover, compared with the latest meta-learning based method (Wu et al., 2020), our method requires much lower computational costs for both training and inference, meanwhile reaching superior performance.
Multi-Source Cross-Lingual NER Here we select source languages in a leave-one-out manner, i.e., all languages except the target one are regarded as source languages.For fair comparisons, we take Spanish, Dutch, and German as target languages, respectively.Table 3 reports the results of different methods for multi-source cross-lingual NER.Both our teacher-student learning methods, i.e., Ours-avg (averaging teacher models, Eq. 7) and Ours-sim (weighting teacher models with learned language similarities, Eq. 11), outperform previous state-ofthe-art methods on Spanish and German by a large margin, which well demonstrates their effectiveness.We attribute the large performance gain to the teacher-student learning process to further leverage helpful information from unlabeled data in the target language.Though Moon et al. (2019) achieves superior performance on Dutch, it is not applicable in cases where the labeled source-language data is inaccessible, and thus it still suffers from the aforementioned limitation w.r.t.data availability.
Moreover, compared with Ours-avg, Ours-sim brings consistent performance improvements.That means, if unlabeled data in source languages is available, using our proposed language similarity measuring method for weighting different teacher  4: Ablation study of the proposed teacher-student learning method for cross-lingual NER.HL: Hard Label; MT: Direct Model Transfer; *-avg: averaging source-language models; *-sim: weighting sourcelanguage models with learned language similarities.models can be superior to simply averaging them.

Ablation Study
Analyses on Teacher-Student Learning To validate the reasonableness of our proposed teacherstudent learning method for cross-lingual NER, we introduce the following baselines.1) Hard Label (HL), which rounds the probability distribution of entity labels (i.e., soft labels output by teacher models) into a one-hot labelling vector (i.e., hard labels) to guide the learning of the student model.Note that in multi-source cases, we use the combined probability distribution of multiple teacher models (Eq.6) to derive the hard labels.To be consistent with Eq. 3, we still adopt the MSE loss here.In fact, both MSE loss and cross-entropy loss lead to the same observation described in this subsection.2) Direct Model Transfer (MT), where NO unlabeled target-language data is available to perform teacher-student learning, and thus it degenerates into: a) directly applying the source-language model in single-source cases, or b) directly applying a weighted ensemble of source-language models in multi-source cases, with weights derived via Eq.6 and Eq.11.
Table 4 reports the ablation study results.It can be seen that using hard labels (i.e., HL-*) would result in consistent performance drops in all crosslingual NER settings, which validates using soft labels in our proposed teacher-student learning method can convey more information for knowledge transfer than hard labels.Moreover, we can also observe that, using direct model transfer (i.e.,   MT-*) would lead to even more significant performance drops in all cross-lingual NER settings (up to 1.46 F1-score).Both demonstrate that leveraging unlabeled data in the target language can be helpful, and that the proposed teacher-student learning method is capable of leveraging such information effectively for cross-lingual NER.

Analyses on Language Similarity Measuring
We further compare the proposed language similarity measuring method with other commonly used unsupervised metrics, i.e., cosine similarity and 2 distance.Specifically, s(x , µ (k) ) in Eq. 11 is replaced by cosine similarity or negative 2 distance between x and the mean value of sentence embeddings w.r.t the k-th source language.
As shown in Table 5, replacing the proposed language similarity measuring method with either cosine / 2 metrics leads to consistent performance drops across all target languages.This further demonstrates the benefits of our language identification based similarity measuring method.

Why Teacher-Student Learning Works?
By analyzing which failed cases of directly applying the source-language model are corrected by the proposed teacher-student learning method, we try to bring up insights on why teacher-student learning works, in the case of single-source cross-lingual NER.Firstly, teacher-student learning can probably help to learn label preferences for some specific words in the target language.Specifically, if a word appears in the unlabeled target-language data and the teacher model consistently predicts it to be associated with an identical label with high probabilities, the student model would learn the preferred label w.r.t that word, and predict it in cases where the sentence context may not provide enough information.Such label preference can help the predictions for tokens that are less ambiguous and generally associated with an identical entity label.As illustrated in Figure 4, in example #1, the source-language (teacher) model, fails to identify "EFE" as an ORG in the test sentences, while the student model (i.e., Ours) can correctly label it, because it has seen "EFE" labeled as ORG by the teacher model with high probabilities in the unlabeled target-language data D tgt .Similar results can also be observed in example #2 and #3.
Moreover, teacher-student learning may help to find a better classifying hyperplane for the student NER model with unlabelled target-language data.Actually, we notice that the source-language model generally makes correct label predictions with higher probabilities, and makes mispredictions with relatively lower probabilities.By calcu-lating the proportion of its mispredictions that are corrected by our teacher-student learning method in different probability intervals, we find that our method tends to correct the low-confidence mispredictions, as illustrated in Figure 5.We conjecture that, with the help of unlabeled target-language data, our method can probably find a better classifying hyperplane for the student model, so that the low-confidence mispredictions, which are closer to the classifying hyperplane of the source-language model, can be clarified.

Conclusion
In this paper, we propose a teacher-student learning method for single-/multi-source cross-lingual NER, via using source-language models as teachers to train a student model on unlabeled data in the target language.The proposed method does not rely on labelled data in the source languages and is capable of leveraging extra information in the unlabelled target-language data, which addresses the limitations of previous label-projection based and model-transfer based methods.We also propose a language similarity measuring method based on language identification, to better weight different teacher models.Extensive experiments on benchmark datasets show that our method outperforms the existing state-of-the-art approaches.

Figure 1 :
Figure 1: Comparison between previous cross-lingual NER methods (a/b) and the proposed method (c).(a): direct model transfer; (b): label projection with pairwise texts; (c): proposed teacher-student learning method.M src/tgt : learned NER model for source/target language; {X, Y } src : labeled data in source language; {X } tgt : unlabeled data in target language; {X , Y } tgt /{X , P } tgt : pseudo-labeled data in target language with hard labels / soft labels.

Figure 2 :
Figure 2: Framework of the proposed teacher-student learning method for single-source cross-lingual NER.

Figure 3 :
Figure 3: Framework of the proposed teacher-student learning method for multi-source cross-lingual NER.

Figure 4 :
Figure 4: Case study on why teacher-student learning works.The GREEN ( RED ) highlight indicates a correct (incorrect) label.The real-valued numbers indicate the predicted probability corresponding to the entity label.

Table 1 :
Statistics of the benchmark datasets.

Table 2 :
in Eq. 8. Performance comparisons of single-source cross-lingual NER.† denotes the reported results w.r.t.freezing the bottom three layers of BERT BASE as in this paper.

Table 5 :
Comparison between the proposed language similarity measuring method and the commonly used cosine/ 2 metrics for multi-source cross-lingual NER.