Massively Multilingual Transfer for NER

In cross-lingual transfer, NLP models over one or more source languages are applied to a low-resource target language. While most prior work has used a single source model or a few carefully selected models, here we consider a “massive” setting with many such models. This setting raises the problem of poor transfer, particularly from distant languages. We propose two techniques for modulating the transfer, suitable for zero-shot or few-shot learning, respectively. Evaluating on named entity recognition, we show that our techniques are much more effective than strong baselines, including standard ensembling, and our unsupervised method rivals oracle selection of the single best individual model.


Introduction
Supervised learning remains king in natural language processing, with most tasks requiring large quantities of annotated corpora. The majority of the world's 6,000+ languages however have limited or no annotated text, and therefore much of the progress in NLP has yet to be realised widely. Cross-lingual transfer learning is a technique which can compensate for the dearth of data, by transferring knowledge from high-to lowresource languages, which has typically taken the form of annotation projection over parallel corpora or other multilingual resources (Yarowsky et al., 2001;Hwa et al., 2005), or making use of transferable representations, such as phonetic transcriptions (Bharadwaj et al., 2016), closely related languages (Cotterell and Duh, 2017) or bilingual dictionaries (Mayhew et al., 2017;Xie et al., 2018).
Most methods proposed for cross-lingual transfer rely on a single source language, which limits the transferable knowledge to only one source.
The target language might be similar to many source languages, on the grounds of the script, word order, loan words etc, and transfer would benefit from these diverse sources of information.
There are a few exceptions, which use transfer from several languages, ranging from multitask learning (Duong et al., 2015;Ammar et al., 2016;Fang and Cohn, 2017), and annotation projection from several languages (Täckström, 2012;Fang and Cohn, 2016;Plank and Agić, 2018). However, to the best of our knowledge, none of these approaches adequately account for the quality of transfer, but rather "weight" the contribution of each language uniformly.
In this paper, we propose a novel method for zero-shot multilingual transfer, inspired by research in truth inference in crowd-sourcing, a related problem, in which the 'ground truth' must be inferred from the outputs of several unreliable annotators (Dawid and Skene, 1979). In this problem, the best approaches estimate each model's reliability, and their patterns of mistakes (Kim and Ghahramani, 2012). Our proposed model adapts these ideas to a multilingual transfer setting, whereby we learn the quality of transfer, and language-specific transfer errors, in order to infer the best labelling in the target language, as part of a Bayesian graphical model. The key insight is that while the majority of poor models make lots of mistakes, these mistakes are diverse, while the few good models consistently provide reliable input. This allows the model to infer which are the reliable models in an unsupervised manner, i.e., without explicit supervision in the target language, and thereby make accurate inferences despite the substantial noise.
In the paper, we also consider a supervised setting, where a tiny annotated corpus is available in the target language. We present two methods to use this data: 1) estimate reliability parameters of the Bayesian model, and 2) explicit model selection and fine-tuning of a low-resource supervised model, thus allowing for more accurate modelling of language specific parameters, such as character embeddings, shown to be important in previous work (Xie et al., 2018).
Experimenting on two NER corpora, one with as many as 41 languages, we show that single model transfer has highly variable performance, and uniform ensembling often substantially underperforms the single best model. In contrast, our zero-shot approach does much better, exceeding the performance of the single best model, and our few-shot supervised models result in further gains.

Approach
We frame the problem of multilingual transfer as follows. We assume a collection of H models, all trained in a high resource setting, denoted M h = {M h i , i ∈ (1, H)}. Each of these models are not well matched to our target data setting, for instance these may be trained on data from different domains, or on different languages, as we evaluate in our experiments, where we use crosslingual embeddings for model transfer. This is a problem of transfer learning, namely, how best we can use the H models for best results in the target language. 2 Simple approaches in this setting include a) choosing a single model M ∈ M h , on the grounds of practicality, or the similarity between the model's native data condition and the target, and this model is used to label the target data; or b) allowing all models to 'vote' in an classifier ensemble, such that the most frequent outcome is selected as the ensemble output. Unfortunately neither of these approaches are very accurate in a cross-lingual transfer setting, as we show in §4, where we show a fixed source language model (en) dramatically underperforms compared to oracle selection of source language, and the same is true for uniform voting.
Motivated by these findings, we propose novel methods for learning. For the "zero-shot" setting where no labelled data is available in the target, we propose the BEA uns method inspired by work 2 We limit our attention to transfer in a 'black-box' setting, that is, given predictive models, but not assuming access to their data, nor their implementation. This is the most flexible scenario, as it allows for application to settings with closed APIs, and private datasets. It does, however, preclude multitask learning, as the source models are assumed to be static. in truth inference from crowd-sourced datasets or diverse classifiers ( §2.1). To handle the "few-shot" case §2.2 presents a rival supervised technique, RaRe, based on using very limited annotations in the target language for model selection and classifier fine-tuning.

Zero-Shot Transfer
One way to improve the performance of the ensemble system is to select a subset of component models carefully, or more generally, learn a non-uniform weighting function. Some models do much better than others, on their own, so it stands to reason that identifying these handful of models will give rise to better ensemble performance. How might we proceed to learn the relative quality of models in the setting where no annotations are available in the target language? This is a classic unsupervised inference problem, for which we propose a probabilistic graphical model, inspired by Kim and Ghahramani (2012).
We develop a generative model, illustrated in Figure 1, of the transfer models' predictions, y ij , where i ∈ [1, N ] is an instance (a token or an entity span), and j ∈ [1, H] indexes a transfer model. The generative process assumes a 'true' label, z i ∈ [1, K], which is corrupted by each transfer model, in producing the prediction, y ij . The corruption process is described by is the confusion matrix specific to a transfer model.
To complete the story, the confusion matrices are drawn from vague row-wise independent Dirichlet priors, with a parameter α = 1, and the true labels are governed by a Dirichlet prior, π, which is drawn from an uninformative Dirichlet distribution with a parameter β = 1. This generative model is referred to as BEA.
Inference under the BEA model involves ex-plaining the observed predictions Y in the most efficient way. Where several transfer models have identical predictions, k, on an instance, this can be explained by letting z i = k, 3 and the confusion matrices of those transfer models assigning high probability to V (j) kk . Other, less reliable, transfer models will have divergent predictions, which are less likely to be in agreement, or else are heavily biased towards a particular class. Accordingly, the BEA model can better explain these predictions through label confusion, using the off-diagonal elements of the confusion matrix. Aggregated over a corpus of instances, the BEA model can learn to differentiate between those reliable transfer models, with high V (j) kk and those less reliable ones, with high V (j) kl , l = k. This procedure applies perlabel, and thus the 'reliability' of a transfer model is with respect to a specific label, and may differ between classes. This helps in the NER setting where many poor transfer models have excellent accuracy for the outside label, but considerably worse performance for entity labels.
For inference, we use mean-field variational Bayes (Jordan, 1998), which learns a variational distribution, q(Z, V, π) to optimise the evidence lower bound (ELBO), assuming a fully factorised variational distribution, q(Z, V, π) = q(Z)q(V )q(π). This gives rise to an iterative learning algorithm with update rules:  where ψ is the digamma function, defined as the logarithmic derivative of the gamma function. The sets of rules (1) and (2) are applied alternately, to update the values of E q log π k , E q log V (j) kl , and q(z ij = k) respectively. This repeats until convergence, when the difference in the ELBO between two iterations is smaller than a threshold.
The final prediction of the model is based on q(Z), using the maximum a posteriori label z i = arg max z q(z i = z). This method is referred to as BEA uns . In our NER transfer task, classifiers are diverse in their F1 scores ranging from almost 0 to around 80, motivating spammer removal (Raykar and Yu, 2012) to filter out the worst of the transfer models. We adopt a simple strategy that first estimates the confusion matrices for all transfer models on all labels, then ranks them based on their mean recall on different entity categories (elements on the diagonals of their confusion matrices), and then runs the BEA model again using only labels from the top k transfer models only. We call this method BEA uns×2 and its results are reported in §4.

Token versus Entity Granularity
Our proposed aggregation method in §2.1 is based on an assumption that the true annotations are independent from each other, which simplifies the model but may generate undesired results. That is, entities predicted by different transfer models could be mixed, resulting in labels inconsistent with the BIO scheme. Table 1 shows an example, where a sentence with 4 words is annotated by 5 transfer models with 4 different predictions, among which at most one is correct as they overlap. However, the aggregated result in the token view is a mixture of two predictions, which is supported by no transfer models.
To deal with this problem, we consider aggre-gating the predictions in the entity view. As shown in Table 1, we convert the predictions for tokens to predictions for ranges, aggregate labels for every range, and then resolve remaining conflicts. A prediction is ignored if it conflicts with another one with higher probability. By using this greedy strategy, we can solve the conflicts raised in entitylevel aggregation. We use superscripts tok and ent to denote token-level and entity-level aggregations, i.e. BEA tok uns and BEA ent uns .

Few-Shot Transfer
Until now, we have assumed no access to annotations in the target language. However, when some labelled text is available, how might this best be used? In our experimental setting, we assume a modest set of 100 labelled sentences, in keeping with a low-resource setting (Garrette and Baldridge, 2013). 4 We propose two models BEA sup and RaRe in this setting.
Supervising BEA (BEA sup ) One possibility is to use the labelled data to find the posterior for the parameters V (j) and π of the Bayesian model described in §2.1. Let n k be the number of instances in the labelled data whose true label is k, and n jkl the number of instances whose true label is k and classifier j labels them as l. Then the quantities in Equation (1) can be calculated as These are used in Equation (2) for inference on the test set. We refer to this setting as BEA sup .
Ranking and Retraining (RaRe) We also propose an alternative way of exploiting the limited annotations, RaRe, which first ranks the systems, and then uses the top ranked models' outputs alongside the gold data to retrain a model on the target language. The motivation is that the above technique is agnostic to the input text, and therefore is unable to exploit situations where regularities occur, such as common words or character patterns that are indicative of specific class labels, including names, titles, etc. These signals are unlikely to be consistently captured by crosslingual transfer. Training a model on the target language with a character encoder component, can distil the signal that are captured by the transfer models, while relating this towards generalisable lexical and structural evidence in the target language. This on its own will not be enough, as many tokens will be consistently misclassified by most or all of the transfer models, and for this reason we also perform model fine-tuning using the supervised data.
The ranking step in RaRe proceeds by evaluating each of the H transfer models on the target gold set, to produce scores s h (using the F 1 score). The scores are then truncated to the top k ≤ H values, such that s h = 0 for those systems h not ranked in the top k, and normalised The range of scores are quite wide, covering 0.00 − 0.81 (see Figure 2), and accordingly this simple normalisation conveys a strong bias towards the top scoring transfer systems.
The next step is a distillation step, where a model is trained on a large unannotated dataset in the target language, such that the model predictions match those of a weighted mixture of transfer models, using ω = (ω 1 , . . . , ω H ) as the mixing weights. This process is implemented as minibatch scheduling, where the labels for each minibatch are randomly sampled from transfer model h with probability ω h . 5 This is repeated over the course of several epochs of training. Finally, the model is fine-tuned using the small supervised dataset, in order to correct for phenomena that are not captured from model transfer, particularly character level information which is not likely to transfer well for all but the most closely related languages. Fine-tuning proceeds for a fixed number of epochs on the supervised dataset, to limit overtraining of richly parameterise models on a tiny dataset. Note that in all stages, the same supervised dataset is used, both in ranking and fine-tuning, and moreover, we do not use a development set. This is not ideal, and generalisation performance would likely improve were we to use additional annotated data, however our meagre use of data is designed for a low resource setting where labelled data is at a premium.

Data
Our primarily evaluation is over a subset of the Wikiann NER corpus (Pan et al., 2017), using 41 out of 282 languages, where the langauges were chosen based on their overlap with multilingual word embedding resources from Lample et al. (2018). 6 The NER taggs are in IOB2 format comprising of LOC, PER, and ORG. The distribution of labels is highly skewed, so we created balanced datasets, and partitioned into training, development, and test sets, details of which are in the Appendix. For comparison with prior work, we also evaluate on the CoNLL 2002 and 2003 datasets (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003), which we discuss further in §4.
For language-independent word embedding features we use fastText 300 dimensional Wikipedia embeddings (Bojanowski et al., 2017), and map them to the English embedding space using character-identical words as the seed for the Procrustes rotation method for learning bingual embedding spaces from MUSE (Lample et al., 2018). 7 Similar to Xie et al. (2018) we don't rely on a bilingual dictionary, so the method can be easily applied to other languages.

Model Variations
As the sequential tagger, we use a BiLSTM-CRF (Lample et al., 2016), which has been shown to result in state-of-the-art results in high resource settings (Ma and Hovy, 2016;Lample et al., 2016). This model includes both word embeddings (for which we used fixed cross-lingual embeddings) and character embeddings, to form a parameterised potential function in a linear chain conditional random field. With the exception of batch size and learning rate which were tuned (details in Appendix), we kept the architecture and the hyperparameters the same as the published code. 8 6 With ISO 639-1 codes: af, ar, bg, bn, bs, ca, cs, da, de, el, en, es, et, fa, fi, fr, he, hi, hr, hu, id, it, lt, lv, mk, ms, nl, no, pl, pt, ro, ru, sk, sl, sq, sv, ta, tl, tr, uk and vi. 7 We also experimented with other bilingual embedding methods, including: supervised learning over bilingual dictionaries, which barely affected system performance; and pure-unsupervised methods (Lample et al., 2018;Artetxe et al., 2018), which performed substantially worse. For this reason we use identical word type seeding, which is preferred as it imposes no additional supervision requirement. 8 https://github.com/guillaumegenthial/ sequence_tagging We trained models on all 41 languages in both high-resource (HSup) and naive supervised lowresource (LSup) settings, where HSup pre-trained models were used for transfer in a leave-one-out setting, i.e., taking the predictions of 40 models into a single target language. The same BiLSTM-CRF is also used for RaRe.
To avoid overfitting, we use early stopping based on a validation set for the HSup, and LSup baselines. For RaRe, given that the model is already trained on noisy data, we stop fine-tuning after only 5 iterations, chosen based on the performance for the first four languages.
We compare the supervised HSup and LSup monolingual baselines with our proposed transfer models: MV uniform ensemble, a.k.a."majority vote"; BEA uns×2 , BEA uns unsupervised aggregation models, applied to entities or tokens (see §2.1); BEA sup supervised estimation of BEA prior ( §2.2); RaRe, RaRe uns supervised ranking and retraining model ( §2.2), and uniform ranking without fine-tuning, respectively; and Oracle selecting the best performing single transfer model, based on test performance.
We also compare with BWET (Xie et al., 2018) as state-of-the-art in unsupervised NER transfer. BWET transfers the source English training and development data to the target language using bilingual dictionary induction (Lample et al., 2018), and then uses a transformer architecture to compensate for missing sequential information. We used BWET in both CoNLL, and Wikiann datasets by transferring from their corresponding source English data to the target language. 9

Results
We report the results for single source direct transfer, and then show that our proposed multilingual methods outperform majority voting. Then we analyse the choice of source languages, and how it affects transfer. 10 Finally we report results on CoNLL NER datasets. 9 Because BWET uses identical characters for bilingual dictionary induction, we observed many English loan words in the target language mapped to the same word in the induced bilingual dictionaries. Filtering such dictionary items might improve BWET.
10 For detailed results see Table 4   . The x axis shows the annotation requirement of each model in the target language where "200" means 100 sentences each for training and development, and "5K+" means using all the available annotation for training and development sets. Points with the same colour/shape have equal data requirement.
Direct Transfer The first research question we consider is the utility of direct transfer, and the simple majority vote ensembling method. As shown in Figure 2, using a single model for direct transfer (English: en) is often a terrible choice. The oracle choice of source language model does much better, however it is not always a closely related language (e.g., Italian: it does best for Indonesian: id, despite the target being closer to Malay: ms). Note the collection of Cyrillic languages (bg, mk, uk) where the oracle is substantially better than the majority vote, which is likely due to script differences. The role of script appears to be more important than language family, as seen for Slavic languages where direct transfer works well between between pairs languages using the same alphabet (Cyrillic versus Latin), but much more poorly when there is an alphabet mismatch. 11 The transfer relationship is not symmetric e.g., Persian: fa does best for Arabic: ar, but German: de does best for Persian. Figure 2 also shows that ensemble voting is well below the oracle best language, which is likely to be a result of overall high error rates coupled with error correlation between models, and little can be gained from ensembling.

Multilingual Transfer
We report the results for the proposed low-resource supervised models (RaRe and BEA sup ), and unsupervised models (BEA uns and BEA uns×2 ), summarised as an average over the 41 languages in Figure 3 (see Appendix A for the full table of results). The figure compares against high-and low-resource supervised baselines (HSup and LSup, respectively), and BWET. The best performance is achieved with a high supervision (HSup, F 1 = 89.2), while very limited supervision (LSup) results in a considerably lower F 1 of 62.1. The results for MV tok show that uniform ensembling of multiple source models is even worse, by about 5 points.
Unsupervised zero-shot learning dramatically improves upon MV tok , and BEA ent uns outperforms BEA tok uns , showing the effectiveness of inference Further analysis show that majority voting works reasonably well for Romance and Germanic languages, which are well represented in the dataset, but fails miserably compared to single best for Slavic languages (e.g. ru, uk, bg) where there are only a few related languages. For most of the isolated languages (ar, fa, he, vi, ta), explicitly training a model in RaRe outperforms BEA ent sup , showing that relying only on aggregation of annotated data has limitations, in that it cannot exploit character and structural features.
Choice of Source Languages An important question is how the other models, particularly the unsupervised variants, are affected by the number and choice of sources languages. Figure 4 charts the performance of MV, BEA, and RaRe against the number of source models, comparing the use of ideal or realistic selection methods to attempt to find the best source models. MV ent , BEA ent sup , and RaRe use a small labeled dataset to rank the source models. BEA ent uns, oracle has the access to the perfect ranking of source models based on their real F 1 on the test set. BEA uns×2 is completely unsupervised in that it uses its own estimates to rank all source models.
MV doesn't show any benefit with more than 3 source models. 12 In contrast, BEA and RaRe con- 12 The sawtooth pattern arises from the increased numbers of ties (broken randomly) with even numbers of inputs. tinue to improve with up to 10 languages. We show that BEA in two realistic scenarios (unsupervised: BEA ent uns×2 , and supervised: BEA ent sup ) is highly effective at discriminating between good and bad source models, and thus filtering out the bad models gives the best results. The BEA ent uns×2 curve shows the effect of filtering using purely unsupervised signal, which has a positive, albeit mild effect on performance. In BEA ent uns, oracle although the source model ranking is perfect, it narrowly outperforms BEA. Note also that neither of the BEA curves show evidence of the sawtooth pattern, i.e., they largely benefit from more inputs, irrespective of their parity. Finally, adding supervision in the target language in RaRe further improves upon the unsupervised models.
CoNLL Dataset Finally, we apply our model to the CoNLL-02/03 datasets, to benchmark our technique against related work. This corpus is much less rich than Wikiann used above, as it includes only four languages (en, de, nl, es), and furthermore, the languages are closely related and share the same script. Results in Table 2 show that our methods are competitive with benchmark methods, and, moreover, the use of 100 annotated sentences in the target language (RaRe l ) gives good improvements over unsupervised models. 13 Results also show that MV does very well, especially MV ent , and its performance is comparable to BEA's. Note that there are only 3 source models and none of them is clearly bad, so BEA estimates that they are similarly reliable which results in little difference in terms of performance between BEA and MV.

Related Work
Two main approaches for cross-lingual transfer are representation and annotation projection. Representation projection learns a model in a highresource source language using representations that are cross-linguistically transferable, and then directly applies the model to data in the target language. This can include the use of crosslingual word clusters  and word embeddings (Ammar et al., 2016;Ni et al., 2017), multitask learning with a closely related high-resource language (e.g. Spanish for Galician) (Cotterell and Duh, 2017), or bridging   (Tsai et al., 2016). In annotation projection, the annotations of tokens in a source sentence are projected to their aligned tokens in the target language through a parallel corpus.  et al., 2017;Xie et al., 2018). Most annotation projection methods with few exceptions (Täckström, 2012;Plank and Agić, 2018) use only one language (often English) as the source language. In multi-source language setting, majority voting is often used to aggregate noisy annotations (e.g. Plank and Agić (2018)). Fang and Cohn (2016) show the importance of modelling the annotation biases that the source language(s) might project to the target language.
Transfer from multiple source languages: Previous work has shown the improvements of multi-source transfer in NER (Täckström, 2012;Fang et al., 2017;Enghoff et al., 2018), POS tagging (Snyder et al., 2009;Plank and Agić, 2018), and parsing (Ammar et al., 2016) compared to single source transfer, however, multi-source transfer might be noisy as a result of divergence in script, phonology, morphology, syntax, and semantics between the source languages, and the target language. To capture such differences, various methods have been proposed: latent variable models (Snyder et al., 2009), majority voting (Plank and Agić, 2018), utilising typological features (Ammar et al., 2016), or explicitly learning annotation bias (Fang and Cohn, 2017). Our work is also related to knowledge distillation from multiple source models applied in parsing (Kuncoro et al., 2016) and machine translation (Kim and Rush, 2016;Johnson et al., 2017). In this work, we use truth inference to model the transfer annotation bias from diverse source models. Finally, our work is related to truth inference from crowd-sourced annotations (Whitehill et al., 2009;Welinder et al., 2010), and most importantly from diverse classifiers (Kim and Ghahramani, 2012;Ratner et al., 2017). Nguyen et al. (2017) propose a hidden Markov model for aggregating crowdsourced sequence labels, but only learn per-class accuracies for workers instead of full confusion matrices in order to address the data sparsity problem in crowdsourcing.

Conclusion
Cross-lingual transfer does not work out of the box, especially when using large numbers of source languages, and distantly related target languages. In an NER setting using a collection of 41 languages, we showed that simple methods such as uniform ensembling do not work well. We proposed two new multilingual transfer models (RaRe and BEA), based on unsupervised transfer, or a supervised transfer setting with a small 100 sentence labelled dataset in the target language. We also compare our results with BWET (Xie et al., 2018), a state-of-the-art unsupervised single source (English) transfer model, and showed that multilingual transfer outperforms it, however, our work is orthogonal to their work in that if training data from multiple source models is created, RaRe and BEA can still combine them, and outperform majority voting. Our unsupervised method, BEA uns , provides a fast and simple way of annotating data in the target language, which is capable of reasoning under noisy annotations, and outperforms several competitive baselines, including the majority voting ensemble, a low-resource supervised baseline, and the oracle single best transfer model. We show that light supervision improves performance further, and that our second approach, RaRe, based on ranking transfer models and then retraining on the target language, results in further and more consistent performance improvements.

A.1 Hyperparameters
We tuned the batch size and the learning rate using development sets in four languages, 14 and then fixed these hyperparameters for all other languages in each model. The batch size was 1 sentence in low-resource scenarios (in baseline LSup and fine-tuning of RaRe), and to 100 sentences, in high-resource settings (HSup and the pretraining phase of RaRe). The learning rate was set to 0.001 and 0.01 for the high-resource and low-resource baseline models, respectively, and to 0.005, 0.0005 for the pretraining and fine-tuning phase of RaRe based on development results for the four languages. For CoNLL datasets, we had to decrease the batch size of the pre-training phase from 100 to 20 (because of GPU memory issues).

A.2 Cross-lingual Word Embeddings
We experimented with Wiki and CommonCrawl monolingual embeddings from fastText (Bojanowski et al., 2017). Each of the 41 languages is mapped to English embedding space using three methods from MUSE: 1) supervised with bilingual dictionaries; 2) seeding using identical character sequences; and 3) unsupervised training using adversarial learning (Lample et al., 2018). The crosslingual mappings are evaluated by precision at k = 1. The resulting cross-lingual embeddings are then used in NER direct transfer in a leave-one-out setting for the 41 languages (41×40 transfers), and we report the mean F 1 in Table 3. CommonCrawl doesn't perform well in bilingual induction despite having larger text corpora, and underperforms in direct transfer NER. It is also evident that using identical character strings instead of a bilingual dictionary as the seed for learning a supervised bilingual mapping barely affects the performance. This finding also applies to few-shot learning over larger ensembles: running RaRe over 40 source languages achieves an average F 1 of 77.9 when using embeddings trained with a dictionary, versus 76.9 using string identity instead. For this reason we have used the string identity method in the paper (e.g., Table 4), providing greater portability to language pairs without a bilingual dictionary. Experiments with unsupervised mappings performed substantially worse than supervised methods, and so we didn't explore these further.

A.3 Direct Transfer Results
In Figure 5 the performance of an NER model trained in a high-resource setting on a source language applied on the other 40 target languages (leave-one-out) is shown. An interesting finding is that symmetry does not always hold (e.g. id vs. ms or fa vs. ar).

A.4 Detailed Low-resource Results
The result of applying baselines, proposed models and their variations, and unsupervised transfer model of Xie et al. (2018) are shown in Table 4.  Table 4: The size of training and test sets (development set size equals test set size) in thousand sentences, and the precision at 1 for Bilingual dictionaries induced from mapping languages to the English embedding space (using identical characters) is shown (BiDic.P@1). F 1 scores on the test set, comparing baseline supervised models (HSup, LSup), multilingual transfer from top k source languages (RaRe, 5 runs, k = 1, 10, 40), an unsupervised RaRe with uniform expertise and no fine-tuning (RaRe uns ), and aggregation methods: majority voting (MV tok ), BEA tok uns and BEA ent uns (Bayesian aggregation in token-and entity-level), and the oracle single best annotation (Oracle). We also compare with BWET (Xie et al., 2018), an unsupervised transfer model with stateof-the-art on CoNLL NER datasets. The mean and standard deviation over all 41 languages, µ, σ, are also reported.