Low-resource named entity recognition via multi-source projection: Not quite there yet?

Projecting linguistic annotations through word alignments is one of the most prevalent approaches to cross-lingual transfer learning. Conventional wisdom suggests that annotation projection “just works” regardless of the task at hand. We carefully consider multi-source projection for named entity recognition. Our experiment with 17 languages shows that to detect named entities in true low-resource languages, annotation projection may not be the right way to move forward. On a more positive note, we also uncover the conditions that do favor named entity projection from multiple sources. We argue these are infeasible under noisy low-resource constraints.


Motivation
Annotation projection plays a crucial role in crosslingual NLP. For instance, the state of the art approaches to low-resource part-of-speech tagging (Das and Petrov, 2011;Täckström et al., 2013) and dependency parsing (Ma and Xia, 2014;Rasooli and Collins, 2015) all make use of parallel corpora under the source-target language dichotomy in some way or another. Beyond syntactic tasks, aligned corpora facilitate cross-lingual transfer through multilingual embeddings (Ruder et al., 2017) across diverse tasks.
What about named entity recognition (NER)? This sequence labeling task with ample source languages appears like an easy target for projection. However, as recently argued by Mayhew et al. (2017), the issue is more complex: "For NER, the received wisdom is that parallel projection methods work very well, although there is no consensus on the necessary size of the parallel corpus. Most approaches require millions of sentences, with a few exceptions which require thousands. Accordingly, the drawback to this approach is the difficulty of finding any parallel data, let alone millions of sentences. Religious texts (such as the Bible and the Koran) exist in a large number of languages, but the domain is too far removed from typical target domains (such as newswire) to be useful. As a simple example, the Bible contains almost no entities tagged as organization." Our paper is a thorough empirical assessment of the quoted conjecture for named entity (NE) tagging in true low-resource languages. In specific, we ask the following questions: -Are there conditions under which the projection of named entity labels from multiple sources yields feasible NE taggers? -If yes, do these conditions scale down to real low-resource languages? To answer these questions, we conduct an extensive study of annotation projection from multiple sources for low-resource NER. It includes 17 diverse languages with heterogeneous datasets, and 2 massive parallel corpora. In terms of crosslingual breadth, ours is one of the largest NER experiments to date, 1 and the only one that focuses on standalone annotation projection. We uncover that the specific conditions that do make NER projection work are not trivially met at a feasibly large scale by true low-resource languages.

Multilingual projection
We project NE labels from multiple sources into multiple targets through sentence and word align- Figure 1: An illustration of named entity projection from two source sentences (Danish, English) to one target (Croatian). In this example, the voting of entity labels is weighted by tagger confidence and alignment probability. The outside label (O) is omitted for simplicity.
LABELING(v t ) = arg max l BALLOT(l|v t ) 7 return BALLOT, LABELING ments. Our projection requires source NE taggers and parallel corpora that are ideally large in both breadth (across many languages) and depth (number of parallel sentences). Evidently, we require that i) the source language texts in the corpus are tagged for named entities, and that ii) the parallel corpora are aligned. Both conditions are typically met under some noise: by applying sourcelanguage NE taggers, and unsupervised sentence and word aligners, respectively.
We view a parallel corpus as a large collection of multilingual sentences. A multilingual sentence is a graph G = (V, A) comprising a target sentence t and n source sentences. The vertice sets V = V 0 ∪ · · · ∪ V n represent words in sentences, where the words v t ∈ V 0 belong to the target sentence V 0 = V t , while all other words v s ∈ V i belong to their respective source sentences V i , i ∈ {1, ..., n}. The graph is bipartite between source vertices V s = V \ V t and target vertices V t , where the edges are word alignments with aligner confidences a(v s , v t ) ∈ (0, 1) as weights. Each source token v s is associated with a label distribution p(l|v s ) that comes from a respective sourcelanguage tagger and indicates its confidence over labels l ∈ L. Here, the labels L are NE tags, but elsewhere they could instantiate other sequence labeling such as POS or shallow parses.
Under these assumptions, we implement projection as weighted voting of source contributions to target words, such that for each target word v t we collect votes into a ballot: Here, each source token v s gets to cast a vote for the future label of v t . Each vote is weighted by its own tagger confidence and reliability of its alignment to target token v t : p(l|v s ) · a(v s , v t ). The individual votes are then summed and the tags for the target tokens are elected. We can train a NE tagger directly from BALLOT provided some normalization to (0, 1), or we can decode a single majority tag for each target word: The process is further detailed as Algorithm 1 and also depicted in Figure 1 for two source vertice sets V i and V j , and one target set V t . This simple procedure was proven to be markedly robust and effective in massively multilingual transfer of POS taggers especially for truly low-resource languages by 2016). CoNLL 2002(Tjong Kim Sang, 2002 es nl news CoNLL 2003 (Tjong Kim Sang and De Meulder, 2003) en de news OntoNotes 5.0 (Weischedel et al., 2011) en ar news NER FIRE 2013 (Rao andDevi, 2013) ta hi wiki ANERCorp (Benajiba et al., 2007) ar news BSNLP 2017 (Piskorski et al., 2017) cs hu pl sk sl news Estonian NER (Tkachenko et al., 2013) et news Europeana NER (Neudecker, 2016) fr news I-CAB (Magnini et al., 2006) it news HAREM (Santos et al., 2006) pt -Stockholm Internet Corpus (Östling, 2013) sv blogs Table 1: The NER datasets in our exepriment. We indicate the languages 2 and domains they cover.
We take into account a set of additional design choices in multi-source NER projection beyond what the algorithm itself encodes.
Sentence selection. We compare two ways to sample the target sentences for training: at random vs. through word-alignment coverage ranking. A target word covered if it has an incoming alignment edge from at least one source word. We mark the target sentences by percentage of covered words from each source, and rank them by mean coverage across sources. We then select the top k ranked sentences to train a tagger. We optimize this parameter for maximum NER scores on development data.
Language similarity. Some source languages arguably help some targets more than others. We model this relation through language similarity between source and target WALS feature vectors (Dryer and Haspelmath, 2013): v s and v t . We implement language similarity as inverse normalized Hamming distance between the two vectors: Only the non-null fields are taken into account. Similarity is contrasted to random selection in our experiment.
Tagger performance. Some source NE taggers perform better than the others monolingually. We thus consider the option to weigh the source contributions not just by language similarity but also through their monolingual NER accuracy, so that the contributions by more accurate source taggers are selected more often.
3 Experiment setup Sources and targets. Table 1 shows the NERannotated datasets we used. These datasets adhere to various differing standards of NE encoding. In a non-trivial effort, we semi-automatically normalize the data into 3-class CoNLL IO encoding (Tjong Kim Sang and De Meulder, 2003), as the common denominator for the widely heterogeneous datasets. We thus detect names of locations (LOC), organizations (ORG), and persons (PER). Languages with more than 5k monolingual training sentences serve as sources and development languages for parameter tuning, while the remainder pose as low-resource targets; see Table 2. For languages that have multiple datasets, we concatenate the data. We end up with typologically diverse sets of sources and targets. We use the predefined train-dev-test splits if available; if not, we split the data at 70-10-20%.
Parallel text. We contrast two sources of parallel data: Europarl (Koehn, 2005) and Watchtower (Agić et al., 2016). The former covers only 21 resource-rich languages but with 400k-2M parallel sentences for each language pair, while the latter currently spans over 300 languages, but with only 10-100k sentences per pair. Europarl comes with near-perfect sentence alignment and tokenization, and we align its words using IBM2 (Dyer et al., 2013). For Watchtower we inherit the original noisy preprocessing: simple whitespace tokenization, automatic sentence alignment, and IBM1 word alignments by Agić et al. (2016) as they show that IBM1 in particular helps debias for low-resource languages.
Tagger. We implement a bi-LSTM NE tagger inspired by Lample et al. (2016) and Plank et al. (2016). We tune it on English development data at two bi-LSTM layers (d = 300), a final dense layer (d = 4), 10 training epochs with SGD, and regular and recurrent dropout at p = 0.5. We use pretrained fastText embeddings (Bojanowski et al., 2017). Currently fastText supports 294 languages and is superior to random initialization in our tagger. Other than through fastText, we don't make explicit use of sub-word embeddings. Our monolingual F 1 score on English is 86.35 under the more standard IOB2 encoding. We do not aim to produce a state-of-the-art model, but to contrast the scores for various annotation projection parameters. We use our tagger both to annotate the source sides of parallel corpora, and to train projected target language NER models. All reported NE tagging results are means over 4 runs.

Results
Europarl sweet spots. With Europarl we show that the combination of monolingual F 1 source (a) Source language ordering (b) Optimal # of sources (c) Parallel sentence sampling (d) Weights in label voting Figure 2: Projection tuning on Europarl: a) Ordering the sources by their monolingual F 1 scores × WALS similarity works best; b) At n = 3 sources the average rank of F 1 scores across development languages is lowest, which indicates that n = 3 is the optimal number of sources in Europarl projection; c) Parallel sentences are best selected by mean word alignment coverage, in contrast to tagger confidence or random sampling; d) Weighted voting for LABELING performs best when weights are word alignment weights × tagger confidences. Results under (b), (c), and (d) all use the best source ordering approach from (a). For random sampling under (a), the sources were randomly selected 5 times for each n.
scores and WALS similarities is the optimal source language ordering. The respective optimal number of sources is n = 3 for Europarl. We show that the best way to sample parallel sentences is through mean word alignment coverage, where we find k = 70000 to roughly be the optimal number of target sentences. Of the different weighting schemes in voting, we select the product of word alignment probability and NER tagger confidence as best. We visualize these experiments in Figure 2. Table 2 shows stable performance on Europarl across the languages, with mean F 1 at 60.7 for n = 3 and only +1.53 higher for n max which is in fact lower than 3.
Moving to Watchtower. Table 2 shows that the performance plunges across languages when Watchtower religious text replaces Europarl, with a mean F 1 of 16.3. There, the gap between n = 3 and mean n max = 4.82 is much larger: Watch-  Table 2: F 1 scores for NER tagging in the experiment languages, shown separately for Europarl and Watchtower, also for fixed number of source languages n = 3 and optimal n max . Full supervision scores are reported for the source languages. All scores are given for 3-class IO encoding.
tower needs more sources, and even then the benefits are low, as the +4.82 increase gets us to an infeasible mean F 1 of 21.12. In target sentence selection we find k = 20000 to be roughly optimal for Watchtower, but we also observe very little change in F 1 when moving to its full size of around 120 thousand target sentences.
To put the Watchtower results into perspective, we implement another simple baseline. Namely, we train a new monolingual English NER system, but instead of using monolingual fastText embeddings, we create simple cross-lingual embeddings following  over Europarl for Dutch, German, and Spanish. In effect, the change to cross-lingual embeddings yields a multilingual tagger for these four languages. The respective F 1 scores of this tagger are low (27-28%), but they still surpass Watchtower projection.

Discussion
We further depict the breakdown of Watchtower projection in two figures. Figure 3 shows precision, recall, and F 1 learning curves for the best projection setup on both parallel corpora. For Europarl, adding more sources always increases recall at the cost of precision: new weaker sources increase the noise, but also improve coverage. For Watchtower, precision slightly increases with more sources, but the recall stays very low throughout, at around 5-12%. The distribution of labels in the source sides of the two parallel corpora (see Figure 4) clarifies the learning curves issue of Watchtower. Namely, for both corpora the optimal word alignment coverage cutoff for selecting target sentences is around 80% covered words (best k = 70000 for Europarl, while k = 20000 for Watchtower). However, these cutoffs result in Europarl projections with nearly two orders of magnitude more named entities than in Watchtower (LOC: 65 times more, ORG: 60, PER: 15), and with different distributions.
To summarize, our results show that there exists a setup in which standalone annotation projection from multiple sources does work for cross-lingual NER. Europarl is an instance of such setup, with its large data volume per language, high-quality preprocessing, and domain rich in named entities. Arguably, there are no parallel corpora of such volume and quality that cover a multitude of true lowresource languages, and we have to do with more limited resources such as Watchtower. In turn, our experiment shows that in such setup standalone projection yields infeasible NE taggers, while it still may yield workable POS taggers or dependency parsers (cf. Agić et al. 2016).
Alternatives. In search for feasible alternatives, we conducted a proof-of-concept replication of the work by Mayhew et al. (2017), who rely on "cheap translation" of training data from multiple sources using bilingual lexicons. The replication involved only one language, Dutch, and we limited the time investment in the effort. We used three translation sources: German, English, and Spanish. Together with instance selection through alignment coverage, we reach a top F 1 score of 69.35 (with 3-class IO encoding), which surpasses even our best Europarl projection for Dutch by 4.56 points.

Related work
There is ample work in cross-lingual NER that exploits cross-lingual representations, comparable or parallel corpora together with entity dictionaries, translation, and the like (Täckström et al., 2012;Kim et al., 2012;Wang et al., 2013;Nothman et al., 2013;Tsai et al., 2016;Ni and Florian, 2016;Ni et al., 2017). We highlight a set of contributions that boast a larger cross-linguistic breadth. out of those, 20 are evaluated for NE linking and 9 for NER on human annotations that are not from Wikipedia. Cotterell and Duh (2017) jointly predict NE for high-and low-resource languages with a character-level neural CRF model. Their evaluation involves 15 diverse languages across 5 language families. The DARPA LORELEI program (Christianson et al., 2018) features challenges in low-resource NER development for "surprise" languages under time constraints.

Conclusions
Our work addresses an important gap in crosslingual NER research. In an experiment with 17 languages, we show that while standalone multisource annotation projection for NER can work when resources are rich in both quality and quantity, it is infeasible at a larger scale due to parallel corpora constraints. For NER in true low-resource languages, our results suggest it is better to choose an alternative approach.