Cross-lingual Name Tagging and Linking for 282 Languages

The ambitious goal of this work is to develop a cross-lingual name tagging and linking framework for 282 languages that exist in Wikipedia. Given a document in any of these languages, our framework is able to identify name mentions, assign a coarse-grained or fine-grained type to each mention, and link it to an English Knowledge Base (KB) if it is linkable. We achieve this goal by performing a series of new KB mining methods: generating “silver-standard” annotations by transferring annotations from English to other languages through cross-lingual links and KB properties, refining annotations through self-training and topic selection, deriving language-specific morphology features from anchor links, and mining word translation pairs from cross-lingual links. Both name tagging and linking results for 282 languages are promising on Wikipedia data and on-Wikipedia data.


Introduction
Information provided in languages which people can understand saves lives in crises. For example, language barrier was one of the main difficulties faced by humanitarian workers responding to the Ebola crisis in 2014. We propose to break language barriers by extracting information (e.g., entities) from a massive variety of languages and ground the information into an existing knowledge base which is accessible to a user in his/her own 1 http://nlp.cs.rpi.edu/wikiann language (e.g., a reporter from the World Health Organization who speaks English only).
Wikipedia is a massively multi-lingual resource that currently hosts 295 languages and contains naturally annotated markups 2 and rich informational structures through crowd-sourcing for 35 million articles in 3 billion words. Name mentions in Wikipedia are often labeled as anchor links to their corresponding referent pages. Each entry in Wikipedia is also mapped to external knowledge bases such as DBpedia 3 , YAGO (Mahdisoltani et al., 2015) and Freebase (Bollacker et al., 2008) that contain rich properties. Figure 1 shows an example of Wikipedia markups and KB properties. We leverage these markups for develop-282 languages, and link them to an English KB (Wikipedia in this work). The major challenges and our new solutions are summarized as follows.
Creating "Silver-standard" through crosslingual entity transfer. The first step is to classify English Wikipedia entries into certain entity types and then propagate these labels to other languages. We exploit the English Abstract Meaning Representation (AMR) corpus (Banarescu et al., 2013) which includes both name tagging and linking annotations for fine-grained entity types to train an automatic classifier. Furthermore, we exploit each entry's properties in DBpedia as features and thus eliminate the need of language-specific features and resources such as part-of-speech tagging as in previous work (Section 2.2).
Refine annotations through self-training. The initial annotations obtained from above are too incomplete and inconsistent. Previous work used name string match to propagate labels. In contrast, we apply self-training to label other mentions without links in Wikipedia articles even if they have different surface forms from the linked mentions (Section 2.4).
Customize annotations through cross-lingual topic transfer. For the first time, we propose to customize name annotations for specific downstream applications. Again, we use a cross-lingual knowledge transfer strategy to leverage the widely available English corpora to choose entities with specific Wikipedia topic categories (Section 2.5).
Derive morphology analysis from Wikipedia markups. Another unique challenge for morphologically rich languages is to segment each token into its stemming form and affixes. Previous methods relied on either high-cost supervised learning (Roth et al., 2008;Mahmoudi et al., 2013;Ahlberg et al., 2015), or low-quality unsupervised learning (Grönroos et al., 2014;Ruokolainen et al., 2016). We exploit Wikipedia markups to automatically learn affixes as language-specific features (Section 2.3).
Mine word translations from cross-lingual links. Name translation is a crucial step to generate candidate entities in cross-lingual entity linking. Only a small percentage of names can be directly translated by matching against cross-lingual Wikipedia title pairs. Based on the observation that Wikipedia titles within any language tend to follow a consistent style and format, we propose an effective method to derive word translation pairs from these titles based on automatic alignment (Section 3.2).

Overview
Our first step is to generate "silver-standard" name annotations from Wikipedia markups and train a universal name tagger. Figure 2 shows our overall procedure and the following subsections will elaborate each component.  Figure 2: Name Tagging Annotation Generation and Training

Initial Annotation Generation
We start by assigning an entity type or "other" to each English Wikipedia entry. We utilize the AMR corpus where each entity name mention is manually labeled as one of 139 types and linked to Wikipedia if it's linkable. In total we obtain 2,756 entity mentions, along with their AMR entity types, Wikipedia titles, YAGO entity types and DBpedia properties. For each pair of AMR entity type t a and YAGO entity type t y , we compute the Pointwise Mutual Information (PMI) (Ward Church and Hanks, 1990) of mapping t a to t y across all mentions in the AMR corpus. Therefore, each name mention is also assigned a list of YAGO entity types, ranked by their PMI scores with AMR types. In this way, our framework produces three levels of entity typing schemas with different granularity: 4 main types (Person (PER), Organization (ORG), Geo-political Entity (GPE), Location (LOC)), 139 types in AMR, and 9,154 types in YAGO. Then we leverage an entity's properties in DBpedia as features for assigning types. For example, an entity with a birth date is likely to be a person, while an entity with a population property is likely to be a geo-political entity. Using all DBpedia entity properties as features (60,231 in total), we train Maximum Entropy models to assign types with three levels of granularity to all English Wikipedia pages. In total we obtained 10 million English pages labeled as entities of interest. Nothman et al. (2013) manually annotated 4,853 English Wikipedia pages with 6 coarsegrained types (Person, Organization, Location, Other, Non-Entity, Disambiguation Page). Using this data set for training and testing, we achieved 96.0% F-score on this initial step, slightly better than their results (94.6% F-score).
Next, we propagate the label of each English Wikipedia page to all entity mentions in all languages in the entire Wikipedia through monolingual redirect links and cross-lingual links.

Learning Model and KB Derived Features
We use a typical neural network architecture that consists of Bi-directional Long Short-Term Memory and Conditional Random Fields (CRFs) network (Lample et al., 2016) as our underlying learning model for the name tagger for each language. In the following we will describe how we acquire linguistic features. When a Wikipedia user tries to link an entity mention in a sentence to an existing page, she/he will mark the title (the entity's canonical form, without affixes) within the mention )", we can learn the following suffixes: "den", "ne", "nden" and "na". We use such affix lists to perform basic word stemming, and use them as additional features to determine name boundary and type. For example, "den" is a noun suffix which indicates ablative case in Turkish.
[[Akdeniz]]den means "from Mediterranean Sea". Note that this approach can only perform morphology analysis for words whose stem forms and affixes are directly concatenated. Table 1 summarizes name tagging features.

Features Descriptions Form
Lowercase forms of (w−1, w0, w+1) Case Case of w0 Syllable The first and the last character of w0 Stem Stems of (w−1, w0, w+1) Affix Affixes of (w−1, w0, w+1) Gazetteer Cross-lingual gazetteers learned from training data Embeddings Character embeddings and word embeddings 4 learned from training data

Self-Training to Enrich and Refine Labels
The name annotations acquired from the above procedure are far from complete to compete with manually labeled gold-standard data. For example, if a name mention appears multiple times in a Wikipedia article, only the first mention is labeled with an anchor link. We apply self-training to propagate and refine the labels. We first train an initial name tagger using seeds selected from the labeled data. We adopt an idea from (Guo et al., 2014) which computes Normalized Pointwise Mutual Information (NPMI) (Bouma, 2009) between a tag and a token: (1) Then we select the sentences in which all annotations satisfy N P M I(tag, token) > τ as seeds 5 . For all Wikipedia articles in a language, we cluster the unlabeled sentences into n clusters 6 by collecting sentences with low cross-entropy into the same cluster. Then we apply the initial tagger to the first unlabeled cluster, select the automatically labeled sentences with high confidence, add them back into the training data, and then re-train the tagger. This procedure is repeated n times until we scan through all unlabeled data.

Final Training Data Selection for Populous Languages
For some populous languages that have many millions of pages in Wikipedia, we obtain many sentences from self-training. In some emergent settings such as natural disasters it's important to train a system rapidly. Therefore we develop the following effective methods to rank and select high-quality annotated sentences. Commonness: we prefer sentences that include common entities appearing frequently in Wikipedia. We rank names by their frequency and dynamically set the frequency threshold to select a list of common names. We first initialize the name frequency threshold S to 40. If the number of the sentences is more than a desired size D for training 7 , we set the threshold S = S + 5, otherwise S = S − 5. We iteratively run the selection algorithm until the size of the training set reaches D for a certain S.
Topical Relatedness: Various criteria should be adopted for different scenarios. Our previous work on event extraction (Li et al., 2011) found that by carefully select 1/3 topically related training documents for a test set, we can achieve the same performance as a model trained from the entire training set. Using an emergent disaster setting as a use case, we prefer sentences that include entities related to disaster related topics. We run an English name tagger  and entity linker (Pan et al., 2015) on the Leidos corpus released by the DARPA LORELEI 5 τ = 0 in our experiment. 6 n = 20 in our experiment. 7 D = 30,000 in our experiment. program 8 . The Leidos corpus consists of documents related to various disaster topics. Based on the linked Wikipedia pages, we rank the frequency of Wikipedia categories and select the top 1% categories (4,035 in total) for our experiments. Some top-ranked topic labels include "International medical and health organizations", "Human rights organizations", "International development agencies", "Western Asian countries", "Southeast African countries"and "People in public health". Then we select the annotated sentences including names (e.g., "World Health Organization") in all languages labeled with these topic labels to train the final model.

Overview
After we extract names from test documents in a source language, we translate them into English by automatically mined word translation pairs (Section 3.2), and then link translated English mentions to an external English KB (Section 3.3). The overall linking process is illustrated in Figure 3.

Name Translation
The cross-lingual Wikipedia title pairs, generated through crowd-sourcing, generally follow a consistent style and format in each language. From Table 2 we can see that the order of modifier and head word keeps consistent in Turkish and English titles.  For each name mention, we generate all possible combinations of continuous tokens. For example, no Wikipedia titles contain the Turkish name "Pekin Teknoloji Enstitüsü (Beijing Institute of Technology)". We generate the following 6 combinations: "Pekin", "Teknoloji", "Enstitüsü", "Pekin Teknoloji", "Teknoloji Enstitüsü" and "Pekin Teknoloji Enstitüsü", and then extract all cross-lingual Wikipedia title pairs containing each combination. Finally we run GIZA++ (Josef Och and Ney, 2003) to extract word for word translations from these title pairs, as shown in Table 2.

Entity Linking
Given a set of tagged name mentions M = {m 1 , m 2 , ..., m n }, we first obtain their English translations T = {t 1 , t 2 , ..., t n } using the approach described above. Then we apply an unsupervised collective inference approach to link T to the KB, similar to our previous work (Pan et al., 2015). The only difference is that we construct knowledge networks (KNs) g(t i ) for T based on their co-occurrence within a context window 9 instead of their AMR relations, because AMR parsing is not available for foreign languages. For each translated name mention t i , an initial list of candidate entities E(t i ) = {e 1 , e 2 , ..., e k } is generated based on a surface form dictionary mined from KB properties (e.g., redirects, names, aliases). If no surface form can be matched then we determine the mention as unlinkable. Then we construct KNs g(e j ) for each entity candidate e j in t i 's entity candidate list E(t i ). We compute the similarity between g(t i ) and g(e j ) based on three measures: salience, similarity and coherence, and select the candidate entity with the highest score.

Performance on Wikipedia Data
We first conduct an evaluation using Wikipedia data as "silver-standard". For each language, we use 70% of the selected sentences for training and 30% for testing. For entity linking, we don't have ground truth for unlinkable mentions, so we only compute linking accuracy for linkable name mentions. Table 3 presents the overall performance for three coarse-grained entity types: PER, ORG and GPE/LOC, sorted by the number of name mentions. Figure 4 and Figure 5 summarize the performance, with some example languages marked for various ranges of data size.  Not surprisingly, name tagging performs better for languages with more training mentions. The F-score is generally higher than 80% when there are more than 10K mentions, and it significantly drops when there are less than 250 mentions. The languages with low name tagging performance can be categorized into three types: (1) the number of mentions is less than 2K, such as Atlantic-Congo (Wolof), Berber (Kabyle), Chadic (Hausa), Oceanic (Fijian), Hellenic (Greek), Igboid (Igbo), Mande (Bambara), Kartvelian (Georgian, Mingrelian), Timor-Babar (Tetum), Tupian (Guarani) and Iroquoian (Cherokee) language groups; Precision is generally higher than recall for most of these languages, because the small number of linked mentions is not enough to cover a wide variety of entities.
(2) there is no space between words, including Chinese, Thai and Japanese; (3) they are not written in latin script, such as the Dravidian group (Tamil, Telugu, Kannada, Malayalam).
The training instances for various entity types are quite imbalanced for some languages. For example, Latin data includes 11% PER names, 84% GPE/LOC names and 5% ORG names. As a result, the performance of ORG is the lowest, while GPE and LOC achieve higher than 75% F-scores for most languages.  The linking accuracy is higher than 80% for most languages. Also note that since we don't have perfect annotations on Wikipedia data for any language, these results can be used to estimate how predictable our "silver-standard" data is, but they are not directly comparable to traditional name tagging results measured against goldstandard data annotated by human.

Performance on Non-Wikipedia Data
In order to have more direct comparison with state-of-the-art name taggers trained from human annotated gold-standard data, we conduct experiments on non-Wikipedia data in 9 languages for which we have human annotated ground truths from the DARPA LORELEI program. Table 4 shows the data statistics. The documents are from news sources and discussion fora.
For fair comparison, we use the same learning method and feature set as described in Section 2.3 to train the models using gold-standard data. Therefore the results of our models trained from gold-standard data are slightly different from some previous work such as (Tsai et al., 2016), mainly due to different learning algorithms and different features sets. For example, the gazetteers we used are different from those in (Tsai et al., 2016), and we did not use brown clusters as additional features.
The name tagging results on LORELEI data set are presented in Table 5. We can see that our approach advances state-of-the-art languageindependent methods (Zhang et al., 2016a;Tsai et al., 2016) on the same data sets for most languages, and achieves 6.5% -17.6% lower F-scores than the models trained from manually annotated gold-standard documents that include thousands of name mentions. To fill in this gap, we would need to exploit more linguistic resources.  constructed a crosslingual entity linking collection for 21 languages, which covers ground truth for the largest number of languages to date. Therefore we compare our approach with theirs that uses a supervised name transliteration model . The entity linking results on non-NIL mentions are presented in Table 6. We can see that except Romanian, our approach outperforms or achieves comparable accuracy as their method on all languages, without using any additional resources or tools such as name transliteration.

Impact of KB-derived Morphological Features
We measured the impact of our affix lists derived from Wikipedia markups on two morphologicallyrich languages: Turkish and Uzbek. The morphol-

Impact of Self-Training
Using Turkish as a case study, the learning curves of self-training on Wikipedia and non-Wikipedia test sets are shown in Figure 6. We can see that self-training provides significant improvement for both Wikipedia (6% absolute gain) and non-Wikipedia test data (12% absolute gain). As expected the learning curve on Wikipedia data is more smooth and converges more slowly than that of non-Wikipedia data. This indicates that when the training data is incomplete and noisy, the model can benefit from self-training through iterative label correction and propagation. Figure 6: Learning Curve of Self-training

Impact of Topical Relatedness
We also found that the topical relatedness measure proposed in Section 2.5 not only significantly reduces the size of training data and thus speeds up the training process for many languages, but also consistently improves the quality. For example, the Turkish name tagger trained from the entire data set without topic selection yields 49.7% Fscore on LORELEI data set, and the performance is improved to 51.5% after topic selection.

Related Work
Wikipedia markup based silver standard generation: Our work was mainly inspired from previous work that leveraged Wikipedia markups to train name taggers (Nothman et al., 2008;Dakka and Cucerzan, 2008;Mika et al., 2008;Ringland et al., 2009;Alotaibi and Lee, 2012;Nothman et al., 2013;Althobaiti et al., 2014). Most of these previous methods manually classified many English Wikipedia entries into pre-defined entity types. In contrast, our approach doesn't need any manual annotations or language-specific features, while generates both coarse-grained and fine-grained types. Many fine-grained entity typing approaches (Fleischman and Hovy, 2002;Giuliano, 2009;Ekbal et al., 2010;Ling and Weld, 2012;Yosef et al., 2012;Nakashole et al., 2013;Gillick et al., 2014;Yogatama et al., 2015;Del Corro et al., 2015) also created annotations based on Wikipedia anchor links. Our framework performs both name identification and typing and takes advantage of richer structures in the KBs. Previous work on Arabic name tagging (Althobaiti et al., 2014) extracted entity titles as a gazetteer for stemming, and thus it cannot handle unknown names. We developed a new method to derive generalizable affixes for morphologically rich language based on Wikipedia markups.

Wikipedia as background features for IE:
Wikipedia pages have been used as additional features to improve various Information Extraction (IE) tasks, including name tagging (Kazama and Torisawa, 2007), coreference resolution (Paolo Ponzetto and Strube, 2006), relation extraction (Chan and Roth, 2010) and event extraction (Hogue et al., 2014). Other automatic name annotation generation methods have been proposed, including KB driven distant supervision Mintz et al., 2009;Ren et al., 2015) and cross-lingual projection (Li et al., 2012;Kim et al., 2012;Wang and Manning, 2014;Zhang et al., 2016b).
Multi-lingual name tagging: Some recent research (Zhang et al., 2016a;Littell et al., 2016;Tsai et al., 2016) under the DARPA LORELEI program focused on developing name tagging techniques for low-resource languages. These approaches require English annotations for projection (Tsai et al., 2016), some input from a native speaker, either through manual annotations (Littell et al., 2016), or a linguistic survey (Zhang et al., 2016a). Without using any manual annotations, our name taggers outperform previous methods on the same data sets for many languages.
Multi-lingual entity linking: NIST TAC-KBP Tri-lingual entity linking (Ji et al., 2016) focused on three languages: English, Chinese and Spanish.  extended it to 21 languages. But their methods required labeled data and name transliteration. We share the same goal as (Sil and Florian, 2016) to extend cross-lingual entity linking to all languages in Wikipedia. They exploited Wikipedia links to train a supervised linker. We mine reliable word translations from cross-lingual Wikipedia titles, which enables us to adopt unsupervised English entity linking techniques such as (Pan et al., 2015) to directly link translated English name mentions to English KB.
Efforts to save annotation cost for name tagging: Some previous work including (Ji and Grishman, 2006;Richman and Schone, 2008;Althobaiti et al., 2013) exploited semi-supervised methods to save annotation cost. We observed that self-training can provide further gains when the training data contains certain amount of noise.

Conclusions and Future Work
We developed a simple yet effective framework that can extract names from 282 languages and link them to an English KB. This framework follows a fully automatic training and testing pipeline, without the needs of any manual annotations or knowledge from native speakers. We evaluated our framework on both Wikipedia articles and external formal and informal texts and obtained promising results. To the best of our knowledge, our multilingual name tagging and linking framework is applied to the largest number of languages. We release the following resources for each of these 282 languages: "silver-standard" name tagging and linking annotations with multiple levels of granularity, morphology analyzer if it's a morphologically-rich language, and an endto-end name tagging and linking system. In this work, we treat all languages independently when training their corresponding name taggers. In the future, we will explore the topological structure of related languages and exploit cross-lingual knowledge transfer to enhance the quality of extraction and linking. The general idea of deriving noisy annotations from KB properties can also be extended to other IE tasks such as relation extraction.