ELISA-EDL: A Cross-lingual Entity Extraction, Linking and Localization System

We demonstrate ELISA-EDL, a state-of-the-art re-trainable system to extract entity mentions from low-resource languages, link them to external English knowledge bases, and visualize locations related to disaster topics on a world heatmap. We make all of our data sets, resources and system training and testing APIs publicly available for research purpose.


Introduction
Our cross-lingual entity extraction, linking and localization system is capable of extracting named entities from unstructured text in any of 282 Wikipedia languages, translating them into English, and linking them to English Knowledge Bases (Wikipedia and Geonames). This system then produces visualizations of the results such as heatmaps, and thus it can be used by an English speaker for monitoring disasters and coordinating rescue and recovery efforts reported from incident regions in low-resource languages. In the rest of the paper, we will present a comprehensive overview of the system components (Section 2 and Section 3), APIs (Section 4), interface 3 (Section 5), and visualization 4 (Section 6).

Entity Extraction
Given a text document as input, the entity extraction component identifies entity name mentions and classifies them into pre-defined types: Person (PER), Geo-political Entity (GPE), Organization (ORG) and Location (LOC). We consider name tagging as a sequence labeling problem, to tag each token in a sentence as the Beginning (B), Inside (I) or Outside (O) of an en-tity mention with a certain type. Our model is based on a bi-directional long short-term memory (LSTM) networks with a Conditional Random Fields (CRFs) layer (Chiu and Nichols, 2016). It is challenging to perform entity extraction across a massive variety of languages because most languages don't have sufficient data to train a machine learning model. To tackle the low-resource challenge, we developed creative methods of deriving noisy training data from Wikipedia , exploiting non-traditional languageuniversal resources (Zhang et al., 2016) and crosslingual transfer learning (Cheung et al., 2017).

Entity Linking and Localization
After we extract entity mentions, we link GPE and LOC mentions to GeoNames 5 , and PER and ORG mentions to Wikipedia 6 . We adopt the name translation approach described in  to translate each tagged entity mention into English, then we apply an unsupervised collective inference approach (Pan et al., 2015) to link each translated mention to the target KB. Figure 2 shows an example output of a Hausa document. The extracted entity mentions "Stephane Dujarric" and "birnin Bentiu" are linked to their corresponding entries in Wikipedia and GeoNames respectively.
Compared to traditional entity linking, the unique challenge of linking to GeoNames is that it is very scarce, without rich linked structures or text descriptions. Only 500k out of 4.7 million entities in Wikipedia are linked to GeoNames. Therefore, we associate mentions with entities in the KBs in a collective manner, based on salience, similarity and coherence measures (Pan et al., 2015). We calculate topic-sensitive PageRank scores for 500k overlapping entities between APIs Description

/status
Retrieve the current server status, including supported languages, language identifiers, and the state (offline, online, or pending) of each model.

/status/{identifier}
Retrieve the current status of a given language.

/entity discovery and linking/ {identifier}
Main entry of the EDL system. Take input in either plain text or *.ltf format, tag names that are PER, ORG or LOC/GPE, and link them to Wikipedia.

/name transliteration/ {identifier}
Transliterate a name to Latin script.

/entity linking/{identifier}
Query based entity linking. Link each mention to KBs. /entity linking amr English entity linking for Abstract Meaning Representation (AMR) style input (Pan et al., 2015). AMR (Banarescu et al., 2013) is a structured semantic representation scheme. The rich semantic knowledge in AMR boosts linking performance.

/localize/{identifier}
Localize a LOC/GPE name based on GeoNames database.     GeoNames and Wikipedia as their salience scores. Then we construct knowledge networks from source language texts, where each node represents a entity mention, and each link represents a sentence-level co-occurrence relation. If two mentions cooccur in the same sentence, we prefer their entity candidates in the GeoNames to share an administrative code and type, or be geographically close in the world, as measured in terms of latitude and longitude. Table 3 shows the performance of our system on some representative low-resource languages for which we have ground-truth annotations from the DARPA LORELEI 7 programs, prepared by the Linguistic Data Consortium.

Training and Testing APIs
In this section, we introduce our back-end APIs. The back-end is a set of RESTful APIs built with Python Flask 8 , which is a light weight framework that includes template rendering and server hosting capabilities. We use Swagger for documentation management. Besides the on-line hosted APIs, we also publish our Docker copy 9 at Dockerhub for software distribution.
In general, we categorize the APIs into two sections: RUN and TRAIN. The RUN section is responsible for running the pre-trained models for 282 languages, and the TRAIN section provides a re-training function for users who want to train their own customized name tagging models using their own datasets. We also published our training and test data sets, as well as resources related to at morphology analysis and name translation at: https://elisa-ie.github.io/wikiann. Table 1 and Table 2 present the detailed functionality and usages of the APIs of these two sections. Besides the core components as described in Section 2 and Section 3, we also provide the APIs of additional components, including a re-trainable name transliteration component (Lin et al., 2016) and a universal name and word translation component based on word alignment derived from cross-lingual Wikipedia links . More detailed usages and examples can be found in our Swagger 10 documentation: https://elisa-ie. github.io/api. Figure 1 shows the test interface, where a user can select one of the 282 languages, enter a text or select an example document, and run the system. Figure 2 shows an output example. In addition to the entity extraction and linking results, we also display the top 5 images for each entity retrieved from Google Image Search 11 . In this way even when a user cannot read a document in a lowresource language, s/he will obtain a high-level summary of entities involved in the document.

Heatmap Visualization
Using disaster monitoring as a use case, we detect the following ten topics from the input multilingual data based on translating 117 English disaster keywords via PanLex 12 : (1) water supply, (2) food supply, (3) medical assistance, (4) terrorism or other extreme violence, (5) utilities, energy or sanitation, (6) evacuation, (7) shelter, (8) search and rescue, (9) civil unrest or widespread crime, and (10) infrastructure, as defined in the NIST LoreHLT2017 Situation Frame detection task 13 . If a sentence includes one of these topics and also a location or geo-political entity, we will visualize the entity on a world heatmap using Mapbox 14 based on its coordinates in the GeoNames database obtained from the entity linker. We also show the entire context sentence and its English translation produced from our state-of-theart Machine Translation system for low-resource languages (Cheung et al., 2017). Figure 3 illustrates an example of the visualized heatmap.
We use different colors and icons to stand for different languages and frame topics respectively (e.g., the bread icon represents "food supply"). Users can also specify the language or frame topic or both to filter out irrelevant results on the map. By clicking an icon, its context sentence will be displayed in a pop-up with automatic translation and highlighted mentions and keywords. We provide various map styles (light, dark, satellite, and streets) for different needs, as shown in Figure 4.

Related Work
Some recent work has also focused on lowresource name tagging (Tsai et al., 2016;Littell et al., 2016;Zhang et al., 2016;Yang et al., 2017) and cross-lingual entity linking (McNamee et al., 2011;Spitkovsky and Chang, 2011;Sil and Florian, 2016), but the system demonstrated in this paper is the first publicly available end-to-end system to perform both tasks and all of the 282 Wikipedia languages.

Conclusions and Future Work
Our publicly available cross-lingual entity extraction, linking and localization system allows an English speaker to gather information related to entities from 282 Wikipedia languages. In the future we will apply common semantic space construction techniques to transfer knowledge and resources from these Wikipedia languages to all thousands of living languages. We also plan to significantly expand entities to the thousands of finegrained types defined in YAGO (Suchanek et al., 2007) and WordNet (Miller, 1995).