Multi-lingual Entity Discovery and Linking

The primary goals of this tutorial are to review the framework of cross-lingual EL and motivate it as a broad paradigm for the Information Extraction task. We will start by discussing the traditional EL techniques and metrics and address questions relevant to the adequacy of these to across domains and languages. We will then present more recent approaches such as Neural EL, discuss the basic building blocks of a state-of-the-art neural EL system and analyze some of the current results on English EL. We will then proceed to Cross-lingual EL and discuss methods that work across languages. In particular, we will discuss and compare multiple methods that make use of multi-lingual word embeddings. We will also present EL methods that work for both name tagging and linking in very low resource languages. Finally, we will discuss the uses of cross-lingual EL in a variety of applications like search engines and commercial product selling applications. Also, contrary to the 2014 EL tutorial, we will also focus on Entity Discovery which is an essential component of EL.


Overall
We live in a golden age of information, where we have access to vast amount of data in various forms: text, video and audio. Over the last few years, one of the key task that has been studied in support of natural language understanding and information extraction from text, is the task of Entity Linking (previously studied as Wikification). Entity Linking (henceforth, EL) (Bunescu and Pasca, 2006;Cucerzan, 2007;Ratinov et al., 2011) is the task of mapping mentions of entities in a text document to an entry in a large catalog of entities such as Wikipedia or another knowledge base (KB). It has also been one of the major tasks in the Knowledge-Base Population track at the Text Analysis Conference (TAC) (McNamee and Dang, 2009b;Ji and Grishman, 2011;. Most works in the literature have used Wikipedia as this target catalog of entities because of its wide coverage and its frequent updates made by the community. The previous Entity Linking tutorial in ACL 2014 (Roth et al., 2014) addressed mostly EL research which have focused on English, the most prevalent language on the web and the one with the largest Wikipedia datasets. However, in the last few years research has shifted to address the EL task in other languages, some of which have very large web presence, such as Spanish (Fahrni et al., 2013;, and Chinese (Cao et al., 2014;Shi et al., 2014) but also in others. In particular, there has been interest in cross-lingual EL (Tsai and Roth, 2016;Sil and Florian, 2016): given a mention in a foreign language document, map it to the corresponding page in the English Wikipedia. Beyond the motivation that drives the English EL task -knowledge acquisition and information extraction -in the crosslingual case and especially when dealing with low resource languages, the hope is to provide improved natural language understanding capabilities for the many languages for which we have few linguistic resources and annotation and no machine translation technology. The LoreHLT2016-2017 evaluation 1 and TAC 2017 pilot evaluation 2 target really low-resource languages like Northern Sotho or Kikuyu which only have about 4000 Wikipedia pages (about 1/1000 the size of the English wikipedia).
The primary goals of this tutorial are to review the framework of cross-lingual EL and motivate it as a broad paradigm for the Information Extraction task. We will start by discussing the traditional EL techniques and metrics and address questions relevant to the adequacy of these to across domains and languages. We will then present more recent approaches such as Neural EL, discuss the basic building blocks of a state-of-the-art neural EL system and analyze some of the current results on English EL. We will then proceed to Cross-lingual EL and discuss methods that work across languages. In particular, we will discuss and compare multiple methods that make use of multi-lingual word embeddings. We will also present EL methods that work for both name tagging and linking in very low resource languages. Finally, we will discuss the uses of cross-lingual EL in a variety of applications like search engines and commercial product selling applications. Also, contrary to the 2014 EL tutorial, we will also focus on Entity Discovery which is an essential component of EL.
The tutorial will be useful for both senior and junior researchers (in academia and industry) with interests in cross-source information extraction and linking, knowledge acquisition, and the use of acquired knowledge in natural language processing and information extraction. We will try to provide a concise road-map of recent approaches, perspectives, and results, as well as point to some of our EL resources that are available to the research community.

Motivation and Overview [20 mins]
We will motivate the general EL problem (for English) by teaching the general methods that incorporate distance measures (Ratinov et al., 2011;Sil and Yates, 2013;Cheng and Roth, 2013). We will then briefly discuss multi-lingual IE problems and motivate cross-lingual EL Sil and Florian, 2016). Then we will motivate the new trend of modeling distributional representations instead of distance.

Key Challenges and Multi-lingual Embeddings [20 mins]
We will present some key challenges daunting high-performing traditional EL systems and candidate generation and transliteration (Tsai and Roth, 2018) from a knowledge-base. We will also present the models for traditional cross-lingual EL (Sil and Florian, 2016;Tsai and Roth, 2016) and discuss some of their challenges: matching context between non-English documents with the English Wikipedia. Recently, neural Entity Discovery and Linking (henceforth, EDL) techniques have combated some of these challenges. These systems use multi-lingual embeddings which are essential building blocks for these neural architectures. Hence, before diving into the architectures we will survey multi-lingual embedding techniques (Mikolov et al., 2013c;Faruqui and Dyer, 2014;Ammar et al., 2016) and which ones work best for neural EL systems and motivate neural EL.

Neural Methods for EDL [30 mins]
Various shared tasks such as TAC-KBP, ACE and CONLL, along with corpora like OntoNotes and ERE have provided the community substantial amount of annotations for both entity mention extraction (1,500+ documents) and entity linking (5,000+ query entities). Therefore supervised models have become popular again for each step of EDL. Among all of the supervised learning frameworks for mention extraction, the most popular one is a combined Deep Neural Networks architecture consisted of Bidirectional Long Short-Term Memory networks (Bi-LSTM) (Graves et al., 2013) and CRFs (Lample et al., 2016). In TAC-KBP2017 many teams trained this framework from the same training data (KBP2015 and KBP2016 EDL corpora) and the same set of features (word and entity embeddings), but got very different results. The men-tion extraction F-score gap between the best system and the worst system is about 24%. We will provide a systematic comparison and analysis on reasons that cause this big gap. We will also introduce techniques to make the framework more robust to noise in low-resource settings. We will then teach neural EL architectures (Globerson et al., 2016;Gupta et al., 2017a;Sil et al., 2018) that can tackle some of the challenges of the traditional systems. Then we will proceed to cross-lingual neural EL and survey the pipelines that most of these EL systems employ: crosslingual NER and in-document coreference resolution. We will talk about how to model the contexts using various neural techniques like CNNs, LSTMs etc. and how systems compute similarity metrics of varying granularity (Francis-Landau et al., 2016;Sil et al., 2018).

Language Universal Methods for Cross-lingual EDL [30 mins]
We will then present some recent advances at developing low-cost approaches to perform crosslingual EL for 282 Wikipedia languages, such as deriving silver-standard annotations by transferring annotations from English to other languages through cross-lingual links and KB properties, refining annotations through self-training and topic selection, deriving language-specific morphology features from anchor links, and mining word translation pairs from cross-lingual links (Pan et al., 2017a). We will also introduce some recent extensions along this line of work, including extending the number of entity types from five to thousands, and its impact on other NLP applications such as Machine Translation.

Multiple Knowledge Bases [25 mins]
A task that is similar to multi-lingual EL in both definition and approaches is domain-specific linking of entities in documents based on a given set of domains/corresponding knowledge repositories (Gao and Cucerzan, 2017). This task is important for applications such as the analysis and indexing of corporate document repositories, in which many of the entities of interest are not part of the general-knowledge but are rather companyspecific and may need to be kept private. Conflating such terminologies and knowledge into one single knowledge model would be both daunting and undesirable. Thus, similarly to handling multiple languages, a system built for multiple-domain linking, has to model each domain separately. We will discuss a multi-KB entity linking framework that employs one general-knowledge KB and a large set of domain-specific KBs as linking targets that extends the work from (Cucerzan, 2007(Cucerzan, , 2014a, as well as a supervised model with a large and diverse set of features to detect when a domain-specific KB matches a document targeted for entity analysis (Gao and Cucerzan, 2017).

New Tasks, Trends and Open Questions [15 mins]
Here, we will address some of the new settings: multi-lingual EL for search engines (Pappu et al., 2017;Tan et al., 2017). We will discuss some open questions such as improving the title candidate generation process for situations where the corresponding titles only exist in the English Wikipedia and also investigate the topological structure of related languages and exploit cross-lingual knowledge transfer to enhance the quality of extraction and linking (Tsai and Roth, 2018). We will also discuss EL for noisy data like social media (Meij et al., 2012;Guo et al., 2013). Finally, we will discuss the possibilities of extending the ideas taught in this EL tutorial to other multi-lingual IE tasks.

System Demos and Resources [10 mins]
Finally, we will show some demos of multi-lingual EL systems from the industry and academia. We will also provide pointers to resources, including data sets and software.
3 Tutorial Instructors Roth has published broadly in machine learning, natural language processing, knowledge representation and reasoning, and has developed several machine learning based natural language processing systems that are widely used in the computational linguistics community and in industry. Over the last few years he has worked on Entity Linking and Wikification. He has taught several tutorials at ACL/NAACL/ECL and other forums. Dan has co-taught the "Wikification and Beyond: The Challenges of Entity and Concept Grounding" tutorial with Heng Ji at ACL 2014.