The Interplay between Lexical Resources and Natural Language Processing

Incorporating linguistic, world and common sense knowledge into AI/NLP systems is currently an important research area, with several open problems and challenges. At the same time, processing and storing this knowledge in lexical resources is not a straightforward task. This tutorial proposes to address these complementary goals from two methodological perspectives: the use of NLP methods to help the process of constructing and enriching lexical resources and the use of lexical resources for improving NLP applications. Two main types of audience can benefit from this tutorial: those working on language resources who are interested in becoming acquainted with automatic NLP techniques, with the end goal of speeding and/or easing up the process of resource curation; and on the other hand, researchers in NLP who would like to benefit from the knowledge of lexical resources to improve their systems and models. The slides of the tutorial are available at https://bitbucket.org/luisespinosa/lr-nlp/


Description
The manual construction of lexical resources is a prohibitively time-consuming process, and even in the most restricted knowledge domains and lessresourced languages, the use of language technologies to ease up this process is becoming a standard practice. NLP techniques can be effectively leveraged to reduce creation and maintenance efforts. In this tutorial we will present open problems and research challenges in these topics concerning the interplay between lexical resources and NLP. Additionally, we will summarize existing attempts in this direction, such as modeling linguistic phenomena like terminology, definitions and glosses, examples and relations, phraseological units, or clustering techniques for senses and topics, as well as the integration of resources of different nature.
As far as the integration of lexical resources in NLP applications is concerned, we will explain some of the current challenges in Word Sense Disambiguation and Entity Linking, as key tasks in natural language understanding which also enable a direct integration of knowledge from lexical resources. We will explain some knowledge-based and supervised methods for these tasks which play a decisive role in connecting lexical resources and text data. Moreover, we will present the field of knowledge-based representations, in particular word sense embeddings, as flexible techniques which act as a bridge between lexical resources and applications. Finally, we will briefly present some recent work on the integration of this encoded knowledge from lexical resources into neural architectures for improving downstream NLP applications.

Introduction and Motivation
Adding explicit knowledge into AI/NLP systems is currently an important challenge due to the gains that can be obtained in many downstream applications. At the same time, these resources can be further enriched and better exploited by making use of NLP techniques. In this context, the main motivation of this tutorial is to show how Natural Language Processing and Lexical Resources have interacted so far, and a view towards potential scenarios in the near future.
As an introduction we first present an overview of current lexical resources, starting from the de facto standard lexical resource for English, i.e., WordNet (Fellbaum, 1998). We provide a concise overview of WordNet, showing what synsets are and how the resource can be viewed as a semantic network. We then briefly discuss some of the limitations of WordNet and discuss how these can be alleviated to some extent with the help of collaboratively-constructed resources, such as Freebase (Bollacker et al., 2008), Wikidata (Vrandečić, 2012) and Babel-Net (Navigli and Ponzetto, 2012). As the main building block of these resources, we show how collaboratively-constructed projects, such as Wikipedia 1 and Wiktionary 2 , can serve as massive multilingual sources of lexical information. The lexical resources session is concluded by a short introduction to the Paraphrase Database (PPDB) (Ganitkevitch et al., 2013;Pavlick et al., 2015) and to a domain-specific lexical resource: SNOMED 3 , which is one of the major ontologies for the medical domain.
The tutorial is then divided in two main blocks. First, we delve into NLP for Creation and Enrichment of Lexical Resources, where we address a range of NLP problems aimed specifically at improving repositories of linguistically expressible knowledge. Second, we cover different use cases in which Lexical Resources for NLP have been leveraged successfully. The last part of the tutorial focuses on lessons learned from work in which we tried to reconcile both worlds, as well as our own view towards what the future holds for knowledgebased approaches to NLP.

NLP for Lexical Resources
The application of language technologies to the automatic construction and extension of lexical resources has proven successful in that it has provided various tools for optimizing this often prohibitively costly and expensive process. NLP techniques provide end-to-end technologies that can tackle all challenges in the language resource creation and maintenance pipeline. In this tutorial we summarize existing efforts in this direction, including the extraction from text of linguistic phenomena like terminology, definitions and glosses, examples and relations, as well as clustering techniques for senses and topics.

Definition extraction.
Techniques for extracting definitional text snippets from corpora (Navigli and Velardi, 2010; Boella 6. Topic/domain clustering techniques. Relevant techniques for filtering general domain resources via topic grouping (Roget, 1911;Navigli and Velardi, 2004;. 7. Alignment of lexical resources 4 . Alignment of heterogeneous lexical resources contributing to the creation of large resources containing different sources of knowledge. We will present approaches for the construction of such resources, such as Yago (Suchanek et al., 2007), UBY (Gurevych et al., 2012), BabelNet (Navigli and Ponzetto, 2012) or Con-ceptNet (Speer et al., 2017), as well as other works attempting to improve the automatic procedures to align lexical resources (Matuschek and Gurevych, 2013;Pilehvar and Navigli, 2014).

Lexical Resources for NLP
In addition to the (semi)automatic efforts for easing the task of constructing and enriching lexical resources presented in the previous section, we present NLP tasks in which lexical resources have shown an important contribution. Effectively leveraging linguistically expressible cues with their associated knowledge remains a difficult task. Knowledge may be extracted from roughly three types of resource (Hovy et al., 2013): unstructured, e.g. text corpora; semistructured, such as encyclopedic collaborative repositories like Wikipedia, or structured, which include lexicographic resources like WordNet. In this section we present some of the applications on which different kinds of lexical resource (including their combination) play an important role. We begin this section by explaining some of the problems and challenges in Word Sense Disambiguation and Entity Linking, as key tasks in natural language understanding which enable the direct integration of knowledge from lexical resources. We describe the most relevant knowledge-based WSD systems, both based on definitions (Lesk, 1986;Banerjee and Pedersen, 2003;Basile et al., 2014) and graph-based (Agirre et al., 2014;Moro et al., 2014); and supervised, both linear models (Zhong and Ng, 2010;Iacobacci et al., 2016) and the most recent branch exploiting neural networks (Melamud et al., 2016;Raganato et al., 2017b). We present an analysis of the main advantages and limitations of each kind of approach (Raganato et al., 2017a).
Then, we summarize the field of knowledgebased representations, in particular sense vectors and embeddings, as flexible techniques connecting lexical resources and downstream applications. We first present techniques which leverage WordNet as main source of knowledge (Chen et al., 2014;Rothe and Schütze, 2015;Jauhar et al., 2015;Johansson and Pina, 2015;Pilehvar and Collier, 2016) and also present other techniques exploiting multilingual resources such as Wikipedia or BabelNet (Iacobacci et al., 2015;Camacho-Collados et al., 2016;Mancini et al., 2017).
Finally, we briefly present a few successful approaches integrating knowledge-based representations into downstream tasks such as sentiment analysis (Flekova and Gurevych, 2016), lexical substitution (Cocos et al., 2017) or visual object discovery (Young et al., 2017). As a case study, we present an analysis on the integration of knowledge-based embeddings into neural architectures via WSD for text classification (Pilehvar et al., 2017), discussing its potential and current open challenges.

Open problems and challenges
In this last section we introduce some of the open problems and challenges for automatizing the resource creation and enrichment process as well as for the integration of knowledge from lexical resources into NLP applications.

Instructors
Jose Camacho Collados is a Research Associate at Cardiff University. Previously he was a Google Doctoral Fellow and completed his PhD at Sapienza University of Rome. His research focuses on Natural Language Processing and, more specifically, on the area of lexical and distributional semantics. Jose has experience in utilizing lexical resources for NLP applications, while enriching and improving these resources by extracting and processing knowledge from textual data. On this area he has co-organized the Se-mEval 2018 shared task on Hypernym Discovery. Previously, he co-organized a workshop on Sense, Concept and Entity Representations and their Applications at EACL 2017 and a tutorial on the same topic at ACL 2016. His background education includes an Erasmus Mundus Master in Natural Language Processing and Human Language Technology and a 5-year BSc degree in Mathematics.
Luis Espinosa Anke received his BA in English Philology in 2006 (Univ. of Alicante, Spain), and his PhD in Natural Language Processing in 2017 (Univ. Pompeu Fabra, Spain). He holds two MAs, one in English-Spanish Translation (Univ. of Alicante), and an Erasmus Mundus MA in Natural Language Processing (NLP) (Univ. of Wolverhampton and Univ. Autonoma de Barcelona). His research interests lie in the intersection between structured representations of knowledge and NLP, specifically computational lexicography and distributional semantics. He has co-organized the Se-mEval 2018 shared tasks on Hypernym Discovery and Multilingual Emoji Prediction. Previously, he co-organized the Spanish NLP conference (2014) and the Focused NER task (Open Knowledge Extraction challenge) at ESWC 2017.
Mohammad Taher Pilehvar is a Research Associate at the University of Cambridge. Taher's research lies in lexical semantics, mainly focusing on semantic representation and similarity. In the past, he has co-instructed three tutorials on these topics (EMNLP 2015, ACL 2016, and EACL 2017 and co-organised three SemEval tasks. He has also co-authored several conference (including two ACL best paper nominations, at 2013 and 2017) and journal papers, including different semantic representation techniques based on heterogeneous lexical resources.