Scalable Construction and Reasoning of Massive Knowledge Bases

In today’s information-based society, there is abundant knowledge out there carried in the form of natural language texts (e.g., news articles, social media posts, scientific publications), which spans across various domains (e.g., corporate documents, advertisements, legal acts, medical reports), which grows at an astonishing rate. Yet this knowledge is mostly inaccessible to computers and overwhelming for human experts to absorb. How to turn such massive and unstructured text data into structured, actionable knowledge, and furthermore, how to teach machines learn to reason and complete the extracted knowledge is a grand challenge to the research community. Traditional IE systems assume abundant human annotations for training high quality machine learning models, which is impractical when trying to deploy IE systems to a broad range of domains, settings and languages. In the first part of the tutorial, we introduce how to extract structured facts (i.e., entities and their relations for types of interest) from text corpora to construct knowledge bases, with a focus on methods that are weakly-supervised and domain-independent for timely knowledge base construction across various application domains. In the second part, we introduce how to leverage other knowledge, such as the distributional statistics of characters and words, the annotations for other tasks and other domains, and the linguistics and problem structures, to combat the problem of inadequate supervision, and conduct low-resource information extraction. In the third part, we describe recent advances in knowledge base reasoning. We start with the gentle introduction to the literature, focusing on path-based and embedding based methods. We then describe DeepPath, a recent attempt of using deep reinforcement learning to combine the best of both worlds for knowledge base reasoning.

Traditional IE systems assume abundant human annotations for training high quality machine learning models, which is impractical when trying to deploy IE systems to a broad range of domains, settings and languages.
In the first part of the tutorial, we introduce how to extract structured facts (i.e., entities and their relations of different types) from text corpora to construct knowledge bases, with a focus on methods that are minimally-supervised and domain-independent for timely knowledge base construction across various application domains.
In the second part, we introduce how to leverage other knowledge, such as the distributional statistics of characters and words, the annotations for other tasks and other domains, and the linguistics and problem structures, to combat the problem of inadequate supervision, and conduct low-resource information extraction.
In the third part, we describe recent advances in knowledge base reasoning. We start with the gentle introduction to the literature, focusing on pathbased and embedding based methods. We then describe DeepPath, a recent attempt of using deep reinforcement learning to combine the best of both worlds for knowledge base reasoning.

Introduction
Motivation.
The success of data mining and artificial intelligence technology is largely attributed to the efficient and effective analysis of structured data. The construction of a well-structured, machine-actionable knowledge base (KB) from raw (unstructured or loosely-structured) data sources is often the premise of consequent applications. Although the majority of existing data generated in our society is unstructured, big data leads to big opportunities to uncover structures of real-world entities (e.g., person, product), attributes (e.g., age, weight), relations (e.g., employee of, manufacture) from massive text corpora. By integrating these semantic structures, one can construct a powerful KB as a conceptual abstraction of the original corpus. The constructed knowledge base will facilitate browsing information and inferring knowledge that are otherwise widely scattered in the text corpora. Computational machines can effectively perform algorithmic analysis at a large scale over these KBs, and apply the new insights to improve human productivity in various downstream tasks.

Our Focus.
In this tutorial, we focus our discussion on two tightly related problems: automatic construction of knowledge bases from text, and knowledge reasoning for knowledge base completion. While traditional information extraction techniques have heavy reliance on human-annotated data, our tutorial will devote more time on introducing methods that can reduce human efforts in the process, by leveraging external knowledge sources (e.g., distant supervision) and exploiting rich data redundancy in massive text corpora (e.g., weak supervision). We also discuss how data sources from various domains and languages could opens up tremendous opportunities to leverage and transfer existing knowledge about domains, tasks and language, and help knowledge extraction in low-resource settings with minimal supervision. In the reasoning part, we aim to leverage the existing background knowledge and design various algorithms to fill in the missing link between entities in the KB, given the extracted KBs are likely incomplete. More specifically, this part will introduce two lines of research for KB reasoning: path-based and embedding-based methods.
Topics to be covered in this tutorial. The first 2/3 of this tutorial presents a comprehensive overview of the information extraction techniques developed in recent years for constructing knowledge bases (see also Section 2 for a more detailed outline). We will discuss the following key issues: (1) data-driven approaches for mining quality phrases from massive, unstructured text corpora; (2) entity recognition and typing: preliminaries, challenges, and methodologies; and (3) relation extraction: previous efforts, limitations, recent progress, and a joint entity and relation extraction method using distant supervision; (4) multi-task and multi-domain learning for lowresource information extraction; (5) distill linguistic knowledge into neural models to help low-resource information extraction. The second half of the tutorial presents a comprehensive overview of KB reasoning techniques. For path-based methods, we will first describe the Path-Ranking Algorithm (PRA) (Lao et al., 2011) and briefly describe extensions such as ProPPR . Our tutorial will also cover the recent integration of PRA with recurrent neural networks. For the embedding based method, we will briefly describe RESCAL (Nickel et al., 2011) and TransE (Bordes et al., 2013). Finally, we discuss DeepPath (Xiong et al., 2017), a novel deep reinforcement learning model that combines the embedding and path-based approaches for the learning to reason problem.
Research Impact. Our phrase mining tool, SegPhrase (Liu et al., 2015), won the grand prize of Yelp Dataset Challenge 1 and was used by TripAdvisor in productions 2 . Our entity recognition and typing system, ClusType , was shipped as part of the products in Microsoft Bing and U.S. Army Research Lab. We built the first named entity recognizer on Chinese social media Dredze, 2015, 2016) and closed the gap between NER on English and Chinese social media. The same technique was applied to build the first relation extractor for cross-sentence, n-ary relation extraction between drug, gene, and mutation (Peng et al., 2017).
Duration and Sessions. The duration of the tutorial is flexible: It is expected to be 3 hours, but it can be extended into 6 hours, based on the need of the conference. The outline presented here is for the 3-hour tutorial. For longer duration of the tutorial, we plan to extend entity and relation extraction parts, and add in more case studies and applications.
Relevance to ACL. Machine "reading" and "reasoning" of large text corpora have long been the interests to CL and NLP communities, especially when people now are exposed to an explosion of information in the form of free text. Extracting structured information is key to understanding messy and scattered raw data, and effective reasoning tools are critical for the use of KBs in downstream tasks like QA. This tutorial will present an organized picture of recent research on knowledge base construction and reasoning. We will show how exciting and surprising knowledge can be discovered from your own not so well-structured raw corpora, and such incomplete KBs can be further used to derive new insights and more complex knowledge with reasoning techniques.

Outline
This tutorial presents a comprehensive overview of techniques for automatic knowledge base construction from text data (especially from a large, domain-specific text corpora), and techniques for reasoning over large-scale knowledge bases. We will discuss the following key issues:

Previous Editions and Related Tutorials
A list of tutorials on the most related topics: Most of the previous tutorials focused exclusively on the knowledge base construction aspect. In the proposed tutorial, we will give a systematic discussion on the problem of knowledge base reasoning, for which extensive studies have been conducted recently but systematic tutorials are lacking. This tutorial also presents recent advances in applying distant and weak supervision to the extraction of structured facts in knowledge base construction, in addition to the traditional supervised techniques and rule-based approaches.
Target audience and prerequisites.
Researchers and practitioners in the field of natural language processing, computational linguistic, text mining, information retrieval, semantic web and machine learning. While the audience with a good background in these areas would benefit most from this tutorial, we believe the material to be presented would give general audience and newcomers an introductory pointer to the current work and important research topics in this field, and inspire them to learn more. Only preliminary knowledge about NLP, algorithms and their applications are needed. We expect there will be around 70 people interested in our tutorial.
Tutorial material and equipment. We will provide attendees a website and upload our tutorial materials (slides, references, softwares). There is no copyright issue. Standard equipment will be enough for our tutorial.