Zero-Shot Open Entity Typing as Type-Compatible Grounding

The problem of entity-typing has been studied predominantly as a supervised learning problems, mostly with task-specific annotations (for coarse types) and sometimes with distant supervision (for fine types). While such approaches have strong performance within datasets they often lack the flexibility to transfer across text genres and to generalize to new type taxonomies. In this work we propose a zero-shot entity typing approach that requires no annotated data and can flexibly identify newly defined types. Given a type taxonomy, the entries of which we define as Boolean functions of freebase “types,” we ground a given mention to a set of type-compatible Wikipedia entries, and then infer the target mention’s type using an inference algorithm that makes use of the types of these entries. We evaluate our system on a broad range of datasets, including standard fine-grained and coarse-grained entity typing datasets, and on a dataset in the biological domain. Our system is shown to be competitive with state-of-the-art supervised NER systems, and to outperform them on out-of-training datasets. We also show that our system significantly outperforms other zero-shot fine typing systems.


Introduction
Entity type classification is the task of connecting an entity mention to a given set of semantic types. The commonly used type sets range in size and level of granularity, from a small number of coarse-grained types (Tjong Kim Sang and De Meulder, 2003) to over a hundred fine-grained types (Ling and Weld, 2012). It is understood that semantic typing is a key component in many natural language understanding tasks, including Question Answering (Toral et al., 2005;Li and Roth, 2005) and Textual Entailment (Dagan et al., 2010(Dagan et al., , 2013. Consequently, the ability to type mentions semantically across domains and text genres, and to use a flexible type hierarchy, is essential for solving many important challenges. Nevertheless, most commonly used approaches and systems for semantic typing (e.g., CORENLP (Manning et al., 2014), COG-COMPNLP (Khashabi et al., 2018), NLTK (Loper and Bird, 2002), SPACY) are trained in a supervised fashion and rely on high quality, taskspecific annotation. Scaling such systems to other domains and to a larger set of entity types faces fundamental restrictions.
Coarse typing systems, which are mostly fully supervised, are known to fit a single dataset very well. However, their performance drops significantly on different text genres and even new data sets. Moreover, adding a new coarse type requires manual annotation and retraining. For finetyping systems, people have adopted a distantsupervision approach. Nevertheless, the number of types used is small: the distantly-supervised FIGER dataset covers only 113 types, a small fraction of most-conservative estimates of the number of types in the English language (the FREEBASE (Bollacker et al., 2008) and WORD-NET (Miller, 1995) hierarchies consist of more than 1k and 1.5k unique types, respectively). More importantly, adapting these systems, once trained, to new type taxonomies cannot be done flexibly.
As was argued in Roth (2017), there is a need to develop new training paradigms that support scalable semantic processing; specifically, there is a need to scale semantic typing to flexible type taxonomies and to multiple domains.
In this work, we introduce ZOE, a zero-shot entity typing system, with open type definitions. Given a mention in a sentence and a taxonomy of entity types with their definitions, ZOE identifies a set of types that are appropriate for the mention Figure 1: ZOE maps a given mention to its type-compatible entities in Wikipedia and infers a collection of types using this set of entities. While the mention "Oarnniwsf," a football player in the U. of Washington, does not exist in Wikipedia, we ground it to other entities with approximately the same types ( §3).
in this context. ZOE does not require any training, and it makes use of existing data resources (e.g., Wikipedia) and tools developed without any taskspecific annotation. The key idea is to ground each mention to a set of type-compatible Wikipedia entities. The benefit of using a set of Wikipedia titles as an intermediate representation for a mention is that there is much human-curated information in Wikipedia -categories associated with each page, FREEBASE types, and DBpedia types. These were put there independently of the task at hand and can be harnessed for many tasks: in particular, for determining the semantic types of a given mention in its context. In this grounding step, the guiding principle is that type-compatible entities often appear in similar contexts. We rely on contextual signals and, when available, surface forms, to rank Wikipedia titles and choose those that are more compatible with a given mention. Importantly, our algorithm does not require a given mention to be in Wikipedia; in fact, in many cases (such as nominal mentions) the mentions are not available in Wikipedia. We hypothesize that any entity possible in English corresponds to some type-compatible entities in Wikipedia. We can then rely mostly on the context to reveal a set of compatible titles, those that are likely to share semantic types with the target mention. The fact that our system is not required to ground to the exact concept is a key difference between our grounding and "standard" Wikification approaches (Mihalcea and Csomai, 2007;Ratinov et al., 2011). As a consequence, while entity linking approaches rely heavily on priors associated with the surface forms and do not consider those that do not link to Wikipedia titles, our system mostly relies on context, regardless of whether the grounding actually exists or not. Figure 1 shows a high-level visualization of our system. Given a mention, our system grounds it into type-compatible entries in Wikipedia. The target mention "Oarnniwsf," is not in Wikipedia, yet it is grounded to entities with approximately correct types. In addition, while some of the grounded Wikipedia entries are inaccurate in terms of entity types, the resulting aggregated decision is correct.
ZOE is an open type system, since it is not restricted to a closed set of types. In our experiments, we build on FREEBASE types as primitive types and use them to define types across seven different datasets. Note, however, that our approach is not fundamentally restricted to FREE-BASE types; in particular, we allow types to be defined as Boolean formulae over these primitives (considering a type to be a set of entities). Furthermore, we support other primitives, e.g., DBPedia or Wikipedia entries. Consequently, our system can be used across type taxonomies; there is no need to restrict to previously observed types or retrain with annotations of new types. If one wants to use types that are outside our current vocabulary, one only needs to define the target type taxonomy in terms of the primitives used in this work.
In summary, our contributions are as follows: • We propose a zero-shot open entity typing framework 1 that does not require training on entity-typing-specific supervised data. • The proposed system outperforms existing zero-shot entity typing systems. • Our system is competitive with fullysupervised systems in their respective domains across a broad range of coarse-and fine-grained typing datasets, and it outperforms these systems in out-of-domain settings.

Related Work
Named Entity Recognition (NER), for which the goal is to discover mention-boundaries in addition to typing, often using a small set of mutu-ally exclusive types, has a considerable amount of work (Grishman and Sundheim, 1996;Mikheev et al., 1999;Tjong Kim Sang and De Meulder, 2003;Florian et al., 2003;Ratinov and Roth, 2009).
There have been many proposals to scale the systems to support a bigger type space (Fleischman and Hovy, 2002;Sekine et al., 2002). This direction was followed by the introduction of datasets with large label-sets, either manually annotated like BBN (Weischedel and Brunstein, 2005) or distantly supervised like FIGER (Ling and Weld, 2012). With larger datasets available, supervised-learning systems were proposed to learn from the data (Yosef et al., 2012;Abhishek et al., 2017;Shimaoka et al., 2017;Xu and Barbosa, 2018;Choi et al., 2018). Such systems have achieved remarkable success, mostly when restricted to their observed domain and labels.
There is a handful of works aiming to pave the road towards zero-shot typing by addressing ways to extract cheap signals, often to help the supervised algorithms: e.g., by generating gazetteers (Nadeau et al., 2006), or using the anchor texts in Wikipedia (Nothman et al., 2008(Nothman et al., , 2009). Ren et al. (2016) project labels in highdimensional space and use label correlations to suppress noise and better model their relations. In our work, we choose not to use the supervisedlearning paradigm and instead merely rely on a general entity linking corpus and the signals in Wikipedia. Prior work has already shown the importance of Wikipedia information for NER. Tsai et al. (2016a) use a cross-lingual WIKIFIER to facilitate cross-lingual NER. However, they do not explicitly address the case where the target entity does not exist in Wikipedia.
The zero-shot paradigm for entity typing has only recently been studied. Yogatama et al. (2015) proposed an embedding representation for userdefined features and labels, which facilitates information sharing among labels and reduces the dependence on the labels observed in the training set. The work of Yuan and Downey (2018) can also be seen in the same spirit, i.e., systems that rely on a form of representation of the labels. In a broader sense, such works-including oursare part of a more general line of work on zeroshot learning (Chang et al., 2008;Palatucci et al., 2009;Norouzi et al., 2013;Romera-Paredes and Torr, 2015;Song and Roth, 2014). Our work can Approach Zero-shot? Use labeled data?

Concept-embedding Clustering
No

Type-Compatible Concepts
No Table 1: Comparison of recent work on entity typing. Our system does not require any labeled data for entity typing; therefore it works on new datasets without retraining.
be thought of as the continuation of the same research direction. A critical step in the design of zero-shot systems is the characterization of the output space. For supervised systems, the output representations are trivial, as they are just indices. For zero-shot systems, the output space is often represented in a high-dimensional space that encodes the semantics of the labels. In OTYPER (Yuan and Downey, 2018), each type embedding is computed by averaging the word embeddings of the words comprising the type. The same idea is also used in PROTOLE (Ma et al., 2016), except that averaging is done only for a few prototypical instances of each type. In our work, we choose to define types using information in Wikipedia. This flexibility allows our system to perform well across several datasets without retraining. On a conceptual level, the work of Lin et al. (2012) and  are close to our approach. The governing idea in these works is to cluster mentions, followed by propagating type information from representative mentions.
Table 1 compares our proposed system with several recently proposed models.

Zero-Shot Open Entity Typing
Types are conceptual containers that bind entities together to form a coherent group. Among the entities of the same type, type-compatibility creates a network of loosely connected entities: Definition 1 (Weak Type Compatibility) Two entities are type-compatible if they share at least one type with respect to a type taxonomy and the contexts in which they appear.
In our approach, given a mention in a sentence, we aim to discover type-compatible entities in Wikipedia and then infer the mention's types using all the type-compatible entities together. The advantage of using Wikipedia entries is that the rich information associated with them allows us to infer the types more easily. Note that this problem is different from the standard entity linking or Wikification problem in which the goal is to find the corresponding entity in Wikipedia. Wikipedia does not contain all entities in the world, but an entity is likely to have at least one type-compatible entity in Wikipedia.
In order to find the type-compatible entities, we use the context of mentions as a proxy. Defining it formally: Definition 2 (Context Consistency) A mention m (in a context sentence s) is context-consistent with another well-defined mention m , if m can be replaced by m in the context s, and the new sentence still makes logical sense.
Hypothesis 1 Context consistency is a strong proxy for type compatibility.
Based on this hypothesis, given a mention m in a sentence s, we find other context-compatible mentions in a Wikified corpus. Since the mentions in the Wikified corpus are linked to the corresponding Wikipedia entries, we can infer m's types by aggregating information associated with these Wikipedia entries. Figure 2 shows the high-level architecture of our proposed system. The inputs to the system are a mention m in a sentence s, and a type definition T . The output of the system is a set of types {t Target } ⊆ T in the target taxonomy that best represents the given mention. The type definitions characterize the target entity-type space. In our experiments, we choose to use FREEBASE types to define the types across 7 datasets; that is, T is a mapping from the set of FREEBASE types to the set of target types: T : {t FB } → {t Target }. This definition comprises many atomic definitions; for example, we can define the type location as the disjunction of FREEBASE types like FB.location and FB.geography: The type definitions of a dataset reflect the understanding of a domain expert and the assumptions made in dataset design. Such definitions are often much cheaper to define, than to annotate full-fledged supervised datasets. It is important to emphasize that, to use our system on different datasets, one does not need to retrain it; there is one single system used across different datasets, working with different type definitions.
For notational simplicity, we define a few conventions for the rest of the paper. The notation t ∈ T , simply means t is a member of the image of the map T (i.e., t is a member of the target types). For a fixed concept c, the notation T (c) is the application of T (.) on the FREEBASE types attached to the concept c. For a collection of concepts C, T (C) is defined as c∈C T (c). We use T coarse (.) to refer to the subset of coarse types of T (.), while T fine (.) defines the fine type subset.
Components in Figure 2 are described in the following sections.

Initial Concept Candidate Generation
Given a mention, the goal of this step is to quickly generate a set of Wikipedia entries based on other words in the sentence. Since there are millions of entries in Wikipedia, it is extremely inefficient to go through all entries for each mention. We adopt ideas from explicit semantic analysis (ESA) (Gabrilovich and Markovitch, 2007), an approach to representing words with a vector of Wikipedia concepts, and to providing fast retrieval of the relevant Wikipedia concepts via inverted indexing.
In our construction we use the WIK-ILINKS (Singh et al., 2012) corpus, which contains a total of 40 million mentions over 3 million concepts. Each mention in WIKILINKS is associated with a Wikipedia concept. To characterize it formally, in the WIKILINKS corpus, for each concept c, there are example sentences sent(c) = {s i }.
Offline computation: The first step is to construct an ESA representation for each word in the WIKILINKS corpus. We create a mapping from each word in the corpus to the relevant concepts associated with it. The result is a map S from tokens to concepts: S : w → {c, score(c|w)} (see Figure 3), where score(c|w) denotes the association of the word w with concept c, calculated as the sum of the TF-IDF values of the word w in the sentences describing c: That is, we treat each sentence as a document and compute TF-IDF scores for the words in it.
Online computation: For a given mention m and its sentence context s, we use our offline wordconcept map S to find the concepts associated with each word, and aggregate them to create a single list of weighted concepts; i.e., w∈s S(w). The resulting concepts are sorted by the corresponding weights, and the top ESA candidates form a set C ESA which is passed to the next step.

Context-Consistent Re-Ranking
After quick retrieval of the initial concept candidates, we re-rank concepts in C ESA based on context consistency between the input mention and concept mentions in WIKILINKS.
For this step, assume we have a representation that encodes the sentential information anchored on the mention. We denote this mention-aware context representation as SentRep(s|m). We define a measure of consistency between a concept c and a mention m in a sentence s: where ConceptRep(c) is representation of a concept: which is the average vector of the representation of all the sentences in WIKILINKS that describe the given concept. We use pre-trained ELMO (Peters et al., 2018), a state-of-the-art contextual and mentionaware word representation. In order to generate SentRep(s|m), we run ELMO on sentence s, where the tokens of the mention m are concatenated with " ", and retrieve its ELMO vector as SentRep(s|m).
According to the consistency measure, we select the top ELMO concepts for each mention. We call this set of concepts C ELMO .

Surface-Based Concept Generation
While context often is a key signal for typing, one should not ignore the information included in the surface form of the mentions. If the corresponding concept or entity exists in Wikipedia, many mentions can be accurately grounded with only trivial prior probability Pr(concept|surface). The prior distribution is pre-computed by calculating the fre-quency of the times a certain surface string refers to different concepts within Wikipedia.
In the test time, for a given mention, we use the pre-computed probability distribution to obtain the most likely concept, c surf = arg max c Pr(c|m), for the given mention m.

Type Inference
Our inference algorithm starts with selection of concepts, followed by inference of coarse and fine types. Our approach is outlined in Algorithm 1 and explained below.
Concept inference. To integrate surface-based and context-based concepts, we follow a simple rule: if the prior probability of the surface-based concept (c surf ) has confidence below a threshold λ, we ignore it; otherwise we include it among the concepts selected from context (C ELMO ), and only choose coarse and fine types from c surf .
To map the selected concepts to the target entity types, we retrieve the FREEBASE-types of each concept and then apply the type definition T (defined just before §3.1). In Algorithm 1, the set of target types of a concept c is denoted as T (c). This is followed by an aggregation step for selection of a coarse type t coarse ∈ T coarse (.), and ends with the selection of a set of fine types {t fine } ⊆ T fine (.).
Coarse type inference. Our type inference algorithm works in a relatively simple confidence analysis procedure. To this end, we define Count(t; C) to be the number of occurrences of type t in the collection of concepts C: Count(t; C) := |{c : c ∈ C and t ∈ T (c)}|.
In theory, for a sensible type t, the count of context-consistent concepts that have this type should be higher than that of the initial concept candidates. In other words, Count(t;C ELMO )/ ELMO Count(t;C ESA )/ ESA > 1. We select the first concept (in the C ELMO ranking) which has some coarse type that matches this criterion. If there is no such concept, we use the coarse types of the highest scoring concept. To select one of the coarse types of the selected concept, we let each concept of C ELMO vote based on its consistency score. We name this voting-based procedure SelectCoarse(c), which selects one coarse type from a given concept: Consistency(c, s, m),
where consistency is defined in Equation (1).
Fine type inference. With the selected coarse type, we take only the fine types that are compatible with the selected coarse type (e.g., the fine type /people/athlete and the coarse type /people are compatible).
Among the compatible fine types, we further filter the ones that have better support from the context. Therefore, we select the fine types t f such that where t c is the previously selected coarse type which is compatible with t f . Intuitively, the fraction filters out the fine-grained candidate types that don't have enough support compared to the selected coarse type.

Experiments
Empirically, we study the behavior of our system compared to published results. All the results are reproduced except the ones indicated by * , which are directly cited from their corresponding papers.
Datasets. In our experiments, we use a wide range of typing datasets: • For coarse entity typing, we use MUC (Grishman and Sundheim, 1996), CoNLL (Tjong Kim Sang and De Meulder, 2003), and OntoNotes (Hovy et al., 2006).  Table 2: Evaluation of fine-grained entity-typing: we compare our system with state-of-the-art systems ( §4.1) For each column, the best zero-shot and overall results are bold-faced and underlined, respectively. Numbers are F 1 in percentage. For supervised systems, we report their in-domain performances, since they do not transfer to other datasets with different labels. For OTYPER, cells with gray color indicate in-domain evaluation, which is the setting in which it has the best performance.
Our system outperforms all the other zero-shot baselines, and achieves competitive results compared to the best supervised systems.  Table 3: Evaluation of coarse entity-typing ( §4.2): we compare two supervised entity-typers with our system. For the supervised systems, cells with gray color indicate in-domain evaluation. For each column, the best, out-of-domain and overall results are bold-faced and underlined, respectively. Numbers are F 1 in percentage. In most of the out-of-domain settings our system outperforms the supervised system.
ZOE's parameters. We use different type definitions for each dataset. In order to design type definitions for each dataset, we follow in the footsteps of Abhishek et al. (2017) and randomly sample 10% of the test set. For the experiments, we exclude the sampled set. For completeness, we have included the type definitions of the major experiments in Appendix D.
The parameters are set universally across different experiments. For parameters that determine the number of extracted concepts, we use ESA = 300 and ELMO = 20, which are based on the upper-bound analysis in Appendix A. For other parameters, we set λ = 0.5, η s = 0.8 and η c = 0.3, based on the FIGER dev set. We emphasize that these parameters are universal across our evaluations.
Evaluation metrics. Given a collection of mentions M , denote the set of gold types and predicted types of a mention m ∈ M as T g (m) and T p (m) respectively. We define the following metrics for our evaluations: , and the Micro recall and F1 follow the same pattern. In the experiment in §4.3, to evaluate systems on unseen types we used modified versions of metrics. Let G(t) be the number of mentions with gold type t, P (t) be the number of mentions predicted to have type t, C(t) be the number of mentions correctly predicted to have type t: • The precision corresponding to F 1 type ma is defined as t C(t) P (t) G(t) t G(t ) ; recall follows the same pattern.
• The precision corresponding to F 1 type mi is defined as t C(t) t P (t) ; recall follows the same pattern.  Baselines. To add to the best published results on each dataset, we create two simple and effective baselines. The first baseline, ELMONN, selects the nearest neighbor types to a given mention, where mentions and types are represented by ELMO vectors. To create a representation for each type t, we average the representation of the WIK-ILINKS sentences that contain mentions of type t (as explained in §3.2). Our other baseline, WIK-IFIERTYPER, uses Wikifier (Tsai et al., 2016b) to map the mention to a Wikipedia concept, followed by mapping to FREEBASE types, and finally projecting them to the target types, via type definition function T (.). Additionally, to compare with published zero-shot systems, we compare our system to OTYPER, a recently published open-typing system. Unfortunately, to the best of our knowledge, the systems proposed by Ma et al.; are not available online for empirical comparison.

Fine-Grained Entity Typing
We evaluate our system for fine-grained namedentity typing. Table 2 shows the evaluation result for three datasets, FIGER, BBN, and OntoNotes fine . We report our system's performance, our zero-shot baselines, and two supervised systems (AFET, plus the-state-of-the-art), for each dataset. There is no easy way to transfer the supervised systems across datasets, hence no out-of-domain numbers for such systems. For each dataset, we train OTYPER and evaluate on the test sets of all the three datasets. In order to run OTYPER on different datasets, we disabled original dataset-specific entity and type features. As a result, among the open typing systems, our system has significantly better results. In addition, our system has competitive scores compared to the supervised systems.

Coarse Entity Typing
In Table 3 we study entity typing for the coarse types on three datasets. We focus on three types  that are shared among the datasets: PER, LOC, ORG. In coarse-entity typing, the best available systems are heavily supervised. In this evaluation, we use gold mention spans; i.e., we force the decoding algorithm of the supervised systems to select the best of the three classes for each gold mention. As expected, the supervised systems have strong in-domain performance. However, they suffer a significant drop when evaluated in a different domain. Our system, while not trained on any supervised data, achieves better or comparable performance compared to other supervised baselines in the out-of-domain evaluations.

Typing of Unseen Types within Domain
We compare the quality of open typing, in which the target type(s) have not been seen before. We compare our system to OTYPER, which relies on supervised data to create representations for each type; however, it is not restricted to the observed types. We follow a similar setting to Yuan and Downey (2018) and split the FIGER test in folds (one fold per type) and do cross-validations. For each fold, mentions of only one type are used for evaluation, and the rest are used for training OTYPER. To be able to evaluate on unseen types (only for this experiment), we use modified metrics F 1 type ma and F 1 type mi that measure per type quality of the system ( §4). In this experiment, we focus on a within-domain setting, and show the results of transfer across genres in the next experiments. The results are summarized in Table 4. We observe a significant margin between ZOE and other systems, including OTYPER.

Biology Entity Typing
We go beyond the scope of popular entity-typing tasks, and evaluate the quality of our system on a dataset that contains sentences from scientific papers (Delėger et al., 2016), which makes it different from other entity-typing datasets. The mentions refer to either "bacteria", or some miscellaneous class (two class typing). As indicated in Ta Table 6: Ablation study of different ways in which concepts are generated in our system ( §4.5). The first row shows performance of our system on each dataset, followed by the change in the performance upon dropping a component. While both signals are crucial, contextual information is playing more important role than the mention-surface signal.
ble 5, our system's overall scores are higher than our baselines.

Ablation Study
We carry out ablation studies that quantify the contribution of surface information ( §3.3) and context information ( §3.2). As Table 6 shows, both factors are crucial and complementary for the system. However, the contextual information seems to have a bigger role overall. We complement our qualitative analysis with the quantitative share of each component. In 69.3%, 54.6%, and 69.7% of mentions, our system uses the context information (and ignores the surface), in FIGER, BBN, and OntoNotes fine datasets, respectively, underscoring the importance of contextual information.

Error Analysis
We provide insights into specific reasons for the mistakes made by the system. For our analysis, we use the erroneous decisions in the FIGER dev set. Two independent annotators label the cause(s) of the mistakes, resulting in 83% agreement between the annotators. The disagreements are later reconciled by an adjudication step. 1. Incompatible concept, due to context information: Ambiguous contexts, or short ones, often contribute to the inaccurate mapping to concepts. In our manual annotations, 23.3% of errors are caused, at least partly, by this issue. 2. Incompatible concept, due to surface information: Although the prior probability is high, the surface-based concept could be wrong. About 26% of the errors are partly due to the surface signal errors. 3. Incorrect type, due to type inference: Even when the system is able to find several typecompatible concepts, it can fail due to inference errors. This could happen if the types attached to the type-compatible concepts are not the majority among other types attached to other con-cepts. This is the major reason behind 56.6% of errors. 4. Incorrect type, due to type definition: Some errors are caused by the inaccurate definition of the type mapping function T . About 23% of the mistakes are partly caused by this issue. Note that each mistake could be caused by multiple factors; in other words, the above categories are not mutually disjoint events. A slightly more detailed analysis is included in Appendix C.

Conclusion
Moving beyond a fully supervised paradigm and scaling entity-typing systems to support bigger type sets is a crucial challenge for NLP. In this work, we have presented ZOE, a zero-shot open entity typing framework. The significance of this work is threefold. First, the proposed system does not require task-specific labeled data. Our system relies on type definitions, which are much cheaper to obtain than annotating thousands of examples. Second, our system outperforms existing state-of-the-art zero-shot systems by a significant margin. Third, we show that without reliance on task-specific supervision, one can achieve relatively robust transfer across datasets.