AFET: Automatic Fine-Grained Entity Typing by Hierarchical Partial-Label Embedding

Distant supervision has been widely used in current systems of ﬁne-grained entity typing to automatically assign categories (entity types) to entity mentions. However, the types so obtained from knowledge bases are often incorrect for the entity mention’s local context. This paper proposes a novel embedding method to separately model “clean” and “noisy” mentions, and incorporates the given type hierarchy to induce loss functions. We formulate a joint optimization problem to learn embeddings for mentions and type-paths, and develop an iterative algorithm to solve the problem. Experiments on three public datasets demonstrate the effectiveness and robustness of the proposed method, with an average 15% improvement in accuracy over the next best compared method 1 .


Introduction
Assigning types (e.g., person, organization) to mentions of entities in context is an important task in natural language processing (NLP). The extracted entity type information can serve as primitives for relation extraction (Mintz et al., 2009) and event extraction (Ji and Grishman, 2008), and assists a wide range of downstream applications including knowledge base (KB) completion (Dong et al., 2014), question answering (Lin et al., 2012) and entity recommendation (Yu et al., 2014). While traditional named entity recognition systems (Ratinov and Roth, 2009;Nadeau and Sekine, 2007) focus on a small set of coarse types (typically fewer than 10), recent studies (Ling and Weld, 2012;Yosef et al., 2012) work on a much larger set of fine-grained types (usually over 100) which form a tree-structured hierarchy (see the blue region of Fig. 1). Fine-grained typing allows one mention to have multiple types, which together constitute a type-path (not necessarily ending in a leaf node) in the given type hierarchy, depending on the local context (e.g., sentence). Consider the example in Fig. 1, "Arnold Schwarzenegger" could be labeled as {person, businessman} in S3 (investment). But he could also be labeled as {person, politician} in S1 or {person, artist, actor} in S2. Such fine-grained type representation provides more informative features for other NLP tasks. For exam-ple, since relation and event extraction pipelines rely on entity recognizer to identify possible arguments in a sentence, fine-grained argument types help distinguish hundreds or thousands of different relations and events (Ling and Weld, 2012).
Traditional named entity recognition systems adopt manually annotated corpora as training data (Nadeau and Sekine, 2007). But the process of manually labeling a training set with large numbers of fine-grained types is too expensive and errorprone (hard for annotators to distinguish over 100 types consistently). Current fine-grained typing systems annotate training corpora automatically using knowledge bases (i.e., distant supervision) (Ling and Weld, 2012;Ren et al., 2016a). A typical workflow of distant supervision is as follows (see Fig. 1): (1) identify entity mentions in the documents; (2) link mentions to entities in KB; and (3) assign, to the candidate type set of each mention, all KB types of its KB-linked entity. However, existing distant supervision methods encounter the following limitations when doing automatic fine-grained typing.
• Noisy Training Labels. Current practice of distant supervision may introduce label noise to training data since it fails to take a mention's local contexts into account when assigning type labels (e.g., see Fig. 1). Many previous studies ignore the label noises which appear in a majority of training mentions (see Table. 1, row (1)), and assume all types obtained by distant supervision are "correct" (Yogatama et al., 2015;Ling and Weld, 2012). The noisy labels may mislead the trained models and cause negative effect. A few systems try to denoise the training corpora using simple pruning heuristics such as deleting mentions with conflicting types . However, such strategies significantly reduce the size of training set (Table 1, rows (2a-c)) and lead to performance degradation (later shown in our experiments). The larger the target type set, the more severe the loss. • Type Correlation. Most existing methods (Yogatama et al., 2015;Ling and Weld, 2012) treat every type label in a training mention's candidate type set equally and independently when learning the classifiers but ignore the fact that types in the given hierarchy are semantically correlated (e.g., actor is more relevant to singer than to politician). As a consequence, the learned classifiers may bias  toward popular types but perform poorly on infrequent types since training data on infrequent types is scarce. Intuitively, one should pose smaller penalty on types which are semantically more relevant to the true types. For example, in Fig. 1 singer should receive a smaller penalty than politician does, by knowing that actor is a true type for "Arnold Schwarzenegger" in S2. This provides classifiers with additional information to distinguish between two types, especially those infrequent ones.
In this paper, we approach the problem of automatic fine-grained entity typing as follows: (1) Use different objectives to model training mentions with correct type labels and mentions with noisy labels, respectively. (2) Design a novel partial-label loss to model true types within the noisy candidate type set which requires only the "best" candidate type to be relevant to the training mention, and progressively estimate the best type by leveraging various text features extracted for the mention. (3) Derive type correlation based on two signals: (i) the given type hierarchy, and (ii) the shared entities between two types in KB, and incorporate the correlation so induced by enforcing adaptive margins between different types for mentions in the training set. To integrate these ideas, we develop a novel embedding-based framework called AFET. First, it uses distant supervision to obtain candidate types for each mention, and extract a variety of text features from the mentions themselves and their local contexts. Mentions are partitioned into a "clean" set and a "noisy" set based on the given type hierarchy. Second, we embed mentions and types jointly into a low-dimensional space, where, in that space, objects (i.e., features and types) that are semantically close to each other also have similar representations. In the proposed objective, an adaptive margin-based rank loss is pro-posed to model the set of clean mentions to capture type correlation, and a partial-label rank loss is formulated to model the "best" candidate type for each noisy mention. Finally, with the learned embeddings (i.e., mapping matrices), one can predict the typepath for each mention in the test set in a top-down manner, using its text features. The major contributions of this paper are as follows: 1. We propose an automatic fine-grained entity typing framework, which reduces label noise introduced by distant supervision and incorporates type correlation in a principle way.

2.
A novel optimization problem is formulated to jointly embed entity mentions and types to the same space. It models noisy type set with a partial-label rank loss and type correlation with adaptive-margin rank loss.
3. We develop an iterative algorithm for solving the joint optimization problem efficiently.
4. Experiments with three public datasets demonstrate that AFET achieves significant improvement over the state of the art.

Automatic Fine-Grained Entity Typing
Our task is to automatically uncover the type information for entity mentions (i.e., token spans representing entities) in natural language sentences. The task takes a document collection D (automatically labeled using a KB Ψ in conjunction with a target type hierarchy Y) as input and predicts a type-path in Y for each mention from the test set D t .
Type Hierarchy and Knowledge Base. Two key factors in distant supervision are the target type hierarchy and the KB. A type hierarchy, Y, is a tree where nodes represent types of interests from Ψ. Previous studies manually create several clean type hierarchies using types from Freebase (Ling and Weld, 2012) or WordNet (Yosef et al., 2012). In this study, we adopt the existing hierarchies constructed using Freebase types 2 . To obtain types for entities E Ψ in Ψ, we use the human-curated entity-type facts in Freebase, denoted as F Ψ = (e, y) ⊂ E Ψ × Y. 2 We use the Freebase dump as of 2015-06-30.

Noisy Training Mentions
Mention: "S1_Arnold Schwarzenegger"; Context: S1; Candidate  Automatically Labeled Training Corpora. There exist publicly available labeled corpora such as Wikilinks (Singh et al., 2012) and ClueWeb (Gabrilovich et al., 2013). In these corpora, entity mentions are identified and mapped to KB entities using anchor links. In specific domains (e.g., product reviews) where such public corpora are unavailable, one can utilize distant supervision to automatically label the corpus (Ling and Weld, 2012). Specifically, an entity linker will detect mentions m i and map them to one or more entity e i in E Ψ . Types of e i in KB are then associated with m i to form its type set Y i , i.e., Y i = y | (e i , y) ∈ F Ψ , y ∈ Y . Formally, a training corpus D consists of a set of extracted entity Problem Description. For each test mention, we aim to predict the correct type-path in Y based on the mention's context. More specifically, the test set T is defined as a set of mention-context pairs (m, c), where mentions in T (denoted as M t ) are extracted from their sentences using existing extractors such as named entity recognizer (Finkel et al., 2005). We denote the gold type-path for a test mention m as Y * . This work focuses on learning a typing model from the noisy training corpus D, and estimating Y * from Y for each test mention m (in set M t ), based on mention m, its context c, and the learned model.
Framework Overview. At a high level, the AFET framework (see also Fig. 2) learns low-dimensional representations for entity types and text features, and infers type-paths for test mentions using the learned embeddings. It consists of the following steps:

Extract text features for entity mentions in train-
ing set M and test set M t using their surface names as well as the contexts. (Sec. 3.1).
2. Partition training mentions M into a clean set (denoted as M c ) and a noisy set (denoted as M n ) based on their candidate type sets (Sec. 3.2).
3. Perform joint embedding of entity mentions M and type hierarchy Y into the same lowdimensional space where, in that space, close objects also share similar types (Secs. 3.3-3.6).
4. For each test mention m, estimate its type-path Y * (on the hierarchy Y) in a top-down manner using the learned embeddings (Sec. 3.6).

The AFET Framework
This section introduces the proposed framework and formulates an optimization problem for learning embeddings of text features and entity types jointly.

Text Feature Generation
We start with a representation of entity mentions.
To capture the shallow syntax and distributional semantics of a mention m i ∈ M, we extract various features from both m i itself and its context c i . Table 2 lists the set of text features used in this work, which is similar to those used in (Yogatama et al., 2015;Ling and Weld, 2012). We denote the set of M unique features extracted from D as F = {f j } M j=1 .

Training Set Partition
A training mention m i (in set M) is considered as a "clean" mention if its candidate type set obtained by distant supervision (i.e., Y i ) is not ambiguous, i.e., candidate types in Y i can form a single path in tree Y. Otherwise, a mention is considered as "noisy" mention if its candidate types form multiple typepaths in Y. Following the above hypothesis, we judge each mention m i (in set M) and place it in either the "clean" set M c , or the "noisy" set M n . Finally, we have M = M c ∪ M n .

The Joint Mention-Type Model
We propose to learn mappings into low-dimensional vector space, where, both entity mentions and type  labels (in the training set) are represented, and in that space, two objects are embedded close to each other if and only if they share similar types. In doing so, we later can derive the representation of a test mention based on its text features and the learned mappings. Mapping functions for entity mentions and entity type labels are different as they have different representations in the raw feature space, but are jointly learned by optimizing a global objective of interests to handle the aforementioned challenges. Each entity mention m i ∈ M can be represented by a M -dimensional feature vector m i ∈ R M , where m i,j is the number of occurrences of feature f j (in set F) for m i . Each type label y k ∈ Y is represented by a K-dimensional binary indicator vector y k ∈ {0, 1} K , where y k,k = 1, and 0 otherwise.

Entity-type facts
Specifically, we aim to learn a mapping function from the mention's feature space to a lowdimensional vector space, i.e., Φ M (m i ) : R M → R d and a mapping function from type label space to the same low-dimensional space, i.e., Φ Y (y k ) : R K → R d . In this work, we adopt linear maps, as similar to the mapping functions used in (Weston et al., 2011).
where U ∈ R d×M and V ∈ R d×K are the projection matrices for mentions and type labels, respectively.

Modeling Type Correlation
In type hierarchy (tree) Y, types closer to each other (i.e., shorter path) tend to be more related (e.g., actor is more related to artist than to person in the right column of Fig. 2). In KB Ψ, types assigned to similar sets of entities should be more related to each other than those assigned to quite different entities (Jiang et al., 2015) (e.g., actor is Feature Description Example Head Syntactic head token of the mention "HEAD Turing" Token Tokens in the mention "Turing", "Machine" POS Part-of-Speech tag of tokens in the mention "NN" Character All character trigrams in the head of the mention ":tu", "tur", ..., "ng:" Word Shape Word shape of the tokens in the mention "Aa" for "Turing" Length Number of tokens in the mention "2" Context Unigrams/bigrams before and after the mention "CXT B:Maserati ,", "CXT A:and the" Brown Cluster Brown cluster ID for the head token (learned using D) "4 1100", "8 1101111", "12 111011111111" Dependency Stanford syntactic dependency (Manning et al., 2014) associated with the head token "GOV:nn", "GOV:turing" Table 2: Text features used in this paper. "Turing Machine" is used as an example mention from "The band's former drummer Jerry Fuchs-who was also a member of Maserati, Turing Machine and The Juan MacLean-died after falling down an elevator shaft.". more related to director than to author in the left column of Fig. 3). Thus, type correlation between y k and y k (denoted as w kk ) can be measured either using the one over the length of shortest path in Y, or using the normalized number of shared entities in KB, which is defined as follows.
Although a shortest path is efficient to compute, its accuracy is limited-It is not always true that a type (e.g., athlete) is more related to its parent type (i.e., person) than to its sibling types (e.g., coach), or that all sibling types are equally related to each other (e.g., actor is more related to director than to author). We later compare these two methods in our experiments.
With the type correlation computed, we propose to apply adaptive penalties on different negative type labels (for a training mention), instead of treating all of the labels equally as in most existing work (Weston et al., 2011). The hypothesis is intuitive: given the positive type labels for a mention, we force the negative type labels which are related to the positive type labels to receive smaller penalty. For example, in the right column of Fig. 3, negative label businessman receives a smaller penalty (i.e., margin) than athele does, since businessman is more related to politician.
Hypothesis 1 (Adaptive Margin) For a mention, if a negative type is correlated to a positive type, the margin between them should be smaller.
We propose an adaptive-margin rank loss to model the set of "clean" mentions (i.e., M c ), based on the above hypothesis. The intuition is simple: for each mention, rank all the positive types ahead of negative types, where the ranking score is measured by similarity between mention and type. We denote   f k (m i ) as the similarity between (m i , y k ) and is defined as the inner product of Φ M (m i ) and Φ Y (y k ).
Here, γ k,k is the adaptive margin between positive type k and negative typek, which is defined as γ k,k = 1 + 1/(w k,k + α) with a smooth parameter α.
i transforms rank to a weight, which is then multiplied to the max-margin loss Θ i,k,k to optimize precision at x (Weston et al., 2011).

Modeling Noisy Type Labels
True type labels for noisy entity mentions M n (i.e., mentions with ambiguous candidate types in the given type hierarchy) in each sentence are not available in knowledge bases. To effectively model the set of noisy mentions, we propose not to treat all candidate types (i.e., {Y i } as true labels. Instead, we model the "true" label among the candidate set as latent value, and try to infer that using text features. Hypothesis 2 (Partial-Label Loss) For a noisy mention, the maximum score associated with its candidate types should be greater than the scores associated with any other non-candidate types We extend the partial-label loss in (Nguyen and Caruana, 2008) (used to learn linear classifiers) to enforce Hypothesis 2, and integrate with the adaptive margin to define the loss for m i (in set M n ).
Minimizing n encourages a large margin between the maximum scores max y k ∈Yi f y k (m i ) and max yk∈Y i f y k (m i ). This forces m i to be embedded closer to the most "relevant" type in the noisy candidate type set, i.e., y * = argmax y k ∈Yi f y k (m i ), than to any other non-candidate types (i.e., Hypothesis 2). This constrasts sharply with multi-label learning (Yosef et al., 2012), where a large margin is enforced between all candidate types and noncandidate types without considering noisy types.

Hierarchical Partial-Label Embedding
Our goal is to embed the heterogeneous graph G into a d-dimensional vector space, following the three proposed hypotheses in the section. Intuitively, one can collectively minimize the objectives of the two kinds of loss functions c and n , across all the training mentions. To achieve the goal, we formulate a joint optimization problem as follows.
We use an alternative minimization algorithm based on block-wise coordinate descent (Tseng, 2001) to jointly optimize the objective O. One can also apply stochastic gradient descent to do online update. Type Inference. With the learned mention embeddings {u i } and type embeddings {v k }, we perform  top-down search in the given type hierarchy Y to estimate the correct type-path Y * i . Starting from the tree's root, we recursively find the best type among the children types by measuring the dot product of the corresponding mention and type embeddings, i.e., sim(u i , v k ). The search process stops when we reach a leaf type, or the similarity score is below a pre-defined threshold η > 0.

Data Preparation
Datasets. Our experiments use three public datasets.  Table 3.
Training Data. We followed the process in (Ling and Weld, 2012) to generate training data for the Wiki dataset. For the BBN and OntoNotes datasets, we used DBpedia Spotlight 3 for entity linking. We discarded types which cannot be mapped to Freebase types in the BBN dataset (47 of 93). Table 2 lists the set of features used in our experiments, which are similar to those used in (Yogatama et al., 2015;Ling and Weld, 2012) except for topics and ReVerb patterns. We discarded the features which occur only once in the corpus.

Evaluation Settings
For the Wiki and OntoNotes datasets, we used the provided test set. Since BBN corpus is fully annotated, we followed a 80/20 ratio to partition it into training/test sets. We report Accuracy (Strict-F1), Micro-averaged F1 (Mi-F1) and Macro-averaged F1 (Ma-F1) scores commonly used in the fine-grained type problem (Ling and Weld, 2012;Yogatama et al., 2015). Since we use the gold mention set for testing, the Accuracy (Acc) we reported is the same as the Strict F1. Baselines. We compared the proposed method (AFET) and its variant with state-of-the-art typing methods, embedding methods and partial-label learning methods 4 : (1) FIGER (Ling and Weld, 2012); (2)  We compare AFET and its variant: (1) AFET: complete model with KB-induced type correlation; (2) AFET-CoH: with hierarchy-induced correlation (i.e., shortest path distance); (3) AFET-NoCo: without type correlation (i.e., all margin are "1") in the objective O; and (4) AFET-NoPa: without label partial loss in the objective O. Table 4 shows the results of AFET and its variants. Comparison with the other typing methods. AFET outperforms both FIGER and HYENA systems, demonstrating the predictive power of the learned embeddings, and the effectiveness of modeling type correlation information and noisy candidate types. We also observe that pruning methods do not always improve the performance, since they aggressively filter out rare types in the corpus, which may lead to low Recall. ClusType is not as good as FIGER and HYENA because it is intended for coarse types and only utilizes relation phrases.

Performance Comparison and Analyses
Comparison with the other embedding methods. AFET performs better than all other embedding methods. HNM does not use any linguistic features. None of the other embedding methods consider the label noise issue and treat the candidate type sets as clean. Although AFET adopts the WARP loss in WSABIE, it uses an adaptive margin in the objective to capture the type correlation information.
Comparison with partial-label learning methods. Compared with PL-SVM and CLPL, AFET obtains superior performance. PL-SVM assumes that only one candidate type is correct and does not consider type correlation. CLPL simply averages the model output for all candidate types, and thus may generate results biased to frequent false types. Superior performance of AFET mainly comes from modeling type correlation derived from KB.
Comparison with its variants. AFET always outperforms its variant on all three datasets. It gains performance from capturing type correlation, as well as handling type noise in the embedding process.

Case Analyses
Example output on news articles. Table 5 shows the types predicted by AFET, FIGER, PTE and WSABIE on two news sentences from OntoNotes dataset: AFET predicts fine-grained types with better accuracy (e.g., person title) and avoids overly-specific predictions (e.g., news company). Figure 5 shows the types estimated by AFET, PTE and WSABIE on a training sentence from OntoNotes dataset. We found AFET could discover the best type from noisy candidate types.     Testing the effect of training set size and dimension. Experimenting with the same settings for model learning, Fig. 6(a) shows the performance trend on the Wiki dataset when varying the sampling ratio (subset of mentions randomly sampled from the training set D). Fig. 6(b) analyzes the performance sensitivity of AFET with respect to d-the embedding dimension on the BBN dataset. Accuracy of AFET improves as d becomes large but the gain decreases when d is large enough.
Testing sensitivity of the tuning parameter. Fig. 7(b) analyzes the sensitivity of AFET with respect to α on the BBN dataset. Performance increases as α becomes large. When α is large than 0.5, the performance becomes stable.
Testing at different type levels. Fig. 7(a)   son and location on level-1, politician and artist on level-2, author and actor on level-3). The results show that it is more difficult to distinguish among more fine-grained types. AFET always outperforms the other two method, and achieves a 22.36% improvement in Ma-F1, compared to FIGER on level-3 types. The gain mainly comes from explicitly modeling the noisy candidate types.
Testing for frequent/infrequent types. We also  evaluate the performance on frequent and rare types ( Table 6). Note that we use a different evaluation metric, which is introduced in (Yosef et al., 2012) to calculate the F1 score for a type. We find AFET can always perform better than other baselines and it works for both frequent and rare types.

Related Work
There has been considerable work on named entity recognition (NER) (Manning et al., 2014), which focuses on three types (e.g., person, location) and cast the problem as multi-class classification following the type mutual exclusion assumption (i.e., one type per mention) (Nadeau and Sekine, 2007). Recent work has focused on a much larger set of fine-grained types (Yosef et al., 2012;Ling and Weld, 2012). As the type mutual exclusion assumption no longer holds, they cast the problem as multilabel multi-class (hierarchical) classification problems Yosef et al., 2012;Ling and Weld, 2012). Embedding techniques are also recently applied to jointly learn feature and type representations (Yogatama et al., 2015;Dong et al., 2015). Del Corro et al. (2015) proposed an unsupervised method to generate context-aware candidates types, and subsequently select the most appropriate type.  discuss the label noise issue in fine-grained typing and propose three pruning heuristics. However, these heuristics aggressively delete training examples and may suffer from low recall (see Table. 4).
In the context of distant supervision, label noise issue has been studied for other information extraction tasks such as relation extraction (Takamatsu et al., 2012). In relation extraction, label noise is introduced by the false positive textual matches of entity pairs. In entity typing, however, label noise comes from the assignment of types to entity mentions without considering their contexts. The forms of distant supervision are different in these two problems. Recently, (Ren et al., 2016b) has tackled the problem of label noise in fine-grained entity typing, but focused on how to generate a clean training set instead of doing entity typing.
Partial label learning (PLL) (Zhang, 2014;Nguyen and Caruana, 2008;Cour et al., 2011) deals with the problem where each training example is associated with a set of candidate labels, where only one is correct. Unlike existing PLL methods, our method considers type hierarchy and correlation.

Conclusion and Future Work
In this paper, we study automatic fine-grained entity typing and propose a hierarchical partial-label embedding method, AFET, that models "clean" and "noisy" mentions separately and incorporates a given type hierarchy to induce loss functions. AFET builds on a joint optimization framework, learns embeddings for mentions and type-paths, and iteratively refines the model. Experiments on three public datasets show that AFET is effective, robust, and outperforms other comparing methods.
As future work, it would be interesting to study topical features as the context cues of the entity mentions, to leverage multi-sensing embedding to represent linguistic features with multiple senses, and to exploits other effective modeling methods to inject type hierarchy information. The proposed objective function is general and can be considered to incorporate various language features, to conduct integrated modeling of multiple sources, and to be extended to distantly-supervised relation extraction.