Embedding Methods for Fine Grained Entity Type Classification

We propose a new approach to the task of ﬁne grained entity type classiﬁcations based on label embeddings that allows for information sharing among related labels. Speciﬁcally, we learn an embedding for each label and each feature such that labels which frequently co-occur are close in the embedded space. We show that it out-performs state-of-the-art methods on two ﬁne grained entity-classiﬁcation benchmarks and that the model can exploit the ﬁner-grained labels to improve classiﬁcation of standard coarse types.


Introduction
Entity type classification is the task of assigning type labels (e.g., person, location, organization) to mentions of entities in documents. These types are useful for deeper natural language analysis such as coreference resolution (Recasens et al., 2013), relation extraction (Yao et al., 2010), and downstream applications such as knowledge base construction (Carlson et al., 2010) and question answering (Lin et al., 2012).
Standard entity type classification tasks use a small set of coarse labels, typically fewer than 20 (Hirschman and Chinchor, 1997;Sang and Meulder, 2003;Doddington et al., 2004). Recent work has focused on a much larger set of fine grained labels (Ling and Weld, 2012;Yosef et al., 2012;Gillick et al., 2014). Fine grained labels are typically subtypes of the standard coarse labels (e.g., artist is a subtype of person and author is a subtype of artist), so the label space forms a tree-structured is-a hierarchy. See Figure 1 for the label sets used in our experiments. A mention labeled with type artist should also be labeled with all ancestors of artist. Since we allow mentions to have multiple labels, this is a multilabel classification task. Multiple labels typically correspond to a single path in the tree (from root to a leaf or internal node).
An important aspect of context-dependent fine grained entity type classification is that mentions of an entity can have different types depending on the context. Consider the following example: Madonna starred as Breathless Mahoney in the film Dick Tracy. In this context, the most appropriate label for the mention Madonna is actress, since the sentence talks about her role in a film. In the majority of other cases, Madonna is likely to be labeled as a musician.
The main difficulty in fine grained entity type classification is the absence of labeled training examples. Training data is typically generated automatically (e.g. by mapping Freebase labels of resolved entities), without taking context into account, so it is common for mentions to have noisy labels. In our example, the labels for the mention Madonna would include musician, actress, author, and potentially others, even though not all of these labels apply here. Ideally, a fine grained type classification system should be robust to such noisy training data, as well as capable of exploiting relationships between labels during learning. We describe a model that uses a ranking loss-which tends to be more robust to label noise-and that learns a joint representation of features and labels, which allows for information sharing among related labels. 1 A related idea to learn output representations for multiclass document classification and part-of-speech tagging was considered in Srikumar and Manning (2014). We show that it outperforms state-of-the-art methods on two fine grained entity-classification benchmarks. We also evaluate our model on standard coarse type classification and find that training embedding models on all fine grained labels gives better results than training it on just the coarse

Models
In this section, we describe our approach, which is based on the WSABIE  model.

Notation
We use lower case letters to denote variables, bold lower case letters to denote vectors, and bold upper case letters to denote matrices. Let x ∈ R D be the feature vector for a mention, where D is the number of features and x d is the value of the d-th feature. Let y ∈ {0, 1} T be the corresponding binary label vector, where T is the number of labels. y t = 1 if and only if the mention is of type t. We use y t to denote a one-hot binary vector of size T , where y t = 1 and all other entries are zero.
Model To leverage the relationships among the fine grained labels, we would like a model that can learn an embedding space for labels. Our model, based on WSABIE, learns to map both feature vectors and labels to a low dimensional space R H (H is the embedding dimension size) such that each instance is close to its label(s) in this space; see Figure 2 for an illustration. Relationships between labels are captured by their distances in the embedded space: co-occurring labels tend to be closer, whereas mutually exclusive labels are further apart.
Formally, we are interested in learning the mapping functions: In this work, we parameterize them as linear functions f (x, A) = Ax and g(y t , B) = By t , where A ∈ R H×D and B ∈ R H×T are parameters. The score of a label t (represented as a one-hot label vector y t ) and a feature vector x is the dot x is the feature vector extracted from a mention, and yt is its label. Here, black cells indicate non-zero and white cells indicate zero values. The parameters are matrices A and B which are used to map the feature vector x and the label vector yt into an embedding space. product between their embeddings: For brevity, we denote this score by s(x, y t ). Note that the total number of parameters is (D+T )×H, which is typically less than the number of parameters in standard classification models that use regular conjunctions of input features with label classes (e.g., logistic regression) when H < T .
Learning Since we expect the training data to contain some extraneous labels, we use a ranking loss to encourage the model to place positive labels above negative labels without competing with each other. Let Y denote the set of positive labels for a mention, and letȲ denote its complement. Intuitively, we try to rank labels in Y higher than labels inȲ. Specifically, we use the weighted approximate pairwise (WARP) loss of . For a mention {x, y}, the WARP loss is: where rank(x, y t ) is the margin-infused rank of label t: rank(x, y t ) = t ∈Ȳ I(1 + s(x, yt) > s(x, y t )), R(rank(x, y t )) is a function that transforms this rank into a weight. In this work, since each mention can have multiple positive labels, we choose to optimize precision at k by setting i . Favoring precision over recall in fine grained entity type classification makes sense because if we are not certain about a particular fine grained label for a mention, we should use its ancestor label in the hierarchy.
In order to learn the parameters with this WARP loss, we use stochastic (sub)gradient descent.
Inference During inference, we consider the top-k predicted labels, where k is the maximum depth of the label hierarchy, and greedily remove labels that are not consistent with other labels (i.e., not on the same path of the tree). For example, if the (ordered) top-k labels are person, artist, and location, we output only person and artist as the predicted labels. We use a threshold δ such thatŷ t = 1 if s(x, y t ) > δ andŷ t = 0 otherwise.

Kernel extension
We extend the WSABIE model to include a weighting function between each feature and label, similar in spirit to . Recall that the WSABIE scoring function is: where A d and B t denote the column vectors of A and B. We can weight each (feature, label) pair by a kernel function prior to computing the embedding: where K ∈ R D×T is the kernel matrix. We use a N -nearest neighbor kernel 2 and set K d,t = 1 if A d is one of N -nearest neighbors of the label vector B t , and K d,t = 0 otherwise. In all our experiments, we set N = 200.
To incorporate the kernel weighting function, we only need to make minor modifications to the learning procedure. At every iteration, we first compute the similarity between each feature embedding and each label embedding. For each label t, we then set the kernel values for the N most similar features to 1, and the rest to 0 (update K). We can then follow the learning algorithm for the standard WSABIE model described above. At inference time, we fix K so this extension is only slightly slower than the standard model.
The nearest-neighbor kernel introduces nonlinearities to the embedding model. It implicitly plays the role of a label-dependent feature selector, learning which features can interact with which labels and turns off potentially noisy features that are not in the relevant label's neighborhood.

Experiments
Setup and Baselines We evaluate our methods on two publicly available datasets that are manually annotated with gold labels for fine grained entity type classification: GFT (Google Fine Types; Gillick et al., 2014) and FIGER (Ling and Weld, 2012). On the GFT dataset, we compare with state-of-the-art baselines from Gillick et al. (2014): flat logistic regression (FLAT), an extension of multiclass logistic regression for multilabel classification problems; and multiple independent binary logistic regression (BINARY), one per label t ∈ {1, 2, . . . , T }. On the FIGER dataset, we compare with a state-of-the-art baseline from Ling and Weld (2012).
We denote the standard embedding method by WSABIE and its extension by K-WSABIE. We fix our embedding size to H = 50. We report microaveraged precision, recall, and F1-score for each of the competing methods (this is called Loose Micro by Ling and Weld). When development data is available, we use it to tune δ by optimizing F1score.
Training data Because we have no manually annotated data, we create training data using the technique described in Gillick et al. (2014). A set of 133,000 news documents are automatically annotated by a parser, a mention chunker, and an entity resolver that assigns Freebase types to entites, which we map to fine grained labels. This approach results in approximately 3 million training examples which we use to train all the models evaluated below. The only difference between models trained for different tasks is the mapping from Freebase types. See Gillick et al. (2014) for details. Table 1 lists the features we use-the same set as used by Gillick et al. (2014), and very similar to those used by Ling and Weld. String features are randomly hashed to a value in 0 to 999,999, which simplifies feature extraction and adds some additional regularization (Ganchev and Dredze, 2008).

Description
Example Head The syntactic head of the mention phrase "Obama" Non-head Each non-head word in the mention phrase "Barack", "H." Cluster Word cluster id for the head word "59" Characters Each character trigram in the mention head ":ob", "oba", "bam", "ama", "ma:" Shape The word shape of the words in the mention phrase "Aa A. Aa" Role Dependency label on the mention head "subj" Context Words before and after the mention phrase "B:who", "A:first" Parent The head's lexical parent in the dependency tree "picked" Topic The most likely topic label for the document "politics"  GFT evaluation There are T = 86 fine grained labels in the GFT dataset, as listed in Figure 1. The four top-level labels are: person, location, organization, and other; the remaining labels are subtypes of these labels. The maximum depth of a label is 3. We split the dataset into a development set (for tuning hyperparameters) and test set (see Table 2). The overall experimental results are shown in Table 3. Embedding methods performed well. Both WSABIE and K-WSABIE outperformed the baselines by substantial margins in F1-score, though the advantage of the kernel version over the linear version is only marginally significant.
To visualize the learned embeddings, we project label embeddings down to two dimensions using PCA in Figure 3. Since there are only 4 top-level labels here, the fine grained labels are color-coded according to their top-level labels for readability. We can see that related labels are clustered together, and the four major clusters correspond to to the top-level labels. We note that these first two components only capture 14% of the total variance of the full 50-dimensional space.  Table 3: Precision (P), Recall (R), and F1-score on the GFT test dataset for four competing models. The improvements for WSABIE and K-WSABIE over both baselines are statistically significant (p < 0.01). FIGER evaluation Our second evaluation dataset is FIGER from Ling and Weld (2012). In this dataset, there are T = 112 labels organized in a two-level hierarchy; however, only 102 appear in our training data (see Figure 1, taken from their paper, for the complete set of labels). The training labels include 37 top-level labels (e.g., person, location, product, art, etc.) and 75 second-level labels (e.g., actor, city, engine, etc.) The FIGER dataset is much smaller than the GFT dataset (see Table 2).
Our experimental results are shown in Table 4. Again, K-WSABIE performed the best, followed by the standard WSABIE model. Both of these methods significantly outperformed Ling and Weld's best result.  Feature learning We investigate whether having a large fine grained label space is helpful in learning a good representation for feature vectors (recall that WSABIE learns representations for both feature vectors and labels). We focus on the task of coarse type classification, where we want to classify a mention into one of the four top-level GFT labels. We fix the training mentions and learn WSABIE embeddings for feature vectors and labels by (1) training only on coarse labels and (2) training on all labels; we evaluate the models only on coarse labels. Training with all labels gives an improvement of about 2 points (F1 score) over training with just coarse labels, as shown in Ta

Discussion
Design of fine grained label hierarchy Results at different levels of the hierarchies in Table 6 show that it is more difficult to discriminate among deeper labels. However, it appears that the depth-2 FIGER types are easier to discriminate than the depth-2 (and depth-3) GFT labels. This may simply be an artifact of the very small FIGER dataset, but it suggests it may be worthwhile to flatten the other subtree ini GFT since many of its subtypes do not obviously share any information.  Table 6: WSABIE model's Precision (P), Recall (R), and F1-score at each level of the label hierarchies for GFT (top) and FIGER (bottom).

Conclusion
We introduced embedding methods for fine grained entity type classifications that outperforms state-of-the-art methods on benchmark entityclassification datasets. We showed that these methods learned reasonable embeddings for finetype labels which allowed information sharing across related labels.