Fine-Grained Entity Type Classification by Jointly Learning Representations and Label Embeddings

Fine-grained entity type classification (FETC) is the task of classifying an entity mention to a broad set of types. Distant supervision paradigm is extensively used to generate training data for this task. However, generated training data assigns same set of labels to every mention of an entity without considering its local context. Existing FETC systems have two major drawbacks: assuming training data to be noise free and use of hand crafted features. Our work overcomes both drawbacks. We propose a neural network model that jointly learns entity mentions and their context representation to eliminate use of hand crafted features. Our model treats training data as noisy and uses non-parametric variant of hinge loss function. Experiments show that the proposed model outperforms previous state-of-the-art methods on two publicly available datasets, namely FIGER (GOLD) and BBN with an average relative improvement of 2.69% in micro-F1 score. Knowledge learnt by our model on one dataset can be transferred to other datasets while using same model or other FETC systems. These approaches of transferring knowledge further improve the performance of respective models.


Introduction
Entity type classification is the task for assigning types or labels such as organization, location to entity mentions in a document. This classification is useful for many natural language processing (NLP) tasks such as relation extraction (Mintz et al., 2009), machine translation (Koehn et al., 2007), question answering (Lin et al., 2012) and knowledge base construction (Dong et al., 2014).
There has been considerable amount of work on Named Entity Recognition (NER) (Collins and Singer, 1999;Tjong Kim Sang and De Meulder, 2003;Ratinov and Roth, 2009;, which classifies entity mentions into a small set of mutually exclusive types, such as Person, Location, Organization and Misc. However, these types are not enough for some NLP applications such as relation extraction, knowledge base construction (KBC) and question answering. In relation extraction and KBC, knowing fine-grained types for entities can significantly increase the performance of the relation extractor (Ling and Weld, 2012;Koch et al., 2014;Mitchell et al., 2015) since this helps in filtering out candidate relation types that do not follow the type constrain. Fine-grained entity types provide additional information while matching questions to its potential answers and significantly improves performance . For example, Li and Roth (2002) rank questions based on their expected answer types (will the answer be food, vehicle or disease).
Typically, FETC systems use over hundred labels, arranged in a hierarchical structure. An important aspect of FETC is that based on local context, two different mentions of same entity can have different labels. We illustrate this through an example in Figure 1. All three sentences S1, S2, and S3 mention same entity Barack Obama. However, looking at the context, we can infer that S1 mentions Obama as a person/author, S2 mentions Obama only as a person, and S3 mentions Obama as a person/politician.
Available training data for FETC has noisy labels. Creating manually annotated training data for FETC is time consuming, expensive, and error prone. Note that, a human annotator will Figure 1: Noise introduced via distant supervision process. S1-S3 indicates sentences where only a subset of labels for entity mention (bold typeface) are relevant given context, highlighted in T1-T3.
have to assign a subset of correct labels from a set of around hundred labels for each entity mention in the corpus. Existing FETC systems use distant supervision paradigm (Craven and Kumlien, 1999) to automatically generate training data. Distant supervision maps each entity in the corpus to knowledge bases such as Freebase (Bollacker et al., 2008), DBpedia (Auer et al., 2007), YAGO (Suchanek et al., 2007). This method assigns same set of labels to all mentions of an entity across the corpus. For example, Barack Obama is a person, politician, lawyer, and author. If a knowledge base has these four matching labels for Barack Obama, then distant supervision assigns all of them to every mention of Barack Obama. Training data generated with distant supervision will fail to distinguish between mentions of Barack Obama in sentences S1, S2, and S3.
Existing FETC systems have one or both of following drawbacks: 1. Assuming training data to be noise free (Ling and Weld, 2012;Yosef et al., 2012;Yogatama et al., 2015;Shimaoka et al., 2016) 2. Use of hand crafted features (Ling and Weld, 2012;Yosef et al., 2012;Yogatama et al., 2015;Ren et al., 2016) We have observed that for real world datasets, more than twenty five percent of training data has noisy labels. First drawback propagates this noise in training data to the FETC model. To extract hand crafted features various NLP tools are used. Since errors inevitably exist in such tools, the second drawback propagates errors of these tools to FETC model.
We propose a neural network based model to overcome the two drawbacks of existing FETC systems. First, we separate training data into clean and noisy partitions using the same method as in AFET system (Ren et al., 2016). For these parti-tions, we use simple yet effective non-parametric variant of hinge loss function while training. To avoid use of hand crafted features, we learn representations for given entity mention and its context. Additionally, we investigate effectiveness of using transfer learning (Pratt, 1993) for FETC task both at feature and model level. We show that feature level transfer learning can be used to improve performance of other FETC system such as AFET, by up to 4.5% in micro-F1 score. Similarly, model level transfer learning can be used to improve performance of the same model using different dataset by up to 3.8% in micro-F1 score.
Our contributions can be summarized as follows: 1. We propose a simple neural network model that learns representations for entity mention and its context, and incorporate noisy label information using a variant of non-parametric hinge loss function. Experimental results on two publicly available datasets demonstrate the effectiveness of proposed model, with an average relative improvement of 2.69% in micro-F1 score. 2. We investigate the use of feature level and model level transfer-learning strategies in the domain of the FETC task. The proposed transfer learning strategies further improve the state-of-the-art on BBN dataset by 3.8% in micro-F1 score.
2 Related Work Ling et al. (2012) proposed the first system for FETC task, which used 112 overlapping labels. They used linear classifier perceptron for multilabel classification. Yosef et al. (2012) used multiple binary SVM classifiers in a hierarchy, to classify an entity mention to a set of 505 types. While the initial work assumed that all labels present in a training dataset for an entity mention are correct, Gillick et al. (2014) introduced context dependent FETC and proposed a set of heuristics for pruning labels that might not be relevant given the entity mention's local context. Yogatama et al. (2015) proposed an embedding based model where userdefined features and labels were embedded into a low dimensional feature space to facilitate information sharing among labels. Shimaoka et al. (2016) proposed an attentive neural network model that used LSTMs to encode entity mention's context and used an atten-(a) α models label-label correlation. Higher the α, lower is the margin between noncorrelation labels.
(b) During inference, labels above this threshold are predicted as positive. tion mechanism to allow the model to focus on relevant expressions in the entity mention's context. However, the model assumed that all labels obtained via distant supervision are correct. In contrast, our model does not assume that all labels are correct. To learn entity representation, we propose a scheme which is simpler yet more effective.
Most recently, Ren et al. (2016) have proposed AFET, an FETC system. AFET separates the loss function for clean and noisy entity mentions. AFET uses label-label correlation information obtained by given data in its parametric loss function (model parameter α). During inference, AFET uses a threshold to separate positive types from negative types (similarity threshold parameter d). However, AFET's loss function is sensitive to change in parameters, which are data dependent. Figure 2 shows the effect of parameter α and d, on AFET performance evaluated on different datasets. In contrast, our model uses a simple yet effective variant of hinge loss function. This function does not need to tune the similarity threshold.
Transfer learning is well applied to many NLP applications, such as cross-domain document classification (Shi et al., 2010), multi-lingual word clustering (Täckström et al., 2012) and sentiment classification (Mou et al., 2016). Initialization of word vectors with pre-trained word vectors in neural network models can be considered as one of the best example of transfer learning in NLP. Wang et al. (2015) provide a broad overview of transfer learning techniques used for language processing.

Problem description
Our task is to automatically classify type information of entity mentions present in natural language sentences. Figure 3 shows a general overview of our proposed approach. Input: The input to the model is a training and testing corpus consisting of a set of sentences on which entity mentions have been identified. In training corpus, every entity mention will have corresponding labels according to a given hierarchy. Formally, a training corpus D train consists of a set of sentences, . Each sentence s i will have one or more entity mentions denoted by m i j,k , where j and k denotes indices of start and end tokens, respectively. Set M consists of all the entity mentions m i j,k . For every entity mention m i j,k , there will be a corresponding label vector l i j,k ∈ {0, 1} K , which is a binary vector, where l i j,kt = 1 if t th type is true otherwise it will be zero. K denotes the total number of labels in a given hierarchy Ψ. Testing corpus D test will only contain sentences and entity mentions. Output: For entity mentions in testing corpus D test , predict their corresponding labels.

Training set partition
Similar to AFET, we partition the mention set M of training corpus D train into two parts, a set M c , consisting only of clean entity mentions and a set M n , consisting only of noisy entity mentions. An entity mention m i j,k is said to be clean if its labels l i j,k belong to only a single path (not necessary to be leaf) in the hierarchy Ψ, that is its labels are not ambiguous; otherwise, it is noisy. For example, as per hierarchy given in figure 1, an entity mention with labels person, artist and politician will be considered as noisy, whereas entity mention with labels person, artist and actor will be considered as clean.

Feature representations
Mention representation: This representation captures information about entity mention's morphology and orthography. We decompose an entity mention into character sequence, and use a vanilla LSTM encoder (Hochreiter and Schmidhuber, 1997) to encode character sequences to a fixed dimensional vector. Formally, for entity mention m i j,k , we decompose it into a sequence of character tokens c , where |m i j,k | denotes the total number of characters present in the entity mention. For entity mention containing multiple tokens, we join these tokens with a space in between tokens. Every character will have corresponding vector representation in a lookup table for characters. The character sequence is then fed one by one to a LSTM encoder, and the final output is used as a feature representation for entity mention m i j,k . We denote this process by a function F m : M → R Dm , where D m is the number of dimensions for mention representation. The whole process is illustrated in figure 4 (Mention representation). Context representation: This representation captures information about the context surrounding the entity mention. Context representation is further divided into two parts, left and right context representation. The left context consists of a sequence of tokens within a sentence from the start of a sentence till the last token of entity mention. The right context consists of a sequence of tokens from the start of entity mention till the end of a sentence. We use bi-directional LSTM encoders (Graves et al., 2013) to encode token level sequences of both context to a fixed dimensional vector. Formally, for an entity mention m i j,k present in a sentence s i , decompose s i into a sequence of tokens s i 1 , s i 2 , . . . , s i k for the left context, and s i j , s i j+1 , . . . , s i |s i | for the right context, where |s i | denotes the number of tokens in the sentence. Every token will have a corresponding vector representation in a lookup tables for token. The token sequence is then fed one by one to a bi-directional LSTM encoder, and the final output will be used as feature representation. We denote this whole process by function F lc : (M, S) → R D lc for computing left context and F rc : (M, S) → R Drc for computing right context. D lc and D rc are the number of dimensions for the left context and the right context representation, respectively. The whole process is illustrated in figure 4 (Left and right context representation).
The context representation described above is slightly different from what was proposed in (Shimaoka et al., 2016), here we include entity mention tokens within both left and right context, to explicitly encode context relative to an entity mention.
In the end, we concatenate entity mention and its context representation into a single D f dimensional vector, where D f = D m + D lc + D rc . This complete process is denoted by a function F : (M, S) → R D f given by: (1) where ⊕ denotes vector concatenation. For brevity, we will now omit the use of subscript j, k from m i j,k and l i j,k , and will use f i to denote feature representation for entity mention and its context obtained via equation 1.

Feature and label embeddings
Similar to Yogatama et al. (2015) and Ren et al. (2016), we embed feature representations and labels in a same dimensional space such that an object is embedded closer to the objects that share similar types than the objects that do not. Formally, we are trying to learn linear mapping func- where D e is the size of embedding space. These mappings are given by: where, U ∈ R D f ×De and V ∈ R D K ×De are projection matrices for features representations and type labels respectively and l i t is one-hot vector representation for label t. We assign a score to each label type t and feature vector as a dot product of their embeddings. Formally, we denote a score as:

Optimization
We use two different loss functions to model clean and noisy entity mentions. For clean entity mentions, we use a hinge loss function. The intuition is simple: maintain a margin, centered at zero, between positive and negative type scores. The scores are computed by similarity between an entity mention and label types (eq. 3). Hinge loss function has two advantages. First, it intuitively seprates positive and negative labels during inference. Second, it is independent of data dependent parameter. Formally, for a given entity mention m i and its label l i we compute the associated loss as given by: where γ andγ are set of indices that have positive and negative labels respectively. For noisy entity mentions, we propose a variant of a hinge loss where, like L c , score for all negative labels should go below −1. However, for positive labels, as we don't know which labels are relevant to entity mention's local context, we propose that the maximum score from the set of given positive labels should be greater than one. This maintains a margin between all negative types and the most relevant positive type. Formally, noisy label loss, L n is defined as: Again, using this loss function makes it intuitive to set a threshold of zero during inference. These loss functions are different from the loss functions used in (Yogatama et al., 2015;Ren et al., 2016) in a way that, we make strict absolute criteria to distinguish between positive and negative labels. Whereas in (Yogatama et al., 2015;Ren et al., 2016) positive labels should have a higher score than negative labels. As their scoring is relative, the final result varies on the threshold used to separate positive and negative labels.
To train the partitioned dataset together, we formulate the joint objective problem as: where θ is the collection of all model parameters that needs to be learned. To jointly optimize the objective O, we use Adam (Kingma and Ba, 2014), a stochastic gradient-based optimization algorithm.

Inference
For every entity mention in set M from D test , we perform a top-down search in the given type hierarchy Ψ, and estimate the correct type path Ψ * . Starting from the tree root, we recursively compute the best type among node's children by computing its score with obtained feature representations. We select the node that has maximum score among other nodes. We continue this process till a leaf node is encountered or the score associated with a node falls below an absolute threshold zero. The thresold is fixed across all datasets used.

Transfer learning
We want to investigate, whether the feature representations learnt for an entity mention are useful. We study what contribution these feature representations make to an existing feature engineering based method such as AFET. We learn the proposed model on one training dataset, namely Wiki dataset, which has the highest number of entity mentions among other datasets and use this model to generate representations that is F (m i j,k , s i ) for another training and testing data. These representations, which are D f dimensional vectors, are used as feature for an existing state-of-the-art model, AFET, in place of the hand-crafted features that were originally used. AFET model is then trained using these feature representations. We call this as feature level transfer learning. On the other hand, we also evaluate model level transfer learning, where we initialize weights of LSTM encoders for a new dataset with the weights learnt from the model trained on another dataset, namely Wiki dataset.

Datasets used
We evaluate the proposed model on three publicly available datasets, provided in a pre-processed tokenized format by Ren et al. (2016). Statistics of the datasets used in this work are shown in Table 1. The details of the datasets are as follows: Wiki/FIGER(GOLD): The training data consists of Wikipedia sentences and was automatically generated in distant supervision paradigm, by mapping hyperlinks in Wikipedia articles to Freebase. The test data, mainly consisting of sentences from news reports, was manually annotated as described in (Ling and Weld, 2012). OntoNotes: OntoNotes dataset consists of sentences from newswire documents present in OntoNotes text corpus (Weischedel et al., 2013). DBpedia spotlight (Daiber et al., 2013) was used to automatically link entity mention in sentences to Freebase. For this corpus, manually annotated test data was shared by Gillick et al. (2014). BBN: BBN dataset consists of sentences from Wall Street Journal articles and is completely manually annotated (Weischedel and Brunstein, 2005). Please refer to (Ren et al., 2016) for more details

Experimental setup
We use Accuracy or Strict-F1 score, Macroaveraged F1 score, and Micro-averaged F1 score as metrics for evaluation. Existing methods for FETC use same measures (Ling and Weld, 2012;Yogatama et al., 2015;Shimaoka et al., 2016;Ren et al., 2016). We removed entity mentions that do not have any label in training as well as test set. We also remove entity mentions that have spurious indices (i.e entity mention length of 0). 3 For all the three datasets, we randomly sampled 10% of the test set, and use it as a development set, on which we tune model parameters. The remaining 90% is used for final evaluation. For all our experiments, we train each model using same hyperparameters five times and report their performance in terms of micro-F1 score on the development set as shown in Figure 5. On Wiki dataset, we observed a large variance in performance as compared to other two datasets. This might be because of the fact that Wiki dataset has a very small development set. From each of these five runs, we pick the best performing model based on the development set and report its result on the test set. Hyperparameter setting: All the neural network based models in this paper used 300 dimensional pre-trained word embeddings distributed by Pennington et al. (2014). The hidden-layer size of word level bi-directional LSTM was 100, and that of character level LSTM was 200. Vectors for character embeddings were randomly initialized and were of size 200. We use dropout with the probability of 0.5 on the output of LSTM encoders. The embedding dimension used was 500. We use Adam (Kingma and Ba, 2014) as optimization method with learning rate of 0.0005-0.001 and mini-batch size in the range of 800 to 1500. The proposed model and some of the baselines were implemented using TensorFlow 4 framework.

Transfer learning
In feature level transfer learning, we use the best performing proposed model trained on Wiki dataset to generate representations that is D f dimensional vector for every entity mention present in the train, development, and test set of the BBN and the OntoNotes dataset. Figure 4 illustrates an example for the encoding process. Then we use these representations as a feature vector in place of the user-defined features and train the AFET model. Its hyper-parameters were tuned on the development set. These results are shown in table 2 as feature level transfer-learning.
In model level transfer learning, we use the learnt weights of LSTM encoders from the best performing proposed model trained on Wiki dataset and initialize the LSTM encoders of the same model with these weights while training on BBN and OntoNotes datasets. These results are shown in table 2 as model level transfer learning. Table 2 shows the results of the proposed method, its variants and the baseline methods. Comparison with other feature learning methods: The proposed model and its variants (our-AllC, our-NoM) perform better than the existing feature learning method by Shimaoka et al. (2016) (Attentive), consistently on all datasets. This indicates benefits of the proposed representation scheme and joint learning of representation and label embedding. Comparison with feature engineering methods: The proposed model performs better than the existing feature engineered methods (FIGER, HYENA, AFET-NoCo, AFET-CoH) consistently across all datasets on Micro-F1 and Macro-F1 evaluation metrics. These methods do not model label-label correlation based on data. In comparison with AFET, the proposed model outperforms AFET on Wiki and BBN dataset in terms of Micro-F1 evaluation metric. This indicates benefits of feature learning as well as data driven label-label correlation. We do a type-wise perfor-

Case analysis: OntoNotes dataset
We observed three things; (i) all models perform relatively poor on OntoNotes dataset compared to their performance on other two datasets; (ii) the proposed model outperforms other models including AFET on the other two datasets, but gave worse performance on OntoNotes dataset; (iii) the two variants of transfer learning significantly improve performance of the proposed model on the BBN dataset but resulted in only a subtle performance change on OntoNotes dataset. Statistics of the dataset (Table 1) indicates that presence of pronominal or other kinds of mentions are relatively higher in OntoNotes (6.78% in test set) than the other two datasets (0% in test set). Examples of such mentions are 100 people, It, the director, etc. Table 3 shows 20 randomly sampled entity mentions from test set of OntoNotes datasets. Some of these mentions are very generic and likely to be dependent on * These results are from (Ren et al., 2016) that also uses 10% of the test set as development set and the remaining for evaluation.
‡ We used the publicly available code distributed by Ren et al. (2016). † All of these results are on exact same train, development and test set. previous sentences. As all the methods use features solely based on the current sentence, they fail to transfer cross-sentence boundary knowledge. Removing pronominal mentions from test set increases the performance of all feature learning methods by around 3%.  Next we analyse where the proposed model is failing as compared to AFET. For this, we look at type-wise performance for the top-10 most frequent types in the OntoNotes test dataset. Results are shown in Table 4. Compared to AFET, the proposed model performs better in all types except other in the top-10 frequent types. The other type, which is dominant in test set (42.6% of entity mentions are of type other) and is a collection of multiple broad subtypes such as product, event, art, living thing, food. Performance of AFET significantly drops (AFET-NoCo) when data-driven label-label correlation is ignored, which indicates that modeling data-driven correlation helps. However, as shown in Figure 2a, the use of label-label correlation depends on appropriate values of parameters which vary from one dataset to another.

Conclusion and Future Work
In this paper, we propose a neural network based model for the task of fine-grained entity classification. The proposed model learns representations for entity mention, its context and incorporate label noise information in a variant of non-parametric hinge loss function. Experiments show that the proposed model outperforms existing state-of-the-art models on two publicly available datasets without explicitly tuning data dependent parameters. Our analysis indicates the following observations. First, OntoNotes dataset has a different distribution of entity mentions compared with other two datasets. Second, if data distribution is similar, then transfer learning is very helpful. Third, incorporating data-driven label-label correlation helps in the case of labels of mixed types. Fourth, there is an inherent limitation in assuming all labels to be clean if they belong to the same path of the hierarchy. Fifth, the proposed model fails to learn label types that are very noisy.
Future work could analyse the effect of label noise reduction techniques on the proposed model, revisiting the definition of clean and noisy labels and modeling label-label correlation in a principled way that is not dependent on dataset specific parameters.