Neural Architectures for Fine-grained Entity Type Classification

In this work, we investigate several neural network architectures for fine-grained entity type classification and make three key contributions. Despite being a natural comparison and addition, previous work on attentive neural architectures have not considered hand-crafted features and we combine these with learnt features and establish that they complement each other. Additionally, through quantitative analysis we establish that the attention mechanism learns to attend over syntactic heads and the phrase containing the mention, both of which are known to be strong hand-crafted features for our task. We introduce parameter sharing between labels through a hierarchical encoding method, that in low-dimensional projections show clear clusters for each type hierarchy. Lastly, despite using the same evaluation dataset, the literature frequently compare models trained using different data. We demonstrate that the choice of training data has a drastic impact on performance, which decreases by as much as 9.85% loose micro F1 score for a previously proposed method. Despite this discrepancy, our best model achieves state-of-the-art results with 75.36% loose micro F1 score on the well-established Figer (GOLD) dataset and we report the best results for models trained using publicly available data for the OntoNotes dataset with 64.93% loose micro F1 score.


Introduction
Entity type classification aims to label entity mentions in their context with their respective semantic types. Information regarding entity type mentions has proven to be valuable for several natural language processing tasks; such as question answering (Lee et al., 2006), knowledge base population (Carlson et al., 2010), and co-reference resolution (Recasens et al., 2013). A natural extension to traditional entity type classification has been to divide the set of types -which may be too coarsegrained for some applications (Sekine, 2008)into a larger set of fine-grained entity types (Lee et al., 2006;Ling and Weld, 2012;Yosef et al., 2012;Gillick et al., 2014;Del Corro et al., 2015); for example person into actor, artist, etc.
Given the recent successes of attentive neural models for information extraction (Globerson et al., 2016;Shimaoka et al., 2016;Yang et al., 2016), we investigate several variants of an attentive neural model for the task of fine-grained entity classification (e.g. Figure 1). This model category uses a neural attention mechanism -which can be likened to a soft alignment -that enables the model to focus on informative words and phrases. We build upon this line of research and our contributions are three-fold: 1. Despite being a natural comparison and addition, previous work on attentive neural architectures do not consider hand-crafted features. We combine learnt and hand-crafted features and observe that they complement each other. Additionally, we perform extensive analysis of the attention mechanism of our model and establish that the attention mechanism learns to attend over syntactic heads and the tokens prior to and after a mention, both which are known to be highly relevant to successfully classifying a mention.
2. We introduce label parameter sharing using a hierarchical encoding that improves performance on one of our datasets and the lowdimensional projections of the embedded labels form clear coherent clusters.
3. While research on fine-grained entity type classification has settled on using two evaluation datasets, a wide variety of training datasets have been used -the impact of which has not been established. We demonstrate that the choice of training data has a drastic impact on performance, observing performance decreases by as much as 9.85% loose Micro F1 score for a previously proposed method. However, even when comparing to models trained using different datasets we report state-of-the-art results of 75.36% loose micro F1 score on the FIGER (GOLD) dataset.

Related Work
Our work primarily draws upon two strains of research, fine-grained entity classification and attention mechanisms for neural models. In this section we introduce both of these research directions. By expanding a set of coarse-grained types into a set of 147 fine-grained types, Lee et al. (2006) were the first to address the task of fine-grained entity classification. Their end goal was to use the resulting types in a question answering system and they developed a conditional random field model that they trained and evaluated on a manually annotated Korean dataset to detect and classify entity mentions. Other early work include Sekine (2008), that emphasised the need for having access to a large set of entity types for several NLP applications. The work primarily discussed design issues for fine-grained set of entity types and served as a basis for much of the future work on fine-grained entity classification.
The first work to use distant supervision (Mintz et al., 2009) to induce a large -but noisy -training set and manually label a significantly smaller dataset to evaluate their fine-grained entity classification system, was Ling and Weld (2012) who introduced both a training and evaluation dataset FIGER (GOLD). Arguing that fine-grained sets of types must be organised in a very fine-grained hierarchical taxonomy, Yosef et al. (2012) introduced such a taxonomy covering 505 distinct types. This new set of types lead to improvements on FIGER (GOLD), and they also demonstrated that the fine-grained labels could be used as features to improve coarse-grained entity type classification performance. More recently, continuing this very fine-grained strategy, Del Corro et al. (2015) introduced the most fine-grained entity type classification system to date, covering the more than 16, 000 types contained in the WordNet hierarchy.
While initial work largely assumed that mention assignments could be done independently of the mention context, Gillick et al. (2014) introduced the concept of context-dependent fine-grained entity type classification where the types of a mention is constrained to what can be deduced from its context and introduced a new OntoNotes-derived manually annotated evaluation dataset. In addition, they addressed the problem of label noise induced by distant supervision and proposed three label cleaning heuristics. Building upon the noise reduction aspects of this work, Ren et al. (2016) introduced a method to reduce label noise even further, leading to significant performance gains on both the evaluation dataset of Ling and Weld (2012) and Gillick et al. (2014). Yogatama et al. (2015) proposed to map handcrafted features and labels to embeddings in or-der to facilitate information sharing between both related types and features. A pure feature learning approach was proposed by Dong et al. (2015). They defined 22 types and used a two-part neural classifier that used a recurrent neural network to obtain a vector representation of each entity mention and in its second part used a fixed-size window to capture the context of a mention. A recent workshop paper (Shimaoka et al., 2016) introduced an attentive neural model that unlike previous work obtained vector representations for each mention context by composing it using a recurrent neural network and employed an attention mechanism to allow the model to focus on relevant expressions in the mention context. Although not pointed in Shimaoka et al. (2016), the attention mechanism used differs from previous work in that it does not condition the attention. Rather, they used global weights optimised to provide attention for every fine-grained entity type classification decision.
To the best of our knowledge, the first work that utilised an attention architecture within the context of NLP was Bahdanau et al. (2014), that allowed a machine translation decoder to attend over the source sentence. Doing so, they showed that adding the attention mechanism significantly improved their machine translation results as the model was capable of learning to align the source and target sentences. Moreover, in their qualitative analysis, they concluded that the model can correctly align mutually related words and phrases. For the set of neural models proposed by , attention mechanisms are used to focus on the aspects of a document that help the model answer a question, as well as providing a way to qualitatively analyse the inference process. Rocktäschel et al. (2015) demonstrated that by applying an attention mechanism to a textual entailment model, they could attain state-of-the-art results, as well as analyse how the entailing sentence would align to the entailed sentence.
Our work differs from previous work on finegrained entity classification in that we use the same publicly available training data when comparing models. We also believe that we are the first to consider the direct combination of hand-crafted features and an attentive neural model.

Models
In this section we describe the neural model variants used in this paper as well as a strong featurebased baseline from the literature. We pose fine-grained entity classification as a multi-class, multi-label classification problem. Given a mention in a sentence, the classifier predicts the types t ∈ {1, 0} K where K is the size of the set of types. Across all the models, we compute a probability y k ∈ R for each of the K types using logistic regression. Variations of the models stem from the ways of computing the input to the logistic regression.
At inference time, we enforce the assumption that at least one type is assigned to each mention by first assigning the type with the largest probability. We then assign any additional types based on the condition that their corresponding probabilities must be greater than a threshold of 0.5, which was determined by tuning it using development data.

Sparse Feature Model
For each entity mention m, we create a binary feature indicator vector f (m) ∈ {0, 1} D f and feed it to the logistic regression layer. The features used are described in Table 1, which are comparable to those used by Gillick et al. (2014) and Yogatama et al. (2015). It is worth noting that we aimed for this model to resemble the independent classifier model in Gillick et al. (2014) so that it constitutes as a meaningful well-established baseline; however, there are two noteworthy differences. Firstly, we use the more commonly used clustering method of Brown et al. (1992), as opposed to Uszkoreit and Brants (2008), as Gillick et al. (2014) did not make the data used for their clusters publicly available. Secondly, we learned a set of 15 topics from the OntoNotes dataset using the LDA (Blei et al., 2003) implementation from the popular gensim software package, 1 in contrast to Gillick et al. (2014) that used a supervised topic model trained using an unspecified dataset. Despite these differences, we argue that our set of features is comparable and enables a fair comparison given that the original implementation and some of the data used is not publicly available.

Neural Models
The neural models from Shimaoka et al. (2016) processes embeddings of the words of the mention and its context; and we adopt the same formalism when introducing these models and our variants. First, the mention representation v m ∈ R Dm×1 and context representation v c ∈ R Dc×1 are computed separately. Then, the concatenation of these representations is used to compute the prediction: Let the words in the mention be m 1 , m 2 , ..., m |m| . Then the representation of the mention is computed as follows: Where u is a mapping from a word to an embedding. This relatively simple method for composing the mention representation is motivated by it being less prone to overfitting.
Next, we describe the three methods from Shimaoka et al. (2016) for computing the context representations; namely, Averaging, LSTM, and Attentive Encoder.

Averaging Encoder
Similarly to the method of computing the mention representation, the Averaging encoder computes the averages of the words in the left and right context. Formally, let l 1 , ..., l C and r 1 , ..., r C be the words in the left and right contexts respectively, where C is the window size. Then, for each sequence of words, we compute the average of the 1 http://radimrehurek.com/gensim/ corresponding word embeddings. Those two vectors are then concatenated to form the representation of the context v c .

LSTM Encoder
For the LSTM Encoder, the left and right contexts are encoded by an LSTM (Hochreiter and Schmidhuber, 1997). The high-level formulation of an LSTM can be written as: Where u i ∈ R Dm×1 is an input embedding, For the left context, the LSTM is applied to the sequence l 1 , ..., l C from left to right and produces For the right context, the sequence r C , ..., r 1 is processed from right to left to produce the outputs ← − h r 1 , ..., ← − h r 1 then serves as the context representation v c .

Attentive Encoder
An attention mechanism aims to encourage the model to focus on salient local information that is relevant for the classification decision. The attention mechanism variant used in this work is defined as follows. First, bi-directional LSTMs (Graves, 2012) are applied for both the right and left context. We denote the output layers of the bi-directional LSTMs as − → h l 1 , For each output layer, a scalar valueã i ∈ R is computed using a feed forward neural network with the hidden layer e i ∈ R Da×1 and weight matrices W e ∈ R Da×2D h and W a ∈ R 1×Da : Next, the scalar values are normalised such that they sum to 1: These normalised scalar values a i ∈ R are referred to as attentions. Finally, we compute the sum of the output layers of the bidirectional LSTMs, weighted by the attentions a i as the representation of the context: An illustration of the attentive encoder model variant can be found in Figure 1.

Hybrid Models
To allow model variants to use both human background knowledge through hand-crafted features as well as features learnt from data, we extended the neural models to create new hybrid model variants as follows. Let v f ∈ R D l ×1 be a lowdimensional projection of the sparse feature f (m): Where W f ∈ R D l ×D f is a projection matrix. The hybrid model variants are then defined as follows: These models can thus draw upon learnt features through v m and v c as well as hand-crafted features using v f when making classification decisions. While existing work on fine-grained entity type classification have used either sparse, manually designed features or dense, automatically learnt embedding vectors, our work is the first to propose and evaluate a model using the combination of both features.

Hierarchical Label Encoding
Since the fine-grained types tend to form a forest of type hierarchies (e.g. musician is a subtype of artist, which in turn is a subtype of person), we investigated whether the encoding of each label could utilise this structure to enable parameter sharing. Concretely, we compose the weight matrix W y for the logistic regression layer as the product of a learnt weight matrix V y and a constant sparse binary matrix S: We encode the type hierarchy formed by the set of types in the binary matrix S as follows. Each type is mapped to a unique column in S, where membership at each level of    This enables us to share parameters between labels in the same hierarchy, potentially making learning easier for infrequent types that can now draw upon annotations of other types in the same hierarchy.

Datasets
Despite the research community having largely settled on using the manually annotated datasets FIGER (GOLD) (Ling and Weld, 2012) and OntoNotes (Gillick et al., 2014) for evaluation, there is still a remarkable difference in the data used to train models ( Table 2) that are then evaluated on the same manually annotated datasets. Also worth noting is that some data is not even publicly available, making a fair comparison between methods even more difficult. For evaluation, in our experiments we use the two well-established manually annotated datasets FIGER (GOLD) and OntoNotes, where like Gillick et al. (2014), we discarded pronominal mentions, resulting in a total of 8, 963 mentions. For training, we use the automatically induced publicly available datasets provided by Ren et al. (2016). Ren et al. (2016) aimed to eliminate label noise generated in the process of distant supervision and we use the "raw" noisy data 2 provided by them for training our models.

Pre-trained Word Embeddings
We use pre-trained word embeddings that were not updated during training to help the model generalise to words not appearing in the training set (Rocktäschel et al., 2015). For this purpose, we used the freely available 300-dimensional cased word embeddings trained on 840 billion tokens from the Common Crawl supplied by Pennington et al. (2014). For words not present in the pretrained word embeddings, we use the embedding of the "unk" token.

Evaluation Criteria
We adopt the same criteria as Ling and Weld (2012), that is, we evaluate the model performance by strict accuracy, loose macro, and loose micro scores.

Hyperparameter Settings
Values for the hyperparameters were obtained from preliminary experiments by evaluating the model performance on the development sets. Concretely, all neural and hybrid models used the same D m = 300-dimensional word embeddings, the hidden-size of the LSTM was set to D h = 100, the hidden-layer size of the attention module was set to D a = 100, and the size of the low-dimensional projection of the sparse features was set to D l = 50. We used Adam (Kingma and Ba, 2014) as our optimisation method with a learning rate of 0.001, a mini-batch size of 1, 000, and iterated over the training data for five epochs. As a regularizer we 2 Although Ren et al. (2016) provided both "raw" data and code to "denoise" the data, we were unable to replicate the performance benefits reported in their work after running their pipeline. We have contacted them regarding this as we would be interested in comparing the benefit of their denoising algorithm for each model, but at the time of writing we have not yet received a response.

Model
Acc.   Table 4: Performance on FIGER (GOLD) for models using different training data.
used dropout (Hinton et al., 2012) with probability 0.5 applied to the mention representation and sparse feature representation. The context window size was set to C = 10 and if the length of a context extends beyond the sentence length, we used a padding symbol in-place of a word. After training, we picked the best model on the development set as our final model and report their performance on the test sets. Our model implementation was done in Python using the TensorFlow (Abadi et al., 2015) machine learning library.

Results
When presenting our results, it should be noted that we aim to make a clear separation between results from models trained using different datasets.

FIGER (GOLD)
We first analyse the results on FIGER (GOLD) (Tables 3 and 4). The performance of the baseline model that uses the sparse hand-crafted features is relatively close to that of the FIGER system of Ling and Weld (2012). This is consistent with the fact that both systems use linear classifiers, similar sets of features, and training data of the same size and domain.
Looking at the results of neural models, we observe a consistent pattern that adding hand-crafted features boost performance significantly, indicating that the learnt and hand-crafted features complement each other. The effects of adding the hier-  Table 5: Performance on OntoNotes for models using the same W2M training data.
archical label encoding were inconsistent, sometimes increasing, sometimes decreasing performance. We thus opted not to include them in the results table due to space constraints and hypothesise that given the size of the training data, parameter sharing may not yield any large performance benefits. Among the neural models, we see that the averaging encoder perform considerably worse than the others. Both the LSTM and attentive encoder show strong results and the attentive encoder with hand-crafted features achieves the best performance among all the models we investigated.
When comparing our best model to previously introduced models trained using different training data, we find that we achieve state-of-the-art results both in terms of loose macro and micro scores. The closest competitor, FIGER + PLE (Ren et al., 2016), achieves higher accuracy at the expense of lower F1 scores, we suspect that this is due to an accuracy focus in their label pruning strategy. It is worth noting that we achieve stateof-the-art results without the need for any noise reduction strategies. Also, even with 600,000 fewer training examples, our variant with hand-crafted features of the attentive model from Shimaoka et al. (2016) outperforms its feature-learning variant.
In regards to the impact of the choice of training set, we observe that the model introduced in Shimaoka et al. (2016) drops as much as 3.36 points of loose micro score when using a smaller dataset. Thus casting doubts upon the comparability of results of fine-grained entity classification models using different training data.
formance improvements when the sparse handcrafted features are added to the neural models.
In the absence of hand-crafted features, the averaging encoder suffer relatively poor performance and the attentive encoder achieves the best performance. However, when the hand-crafted features are added, a significant improvement occurs for the averaging encoder, making the performance of the three neural models much alike. We speculate that some of the hand-crafted features such as the dependency role and parent word of the head noun, provide crucial information for the task that cannot be captured by the plain averaging model, but can be learnt if an attention mechanism is present. Another speculative reason is that because the training dataset is noisy compared to FIGER (GOLD) (since FIGER (GOLD) uses anchors to detect entities whereas OntoNotes uses an external tool), and the size of the dataset is small, the robustness of the simpler averaging model becomes clearer when combined with the hand-crafted features. Another interesting observation can be seen for models with the hierarchical label encoding, where it is clear that consistent performance increases occur. This can be explained by the fact that the type ontology used in OntoNotes is more well-formed than its FIGER counterpart. While we do not obtain state-of-the-art performance when considering models using different training data, we do note that in terms of F1-score we perform within 1 point of the state of the art. This being achieved despite having trained our models on different non-proprietary noisy data.
Once again we have an opportunity to study the impact of the choice of training data by comparing the results of the hand-crafted features of Gillick et al. (2014) to our own comparable set of features. What we find is that the performance drop is very dramatic, 9.85 points of loose micro score. Given that the training data for the previously introduced model is not publicly available, we hesi- tate to speculate as to exactly why this drop is so dramatic, but similar observations have been made for entity linking (Ling et al., 2015). This clearly underlines how essential it is to compare models on an equal footing using the same training data.

PCA visualisation of label embeddings
By visualising the learnt label embeddings (Figure 3) and comparing the non-hierarchical and hierarchical label encodings, we can observe that the hierarchical encoding forms clear distinct clusters.

Attention Analysis
While visualising the attention weights for specific examples have become commonplace, it is still not clear exactly what syntactic and semantic patterns that are learnt by the attention mechanism. To better understand this, we first qualitatively analysed a large set of attention visualisations and observed that head words and the words contained in the phrase forming the mention tended to receive the highest level of attention. In order to quantify this notion, we calculated how frequently the word strongest attended over for all mentions of a specific type was the syntactic head or the words before and after the mention in its phrase. What we found through our analysis (Table 7) was that our attentive model without hand-crafted features does indeed learn that head words and the phrase surrounding the mention are highly indica-  tive of the mention type, without any explicit supervision. Furthermore, we believe that this in part might explain why the performance benefit of adding hand-crafted features was smaller for the attentive model compared to our other two neural variants.

Conclusions and Future Work
In this paper, we investigated several model variants for the task of fine-grained entity type classification. The experiments clearly demonstrated that the choice of training data -which until now been ignored for our task -has a significant impact on performance. Our best model achieved state-ofthe-art results with 75.36% loose micro F1 score on FIGER (GOLD) despite being compared to models trained using larger datasets and we were able to report the best results for any model trained using publicly available data for OntoNotes with 64.93% loose micro F1 score. The analysis of the behaviour of the attention mechanism demonstrated that it can successfully learn to attend over expressions that are important for the classification of fine-grained types. It is our hope that our observations can inspire further research into the limitations of what linguistic phenomena attentive models can learn and how they can be improved. As future work, we see the re-implementation of more methods from the literature as a desirable target, so that they can be evaluated after utilising the same training data. Additionally, we would like to explore alternative hierarchical label encodings that may lead to more consistent performance benefits.
To ease the reproducability of our work, we make our code used for the experiments available at https://github.com/ shimaokasonse/NFGEC. Curie Career Integration Award, and an Allen Distinguished Investigator Award. We would like to thank Dan Gillick for answering several questions related to his 2014 paper.