An Attentive Fine-Grained Entity Typing Model with Latent Type Representation

We propose a fine-grained entity typing model with a novel attention mechanism and a hybrid type classifier. We advance existing methods in two aspects: feature extraction and type prediction. To capture richer contextual information, we adopt contextualized word representations instead of fixed word embeddings used in previous work. In addition, we propose a two-step mention-aware attention mechanism to enable the model to focus on important words in mentions and contexts. We also present a hybrid classification method beyond binary relevance to exploit type inter-dependency with latent type representation. Instead of independently predicting each type, we predict a low-dimensional vector that encodes latent type features and reconstruct the type vector from this latent representation. Experiment results on multiple data sets show that our model significantly advances the state-of-the-art on fine-grained entity typing, obtaining up to 6.1% and 5.5% absolute gains in macro averaged F-score and micro averaged F-score respectively.


Introduction
Fine-grained entity typing aims to assign one or more types to each entity mention given a certain context. For example, in the following sentence, "If Rogers is in the game, the Huskies will be much better equipped to match the Cougars in that aspect", the mention "Rogers" should be labeled as athlete in addition to person according to the context (e.g., game, Huskies). These fine-grained entity types are proven to be effective in supporting a wide range of downstream applications such as relation extraction (Yao et al., 2010), question answering , and coreference resolution (Recasens et al., 2013).
Fine-grained entity typing is usually formulated as a multi-label classification problem. Previous approaches (Ling and Weld, 2012;Choi et al., 2018;Xin et al., 2018) typically address it with binary relevance that decomposes the problem into isolated binary classification subproblems and independently predicts each type. However, this method is commonly criticized for its label independence assumption, which is not valid for finegrained entity typing. For example, if the model is confident at predicting the type artist, it should promote its parent type person but discourage organization and its descendant types. In order to capture inter-dependencies between types, we propose a hybrid model that incorporates latent type representation in addition to binary relevance. Specifically, the model learns to predict a low-dimensional vector that encodes latent type features obtained through Principle Label Space Transformation (Tai and Lin, 2012) and reconstruct the sparse and high-dimensional type vector from this latent representation.
Another major challenge in fine-grained entity typing is to differentiate similar types, such as director and actor, which requires the model to capture slightly different nuances in texts. Previous neural models (Shimaoka et al., 2016;Xin et al., 2018;Choi et al., 2018;Xu and Barbosa, 2018) generally extract features from pre-trained word embeddings. Instead, we adopt contextualized word representations (Peters et al., 2018), which can capture context-aware word semantics and better represent out-of-vocabulary words. We further propose a two-step attention mechanism to actively extract the most relevant information from the sentence. Particularly, we calculate the attention for context words in a mention-aware manner, allowing the model to focus on different parts of the sentence for different mentions. For example, in the following sentence, "In 2005 two fed- eral agencies, the US Geological Survey and the Fish and Wildlife Service, began to identify fish in the Potomac and tributaries ...", the model should use "federal agencies" to help classify "US Geological Survey" and "Fish and Wildlife Service" as government agency, but focus on "fish" and "tributaries" to determine that "Potomac" should be a body of water (river) instead of a city. Figure 1 illustrates our fine-grained entity typing framework. We represent the input sentence using pre-trained contextualized word representations. Next, we apply a two-step mention-aware attention mechanism to extract the most relevant features from the sentence to form the feature vector. On top of the model, we employ a hybrid classifier to predict the types of each mention.

Sentence Encoder
Contextual information plays a key role as we often need to determine the types especially subtypes according to the context. Hence, unlike previous neural models that generally use fixed word embeddings, we employ contextualized word representations (ELMo, Peters et al. 2018) that can capture word semantics in different contexts. Furthermore, because ELMo takes as input characters instead of words, it can better represent out-ofvocabulary words that are prevalent in entity mentions by leveraging sub-word information. Given a sentence of S words, the encoder generates a sequence of word vectors {r 1 , ..., r S }, where r i ∈ R dr is the representation of the i-th word.

Mention Representation
Previous attentive models (Shimaoka et al., 2017;Xu and Barbosa, 2018;Xin et al., 2018;Choi et al., 2018) only apply attention mechanisms to the context. However, some words in an entity mention may provide more useful information for typing, such as "Department" in Figure 1. To allow the model to focus on more informative words, we represent a mention m consisting of M words as a weighted sum of its contextualized word representations with an attention mechanism (Bahdanau et al., 2015) as where the attention score a m i is computed as , e m i = v m tanh (W m r i ), where parameters W m ∈ R da×dr and v m ∈ R da are learned during training, and the hidden attention dimension d a is set to d r in our experiments.

Context Representation
Given the context of mention m, we form its representation from involved contextualized word vectors with a mention-aware attention mechanism where C is the number of contextual words, and where ⊕ represents concatenation, and v c ∈ R da and W c ∈ R da×(2dr+1) are trainable parameters.
We introduce a relative position term p i to indicate the distance from the i-th word to the mention as where a and b are indices of the first and last words of the mention, and µ is set to 0.1. Finally, the feature vector of mention m is formed by concatenating its mention representation m and context representation c.

Hybrid Classification Model
We propose a hybrid type classification model consisting of two classifiers as Figure 1 shows. We first learn a matrix W b ∈ R dt×2dr to predict type scores byỹ whereỹ b i is the score for the i-th type and d t is the number of types. However, this method independently predicts each type and does not consider their inter-dependencies. To tackle this issue, we introduce an additional classifier inspired by Principle Label Space Transformation (Tai and Lin, 2012). Under the hypercube sparsity assumption that the number of training examples is much smaller than 2 dt , Tai and Lin (2012) project high-dimensional type vectors into a lowdimensional space to find underlying type correlations behind the first order co-occurrence through Singular Value Decomposition (SVD) where U ∈ R dt×d l , Σ ∈ R d l ×d l , L ∈ R N ×d l , and d l d t . This low-dimensional space is similar to the hidden concept space in Latent Semantic Analysis (Deerwester et al., 1990). The i-th row of L is the latent representation of the i-th type vector. After that, we learn to predict the latent type representation from the feature vector using where V l ∈ R 2dr×d l is trainable. We then reconstruct the type vector from l using a linear projectionỹ l = W l l = U Σl. Next, by combining scores from both classifiers, we havẽ where γ is a scalar initialized to 0.1 and updated during training. Finally, our training objective is to minimize the following cross-entropy-based loss function In the test phase, we predict each type with a probabilityỹ i > 0.5 or arg maxỹ i if all probabilities are lower than 0.5.

Data Sets
In our experiments, we evaluate the proposed model on the following data sets.
OntoNotes fine-grained entity typing data set is derived from the OntoNotes corpus (Weischedel et al., 2013) and annotated by Gillick et al. (2014) using a three-layer set of 87 types. We use the augmented data set created by Choi et al. (2018).
FIGER (Ling and Weld, 2012) contains 2.7 million automatically labeled training instances from Wikipedia and 434 manually annotated sentences from news reports. We use ground truth mentions in our experiments and sample 0.1M training instances as the development set.
KNET (Xin et al., 2018) is another data set derived from Wikipedia. It consists of an automatically annotated subset (WIKI-AUTO) and a manually annotated (WIKI-MAN) test set.

Experimental Setup
We use the pre-trained original-5.5b ELMo model 2 and freeze its weights during training. We use an Adam optimizer with learning rate of 5e-5, L2 weight decay of 0.01, warmup rate of 0.1, and linear learning rate decay. We use a minibatch size of 200. To reduce overfitting, we apply dropout (Srivastava et al., 2014) to the word representation, relative position term, and the final feature vector with probability 0.5, 0.2, and 0.2. We evaluate the performance by strict accuracy (Acc), macro-average F-score (Macro F1), and micro-average F-score (Micro F1) (Ling and Weld, 2012).

Evaluation Results
We compare the performance of our model with state-of-the-art methods on OntoNotes, FIGER, and KNET in Table 2     We compare the outputs of both classifiers (ỹ b andỹ l ). The reconstructed type vectorỹ l alone doesn't predict entity types accurately, while adding this classifier substantially improves the performance of the model. We show results on the BBN dataset in  To evaluate the influence of individual components of our model, we conduct an ablation study as shown in Table 6. We implement a baseline model similar to (Shimaoka et al., 2017) (no handcrafted features), the state-of-the-art on this data set. We observe that each component added to the model improves its performance.
We visualize mention and context attention in Figure 2. The first example shows the impact of mention attention. The baseline model mistakenly classifies the mention as location probably because "Asian" generally appears in location mentions, while our model successfully predicts organization by assigning higher weights to "Student Commission". In Example #2 and #3, we compare context attention between "Terry Martino" and "APA". Although they occur in the same sentence, our model is able to focus on different context words for different mentions with the mention-aware attention mechanism.

Related Work
As entity types are usually organized as a forest of hierarchies, several models are proposed to leverage this structure. In (Yosef et al., 2012), the authors build a set of classifiers based on the taxon-omy of YAGO (Hoffart et al., 2013) and perform top-down hierarchical classification. Shimaoka et al. (2017) propose a hierarchical label encoding method to share parameters between types in the same hierarchy. Xu and Barbosa (2018) propose a hierarchy-aware loss function to reduce the penalty when predicted types are related. By contrast, our model automatically find type interdependency via matrix factorization and is able to capture inter-dependencies between types regardless of whether they are in the same hierarchy. Attention mechanisms are widely used in neural fine-grained entity typing models (Shimaoka et al., 2016(Shimaoka et al., , 2017Xu and Barbosa, 2018;Xin et al., 2018;Choi et al., 2018) to weight context words. We also apply it to mention words and introduce a position term to make attentions for context words more mention-aware.

Conclusions and Future Work
We propose an attentive architecture for finegrained entity typing with latent type representation. Experiments on multiple data sets demonstrate that our model achieves state-of-the-art performance. In the future, we will further improve the performance of fine-grained types, which is still lower than that of general types due to less training instances and distant supervision noise. We also plan to utilize fine-grained entity typing results in more downstream applications, such as coreference resolution and event extraction.