Fine-grained Entity Typing through Increased Discourse Context and Adaptive Classification Thresholds

Fine-grained entity typing is the task of assigning fine-grained semantic types to entity mentions. We propose a neural architecture which learns a distributional semantic representation that leverages a greater amount of semantic context -- both document and sentence level information -- than prior work. We find that additional context improves performance, with further improvements gained by utilizing adaptive classification thresholds. Experiments show that our approach without reliance on hand-crafted features achieves the state-of-the-art results on three benchmark datasets.


Introduction
Named entity typing is the task of detecting the type (e.g., person, location, or organization) of a named entity in natural language text. Entity type information has shown to be useful in natural language tasks such as question answering (Lee et al., 2006), knowledge-base population (Carlson et al., 2010;Mitchell et al., 2015), and coreference resolution (Recasens et al., 2013). Motivated by its application to downstream tasks, recent work on entity typing has moved beyond standard coarse types towards finer-grained semantic types with richer ontologies (Lee et al., 2006;Ling and Weld, 2012;Yosef et al., 2012;Gillick et al., 2014;Del Corro et al., 2015). Rather than assuming an entity can be uniquely categorized into a single type, the task has been approached as a multi-label classification problem: e.g., in "... became a top seller ... Monopoly is played in 114 countries. ..." (Figure 1), "Monopoly" is considered both a game as well as a product.
The state-of-the-art approach (Shimaoka et al., 2017) for fine-grained entity typing employs an attentive neural architecture to learn representations of the entity mention as well as its context. These representations are then combined  with hand-crafted features (e.g., lexical and syntactic features), and fed into a linear classifier with a fixed threshold. While this approach outperforms previous approaches which only use sparse binary features (Ling and Weld, 2012;Gillick et al., 2014) or distributed representations (Yogatama et al., 2015), it has a few drawbacks: (1) the representations of left and right contexts are learnt independently, ignoring their mutual connection; (2) the attention on context is computed solely upon the context, considering no alignment to the entity; (3) document-level contexts which could be useful in classification are not exploited; and (4) hand-crafted features heavily rely on system or human annotations.
To overcome these drawbacks, we propose a neural architecture (Figure 1) which learns more context-aware representations by using a better attention mechanism and taking advantage of semantic discourse information available in both the document as well as sentence-level contexts. Fur-ther, we find that adaptive classification thresholds leads to further improvements. Experiments demonstrate that our approach, without any reliance on hand-crafted features, outperforms prior work on three benchmark datasets.

Model
Fine-grained entity typing is considered a multilabel classification problem: Each entity e in the text x is assigned a set of types T * drawn from the fine-grained type set T . The goal of this task is to predict, given entity e and its context x, the assignment of types to the entity. This assignment can be represented by a binary vector y ∈ {1, 0} |T | where |T | is the size of T . y t = 1 iff the entity is assigned type t ∈ T .

General Model
Given a type embedding vector w t and a featurizer ϕ that takes entity e and its context x, we employ the logistic regression (as shown in Figure 1) to model the probability of e assigned t (i.e., y t = 1) and we seek to learn a type embedding matrix W = [w 1 , . . . , w |T | ] and a featurizer ϕ such that (2) At inference, the predicted type setT assigned to entity e is carried out bŷ with r t the threshold for predicting e has type t.

Featurizer
As shown in Figure 1, featurizer ϕ in our model contains three encoders which encode entity e and its context x into feature vectors, and we consider both sentence-level context x s and document-level context x d in contrast to prior work which only takes sentence-level context (Gillick et al., 2014;Shimaoka et al., 2017). 1 The output of featurizer ϕ is the concatenation of these feature vectors: We define the computation of these feature vectors in the followings. Entity Encoder: The entity encoder f computes the average of all the embeddings of tokens in entity e. Sentence-level Context Encoder: The encoder g s for sentence-level context x s employs a single bidirectional RNN to encode x s . Formally, let the tokens in x s be x 1 s , . . . , x n s . The hidden state h i for token x i s is a concatenation of a left-to-right hidden state − → h i and a right-to-left hidden state where − → f and ← − f are L-layer stacked LSTMs units (Hochreiter and Schmidhuber, 1997). This is different from Shimaoka et al. (2017) who use two separate bi-directional RNNs for context on each side of the entity mention. Attention: The feature representation for x s is a weighted sum of the hidden states: g s (x s , e) = n i=1 a i h i , where a i is the attention to hidden state h i . We employ the dot-product attention (Luong et al., 2015). It computes attention based on the alignment between the entity and its context: where W a is the weight matrix. The dot-product attention differs from the self attention (Shimaoka et al., 2017) which only considers the context. Document-level Context Encoder: The encoder g d for document-level context x d is a multi-layer perceptron: where DM is a pretrained distributed memory model (Le and Mikolov, 2014) which converts the document-level context into a distributed representation. W d 1 and W d 2 are weight matrices.

Adaptive Thresholds
In prior work, a fixed threshold (r t = 0.5) is used for classification of all types (Ling and Weld, 2012;Shimaoka et al., 2017). We instead assign a different threshold to each type that is optimized to maximize the overall strict F 1 on the dev set. We show the definition of strict F 1 in Section 3.1.

Experiments
We conduct experiments on three publicly available datasets. 2 Table 1 shows the statistics of these datasets.

Metrics
We adopt the metrics used in Ling and Weld (2012) where results are evaluated via strict, loose macro, loose micro F 1 scores. For the i-th instance, let the predicted type set beT i , and the reference type set T i . The precision (P ) and recall (R) for each metric are computed as follow. Strict: We made the source code and data publicly available at https://github.com/sheng-z/figet. Loose Macro:

Hyperparameters
We use open-source GloVe vectors (Pennington et al., 2014) trained on Common Crawl 840B with 300 dimensions to initialize word embeddings used in all encoders. All weight parameters are sampled from U(−0.01, 0.01). The encoder for sentence-level context is a 2-layer bidirectional RNN with 200 hidden units. The DM output size is 50. Sizes of W a , W d 1 and W d 2 are 200×300, 70×50, and 50×70 respectively. Adam optimizer (Kingma and Ba, 2014) and mini-batch gradient is used for optimization. Batch size is 200. Dropout (rate=0.5) is applied to three feature functions. To avoid overfitting, we choose models which yield the best strict F 1 on dev sets.

Results
We compare experimental results of our approach with previous approaches 3 , and study contribution of our base model architecture, documentlevel contexts and adaptive thresholds via ablation. To ensure our findings are reliable, we run each experiment twice and report the average performance. Overall, our approach significantly increases the state-of-the-art macro F 1 on both OntoNotes and BBN datasets.
On OntoNotes (Table 3), our approach improves the state of the art across all three metrics. Note that (1) without adaptive thresholds or document-level contexts, our approach still outperforms other approaches on macro F 1 and micro F 1 ; (2) adding hand-crafted features (Shimaoka et al., 2017) does not improve the performance.   (Ren et al., 2016a) 55.10 71.10 64.70 FNET (Abhishek et al., 2017) 52.20 68.50 63.30 NEURAL (Shimaoka et al., 2017)    On BBN (Table 4), while Ma et al. (2016)'s label embedding algorithm holds the best strict F 1 , our approach notably improves both macro F 1 and micro F 1 . 4 The performance drops to a competitive level with other approaches if adaptive thresholds or document-level contexts are removed.
On FIGER (Table 5) where no document-level context is currently available, our proposed ap-4 Integrating label embedding into our proposed approach is an avenue for future work.

Approach
Strict Macro Micro KWSABIE (Yogatama et al., 2015) N/A N/A 72.25 Attentive (Shimaoka et al., 2016) 58.97 77.96 74.94 FNET(Abhishek et al., 2017 65.80 81.20 77.40 Ling and Weld (2012) 52.30 69.90 69.30 PLE (Ren et al., 2016b) 49.44 68.75 64.54 Ma et al. (2016) 53.54 68.06 66.53 AFET (Ren et al., 2016a) 53.30 69.30 66.40 NEURAL (Shimaoka et al., 2017)   proach still achieves the state-of-the-art strict and micro F 1 . If compared with the ablation variant of the NEURAL approach, i.e., w/o hand-crafted features, our approach gains significant improvement. We notice that removing adaptive thresholds only causes a small performance drop; this is likely because the train and test splits of FIGER are from different sources, and adaptive thresholds are not generalized well enough to the test data. KWASIBIE, Attentive and FNET were trained on a different dataset, so their results are not directly comparable. Table 2 shows examples illustrating the benefits brought by our proposed approach. Example A illustrates that sentence-level context sometimes is not informative enough, and attention, though already placed on the head verbs, can be misleading. Including document-level context (i.e., "Canada's declining crude output" in this case) helps preclude wrong predictions (i.e., /other/health and /other/health/treatment). Example B shows that the semantic patterns learnt by our attention mechanism help make the correct prediction. As we observe in Table 3 and Table 5, adding handcrafted features to our approach does not im-prove the results. One possible explanation is that hand-crafted features are mostly about syntactichead or topic information, and such information are already covered by our attention mechanism and document-level contexts as shown in Table 2. Compared to hand-crafted features that heavily rely on system or human annotations, attention mechanism requires significantly less supervision, and document-level or paragraph-level contexts are much easier to get. Through experiments, we observe no improvement by encoding type hierarchical information (Shimaoka et al., 2017). 5 To explain this, we compute cosine similarity between each pair of fine-grained types based on the type embeddings learned by our model, i.e., w t in Eq. (1). Table 6 shows several types and their closest types: these types do not always share coarse-grained types with their closest types, but they often co-occur in the same context.

Conclusion
We propose a new approach for fine-grained entity typing. The contributions are: (1) we propose a neural architecture which learns a distributional semantic representation that leverage both document and sentence level information, (2) we find that context increased with document-level information improves performance, and (3) we utilize adaptive classification thresholds to further boost the performance. Experiments show our approach achieves new state-of-the-art results on three benchmarks.