End-to-End Trainable Attentive Decoder for Hierarchical Entity Classification

We address fine-grained entity classification and propose a novel attention-based recurrent neural network (RNN) encoder-decoder that generates paths in the type hierarchy and can be trained end-to-end. We show that our model performs better on fine-grained entity classification than prior work that relies on flat or local classifiers that do not directly model hierarchical structure.


Introduction
Many tasks in natural language processing involve hierarchical classification, e.g., fine-grained morphological and part-of-speech tags form a hierarchy (Mueller et al., 2013) as do many large topic sets (Lewis et al., 2004). The task definition can either specify that a single path is correct, corresponding to a single-label classification problem at the lowest level of the hierarchy, e.g., in finegrained morphological tagging; or that multiple paths can be correct, corresponding to a multilabel classification problem at the lowest level of the hierarchy, e.g., in topic classification.
In this paper, we address fine-grained entity mention classification, another problem with a hierarchical class structure. In this task, each mention can have several fine-grained types, e.g., Obama is both a politician and an author in a context in which his election is related to his prior success as a best-selling author; thus, the problem is multilabel at the lowest level of the hierarchy.
Two standard approaches to hierarchical classification are flat and local classification. In flat classification (e.g., FIGER (Ling and Weld, 2012), Attentive Encoder (Shimaoka et al., 2016;Shimaoka et al., 2017)), the task is formalized as a flat multiclass multilabel problem. In local classification (Gillick et al., 2014;Yosef et al., 2012;Yogatama et al., 2015), a separate local classifier is learned for each node of the hierarchy. In both approaches, some form of postprocessing is necessary to make the decisions consistent, e.g., an entity can only be a celebrity if they are also a person.
In this paper, we propose an attentive RNN encoder-decoder for hierarchical classification. The encoder-decoder performs classification by generating paths in the hierarchy from top node to leaf nodes. Thus, we model the structure of the hierarchy more directly than prior work. On each step of the path, part of the input to the encoderdecoder is an attention-weighted sum of the states of a bidirectional Gated Recurrent Unit (GRU)  run over the context of the mention to be classified. Unlike prior work on hierarchical entity classification, our architecture can be trained end-to-end. We show that our model performs better than prior work on the FIGER dataset (Ling and Weld, 2012). This paper is structured as follows. In Section 2, we provide a detailed description of our model PthDCode. In Section 3, we describe and analyze our experiments. In Section 4, we discuss related work. Section 5 concludes.

Model
Figure 1 displays our model PthDCode.
We use lowercase italics for variables, uppercase italics for sequences, lowercase bold for vectors and uppercase bold for matrices. Sentence S = x 1 , . . . , x |S | is a sequence of words, represented as embeddings x i , each of dimension d. The classes of an entity are represented as y, a vector of l binary indicators, each indicating whether the corresponding class is correct. Hidden states of forward and backward encoders and of the decoder have dimensionality p.
PthDCode extracts mention x b , . . . , x r , right context R c = x r+1 , . . . , x r+w and left context  Figure 1: PthDCode, the attentive encoder-decoder for hierarchical entity classification The representation m of the mention is computed as the average of its r − b + 1 vectors. The context is represented by C, a matrix of size 2w ×2p; each column of C consists of two hidden state vectors h (each of dimension 2p), corresponding to forward and backward GRUs run on L c and R c . The initial state s 0 of PthDCode's decoder RNN is computed using the mention representation m compressed to p dimensions by an extra hidden layer (not shown in the figure). Initial output y 0 is a dummy symbol SOL (Start Of Label), and initial attention weights c 0 are set to zero. At each path generation step i , attention weights α ij are computed following : where att is a feedforward network with softmax output layer and C .j is the j th column of C. The final context representation for the decoder is then computed as c i = 2w j=1 α ij C .j . In Figure 1, dashed objects are used for indicating involvement in calculating attention weights.
The attention-weighted sum c i and the current state s i−1 are used to predict the distribution y i over entity classes (non-dashed *-nodes in Figure 1): where g is a feedforward network with elementwise sigmoid. Finally, PthDCode uses prediction y i , weighted average c i and previous state s i−1 to compute the next state: The loss function at each step or level is binary cross-entropy: where y i and t i are prediction and truth and l the number of classes. The objective is to minimize the total loss, i.e., the sum of the losses at each level. During inference, we compute the Cartesian product of predicted types at each level and filter out those paths that do not occur in train.

Experiments and results
Dataset. We use the Wiki dataset (Ling and Weld, 2012) published by Ren et al. (2016). 1 It consists of 2.69 million mentions obtained from 1.5 million sentences sampled from Wikipedia articles. These mentions are tagged with 113 types with a maximum of two levels of hierarchy. Ling and Weld (2012) also created a test set of 434 sentences that contain 562 gold entity mentions. Similar to prior work (Ling and Weld, 2012;Ren et al., 2016;Yogatama et al., 2015;Shimaoka et al., 2017), we randomly sample a training set of 2 million and a disjoint dev set of size 500. Evaluation. Like prior work, we use three F 1 metrics, strict, loose macro and loose micro, that differ in the definition of precision P and recall R. Let n be the number of mentions, T i the true set of tags of mention i and Y i the predicted set. Then, we define ) for loose micro. Parameter Settings. We use pre-trained word embeddings of size 300 provided by (Pennington et al., 2014). OOV vectors are randomly initialized. Similar to (Shimaoka et al., 2017), all hidden states h of the encoder-decoder were set to 100 dimension and mention lengths m to 5. Window size is w = 15. We bracket left and right contexts with special start and end symbols. For short left / right contexts, we bracket with additional different start / end symbols that are masked out for calculation of loss and attention weights. Another special symbol EOL (End Of Label) is appended to short paths, so that all hierarchical paths have the same length. We use ADAM (Kingma and Ba, 2014) with learning rate .001 and batch size 500. Following (Srivastava et al., 2014), we regularize our learning by dropout of states used in computing prediction as in Eq. 3 with probability of .5. Similarly, we also drop out feedback connections used in computing next states as in Eq. 4 with probability of .2. We also add Gaussian noise with a probability of .1 to feedforward weights. The weights of feedforward units are initialized with an isotropic Gaussian distribution having mean 0 and standard deviation .02 while weights of recurrent units are initialized with random orthogonal matrix.
Results. As shown in Figure 2, we evaluate our model on dev and test sets after every 2k iterations and report the performances of the models that are stable in all form of metrics on dev set. The reason for evaluating on range of models is nature of collection of dev and test data. We use c v = σ/µ, the coefficient of variation (Brown, 1998), to select and combine models in application. After an initial training stage, we compute c v for each of the three metrics for windows of 10,000 iterations, startpoints have the form 4000 + 6000s. For a given window starting at iteration 2000t, we compute c v of the three metrics based on the six iterations 2000(t + i), 0 ≤ i ≤ 5. We select the range with the lowest average c v ; this was the interval [40000, 50000]; cf. Figure 2. Since train and test data are collected from different sources, the sensitive strict measure varies with a larger standard deviation compared to other metrics. Table 1 shows performance of PthDCode on test, based on the interval [40000, 50000]; average and standard deviation are computed for 2000(20 + i), 0 ≤ i ≤ 5, as described above. PthDCode achieves clearly better results than other baseline methods -FIGER (Ling and Weld, 2012), (Yogatama et al., 2015) and (Shimaoka et al., 2017) -when trained on raw (i.e., not denoised) datasets of a similar size. Attentive encoder (Shimaoka et al., 2017) is a neural baseline for PthDCode, to which comparison in Table 1 suggests decoding of path hierarchy rather than flat classification significantly improves the performance. Ren et al. (2016) implementation of FIGER (Ling and Weld, 2012) trained on the denoised corpus performs better on strict and loose micro metrics, but as the training data are different, results are not directly comparable. An important observation in Table 1 is that most of the improved systems (Ren et al., 2016;Yogatama et al., 2015) consider entity classification in a hierarchical setup either through denoising or classification. One can also observe that our model achieves relatively high increase in terms of loose macro. The reason for this is mostly because of the macro F 1 direct dependence on average precision and average recall, which in our case is relatively high because of large improvement in the recall. Table 2 shows that for level-wise comparisons on loose micro F 1 , PthDCode improves recall compared to Yogatama et al. (2015)'s precision oriented system. We attribute this increase in recall and F 1 to the fact that PthDCode at each step collects feedback from the preceding level and is   Table 3 shows, for some examples, which five words received the highest attention on level 1 (L1) and on level 2 (L2). The words are ordered from highest to lowest attention. We see that PthDCode attends to "from" for the location "Glasgow", but not for the organization "University of Glasgow". We also see that some words appear only on one of the two levels, e.g., for the mention "Glasgow", the context word "Glasgow" only appears on level 2. This indicates the benefit of level-wise attention. The last row shows an example of two types, /PEOP, /PEOP/Ethnc, that are correct, but are not part of the gold standard, so we count them as errors.

Related work
Named entity recognition (NER) is the joint problem of entity mention segmentation and entity mention classification (Finkel et al., 2005;Mc-Callum and Li, 2003). Most work on NER uses a small set of coarse-grained labels like person and location, e.g., MUC-7 (Chinchor and Robinson, 1998). Most work on the fine-grained FIGER (Ling and Weld, 2012) and HYENA (Yosef et al., 2012) taxonomies has cast NER as a two-step process (Elsner et al., 2009;Ritter et al., 2011;Collins and Singer, 1999) of entity mention segmentation followed by entity mention classification. The reason for two-step is the high complexity of joint models for fine-grained entity recognition. A joint model like CRF (Lafferty et al., 2001) has a state space corresponding to segmentation type times semantic types. Introducing a larger class set into joint models already increases the complexity of learning drastically, furthermore the multilabel nature of fine-grained entity mention classification explodes the state space of the exponential model further (Ling and Weld, 2012).
Utilizing fine-grained entity information enhances the performance for tasks like named entity disambiguation (Yosef et al., 2012), relation extraction (Ling and Weld, 2012) and question answering (Lin et al., 2012;Lee et al., 2006). A major challenge with fine grained entity mention classification is the scarcity of human annotated datasets. Currently, most of the datasets are collected through distant supervision, utilizing Wikipedia texts with anchor links to obtain entity mentions and using knowledge bases like Freebase and YAGO to obtain candidate types for the mention. This introduces noise and complexities like unrelated labels, redundant labels and large sizes of candidate label sets. To address these challenges, Ling and Weld (2012) mapped Freebase types to their own tag set with 113 types, Yosef et al. (2012) derived a 505-subtype fine-grained taxonomy using YAGO knowledge base, Gillick et al. (2014) devised heuristics to filter candidate types and, most recently, Ren et al. (2016) proposed a heterogeneous partial-label embedding framework to denoise candidate types by jointly embedding entity mentions, context features and entity type hierarchy.
We address fine-grained entity mention classification in this paper. A related problem is finegrained entity typing: the problem of predicting the complete set of types of the entity that a mention refers to (Yaghoobzadeh and Schütze, 2017). For the sentences "Obama was elected president" and "Obama graduated from Harvard in 1991", fine-grained entity mention classification should predict "politician" for the first and "lawyer" for the second. In contrast, given a corpus containing these two sentences, fine-grained entity typing should predict the types {"politician", "lawyer"} for "Obama".
A common approach for solving hierarchical problems has been flat classification, i.e., not making direct use of the hierarchy. But exploiting the hierarchical organization of the classes reduces complexity, makes better use of training data in learning and enhances performance. Gillick et al. (2014) showed that addressing the entity classification problem with a hierarchical approach   (2012) used a set of support vector machine classifiers corresponding to each node in the hierarchy and then postprocessed them during inference through a metaclassifier. Yogatama et al. (2015), using a kernel enhanced WSABIE embedding method (Weston et al., 2011), learned an embedding for each type in the hierarchy and during inference filtered out predicted types that exceeded a threshold limit and did not fit into a path. Ren et al. (2016) showed that mapping a set of correlations, more specifically correlations of the types in the hierarchy, into an embedding space generates embeddings for mentions and types. These embeddings were then used for filtering the noisy candidate types and for denoising the train corpus. Ren et al. (2016) also showed that using the denoised corpus with baseline methods of (Ling and Weld, 2012;Yosef et al., 2012) enhanced the performance of those baseline methods significantly.
Recurrent neural networks (RNN) have been a successful model for sequence modeling tasks. Introduction of RNN based encoder-decoder architectures  addressed the end to end sequence to sequence learning problem that does not highly depend on lengths of sequences.  included attention mechanism to an encoder-decoder architecture and subsequently several other methods used them to improve performance on a range of tasks, e.g., machine translation , image captioning (Xu et al., 2015), question answering (Kumar et al., 2016), morphological reinflection (Kann and Schütze, 2016). Recently, Shimaoka et al. (2016) and Shimaoka et al. (2017) included attention weighted contextual information into their logistic classification based entity classification model and showed improvement over traditional and nonattention based LSTM models.
In this paper, we describe the first decoder for hierarchical classification. It is trained end-to-end to predict paths from root to leaf nodes and also leverages attention-weighted sums of hidden state vectors of context when predicting classes at each level of the hierarchy.

Conclusion
We introduced an entity mention classification model that learns to predict types from an entity type hierarchy using an encoder-decoder with a level-wise contextual attention mechanism. A clear improvement in performance is observed at each level as well as in overall type hierarchy prediction compared to models trained in a comparable setting and performance close to models trained on datasets that have been denoised. We attribute this good performance to the fact that our method is the first neural network model for hierarchical classification that can be trained end-toend while taking into account the tree structure of the entity classes through direct modeling of paths in the hierarchy.