ToHRE: A Top-Down Classification Strategy with Hierarchical Bag Representation for Distantly Supervised Relation Extraction

Distantly Supervised Relation Extraction (DSRE) has proven to be effective to find relational facts from texts, but it still suffers from two main problems: the wrong labeling problem and the long-tail problem. Most of the existing approaches address these two problems through flat classification, which lacks hierarchical information of relations. To leverage the informative relation hierarchies, we formulate DSRE as a hierarchical classification task and propose a novel hierarchical classification framework, which extracts the relation in a top-down manner. Specifically, in our proposed framework, 1) we use a hierarchically-refined representation method to achieve hierarchy-specific representation; 2) a top-down classification strategy is introduced instead of training a set of local classifiers. The experiments on NYT dataset demonstrate that our approach significantly outperforms other state-of-the-art approaches, especially for the long-tail problem.


Introduction
Knowledge bases (KBs) such as Freebase (Bollacker et al., 2008), DBpedia (Auer et al., 2007), and NELL (Carlson et al., 2010) currently play an essential role in NLP tasks including information retrieval and question answering. As current KBs are still far from complete compared with the infinite real-world facts, relation extraction (RE) which aims to extract relations between two entities in texts and enrich KBs has attracted a surge of research interest. Most existing supervised RE methods require high quality labeled data, which is time-consuming and labor-intensive. Therefore, Mintz et al. (2009) propose the distant supervision (DS) approach to automatically generate a large amount of training data for RE by aligning KBs with large-scale unlabeled corpora.
Two problems have to be addressed in building an efficient DS model. The main problem, namely the wrong labeling problem, comes from the assumption of DS. DS is based on an assumption: if two entities have a relation in KBs, all sentences that contain these two entities will express this relation. This strong and unrealistic assumption inevitably bring some noise. For example, DS method may mistakenly label the sentence "Steve Jobs passed away the day before Apple unveiled iPhone 4s in late 2011" with the relation business/company/founders. Secondly, although DS method can automatically generate large-scale training data, this training data can only cover a limited part of relations. Nearly 70% of the relations are long-tail and still suffer from data deficiency in the widely used New York Times (NYT) dataset. To alleviate the wrong labeling problem, Riedel et al. (2010) and Hoffmann et al. (2011) propose the multi-instance learning (MIL) framework, which assigns a label to a bag of sentences containing a common entity pair. Based on the MIL framework, many efforts (Zeng et al., 2015;Lin et al., 2016;Ye and Ling, 2019) have been devoted to identifying valid sentences from labeled bags. However, they ignore the hierarchical relation information. For example, the relations in Freebases are labeled as shown in Figure 1 (LEFT). Considering the correlations among relations, relations can also be naturally organized as a class hierarchy )-typically like a tree shown in Figure 1 (RIGHT). Each layer in the relation hierarchies has its semantic information.  and  take advantage of this relation hierarchies and propose a hierarchical attention scheme to simultaneously solve the noise problem and achieve state-of-the-art performance in extracting long-tail relations. Nevertheless, they are based on a flat classification approach without fully exploring the informative relation hierarchies.
To leverage the inherent hierarchical structure of relations, we formulate DSRE as a hierarchical classification task, which extracts relations in a top-down manner. Intuitively, coarse-grained relations in the top level are easy whereas fine-grained relations in the bottom layer are harder to classify. In this way, we can preferentially extract the relation in the top level and then use the top-level relation to boost the performance of the relation in the bottom level. There exist two challenges when conducting a hierarchical top-down classification: capturing specific bag representations in different levels and training large amounts of classifiers. For the challenge of the bag representation, a bag of sentences expresses different relations in different levels. Hence, it is necessary to dynamically adjust the bag representations in different relation levels. To capture the hierarchy-specific bag representation, we propose a hierarchical bag representation method, which incorporates the hierarchically-refined selective attention mechanism to dynamically adjust the bag representation in different levels. For the challenge of massive classifiers, traditional hierarchical classification methods which adopt the top-down manner need to train a set of local classifiers (D'Alessio et al., 2000;Clare and King, 2003;Holden and Freitas, 2009). The number of local classifiers depends on the size of the label hierarchies, making hierarchical classification infeasible to scale. To handle the massive local classifiers problem, we introduce a top-down classification strategy which shares most of its parameters in different levels to avoid training massive local classifiers, thus making our methods available for various relation hierarchies.
We name our approach A Top-Down Classification Strategy for Hierarchical Relation Extraction (ToHRE). Our main contributions are summarized as follows: • To the best of our knowledge, we are the first to explore the feasibility of the hierarchical classification in Distantly Supervised Relation Extraction.
• We design a hierarchically-refined representation method to enhance the bag representation in different relation levels and a top-down classification strategy to avoid training massive local classifiers.
• We conduct thorough experiments on the widely-used NYT dataset and achieve significant improvements over state-of-the-art models, especially for long-tail relations.

Overview
Problem Definition Following the MIL setting, we split entire sentences into multiple entity-pair bags {B h 1 ,t 1 , B h 2 ,t 2 , ...}. Each entity-pair bag B h i ,t i contains m sentences {s 1 , s 2 , ..., s m } mentioning both the entities h i and t i . Each sentence is a sequence of tokens, i.e., s = {w 1 , w 2 , ..., w n }, where n is the length of the sentence. Besides, we define the relation classes as R = {r 1 , r 2 , ...}. Given an entitypair bag and two corresponding entities, the previous works are all focusing on flat classification, which directly label the bag with a predicted relation r from pre-defined relation classes R. To leverage the inherent hierarchical structure of relations to conduct hierarchical relation extraction, we define a relation hierarchy H = (L, E) as a tree-structured hierarchy with a set of nodes (i.e., relations) L and a set of edges E indicating the relationship between the parent node and its child node. As illustrated in Figure 1, the leaf nodes in H are made up of the pre-defined relation classes R. Hence, all leaf nodes are base-level relations (e.g., /people/person/place of birth). We generate the corresponding higher-level relations (e.g., /people/person and /people) as their parent nodes. Specifically, for a relation r in the leaf node, we generate k times to construct its hierarchical chain of parent relations r 0 , r 1 , ..., r k , where r i−1 is the parent relation of r i . It is worth noting that r 0 is the virtual root relation and r k is the base-level relation, namely r.
Different from previous methods, our proposed model aims to explore the relation hierarchy H in a top-down manner and output relation probability in each relation level. The entity-pair bag along with the history of parent relations are integrated to predict a relation from pre-defined relation classes R. Framework Our model consists of two key components: hierarchical bag representation module and a top-down classification strategy. The hierarchical bag representation module is shown in Figure 2. It takes an entity-pair bag as input and outputs the bag representations in different relation levels. First, each sentence in the entity-pair bag is transformed to a matrix with the entity-aware embedding. Then, a Piecewise Convolution layer (Zeng et al., 2015) is used to obtain the sentence representation. After that, a hierarchically-refined selective attention is leveraged to select sentences in the bag which actually expresses the corresponding hierarchical relation and output hierarchy-specific bag representation. The top-down classification strategy is illustrated in Figure 3. The designed strategy takes the hierarchical bag representation and corresponding relation embeddings as input to discriminate its child relations, then walks down the relation hierarchies to predict the next relation until the leaf node. The details of the two components are introduced in Section 2.2 and Section 2.3 respectively.

Hierarchical Bag Representation
As mentioned in Section 2.1, the hierarchically-refined representation method aims to obtain bag representations in different relation levels. Specifically, given a bag of sentences B h,t = {s 1 , s 2 , ..., s m } and two corresponding entities w h and w t 1 , we aim to obtain k-level bag representations r 1 h,t , r 2 h,t , ..., r k h,t . Entity-Aware Embedding For a sentence s = {w 1 , w 2 , ..., w n } in the bag, each word w i is transformed into a low-dimensional dense-vector representation, i.e., [v 1 , ..., v n ] ∈ R da×n , where d a denotes the dimension of word embedding. Besides, relative position (Zeng et al., 2015) is a crucial feature for relation extraction model to specify the target entity pair and make the model pay more attention to the words close to the target entities. For i-th word, the relative position features can be represented by two dense-vectors p w h i and p wt i ∈ R d b . Previous works concatenate the word embedding and position embedding for i-th word, i.e., In addition to relative position features, Li et al. (2020) verify both the head entity embedding v h and tail entity embedding v t are vitally important for the relation extraction task. They use a position-wise gate mechanism to dynamically select features between relative position embeddings and entity embeddings. To go a step further, we argue that head entity and tail entity are not equally important in extracting hierarchical relations. To this end, we propose an entity-aware embedding method to dynamically enhance entity information in word representation. Formally, we denote the head-entity-enhanced embedding as The tail-entity-enhanced embedding is denoted as X (t) in a similar way. Then we use a selective mechanism to dynamically select features between X (h) and X (t) . The selective mechanism is formulated as: Piecewise Convolution Neural Network We employ the Piecewise Convolution Neural Network (PCNN) as the sentence encoder to map the aforementioned input representation X into a sentence representation. Note that, our hierarchically-refined representation method is designed independently of the sentence encoder. Hence, other neural networks such as RNN can be the alternative and used as the sentence encoder in our approach. Since previous works show that PCNN can achieve state-of-the-art performance and give time efficiency, we select it as the sentence encoder in this paper. PCNN mainly consists of two parts: the convolution layer and piecewise max-pooling.
The convolution layer applies a kernel of window size l to slide over the input representation [x 1 , x 2 , · · · , x n ] and output the d c dimensional hidden embedding h, where h ∈ R dc×n and d c is the number of feature maps.
Then, a piecewise max-pooling method (Zeng et al., 2015) is applied on the hidden embeddings. The hidden embedding h is firstly divided into three parts h (1) , h (2) , h (3) by the position of head and tail entities. After that, we perform max-pooling on each part respectively and concatenate the pooling results to get the final embedding u: where u ∈ R 3dc is the final sentence representation.
Hierarchically-Refined Selective Attention Given a bag of sentences B h,t = {s 1 , s 2 , ..., s m }, we already achieve the sentence embeddings U h,t = {u 1 , u 2 , ..., u m } through input representation and PCNN encoder. For a relation r ∈ R, we generate its k-level chain of parent relations r 0 , r 1 , ..., r k . In this section, we aim to identify valid sentences in different relation levels and obtain hierarchy-specific bag representations, i.e., r 1 h,t , r 2 h,t , ..., r k h,t We propose a hierarchically-refined selective attention mechanism to identify valid sentences in klevel relation hierarchies. Specifically, the bag presentation of j-level (1 ≤ j ≤ k) relation level is formulated as: Here β j i is the hierarchically-refined attention weight for i-th sentence in j-th relation level, which is further defined as: where e j i is referred as a query-based function which scores how well the input sentence u i and the j-th level of predicted relation r matches. We denote q r j as layer-wise query vector associated with j-th level of relation r to compute e j i : where A is a weighted diagonal matrix. Finally, the hierarchical bag representation r 1 h,t , r 2 h,t , ...

Top-Down Classification Strategy
In this section, a novel top-down classification strategy for DSRE is proposed to explore the relation hierarchy H step by step. We first define some notations used and then introduce the novel strategy in detail.
Definition For each relation in the hierarchy H, the relation embedding l ∈ R C is randomly initialized and updated during training. In level j, we define the current parent node as r j and the child nodes of r j as C(r j ). Furthermore, an embedding matrix consists of C(r j ) is denoted as C j ∈ R |C(r j )|×C .
Top-Down Manner The proposed top-down manner starts from the virtual root node and goes down one level on the hierarchy and stops at a leaf node. Notably, only the ground-truth hierarchical relations are visited during the training phase, e.g., for a bag labeled with relation r, only the nodes in its hierarchical chain r 0 , r 1 , ..., r k are visited. While in the testing phase, the proposed top-down manner visits each node in the relation hierarchy H and outputs local probabilities for each node. Local Classification Strategy In each level j, we aim to conduct local classification, which outputs the probability distribution of C(r j ), i.e., the child node probabilities of r j . Traditional methods train a set of massive classifiers for each node or each parent node in relation hierarchy H, making them infeasible to scale. Inspired by (Mao et al., 2019), who propose a top-down supervised method to pre-train a reinforcement learning model for hierarchical classification, we design a local classification strategy for DSRE which calculates the matching score between the hierarchical bag representation and the candidate embedding matrix.
Specifically when conducting local classification in level j, the embedding of current relation l j and the bag representation r j+1 h,t are concatenated and projected to a hidden state vector s j ∈ R C via a twolayer feed-forward network. Then the candidate embedding matrix C j is multiplied with the hidden state vector s j to obtain the local probability p(r j+1 |r j , r j+1 h,t , θ), i.e., the ground-truth child relation probability of r j . The aforementioned process is formulated as: p(r j+1 |r j , r j+1 h,t , θ) = softmax(C j s j )

Training and Testing
Here we introduce the learning and optimization details of our model. During the training process, we minimize the cross entropy loss function. Given the collection of entity-pair bags B = {B h 1 ,t 1 , B h 2 ,t 2 , ...} and corresponding labeled relations {r 1 , r 2 , ..., }, We defined the hierarchical loss as followings: where λ is a harmonic factor, and θ 2 2 is the regularizer defined as L 2 normalization. All models are optimized using stochastic gradient descent (SGD).
In the testing phase, we output the final probability of relation r for B h,t by multiplying the probabilities of its hierarchical chain of parent relations: 3 Experiments

Dataset and Evaluation Metrics
We evaluate our proposed model on the New York Times (NYT) dataset (Riedel et al., 2010), which has been widely used for distantly supervised relation extraction (Zeng et al., 2015;Lin et al., 2016;. This dataset is constructed by aligning relation facts in Freebase with the New York Times corpus. There are 52 actual relations and a special NA relation indicating that there is no relation between two entities. Specially, we put NA relation as a leaf node in H and directly link it to the virtual root node. Following the previous works from Zeng et al. (2015) and Lin et al. (2016), we evaluate our model using the held-out evaluation. In the held-out evaluation, the relations extracted from test data are automatically compared with those in Freebase. It is an approximate measure of the model without requiring costly human evaluation. We report the precision-recall (PR) curves, top-N precision (P@N) and accuracy of Hits@K in our experiments.  Figure 4: Precision-recall curves.

Setup
We use the pre-trained word embeddings released by Lin et al. (2016) 2 for initialization. The vocabulary contains the words which appeared more than 100 times in the NYT corpus. We apply the dropout strategy (Srivastava et al., 2014) to the hidden state vector s j to prevent overfitting. Besides, we pre-train the PCNN before training our hierarchical classification model. Table 1 shows all the hyper-parameters used in our experiments.

Overall Performance
Precision-Recall Curves We compare the precision-recall curves of our model with several major baselines to evaluate the overall performance in Figure 4. We report the results of previous RE models which adopt the PCNN architecture as sentence encoder: the original PCNN and its variants with different attention-based modules. More specifically, +ONE (Zeng et al., 2015) indicates the original PCNN under MIL setting; +ATT (Lin et al., 2016) is a plain selective attention over sentences; +ATT+SL (Liu et al., 2017) combines the ATT scheme with soft-label method to solve the bag-level noise; +HATT  leverages relation hierarchies to calculate a coarse-to-fine grained attention, which is the principal baseline of our work; ToHRE is the abbreviation of our proposed framework. As shown in Figure 4, our model achieves the best result among all attention-based models. Even compared with PCNN+HATT, our model achieves higher precision over most part of the entire range of recall. It indicates the ability of our proposed model to handle the noise problem in the RE task.
P@N To further verify the effectiveness of our proposed model, we compare our model with previous state-of-the-art approaches on the Top-N precision. We briefly introduce these SOTA models: RESIDE (Vashishth et al., 2018) utilizes the available side information from knowledge bases, including entity types and relation alias information; DISTRE (Alt et al., 2019) is a Transformer which combines an attentive selection mechanism for the multi-instance learning scenario; PCNN+BAG ATT (Ye and Ling, 2019) combines both intra-bag and inter-bag attentions to cope with the noisy sentence and noisy bag problems in DSRE and achieve the SOTA performance in terms of top-n precision metric. The evaluation results are shown in Table 2. It can be observed that: (1) ToHRE outperforms previous methods in most cases from P@100 to P@2000 indicating that our model have a consistent performance. (2) ToHRE has the highest precision in P@100, which is critical to some NLP tasks like Knowledge Base Completion that need convincing prediction at top-100 to obtain high-quality relational triple.

Evaluation Result for Long-tail Relations
To demonstrate the improvements in performance for long-tail relations, we follow the work from     100/200 training instances. The Hits@K metric is introduced for evaluation. For each entity pair, the evaluation requires its corresponding golden relation in the first K candidate relations recommended by the models. Following previous work, we select K from {10,15,20} and report the macro average Hits@K accuracies for these subsets. We compare our model with PCNN+HATT  which is the first work to evaluate the long-tail relations under such settings and PCNN+KATT , which follows  and achieves the SOTA performance in the long-tail relation extraction. From the evaluation results in Table 3, we can observe that ToHRE improves previous SOTA approach by a large margin, i.e., 27.6% and 26.5% in the aspect of Hits@10 under different training instances and has consistent performance on Hits@15 and Hits@20 settings. The result indicates that relation hierarchies have been better exploited in our hierarchical classification framework than previous methods. Although ToHRE has improved performance on the long-tail relation extracting, the results of all methods are far from satisfactory, which requires more sophisticated models to handle this problem.

Ablation Study
We conduct an ablation study to verify the effectiveness of the entity-aware word embedding module.
To this end, we denote ToHRE w/o EW as using the position-aware word embedding (Zeng et al., 2015) instead of the proposed entity-aware embedding. From the corresponding P@N results shown in Table 2, we observe that the prediction result has an obvious drop without using the entity-aware word embedding, especially has a 10.3% decreases in top-100. It indicates that the proposed entity-aware word embedding methods can successfully capture the corrections between each word and two corresponding entities.

Case study
In this section, we conduct a case study to show the predicting process of our framework. Table 4 shows the predicted score in different relation levels for B 1 and B 2 . The B 1 contains two sentences where the second sentence does not express the labeled relation /location/country/administrative divisions. However, we can observe that our model predicts high scores to all relation levels despite the noisy sentence, which shows that our hierarchically-refined selective attention can alleviate the noise problem in different levels. The B 2 is labeled with the relation /people/person/religion which has few training instances. The previous flat models can not extract such relations due-to data deficiency. However, our model can  Table 4: A case study where the entities are mark in bold. P 1 , P 2 and P 3 show the predicted relation probability in level 1 , level 2 and level 3 respectively.
successfully extract the data-rich relations in top-level, i.e., /people and /people/person. Even though we have a low score in base-level relation /people/person/religion, we combined it with the high score of top-level to assign the bag with an overall probability.
4 Related Work

Distantly Supervision Relation Extraction
Distant supervision proposed by Mintz et al. (2009) is an efficient approach that automatically generates large-scale training data for the RE task. However, the training data generated by DS usually contain amounts of wrongly labeled sentences and suffer from the long-tail problem. To alleviate the wrong labeling problem, Riedel et al. (2010) and Hoffmann et al. (2011) propose a MIL framework, where training sentences are arranged in bags and a label is provided for a bag of sentences instead of each sentence individually. This framework is well-suited for the DS setting and thus many works adopt this framework to select informative sentences from the noisy bags. For example, Zeng et al. (2015) propose PCNN to automatically extract features from sentences and select the most important sentence. Attention mechanisms (Lin et al., 2016;Lei et al., 2018;Yuan et al., 2019;Ye and Ling, 2019) are investigated to assign high attention to informative sentences. Reinforcement learning (Zeng et al., 2018;Feng et al., 2018;Qin et al., 2018) is adopted to filter the noisy sentences. As for solving the long-tail problem in DS,  leverage relation hierarchies to calculate a coarse-to-fine attention for better extracting long-tail relations.  take advantage of the knowledge from data-rich relations at the head of distribution to boost the performance of the data-poor relations in the tail. However, most previous works formulate DSRE as a flat classification problem which has not fully exploited the inherent hierarchical structure of relations. Indeed, hierarchical classification has been widely used in other tasks, such as text classification (Gopal and Yang, 2013;Mao et al., 2019), question answering (Qu et al., 2012) and online advertising (Agrawal et al., 2013) and demonstrates its efficiency for hierarchical structure. It is notable that in some works, such as works from Mao et al. (2019) and Silla and Freitas (2010), the flat classification is also regarded as a special circumstance of hierarchical classification where only the label at the leaf node is predicted. But in order to distinguish our model from previous works, we consider flat classification and hierarchical classification as two independent methods in this paper. To the best of our knowledge, we are the first to conduct hierarchical classification for the DSRE task.

Local Approach vs. Global Approach for Classification
We summarize hierarchical classification methods into two categories based on how the hierarchy is explored: local and global approaches. Local approaches explore the hierarchy in a top-down manner by training a set of local classifiers for each node (D'Alessio et al., 2000), or each parent node (Holden and Freitas, 2009), or each level in the hierarchy (Clare and King, 2003). The disadvantage of local approaches is that the number of local classifiers depends on the size of the label hierarchy, which makes them infeasible to scale. The global approaches (Kiritchenko et al., 2005;Cai and Hofmann, 2004) train a single classifier for all classes in the hierarchy. Compared with the local classifier, less research on the global classifier has been investigated due to the complexity of the problem (Silla and Freitas, 2010).

Conclusion
In this paper, we take advantage of the inherent hierarchical structure of relations and propose a topdown classification strategy with a hierarchical bag presentation. In this way, we formulate the DSRE as a hierarchical classification task. The experimental results indicate that our proposed model outperforms previous state-of-the-art flat methods, especially for long-tail relations. In the future, we plan to explore the following directions: (1) incorporating entity information from external knowledge graphs to enhance the hierarchical bag representation; (2) utilizing more sophisticated training strategies for the long-tail relation problem.