Hierarchical Relation Extraction with Coarse-to-Fine Grained Attention

Distantly supervised relation extraction employs existing knowledge graphs to automatically collect training data. While distant supervision is effective to scale relation extraction up to large-scale corpora, it inevitably suffers from the wrong labeling problem. Many efforts have been devoted to identifying valid instances from noisy data. However, most existing methods handle each relation in isolation, regardless of rich semantic correlations located in relation hierarchies. In this paper, we aim to incorporate the hierarchical information of relations for distantly supervised relation extraction and propose a novel hierarchical attention scheme. The multiple layers of our hierarchical attention scheme provide coarse-to-fine granularity to better identify valid instances, which is especially effective for extracting those long-tail relations. The experimental results on a large-scale benchmark dataset demonstrate that our models are capable of modeling the hierarchical information of relations and significantly outperform other baselines. The source code of this paper can be obtained from https://github.com/thunlp/HNRE.


Introduction
Relation extraction (RE) aims to predict relational facts from plain text. Conventional supervised RE models (Zelenko et al., 2003;Mooney and Bunescu, 2006) usually suffer from the lack of high-quality training data, because manual labeling of training data is time-consuming and humanintensive. Mintz et al. (2009) propose distant supervision to automatically label training instances by aligning existing knowledge graphs (KGs) and text: For an entity pair in KGs, those sentences containing both the entities will be labeled with the corresponding relation of the entity pair in KGs. RE relies on distant supervision to scale up to large-scale training corpora. However, this automatic mechanism is inevitably accompanied by the wrong labeling problem, because not all sentences containing two entities can exactly express their relations in KGs, e.g., we may mistakenly label "Bill Gates retired from Microsoft" with the relation business/company/founders.
To alleviate the wrong labeling problem, many efforts (Riedel et al., 2010;Hoffmann et al., 2011;Surdeanu et al., 2012;Zeng et al., 2015) have been devoted to identifying valid instances from noisy data, especially the recent state-of-the-art attention-based methods (Lin et al., 2016;Ji et al., 2017;Wu et al., 2017). Nevertheless, each relation is handled in isolation in most existing methods. For each relation, there is often a separate model (e.g. neural attention scheme) to select relation-related informative instances from noisy data, regardless of rich semantic correlations among relations, typically located in the form of relation hierarchies.
We take the KG Freebase (Bollacker et al., 2008) as an example, in which relations are labeled as hierarchical structures. For example, the relation /location/province/capital in Freebase indicates the relation between a province and its capital. It is labeled under the location branch. Under this branch, there are some other relations /location/location/contains and /location/country/capital, which are closely correlated to each other. The rich correlations among relations are well revealed by these relation hierarchies. In fact, McCallum et al. (1998) take advantage of hierarchies of classes to improve classification models and inspire many later models (Rousu et al., 2005;Weinberger and Chapelle, 2009). Furthermore, the hierarchical information of entities in KGs has also been utilized and demonstrated to be effective for model enhancement (Hu et al., 2015;Xie et al., 2016).
To take advantage of the rich correlated information among relations, we propose a novel hierarchical attention scheme via utilizing the relation hierarchies, rather than directly utilizing hierarchical information as features for models. Similar to the conventional attention-based method, our method also computes an attention score for each instance according to its significance of expressing the corresponding relation. The key difference is that, as illustrated in Figure 1, our hierarchical attention scheme follows the relation hierarchies to compute scores for those instances containing the same entity pair on the each layer of the hierarchies.
The hierarchical attention scheme provides coarse-to-fine granularity for identifying valid instances. The attention on the bottom layer can capture more specific features of the relation, which has a comparable ability of fine-grained instance selection like conventional attention-based methods. The attention on the top-layer can capture the common features shared by several related subrelations, which provides coarse-grained instance selection. Since there are more sufficient data for training the top-layer attention, the whole hierarchical attention scheme can enhance RE models for solving those long-tail relations.
We conduct experiments on a large-scale benchmark dataset for RE in this paper. The experimental results show that the proposed coarse-tofine grained attention scheme based on relation hierarchies significantly outperforms other baseline methods, even as compared to the recent stateof-the-art attention-based models, especially for those long-tail relations.

Related Works
Supervised models (Zelenko et al., 2003;Zhou et al., 2005;Mooney and Bunescu, 2006) for RE require adequate amounts of annotated data for their training. It is time-consuming to manually label large-scale training data. Hence, Mintz et al. (2009) propose distant supervision to automatically label data. Distant supervision inevitably accompanies with the wrong labeling problem. To alleviate the noise issue caused by distant supervision, Riedel et al. (2010) and Hoffmann et al. (2011) propose multi-instance learning (MIL) mechanisms. Riedel et al. (2013) propose universal schema to transmit information between relations of KGs and textual patterns to enhance extraction performance.
These early RE methods mainly extract semantic features using NLP tools to build relation classifiers. Recently, neural models have been widely used for RE. These neural models can accurately capture textual relations without explicit linguistic analysis (Zeng et al., 2014;Xu et al., 2015;Santos et al., 2015;Zhang and Wang, 2015;. Zeng et al. (2015) employ the MIL scheme by selecting one most valid instance for distantly supervised neural relation extraction (NRE), whose denoising capability is far from satisfactory because most informative instances are neglected. Lin et al. (2016) and Zhang et al. (2017) propose neural attention schemes to select those informative instances. To further improve the attention performance, some works incorporate knowledge information (Zeng et al., 2017;Ji et al., 2017;Han et al., 2018) and advanced training strategies Huang and Wang, 2017). More sophisticated mechanisms, such as reinforcement learning (Feng et al., 2018;Zeng et al., 2018) and adversarial training (Wu et al., 2017), have also been adapted for RE recently.
However, most existing works model each relation in isolation to identify informative instances, neglecting rich correlations among relations, especially the hierarchical information of those relations. Hierarchical information is widely applied for model enhancement, especially for classification models (McCallum et al., 1998;Rousu et al., 2005;Weinberger and Chapelle, 2009;Zhao et al., 2011;Bi and Kwok, 2011;Zhou et al., 2011;Verma et al., 2012). Many efforts are also devoted to utilizing hierarchical information in KGs. Leacock and Chodorow (1998) and Ponzetto and Strube (2007) adopt hierarchical information derived from KGs to construct concept relatedness. Morin and Bengio (2005) propose a neural language model by utilizing hierarchical information in WordNet. Further, Hu et al. (2015) learn entity representations by considering the whole entity hierarchies of Wikipedia and inspire many works (Krompaß et al., 2015;Xie et al., 2016) to utilize hierarchical type structures to help the representation learning of KGs.
Different from the recent hierarchical models that mainly focus on entity hierarchies and directly utilize hierarchical information as simple features, we incorporate relation hierarchies to build a hierarchical attention scheme with coarse-to-fine granularity to enhance RE performance. As compared with the existing models for RE, our models could take advantage of relation correlations to better identify informative instances, especially for those long-tail relations, by transferring the knowledge from their related relations of high-frequency.

Methodology
In this section, we introduce the overall framework of our hierarchical attention for RE, starting with notations and definitions.

Notations
We denote a KG as G = {E, R, F}, where E, R and F indicate the sets of entities, relations and facts respectively. (h, r, t) ∈ F indicates that there is a relation r ∈ R between h ∈ E and t ∈ E. We follow the MIL setting and split the entire instances into multiple entity-pair bags {S h 1 ,t 1 , S h 2 ,t 2 , . . .}. Each bag S h i ,t i contains multiple instances {s 1 , s 2 , . . .} mentioning both the entities h i and t i . The distant supervision mechanism will label the bag with the corresponding relation of the mentioned entity pair. Each instance s in these bags is denoted as a word sequence s = {w 1 , w 2 , . . .}.

Framework
Given an entity pair (h, t) and its entity-pair bag S h,t , we adopt our models to measure the probability of each relation r ∈ R holding between the pair. As shown in Figure 2, the overall framework of our models includes a sentence encoder and a coarse-to-fine grained hierarchical attention. The sentence encoder adopts several convolutional neural networks to represent sentence semantics with embeddings, and the hierarchical attention is used to select the most informative instances to exactly express their relations.
For each instance s i ∈ S h,t , we use the sentence encoder to represent its semantic information as an embedding s i . The details of the sentence encoder will be introduced in Section 3.3. Since not all instances in the bag S h,t are positive to express the relation between h and t, we apply the hierarchical attention to compute an instance weight α i for each instance s i . The details of the hierarchical attention will be introduced in Section 3.4. We build the global textual relation representation r h,t with the weighted sum of instance output embeddings, Here α i is the instance weight for the ith instance output embedding s i . By taking r h,t as the textual relation representation of the entity pair (h, t), we estimate its probability over each relation r ∈ R, i.e., whether there is a specific relation r between h and t. We define the conditional probability P (r|h, t, S h,t ), where o is the scores of all relations, which is defined as follows, where M is the representation matrix to calculate the relation scores.

Sentence Encoder
Given an instance s containing two entities, we apply several neural architectures to encode the instance into its corresponding embeddings s.

Input Layer
The input layer of the sentence encoder aims to embed both semantic information and positional information of words into their input embeddings. Word Embedding is proposed by Hinton (1986), which aims to transform words into distributed representations to capture syntactic and semantic meanings of words. Given a sentence s consisting of multiple words s = {w 1 , . . . , w n }, we adopt Skip-Gram (Mikolov et al., 2013) to Position Embedding is proposed by Zeng et al. (2014). Position embedding is used to embed the relative distances of each word to the two entities into two k p -dimensional vectors. By concatenating the distance embeddings for the current word w i to the both head and tail entities, we get a unified position embedding p i ∈ R kp×2 .
For each word w i , we concatenate its word embedding w i and position embedding p i to build its input embedding

Encoding Layer
The encoding layer aims to compose the input embeddings of the given instance into its corresponding instance embedding. In this paper, we choose two convolutional neural architectures, CNN (Zeng et al., 2014) and PCNN (Zeng et al., 2015), to encode input embeddings into instance embeddings.
Other neural architectures such as recurrent neural architectures (Zhang and Wang, 2015) can also be used as sentence encoders. Since previous works show that both convolutional and recurrent architectures can achieve comparable stateof-the-art performance, we simply select convolutional architectures in this paper. Note that, our hierarchical attention scheme is designed independently to the encoder choices, hence it can be easily adapted to fit other encoder architectures.
CNN slides a convolution kernel with the window size m over the input sequence {x 1 , . . . , x n } to get the k h -dimensional hidden embeddings. (4) A max-pooling is then applied over these hidden embeddings to output the final instance embedding s as follows, where [·] j is the j-th value of a vector. PCNN is an extension to CNN, which also adopts a convolution kernel to obtain hidden embeddings. Then, a piecewise max-pooling is applied over the hidden embeddings, where [·] j is the j-th value of a vector, i 1 and i 2 are entity positions. The final instance embedding s is achieved by concatenating these three pooling results as follows,

Hierarchical Selective Attention
Given the entity pair (h, t) and its bag of instances S h,t = {s 1 , s 2 , . . . , s m }, we achieve the instance embeddings {s 1 , s 2 , . . . , s m } using the sentence encoder. Afterwards, we apply a hierarchical selective attention over them to get the textual relation representation r h,t for extracting relations. In this part, we will first introduce a plain selective attention, and then introduce our hierarchical attention.

Plain Selective Attention
The plain selective attention scheme computes the attention score α i for each instance s i to indicate how well the instance can express the relation between the two entities. We assign a query vector q r to each relation r ∈ R and the attention for each sentence in S h,t = {s 1 , s 2 , . . . , s m } is defined as follows, where W s is the weight matrix. The attention scores can be used in Eq. 1 to compute textual relation representations. For simplicity, we denote such a plain selective attention operation as the following equation, Hierarchical Selective Attention The inherent hierarchical structure of relations lead us to modeling hierarchical attention. Generally, given a relation set R of a KG G (e.g. Freebase), which consists of base-level relations (e.g. /location/province/capital), we can generate the corresponding higher-level relation set R H . The relations in the high-level set (e.g. location) are more general and common, which usually contain several sub-relations in the base-level set. We assume that the sub-relations of different relations are disjoint, in other words, the relation hierarchies are tree-structured. The generation process can be done recursively. In practice, we start from R 0 = R which is the set of all relations we focus for RE, and generate k − 1 times to get a total of k-level hierarchical relation sets {R 0 , R 1 , . . . , R k−1 }. As shown in Figure 2, for a relation r = r 0 ∈ R = R 0 , which is the focus for RE, we construct its hierarchical chain of parent relations by backtracking the relation hierarchy as follows, where r i−1 is the sub-relation of r i . As with the plain attention, we assign a query vector q r to each relation r ∈ k−1 i=0 R i . With the hierarchical chain, we compute attention operations on the each layer of the relation hierarchies to obtain corresponding textual relation representations, r i h,t = ATT(q r i , {s 1 , s 2 , . . . , s m }).
During the training process, those relation query vectors of high-level relations (i.e., q r i with larger i) have more instances for training than those query vectors of base-level relations. Hence, the high-level query vectors are more robust for instance selection but with coarse-grained capability. In contrast, the base-level query vectors (i.e., q r i with smaller i) always suffer from data sparsity, especially for those long-tail base relations. Hence, the base-level query vectors can perform fine-grained instance selection but the performance is not stable. Based on the hierarchical selective attention, we can simply concatenate the textual relation representations on different layers as the final representation, The representation r h,t will be finally fed to compute the conditional probability P (r|h, t, S h,t ) in Eq. 2. Note that, those high-level representations (i.e., r i h,t with larger i) are coarse-grained, and those base-level representations (i.e., r i h,t with smaller i) are fine-grained. These hierarchical representations can provide more informative information than single-layered attention for relation prediction, especially for those long-tail relations.

Initialization and Implementation Details
Here we introduce the learning and optimization details for our hierarchical attention model. During the training process, we minimize the cross entropy loss function. Given the collection of entity-pair bags π = {S h 1 ,t 1 , S h 2 ,t 2 , . . .} and corresponding labeled relations {r 1 , r 2 , . . .}, we define the loss function as follows, where λ is a harmonic factor, and θ 2 2 is the regularizer defined as L 2 normalization. All models are optimized using stochastic gradient descent (SGD).

Datasets and Evaluation
We evaluate our models on the New York Times (NYT) dataset developed by Riedel et al. (2010), which is widely used in recent works (Lin et al., 2016;Zeng et al., 2017;Ji et al., 2017;Han et al., 2018;Wu et al., 2017;Huang and Wang, 2017;Feng et al., 2018;Zeng et al., 2018). The dataset has 53 relations including the NA relation which indicates relations of instances are not available. The training set has 522, 611 sentences, 281, 270 entity pairs and 18, 252 relational facts. In the test set, there are 172, 448 sentences, 96, 678 entity pairs and 1, 950 relational facts. In both the training and test set, we truncate the sentences which have more than 120 words into 120 words.
We evaluate all models in the held-out evaluation. It evaluates models by comparing the relational facts discovered from the test articles with those in Freebase and provides an approximate measure of precision without human evaluation. For evaluation, we draw precision-recall curves for all models. Besides precision-recall curves, we also show the precision values at the specific recall rate to conduct a more direct comparison, and calculate the micro and macro average precision scores to show the overall effect of different models. To further verify the effect of our hierarchical attention for few-shot entity pairs, we follow the previous works to report the Precision@N results. The dataset and baseline code can be found from Github 1 (Lin et al., 2016;Wu et al., 2017;.

Parameter Settings
To fairly compare the results of our hierarchical attention models with those baselines, we also set most of the experimental parameters following Lin et al. (2016). Table 1 shows all experimental parameters used in the experiments. We apply dropout on the output layers of our models to prevent overfitting. For CNN, we set the dropout rate to 0.5. For PCNN, we observe that this model tends to overfit on the training set very quickly, and hence we set the dropout rate to 0.9 to alleviate the overfitting problem. We also pre-train the sentence encoder of PCNN before training our hierarchical attention.

Overall Evaluation Results
To evaluate the performance of our proposed hierarchical models, we compare the precision-recall curves of our models with various previous relation extraction models. The evaluation results are shown in Figure 3 and Figure 4. We report the results of the neural architectures including CNN and PCNN with various attentionbased methods: +HATT is our hierarchical attention method; +ATT is the plain selective attention method over instances (Lin et al., 2016); +ATT+ADV is the denoising attention method by adding a small adversarial perturbation to instance embeddings (Wu et al., 2017); +ATT+SL is the attention-based model using soft-labeling method to mitigate the side effect of the wrong label-ing problem at entity-pair level ; +ONE is a vanilla MIL neural model without attention schemes (Zeng et al., 2015). We also compare our method with feature-based models, including Mintz (Mintz et al., 2009), MultiR (Hoffmann et al., 2011 and MIML (Surdeanu et al., 2012). From the results, we observe that: (1) All methods have reasonable precision when recall is smaller than 0.05. When the recall gradually grows, the performance of the feature-based methods drops much more faster than those neural models. It shows that human-designed features are very limited as compared to neural models, especially in a noisy data environment. Hence, for simplicity, we mainly show the results of our models and other attention-based neural models in the following experiments.
(2) Both for CNNs and PCNNs, the models with attention schemes outperform the vanilla models without attention schemes. Though vanilla neural models are powerful for relation classification, it is still difficult to address data noise. The attentionbased methods apply attentions over multiple instances and dynamically reduce the influence of noisy instances, which can effectively improve the performance of RE and achieve the state-of-the-art results.
(3) As shown in both of the figures, the models using hierarchical attention (HATT) achieve the best results among all the attention-based models. Even when compared with PCNN+ATT+ADV and PCNN+ATT+SL which adopt sophisticated denoising schemes and extra information, our models still keep significant advantages. This indicates that, as compared to the conventional plain attention schemes which handle each relation in isolation, our method can better take advantage of the rich correlations among relations. We believe the performance of our hierarchical attention scheme can be further improved by adopting extra mechanisms like adversarial training, reinforcement learning and soft-labeling at entity-pair level, which will be left as our future work.

Effect of Hierarchical Attention for Different Relations
To further verify the effectiveness of our hierarchical attention method for different relations, we evaluate the RE performance of our method and conventional attention methods. Since we focus   more on the performance of those top-ranked results, we report the precision scores when the recall is 0.1, 0.2, 0.3 and their mean. We also report micro average scores and macro average scores in this experiment. As an approximation of the area under the precision-recall curve, the micro average score gives a more complete view of the model performance. Since the micro average score generally overlooks the influences of those long-tail relations, we use the macro average score to give more emphasis on long-tail relations in test sets, which is often neglected by the previous works.
The evaluation results are shown in Table 2, and from the results we observe that: Our HATT method achieves consistent and significant improvements as compared to the plain ATT method. From the micro and macro average precision scores, we find that our HATT method effectively improves RE performance especially for those long-tail relations. As compared to the plain ATT method, our method can take advantage of correlations among relations to achieve the improvement, especially on the long-tail relations.
To further demonstrate the improvements in performance on long-tail relations after introduc-   ing relation hierarchies, we extract a subset of the test dataset in which all the relations has fewer than 100/200 training instances. We employ the Hits@K metric for evaluation. For each entity pair, the evaluation requires its corresponding golden relation in the first K candidate relations recommanded by the models. Because it is difficult for the existing models to extract long-tail relations, we select K from {10, 15, 20}. We report the both micro and macro average Hits@K accuracies for these subsets. From the evaluation results in Table 3, we observe that: (1) For both CNN and PCNN models, our hierarchical attention outperforms the plain attention model. By taking advantage of the relation hierarchy, our models can learn better about long-tail relations via correlation information among relations. We also observe that even our hierarchical CNN model presents a better performance than the plain PCNN model. This shows the power of relation hierarchies, which makes our simpler CNN model outperforms the PCNN model on those long-tail relations.
(2) Although our HATT method has achieved obvious progress on the long-tail relations as compared with the plain ATT method, the results of all these methods are still far from satisfactory. This indicates that distantly supervised RE models suffer from not only the wrong labeling problem, but also the long-tail relation problem. We will incorporate more schemes and extra information to solve this problem in the future.

Effect of Hierarchical Attention with Different Instances
Since our method mainly focuses on modifications over selective attention, we also conduct Preci-sion@N tests on those entity pairs with few instances following (Lin et al., 2016). We use the three test settings for this experiment: the ONE test set where we randomly select one instance for each entity pair for evaluation; the TWO test set where we randomly select two instances for each entity pair; the ALL test set where we use all instances for each remaining entity pair for evaluation. For the ONE and TWO test set, we intend to show that taking correlation information among relations into consideration can lead to a better relation classifier. The ALL test set is designed to show the effect of our attention over multiple instances. We report the precision values of top N triples extracted, where N ∈ {100, 200, 300}.
The evaluation results are shown in Table 4, and from the results we observe that: (1) The performance of all methods is generally improved as the instance number increases. This shows that the selective attention model can effectively take advantage of information from multiple noisy instances by combining useful instances while discarding useless ones.
(2) Our HATT method has higher precision values in the ONE test set. This indicates that even in an insufficient information scenario, correlations among relations can be caught by our hierarchical attention.

Case Study
We give some examples of how our hierarchical selective attention takes effect in selecting the sentences. In Table 5, we display the sentences that are scored highest ("Good") or lowest ("Bad") by the attention of different hierarchical levels ("High" and "Base").
The relation /people/person/children has fewer than 1000 training instances and it is a long-tail relation. For this relation, the instance recommended by the higher-level attention straightforwardly expresses the relational fact that Nathan is the child of David by telling that Nathan is David's son, while the sentence with the low attention score actually gives the relationship of being at the same generation. On the contrary, the lower-level attention mistakenly assigns high attention to the incorrect sentence. This example shows that our hierarchical attention is beneficial for these long-tail relations.

Conclusion and Future Work
In this paper, we take advantage of relation hierarchies and propose a novel hierarchical instancelevel attention for relation extraction. As compared with previous attention-based methods, our hierarchical attention provides coarse-to-fine granularity in instance selection and performs better extraction for long-tail relations. We conduct various experiments and the evaluation results show that incorporating the inherent hierarchical structure of relations into attention schemes can take advantage of correlations among relations and improve the performance significantly.
In the future, we plan to explore the following directions: (1) It will be promising to adopt extra information to help train more efficient models for solving the long-tail relation problem. (2) We may also combine our attention method with recent denoising methods to further improve model performance.