Improving Long-Tail Relation Extraction with Collaborating Relation-Augmented Attention

Wrong labeling problem and long-tail relations are two main challenges caused by distant supervision in relation extraction. Recent works alleviate the wrong labeling by selective attention via multi-instance learning, but cannot well handle long-tail relations even if hierarchies of the relations are introduced to share knowledge. In this work, we propose a novel neural network, Collaborating Relation-augmented Attention (CoRA), to handle both the wrong labeling and long-tail relations. Particularly, we first propose relation-augmented attention network as base model. It operates on sentence bag with a sentence-to-relation attention to minimize the effect of wrong labeling. Then, facilitated by the proposed base model, we introduce collaborating relation features shared among relations in the hierarchies to promote the relation-augmenting process and balance the training data for long-tail relations. Besides the main training objective to predict the relation of a sentence bag, an auxiliary objective is utilized to guide the relation-augmenting process for a more accurate bag-level representation. In the experiments on the popular benchmark dataset NYT, the proposed CoRA improves the prior state-of-the-art performance by a large margin in terms of Precision@N, AUC and Hits@K. Further analyses verify its superior capability in handling long-tail relations in contrast to the competitors.


Introduction
Relation extraction, as a fundamental task in natural language processing (NLP), aims to discriminate the relation between two given entities in a plain text. As recent data-driven algorithms, e.g., deep neural networks Du et al., 2018;Ye and Ling, 2019), have shown their capabilities in tackling NLP tasks, labor-intensive annotation and scarce training data become the major obstacles of achieving promising performance on relation extraction.
Instead of costly manual labeling, distant supervision method (Mintz et al., 2009) is proposed to autolabel the training data for relation extraction. It labels a sentence containing a pair of entities with the relation between them in a knowledge graph, e.g. Freebase (Bollacker et al., 2008). A strong assumption behind this is that a sentence containing two entities only expresses the relation existing in the knowledge graph, but this assumption will not always hold . Hence, it leads to two main problems regrading the training data. First, when the assumption is invalid, wrong labeling problem appears and degrades algorithms by introducing noisy supervision signals. This problem has been well-studied by recent works Ji et al., 2017;Li et al., 2020) operating at bag level for multi-instance learning (Hoffmann et al., 2011), where the "bag" denotes a set of sentences with the same entity pair. Most of them use selective attention to avoid wrongly-labeled sentences.
Second, the long-tail problem is caused by using a knowledge graph as distant supervision to autolabel a domain-specific corpus, where the knowledge graph usually suffers from long-tail relations. For example, to build NYT dataset, applying Freebase to a news corpus, New York Times, leads to ∼70% of Here the criterion being a longtail relation is the number of corresponding training sentences is less than 1000. Middle: Relation hierarchies in Freebase knowledge graph, where relation with " * " is long-tail. Through the common high-level relation, it can be exploited that 1) multiple semantic-related low-level relations complement each other and 2) semantic knowledge is transferred from data-rich to long-tail relations. Right: Empirical AUC results of competitive approaches on data-rich and long-tail test subsets.
relations long-tail, as illustrated in Figure 1 (left). This problem seriously disrupts the data balance and thus becomes one of the main barriers of improvement.
To alleviate the long-tail problem, two approaches  naturally share the knowledge from data-rich relations to the long-tail ones when those relations have semantic overlap. This semantic overlap or relatedness is usually stored in the relation hierarchies of a knowledge graph, e.g., Freebase in Figure 1 (middle). Specifically, these approaches extend selective attention  by introducing the embeddings of high-level (i.e., coarse-grained) relations as complements to original low-level (i.e., fine-grained) relations. As such, the high-level relation embeddings are used as queries of selective attention to derive extra bag-level features. To learn the relation embeddings,  randomly initialize them followed by supervised learning in end-to-end fashion, whereas  combine the embeddings from both TransE (Bordes et al., 2013) pre-trained on Freebase and graph convolutional network (Defferrard et al., 2016) applied to the relation's hierarchies.
Despite proven to improve overall and long-tail performance, they also post two issues: 1) Limited by selective attention framework, the relation embeddings are only used as the attention's queries and thus not well-exploited to share knowledge. 2) Despite the capability in mitigating the long-tail problem, graph embeddings pre-trained on large-scale knowledge graph are time-consuming and not always offthe-shelter, hence at the cost of practicability.
In this work, we propose a novel neural network, named as Collaborating Relation-augmented Attention (CoRA), to tackle distantly supervised relation extraction, where no external knowledge is introduced and the relation hierarchies are fully utilized to alleviate the long-tail problem. Specifically, as an alternative to selective attention framework, we first propose a base model, relation-augmented attention, operating at bag level to minimize the effect of wrong labeling, where the relation-augmenting process is fulfilled by a sentence-to-relation attention. Empowered by the base model, we then leverage the high-level relations for collaborating features in light of the relation hierarchies. Besides a further relief of wrong labeling, such features facilitate knowledge transfer among the low-level relations inheriting a common high-level relation.
Intuitively, selective attention and its hierarchical extensions learn relation label embeddings to score each sentence in a bag. In contrast, the proposed relation-augmented attention network achieves the same goal via a memory network-like structure: sentences equipped with relation features are passed into an attention-pooling (i.e., a kind of self-attention (Lin et al., 2017)) for bag-level representations. Our method is especially effective when extended to multi-granular relations -the features are enriched by cross-relation sharing, which hence benefits long-tail relations. As shown in Figure 1 (right), our proposed approach achieves consistent outstanding performance on both data-rich and long-tail relations.
We use two objectives to jointly train the CoRA. The first is predicting the relation label at bag level, which is the goal of relation extraction. As auxiliary objective, the second is guiding the model to equip each sentence with correct multi-granular relation embeddings during the augmenting process. It aims to boost downstream attention-pooling and is fulfilled by applying the multi-granular labels to the sentence-to-relation attention during training.
Our main contributions are summarized as: • We propose a base model, named relation-augmented attention network, to handle wrong labeling problem in multi-instance learning.
• We then propose to extend the base model with the relation hierarchies, called CoRA, to further promote the performance on long-tail relations.
• We evaluate CoRA on the popular benchmark dataset NYT and set state-of-the-art results in Precision@N, AUC and Hits@K. We also verify its capability in alleviating both wrong labeling and long-tail problems via insightful analyses. The source code of this work is released at https://github.com/YangLi1221/CoRa.

Proposed Approach
This section begins with a definition of distantly supervised relation extraction with multi-granular relation labels. Then an embedding method is introduced to represent sentences. Lastly, our base model and its hierarchical extension are presented to handle wrong labeling and long-tail relations. An illustration of the model is shown in Figure 2.

Task Definition
Given a bag of sentences B = {s 1 , . . . , s m } in which each sentence contains a pair of head e (h) and tail e (t) entities in common, the distant supervision (Mintz et al., 2009) assigns this bag with a relation label r (0) according to the entity pair in a knowledge graph. The goal of relation extraction is to predict the relation labelr (0) of an entity pair based on the corresponding sentence bag when the pair is not included in the knowledge graph. As following the hierarchical setting , labels of coarse-grained relations, [r (1) , . . . , r (M ) ], can be used to share knowledge across relations.

Sentence-Level Representation
To embed each sentence s j in a bag B = {s 1 , . . . , s m } into latent semantic space, we derive a sentence representation from three kinds of features, including word embedding (Mikolov et al., 2013), position embedding (Zeng et al., 2015) and entity embedding (Li et al., 2020). The integration of them has been proven crucial and effective to relation extraction by previous work (Li et al., 2020). In the following, we omit the index of a sentence, j, for a clear elaboration. Basically, a sentence s is first tokenized into a sequence of n words, s = [w 1 , . . . , w n ], then a word2vec method (Mikolov et al., 2013) is used to transform the discrete tokens into low-dimensional, real-valued vector embeddings, i.e., Word Embedding.
On the one hand, position-aware embedding offers rich positional information for downstream modules (Zeng et al., 2014). For i-th word, the relative position is represented as the distances from the word to head e (h) and tail e (t) entities respectively. Two integer distances are then transformed into low-dimensional vectors, x (ph) i and x (pt) i ∈ R dp , by a learnable weight matrix. Consequently, a sequence of position-aware embeddings is denoted as [; ] denotes the operation of vector concatenation. On the other hand, entity-aware embedding is also crucial since the goal of relation extraction is to discriminate the relation between two entities. The embedding of head or tail entity is represented by the corresponding word embedding. Note that each entity is one entry in the vocabulary of word embedding even if it is usually composed of multiple words. Hence, a sequence of entity-aware embeddings is denoted as To integrate the embeddings above, a position-wise gating procedure is employed by following Li et al. (2020). That is, where "•" denotes element-wise product W (g1) ∈ R dx×3dw and W (g2) ∈ R dx×(dw+2dp) are learnable parameters, λ is a hyper-parameter to control smoothness, and X = [x 1 , . . . , x n ] ∈ R dx×n is the resulting sequence of word embeddings specially for relation extraction.
Piecewise Convolutional Neural Network. As a common practice in distantly supervised relation extraction, piecewise convolutional neural network (PCNN) (Zeng et al., 2015) is used to generate contextualized representations over an input sequence of word embeddings. Compared to the typical 1D-CNN with max-pooling (Zeng et al., 2014), piecewise max-pooling has the capability to capture the structure information between two entities by considering their positions. Specifically, 1D-CNN (Kim, 2014) is first invoked over the input sequence for contextualized representations. Then a piecewise max-pooling performs over the output sequence to obtain sentence-level embedding. These steps are written as where W (c) ∈ R dc×Q×dx is a conv kernel with window size of Q. H (1) , H (2) and H (3) are three consecutive parts of H, obtained by dividing H w.r.t. indices of head e (h) and tail e (t) entities. Consequently, s ∈ R d h , where d h = 3d c , is the resulting sentence-level representation.

Relation-Augmented Attention Network
Due to the effectiveness of selective attention  in multi-instance learning, most recent works employ the selective attention as the baseline and then propose own approaches for improvements in wrong labeling and/or long-tail relations. However, selective attention gradually becomes a bottleneck of performance improvement. For example, Li et al. (2020) find using simple gating mechanism to replace selective attention further alleviates wrong labeling problem and significantly promotes extracting results. Intuitively, on the one hand, employing the basic PCNN and vanilla attention mechanism inevitably limits the expressive power of this framework and thus sets a barrier. On the other hand, the relation embeddings, similar to label embeddings (Bengio et al., 2010), are crucial to distant supervision relation extraction, but are only used as attention query to score a sentence and thus not well-exploited.
In contrast, we aim to augment each sentence in a bag with the relation embeddings by a sentence-torelation attention, and pass the relation-augmented representations of a bag's sentences into an attentionpooling module. The attention-pooling, a kind of self-attention (Lin et al., 2017;Shen et al., 2018b;Shen et al., 2018a), is used to derive an accurate bag-level representation for relation classification. In details, we first define a relation embedding matrix R (0) ∈ R d h ×N (0) where d h denotes the size of hidden states and N (0) denotes the number of distinct relations r (0) in a distantly supervised relation extraction task. Then, we formulate a sentence-to-relation (sent2rel) attention as opposed to selective attention, which aims at augmenting sentence representation from §2.2 with relation information. The sentence representation s is used as a query to attend the relation embedding matrix R (0) via a dotproduct compatibility function: where softmax(·) denotes a normalization function along last dimension and c (0) is the resulting relationaware representation corresponding to the sentence s. Then we merge the relation-aware representation c (0) into original sentence representation s by an element-wise gate mechanism with residual connection (He et al., 2016) and layer normalization (Ba et al., 2016), i.e., where MLP(·) denotes a multi-layer perceptron to increase nonlinearity. Finally, we define relationaugmented sentence representation in our base model as Next moving to multi-instance learning, we put each sentence back to its bag B = {s 1 , . . . , s m } so the bag of sentences with relation-augmentation is represented as U = [u 1 , . . . , u m ] ∈ R d h ×m . Differing from selective attention framework, our sentence representations are augmented by the relation embeddings as elaborated above. Hence, we straightforwardly introduce an attention-pooling module to derive a bag-level representation denoising from the wrongly-labeled sentences. Specifically, the attentionpooling learns to assign each sentence with an importance score according its representation. Then it performs a weighted sum over a bag of sentence representations, where the weights are proportional to their scores. This attention is formulated as where w is a learnable weight vector, and b denotes the resulting bag-level representation. Lastly, an MLP is used to obtain a categorical distribution over all relations as bag-level prediction:

Collaborating Relation-Augmented Attention Network
Beyond only fine-grained relations used above, high-level relation embeddings as hierarchical knowledge can collaborate with the low-level embeddings to boost the performance by alleviating long-tail problem . Intuitively, a high-level relation, shared crossing several low-level relations, is used to represent common knowledge of low-level relations. Therefore, via the common high-level relation, 1) several low-level long-tail relations with semantic overlap mutually benefit each other, and 2) the semantic knowledge is easily transferred from data-rich relations to long-tail ones. These common knowledge is implicitly utilized to distinguish the coarse-grained relation of a bag and thus benefits the final relation prediction. With the relation-augmented sentence representation further enriched via collaborating, we name it as Collaborating Relations-augmented Attention (CoRA). Empowered by non-trivial structure design of our base model, high-level relation embeddings can be easily integrated into the base model by re-defining Eq.(11). In particular, given the coarse-grained relation labels from low to high level, i.e., [r (1) , . . . , r (M ) ], we define a list of relation embedding matrices [R (1) , . . . , R (M ) ] in addition to R (0) defined in last section. With these relation embedding matrices, we individually generate their corresponding relation-augmented sentence representations, i.e., [u (1) , . . . , u (M ) ], via the same procedure defined in Eq.(6 -10) of §2.3. Then, we concatenate [u (1) , . . . , u (M ) ] in conjunction with u (0) to re-formulate Eq.(11) as The following procedure is identical to that in base model elaborated above, except that the learnable weight matrices are up-scaled linearly with the depth of relation hierarchies.

Training Objectives
The main objective for relation extraction is defined to minimize a cross-entropy loss, i.e., where D is the training set consisting of sentence bags. Besides, an auxiliary objective guides sentenceto-relation attention modules to augment each sentence with correct relation embeddings. This is critical to perform downstream attention-pooling and overcome the challenges presented by distant supervision. Given the sent2rel attention score α (l) and relation label r (l) at an arbitrary l level, the loss function to achieve this objective is defined as where M = 0 for the base model in §2.3, where M > 0 for CoRA in §2.4. Finally, we optimize the proposed model by jointly minimizing the two loss functions above, i.e., L = L (re) + L (att) .

Experiments
We evaluate our proposed network on a popular benchmark dataset and conduct several analyses for insights into our proposed model.
Dataset and Evaluation Metrics. By following previous works (Zeng et al., 2015;, we employ the only popular distantly supervised relation extraction dataset, New York Times (NYT) dataset (Riedel et al., 2010). It contains 53 distinct relations which includes a NA class denoting the relation between the entity pair is unavailable. And it consists of 570K and 172K sentences in training and test sets respectively. Two metrics, 1) area under precision-recall curve (AUC) and 2) top-n precision (P@N) are usually used to measure the effectiveness. We also use Hits@K for long-tail relations by following . Comparative Approach. We compare the proposed approach with extensive previous works that are summarized as follows. A model with " * " denotes it is proposed for the long-tail problem.
• PCNN+ATT  proposes a selective attention to alleviate wrong labeling.
• PCNN+BAG-ATT (Ye and Ling, 2019) proposes intra-bag and inter-bag attentions to handle wrongly-labeled sentences at sentence level and bag level respectively.
• PCNN+KATT *  integrates externally pre-trained graph embeddings with relation hierarchies for long-tail relations. Note, standard AUC and P@N values are not available in the paper while only Hits@K is defined and reported for long-tail settings.
• SeG (Li et al., 2020) focuses on one-sentence bags and proposes selective gate mechanism.

Evaluation on Benchmark
As shown in Table 1 and Figure 3 (left), we compare our CoRA with previous competitive approaches on the distantly supervised relation extraction benchmark in terms of top-n precision, AUC and PR curve. Specifically, CoRA significantly outperforms selective attention baseline, i.e., PCNN+ATT. It also surpasses selective gate framework that shows inferior performance on long-tail relations as in Figure 1 of §1. In addition, compared to PCNN+HATT also utilizing relation hierarchies, CoRA is able to achieve much better results in both P@N and AUC.

Ablation Study
To further evaluate the effectiveness of each module in the proposed framework, we conduct an extensive ablation study in the bottom of Table 1 and Figure 3 (middle). Since the performance drop is consistent in P@N and AUC, we mainly use AUC as the metric to perform following study. Compared to CoRA, base model without relation collaborating features only shows marginal precision drop when recall > 0.3 in PR-curve, but a significant drop on long-tail relations (detailed in the next section). Also, as an alternative to selective attention, our base model outperforms PCNN+ATT by a large margin. Then, removing simple entity embeddings in §2.2 leads to remarkable degeneration, verifying its importance. It is also rational to compare PCNN+ATT with "Base w/o Ent Emb" (+0.06 AUC) to demonstrate our relationaugmented framework is indeed better than selective attention. Then, removing "Sent2rel Attention", "Attention-pooling" and "Aux Obj" reduce the AUC by 0.10, 0.01 and 0.12 respectively.

# Training Instance <100 <200
Hits@K ( Table 2: Hits@K (Macro) on the relations whose number of training instance < 100/200. "Hits@K" denotes whether a test sentence bag whose gold relation label r (0) falls into top-K relations ranked by their prediction confidences. "Macro" denotes macro average is applied regarding relation labels.

Evaluation on Long-Tail Relations
To prove the capability of CoRA in handling long-tail relations, we conduct an evaluation solely on longtail relations. Our evaluation setting is identical to , where Hits@K (Macro) is used to represent statistical performance on long-tail relations. As shown in Table 2, we compare CoRA with competitors and our base models. It is observed that, CoRA improves the performance on long-tail relations by a large margin and delivers a new state-of-the-art results. Compared to previous works (PCNN+HATT/+KATT) that also leverage the relation hierarchies, our relation-augmented attention (Base) without any hierarchy even gets competitive results, not to mention pre-trained graph embeddings used in PCNN+KATT. Further comparing our base model with selective attention (PCNN+ATT), the huge performance gap demonstrates the advantages of our framework in handling both wrong labeling and long-tail relations. Finally, as shown in the table's last row, removing the proposed sent2rel attention leads to significant decrease, which emphasizes its importance for long-tail relations.

Analysis and Case Study
Distributions of Sent2rel Attention Scores. Sent2rel attention used to incorporate multi-granular relation embeddings is an essential module in CoRA, so its normalized attention scores (i.e., attention probabilities) derived from Eq.(6) are critical to measure the knowledge transfer crossing relations. We Example Sentence 1: Muhammad yunus, who won the nobel peace prize, last year, demonstrated with grameen bank, the power of microfinancing. Top-3 of attention score α (2) Top-3 of attention score α (1) Top-3 of attention score α (0) /business:  Table 3: Two example sentences with top-3 sent2rel attention scores at all relation levels. Both sentences express the same long-tail relation "/business/company/founders".
show a probability distribution of maximum attention score in Figure 3 (right). Obviously, a high-level sent2rel attention tends to produce larger maximum attention score and more accurate attention target. It is easily inferred that, 1) accurate attention at high-level promotes the knowledge transfer through the relation hierarchies, and 2) attention probability distribution is more smooth at low-level to further boost embedding sharing crossing relations. To dig this out, in Table 3, we conduct a case study by showing top attention scores at all three relation levels. It is observed that attention scores and the corresponding relations are intuitively consistent with the analyses above. One exception is that NA class appears to be assigned with high attention score at low-level sent2rel attention, which indirectly explains 1) our base model w/o collaborating relation features only delivers inferior performance and 2) sent2rel attention for low-level relations are inaccurate.
Performance based solely on Sent2rel Module. Multi-granular relation labels are used as supervision signals for sent2rel attention modules, and the accuracy of each module is greater than 90% as in Figure 3. Therefore, it is interesting to check if the attention scores can be directly used to predict relations at bag level. We present two settings: 1) only using attention scores on fine-grained relations, i.e., α (0) , and 2) using products of attention scores at all three levels to make the best of relation hierarchies. As a result, setting 1 and 2 deliver AUC of 0.41 and 0.43 respectively, which surprisingly outperform several previous works in Table 1.

Error Analysis.
To investigate the possible reasons for misclassification, we manually check several randomly-sampled error examples from the test set and find the following factors can cause wrong predictions. 1) Most of error cases demonstrate the proposed model still struggles in handling wrong labeling problem, possibly because limited expressive power of text representation is incompetent at handling noisy, imbalance data. 2) The sent2rel attention could be invalid when sibling relations have totally distinct meanings, and posts negative effects on relation extraction. For example, /people/person/children and /people/person/profession refer to opposite meanings. 3) Since a sentence embedding is augmented by multiple semantically-related relation embeddings, relation ambiguity problem deteriorates to post errors. For example, it is hard to distinguish /people/deceased person/place of death and /people/deceased person/place of burial.

Related Work
Relation Extraction. Supervised relation extraction models (Zelenko et al., 2003;GuoDong et al., 2005) require large amounts of annotated data, which is time consuming and labor intensive. To obtain a large amount of labeled data, Mintz et al. (2009) propose distant supervision method to automatically annotate data. However, it inevitably leads to the wrong labeling problem due to the strong assumption. To reduce the effect of wrong labeling problem, multi-instance learning paradigm (Riedel et al., 2010;Hoffmann et al., 2011) is proposed. To introduce the merits of deep learning into relation extraction, Zeng et al. (2014; specifically design the position embedding and piecewise convolutional neural network to better extract the features of each sentence. To further alleviate the effect of wrong labeling problem,  propose the selective attention framework under multi-instance learning paradigm. Recently, many works Du et al., 2018;Li et al., 2020) are built upon the selective attention  framework to handle wrong labeling problem in distant supervision relation extraction. For example, Ye and Ling (2019) propose bag-level selective attention to share training information among the bags with the same label. Hu et al. (2019) propose a multi-layer attention-based model with joint label embedding. Li et al. (2020) propose to replace the attention with a gate mechanism especially for one-sentence bags.
Hierarchical Relation Extraction. More related to our work, to alleviate long-tail problem posted by distant supervision, it is natural to utilize relation hierarchies for knowledge transfer crossing relations. There are two existing works falling into this paradigm. Besides using the embedding of fine-grained relation as a query of selective attention,  also use embeddings of coarse-grained relations as extra queries to perform a hierarchical attention.  enhance embeddings of multi-granular relations by merging the embeddings from pre-trained graph model and GCN to alleviate long-tail problem.

Conclusion
In this paper, we propose a novel multi-instance learning framework, relation-augmented attention, as our base model for distantly supervised relation extraction. It operates at bag level to minimize the effect of the wrong labeling and leverages a sent2rel attention to alleviate the long-tail problem. By fully exploiting hierarchical knowledge of relations, we extend the base model to CoRA for boosting the performance on long-tail relations. The experiments conducted on the NYT dataset show that CoRA delivers new state-of-the-art performance in terms of both standard metrics (i.e., top-n precision, PR curve, AUC) and long-tail metrics (i.e., Hits@K). And extensive analyses provide comprehensive insights into the proposed model, and verify its capability in handling both wrong labeling and long-tail problems.