Denoising Relation Extraction from Document-level Distant Supervision

Distant supervision (DS) has been widely used to generate auto-labeled data for sentence-level relation extraction (RE), which improves RE performance. However, the existing success of DS cannot be directly transferred to the more challenging document-level relation extraction (DocRE), since the inherent noise in DS may be even multiplied in document level and significantly harm the performance of RE. To address this challenge, we propose a novel pre-trained model for DocRE, which denoises the document-level DS data via multiple pre-training tasks. Experimental results on the large-scale DocRE benchmark show that our model can capture useful information from noisy DS data and achieve promising results.


Introduction
Relation extraction (RE) aims to identify relational facts between entities from texts.Recently, neural relation extraction (NRE) models have been verified in sentence-level RE (Zeng et al., 2014).Distant supervision (DS) (Mintz et al., 2009) provides large-scale distantly-supervised data that multiplies instances and enables sufficient model training.
Sentence-level RE focuses on extracting intrasentence relations between entities in a sentence.However, it is extremely restricted with generality and coverage in practice, since there are plenty of inter-sentence relational facts hidden across multiple sentences.Statistics on a large-scale RE dataset constructed from Wikipedia documents show that at least 40.7% relational facts can only be inferred from multiple sentences (Yao et al., 2019).Therefore, document-level relation extraction (DocRE) is proposed to jointly extract both inter-and intra-sen- tence relations (Christopoulou et al., 2019).Fig. 1 gives a brief illustration of DocRE.
Most DocRE models heavily rely on high-quality human-annotated training data, which is timeconsuming and labor-intensive.However, it is extremely challenging to extend the sentence-level DS to the document level.The challenges of conducting document-level DS mainly come from: (1) Each entity contains multiple mentions, and mentions without relational context bring noise to entity representations; (2) The inherent noise of DS will be even multiplied at the document level.Statistics in Yao et al. (2019) show that 61.8% inter-sentence relation instances generated by document-level DS are actually noise; (3) It is challenging to capture useful relational semantics from long documents, since most contents in the documents may be irrelevant to the given entities and relations.In sentencelevel RE, several efforts (Lin et al., 2016;Feng et al., 2018) have been devoted to denoise the DS corpus by jointly considering multiple instances.However, these denoising methods can not be directly adapted to DocRE, since they are specially designed for bag-level RE evaluations.
In this work, we attempt to introduce documentlevel DS to DocRE after denoising.To alleviate the noise, we propose a pre-trained model with three specially designed tasks to denoise the document-level DS corpus and leverage useful information.The three pre-training tasks include: (1) Mention-Entity Matching, which aims to capture useful information from multiple mentions to produce informative representations for entities.It consists of intra-document and inter-document sub-tasks.The intra-document sub-task aims to match masked mentions and entities within a document to grasp the coreference information.The inter-document sub-task aims to match entities between two documents to grasp the entity association across documents.(2) Relation Detection, which focuses on denoising "Not-A-Relation (NA)" and incorrectly labeled instances by detecting the entity pairs with relations, i.e., positive instances.It is specially designed as the document-level denoising task.We also conduct a pre-denoising module trained with this task to filter out NA instances before pre-training.(3) Relational Fact Alignment, which requires the model to produce similar representations for the same entity pair from diverse expressions.This allows the model to focus more on diverse relational expressions and denoising irrelevant information from the document.
In experiments, we evaluate our model on an open DocRE benchmark and achieve significant improvement over competitive baselines.We also conduct detailed analysis and ablation test, which further highlight the significance of DS data and verify the effectiveness of our pre-trained model for DocRE.To the best of our knowledge, we are the first to denoise document-level DS with pre-trained models.We will release our codes in the future.

Mention
Entity to extend the scope of knowledge acquisition to the document level, which has attracted great attention recently (Yao et al., 2019).Some works use linguistic features (Xu et al., 2016;Gu et al., 2017) and graph-based models (Christopoulou et al., 2019;Sahu et al., 2019) to extract inter-sentence relations on human-annotated data.Quirk and Poon (2017) and Peng et al. (2017) attempt to extract inter-sentence relations with distantly supervised data.However, they only use entity pairs within three consecutive sentences.Different from these works, we bring in document-level DS to DocRE and conduct pre-training to denoise these DS data.

Methodology
In this section, we present our proposed model in detail.Fig. 2 gives an illustration of our framework.We first apply the pre-denoising module to screen out some NA instances from all documents.Then we pre-train the document encoder with three pre-training tasks on the document-level distantly supervised dataset.Finally, we fine-tune the model on the human-annotated dataset.

Document Encoder
We adopt BERT (Devlin et al., 2019) as the document encoder to encode documents into representations of entity mentions, entities and relational instances.Let D = {ω i } n i=1 denote the input document which consists of n tokens, and be the set of entities mentioned in the document, where entity e i = {m j i } l i j=1 contains l i mentions in the document.Following Soares et al. ( 2019), we use entity markers [Ei] and [/Ei] for each entity e i .The start marker [Ei] is inserted at the begin of all mentions of entity e i , and the end marker [/Ei] is inserted at the end.
We use BERT to encode the token sequence with entity markers into a sequence of hidden state {h 1 , ..., h n}, where n indicates the length of the sequence with entity markers.We define representation m j i of each entity mention as the hidden state of its start marker.Then a max-pooling operation is performed to obtain the aggregated representation of entity e i from its mentions: e i = MaxPooling({m j i } l i j=1 ).Next, for each entity pair (e i , e k ), we use a bilinear layer to compute the relational representation: r i,k = Bilinear E (e i , e k ).

Pre-training Tasks
We design three pre-training tasks, which help the model to denoise document-level DS data and learn informative representations in both mention/entitylevel and relation-level from large-scale DS data.Mention-Entity Matching.An entity is usually mentioned multiple times in a document, and it is important for expressive entity representations to capture relational information from these mentions.Hence, we propose the mention-entity matching task to help the model to produce expressive representations for mentions and entities, which includes intra-document and inter-document sub-tasks.
The intra-document sub-task requires the model to grasp the coreference information within a document.We randomly mask an entity mention and require the model to predict which entity in the document it belongs to.Formally, given the masked entity mention m q and k m entities {e i m } km i=1 from the same document, we compute the matching score for e i m and m q with a bilinear layer as follows: The inter-document sub-task requires the model to link the same entity in two different documents.It aims to develop the model to encode useful information from the contexts into the representations.Given the entities {e i A } ke i=1 from document d A where k e is the size of the entity set, and the entity e q B from document d B which is also mentioned in d A , we define the matching score for entity e q B and e i A as: where Bilinear M indicates the same bilinear layer in intra-document sub-task.Then both matching scores are fed into an output softmax function.
Relation Detection.The NA relation is dominating in DocRE.It is necessary for models to denoise NA instances and to identify the true positive instances from NA noise.Therefore, we design this task, which requires the model to distinguish positive entity pairs from NA instances.Formally, given k n instances {r i n } kn i=1 from given documents where only one is positive, we have their positive score as: where w n and b n indicate weights and bias.Next, we apply a softmax function to compute the probability of i-th instance to be positive.Similar to the previous mention/entity-level task, this task can also be divided into intra-and interdocument sub-tasks.For the intra-document subtask, the instances are all sampled from one single document.For the inter-document sub-task, the instances are sampled from different documents.
Relational Fact Alignment.To grasp useful information from the long documents and denoise irrelevant content, we design the relation-level task, which requires the representations of the same entity pairs in different documents to be similar.Formally, assume d A and d B are two documents from the training set, which share several relational facts.Let {r i A } ks i=1 denote the relational instances in d A , and r q B denote the representation of the relational instance in d B whose relational fact is contained in d A .Then the model is required to find the relational instance from {r i A } ks i=1 , which shares the same relational fact with r q B .First, we compute the similarity score of two relational instances: w s and b s are weights and bias.Then similarity scores are fed into a softmax over instances in d A .
Finally, the overall pre-training loss L is the sum of all cross-entropy losses in three tasks. (5) Note that the loss can be easily minimized by an entity linking system without any relational knowledge.To avoid this problem, we replace all the mentions of an entity in a document by a special blank symbol [BLANK] with probability α following Soares et al. (2019).In such a case, the model can only learn representations from the context.As a result, minimizing the loss L requires the model to do more than just memorizing named entities.

Pre-denoising Module
As stated before, the document-level DS will generate more noise.To alleviate the issue, we propose to screen out entity pairs with low relational probability from all documents with a rank model.We train the rank model with the Relation Detection task on the human-annotated training set.Then, the rank model is able to give high scores to positive instances and low scores to NA instances.During the pre-denoising process, we compute positive scores for all entity pairs as stated in Eq. 3. Next, for each document, we rank all its entity pairs according to their positive scores, and keep top k d entity pairs for pre-training, fine-tuning and evaluation.The framework of the pre-denoising module is the same as the model used for pre-training.Please refer to the previous section for details.With the pre-denoising module, the wrong labeling problem in DS corpus and the label imbalance problem (i.e., most entity pairs belong to NA instances) in the human-annotated corpus can be alleviated.
In experiments, we use the document-level DS data to pre-train our model and then fine-tune and evaluate the model on the human-annotated data.Following Yao et al. (2019), we use F 1 and IgnF 1 as evaluation metrics, where IgnF 1 denote the F 1 scores excluding relational facts in both training and dev/test sets.Please refer to the appendix for details about DocRED and experimental settings.

Implementation Details
We pre-train our model based on BERT BASE .All the hyper-parameters are selected with manually tuning.The learning rate is set to 3 × 10 −5 for pre-training and 10 −5 for fine-tuning.

Ablation Study
To explore the contribution of different pre-training tasks, we show the results of the ablation study in Tab. 2. Specifically, we show the scores with different pre-training tasks turned off one at a time.MM, RD, RA indicate three pre-training tasks: Mentions/Entities Matching, Relation Detection, and Relational Facts Alignment.We observe that all three pre-training tasks contribute to the main model, as the performance deteriorates with any of the tasks missing.Note that the removal of the RD pre-training task leads to a large drop in both F 1 and IgnF 1 scores, even lower than those of our model without pre-training (BERT+D).This is because without RD, the model is unable to identify positive instances, which is quite important in document-level RE and then the label imbalance problem makes the scores drop.Moreover, we conduct another ablation study to explore the effectiveness of intra-and interdocument subtasks.The results are shown in Tab. 2, where w/o Intra and w/o Inter refer to pre-training without intra-and inter-document sub-tasks.We find that both intra-document and inter-document sub-tasks contribute to the main model in general.

Conclusion
In this work, we propose to denoise distantly supervised data in DocRE by multiple pre-training tasks.Experiment results verify the effectiveness of our model.In the future, we will explore how to improve the efficiency of our pre-training.

Figure 1 :
Figure 1: An example of DocRE.Given a document, DocRE models should capture the relational semantics across sentences to extract multiple relational facts.

Figure 2 :
Figure 2: The framework of our proposed model.
to select informative instances.It is hard to directly adopt these models to DocRE, since DocRE should extract multiple relational facts from each document.Soares et al. (2019) propose a pretrained model for sentence-level RE.
Document-level RE.Document-level RE attempts

Table 1 :
: it considers the relations' interactions with attention to jointly learn all entity Main results on DocRED.Results with *, ♣ and ♠ are from Yao et al. (2019), Wang et al. (2019) and Tang et al. (2020) respectively.

Table 2 :
Results of ablation study on DocRED.
cally, D refers to the pre-denoising module and P indicates pre-training tasks.From the results, we can observe that: (1) Our model outperforms all baselines by a significant margin.It is due to the effectiveness of the pre-denoising mechanism