Double Graph Based Reasoning for Document-level Relation Extraction

Document-level relation extraction aims to extract relations among entities within a document. Different from sentence-level relation extraction, it requires reasoning over multiple sentences across a document. In this paper, we propose Graph Aggregation-and-Inference Network (GAIN) featuring double graphs. GAIN first constructs a heterogeneous mention-level graph (hMG) to model complex interaction among different mentions across the document. It also constructs an entity-level graph (EG), based on which we propose a novel path reasoning mechanism to infer relations between entities. Experiments on the public dataset, DocRED, show GAIN achieves a significant performance improvement (2.85 on F1) over the previous state-of-the-art. Our code is available at https://github.com/DreamInvoker/GAIN .


Introduction
The task of identifying semantic relations between entities from text, namely relation extraction (RE), plays a crucial role in a variety of knowledge-based applications, such as question answering (Yu et al., 2017) and large-scale knowledge graph construction. Previous methods (Zeng et al., 2014;Zeng et al., 2015;Xiao and Liu, 2016;Zhang et al., 2017;Baldini Soares et al., 2019) focus on sentence-level RE, which predicts relations among entities in a single sentence. However, sentence-level RE models suffer from an inevitable limitation -they fail to recognize relations between entities across sentences. Hence, extracting relations at the document-level is necessary for a holistic understanding of knowledge in text.
There are several major challenges in effective relation extraction at the document-level. Firstly, the subject and object entities involved in a relation Figure 1: An example document and its desired relations from DocRED (Yao et al., 2019). Entity mentions and relations involved in these relation instances are colored. Other mentions are underlined for clarity. may appear in different sentences. Therefore a relation cannot be identified based solely on a single sentence. Secondly, the same entity may be mentioned multiple times in different sentences. Crosssentence context information has to be aggregated to represent the entity better. Thirdly, the identification of many relations requires techniques of logical reasoning. This means these relations can only be successfully extracted when other entities and relations, usually spread across sentences, are identified implicitly or explicitly. As Figure 1 shows, it is easy to recognize the intra-sentence relations (Maryland, country, U.S.), (Baltimore, located in the administrative territorial entity, Maryland), and (Eldersburg, located in the administrative territorial entity, Maryland), since the subject and object appear in the same sentence. However, it is non-trivial to predict the inter-sentence relations between Baltimore and U.S., as well as Eldersburg and U.S., whose mentions do not appear in the same sentence and have long-distance dependencies. Besides, the identification of these two relation instances also  Table 1: Statistics of bad cases in randomly sampled 100 documents from DocRED dev set for BiLSTM (Yao et al., 2019), with 1150 bad cases in total.
requires logical reasoning. For example, Eldersburg belongs to U.S. because Eldersburg is located in Maryland, which belongs to U.S.. Recently, Yao et al. (2019) proposed a largescale human-annotated document-level RE dataset, DocRED, to push sentence-level RE forward to document-level and it contains massive relation facts. Figure 1 shows an example from DocRED. We randomly sample 100 documents from the Do-cRED dev set and manually analyze the bad cases predicted by a BiLSTM-based model proposed by Yao et al. (2019). As shown in Table 1, the error type of inter-sentence and that of logical reasoning take up a large proportion of all bad cases, with 53.5% and 21.0% respectively. Therefore, in this paper, we aim to tackle these problems to extract relations from documents better. Previous work in document-level RE do not consider reasoning (Gupta et al., 2019;Jia et al., 2019;Yao et al., 2019), or only use graph-based or hierarchical neural network to conduct reasoning in an implicit way (Peng et al., 2017;Sahu et al., 2019;Nan et al., 2020). In this paper, we propose a Graph Aggregation-and-Inference Network (GAIN) for document-level relation extraction. It is designed to tackle the challenges mentioned above directly. GAIN constructs a heterogeneous Mention-level Graph (hMG) with two types of nodes, namely mention node and document node, and three different types of edges, i.e., intra-entity edge, inter-entity edge and document edge, to capture the context information of entities in the document. Then, we apply Graph Convolutional Network (Kipf and Welling, 2017) on hMG to get a document-aware representation for each mention. Entity-level Graph (EG) is then constructed by merging mentions that refer to the same entity in hMG, on top of which we propose a novel path reasoning mechanism. This reasoning mechanism allows our model to infer multi-hop relations between entities.
In summary, our main contributions are as fol-lows: • We propose a novel method, Graph Aggregation-and-Inference Network (GAIN), which features a double graph design, to better cope with document-level RE task.
• We introduce a heterogeneous Mention-level Graph (hMG) with a graph-based neural network to model the interaction among different mentions across the document and offer document-aware mention representations.
• We introduce an Entity-level Graph (EG) and propose a novel path reasoning mechanism for relational reasoning among entities.
We evaluate GAIN on the public DocRED dataset. It significantly outperforms the previous state-of-the-art model by 2.85 F1 score. Further analysis demonstrates the capability of GAIN to aggregate document-aware context information and to infer logical relations over documents.

Task Formulation
We formulate the document-level relation extraction task as follows. Given a document comprised of N sentences D = {s i } N i=1 and a va- refers to the i-th sentence consisting of M words, e i = {m j } Q j=1 and m j refers to a span of words belonging to the j-th mention of the i-th entity, the task aims to extract the relations between different entities in E, namely {(e i , r ij , e j )|e i , e j ∈ E, r ij ∈ R}, where R is a pre-defined relation type set.
In our paper, a relation r ij between entity e i and e j is defined as inter-sentential, if and only if S e i ∩ S e j = ∅, where S e i denotes those sentences containing mentions of e i . Instead, a relation r ij is defined as intra-sentential, if and only if S e i ∩S e j = ∅. We also define K-hop relational reasoning as predicting relation r ij based on a K-length chain of existing relations, with e i and e j being the head and tail of the reasoning chain, i.e., e i

Graph Aggregation and Inference
Network ( Figure 2: The overall architecture of GAIN. First, A context encoder consumes the input document to get a contextualized representation of each word. Then, the Mention-level Graph is constructed with mention nodes and a document node. After applying GCN, the graph is transformed into Entity-level Graph, where the paths between entities are identified for reasoning. Finally, the classification module predicts target relations based on the above information. Different entities are in different colors. The number i in the mention node denotes that it belongs to the i-th sentence. module (Sec. 3.3), classification module (Sec. 3.4), as is shown in Figure 2.

Encoding Module
In the encoding module, we convert a document D = {w i } n i=1 containing n words into a sequence of vectors {g i } n i=1 . Following Yao et al. (2019), for each word w i in D, we first concatenate its word embedding with entity type embedding and coreference embedding: where E w (·) , E t (·) and E c (·) denote the word embedding layer, entity type embedding layer and coreference embedding layer, respectively. t i and c i are named entity type and entity id. We introduce None entity type and id for those words not belonging to any entity. Then the vectorized word representations are fed into an encoder to obtain the context sensitive representation for each word: [g 1 , g 2 , . . . , g n ] = Encoder([x 1 , x 2 , . . . , x n ]) (2) where the Encoder can be LSTM or other models.

Mention-level Graph Aggregation Module
To model the document-level information and interactions between mentions and entities, a heterogeneous Mention-level Graph (hMG) is constructed. hMG has two different kinds of nodes: mention node and document node. Each mention node denotes one particular mention of an entity. And hMG also has one document node that aims to model the overall document information. We argue that this node could serve as a pivot to interact with different mentions and thus reduce the long distance among them in the document.
There are three types of edges in hMG: • Intra-Entity Edge: Mentions referring to the same entity are fully connected with intraentity edges. In this way, the interaction among different mentions of the same entity could be modeled.
• Inter-Entity Edge: Two mentions of different entities are connected with an inter-entity edge if they co-occur in a single sentence. In this way, interactions among entities could be modeled by co-occurrences of their mentions.
• Document Edge: All mentions are connected to the document node with the document edge.
With such connections, the document node can attend to all the mentions and enable interactions between document and mentions. Besides, the distance between two mention nodes is at most two with the document node as a pivot. Therefore long-distance dependency can be better modeled.
Next, we apply Graph Convolution Network (Kipf and Welling, 2017) on hMG to aggregate the features from neighbors. Given node u at the l-th layer, the graph convolutional operation can be defined as: denotes neighbors for node u connected in k-th type edge. σ is an activation function (e.g., ReLU).
Different layers of GCN express features of different abstract levels, and therefore in order to cover features of all levels, we concatenate hidden states of each layer to form the final representation of node u: where h (0) u is the initial representation of node u. For a mention ranging from the s-th word to the t-th word in the document, h t j=s g j and for document node, it is initialized with the document representation output from the encoding module.

Entity-level Graph Inference Module
In this subsection, we introduce Entity-level Graph (EG) and path reasoning mechanism. First, mentions that refer to the same entity are merged to entity node so as to get the nodes in EG. Note that we do not consider document node in EG. For i-th entity node e i mentioned N times, it is represented by the average of its N mention representations: Then, we merge all inter-entity edges that connect mentions of the same two entities so as to get the edges in EG. The representation of directed edge from e i to e j in the EG is defined as : where W q and b q are trainable parameters, and σ is an activation function (e.g., ReLU). Based on the vectorized edge representation, the i-th path between head entity e h and tail entity e t passing through entity e o is represented as: Note that we only consider two-hop paths here, while it can easily extend to multi-hop paths. We also introduce attention mechanism (Bahdanau et al., 2015), using the entity pair (e h , e t ) as query, to fuse the information of different paths between e h and e t .
where α i is the normalized attention weight for i-th path. Consequently, the model will pay more attention to useful paths. σ is an activation function. With this module, an entity can be represented by fusing information from its mentions, which usually spread in multiple sentences. Moreover, potential reasoning clues are modeled by different paths between entities. Then they can be integrated with the attention mechanism so that we will take into account latent logical reasoning chains to predict relations.

Classification Module
For each entity pair (e h , e t ), we concatenate the following representations: (1) the head and tail entity representation e h and e t derived in the Entity-level Graph, with the comparing operation (Mou et al., 2016) to strengthen features, i.e., absolute value of subtraction between the representation of two entities, |e h − e t |, and element-wise multiplication, e h e t ; (2) the representation of document node in Mention-level Graph, m doc , as it can help aggregate cross-sentence information and provide document-aware representation; (3) the comprehensive inferential path information p h,t .
Finally, we formulate the task as multi-label classification task and predict relations between entities: where W a , W b , b a , b b are trainable parameters, σ is an activation function (e.g., ReLU). We use binary cross entropy as the classification loss to train our model in an end-to-end way: where S denotes the whole corpus, and I (·) refers to indication function.

Experimental Settings
In our GAIN implementation, we use 2 layers of GCN and set the dropout rate to 0.6, learning rate to 0.001. We train GAIN using AdamW (Loshchilov and Hutter, 2019) as optimizer with weight decay 0.0001 and implement GAIN under PyTorch (Paszke et al., 2017) and DGL (Wang et al., 2019b). We implement three settings for our GAIN. GAIN-GloVe uses GloVe (100d) and BiLSTM (256d) as word embedding and encoder. GAIN-BERT base and GAIN-BERT large use BERT base and BERT large as encoder respectively and the learning rate is set to 1e −5 .

Baselines and Evaluation Metrics
We use the following models as baselines. Yao et al. (2019) proposed models to encode the document into a sequence of hidden state vector {h i } n i=1 using CNN (Fukushima, 1980), LSTM (Hochreiter and Schmidhuber, 1997), and BiL-STM (Schuster and Paliwal, 1997) as their encoder, and predict relations between entities with their representations. Other pre-trained models like BERT (Devlin et al., 2019), RoBERTa , and CorefBERT (Ye et al., 2020) are also used as encoder (Wang et al., 2019a;Ye et al., 2020) to document-level RE task.
Context-Aware, also proposed by Yao et al. (2019) on DocRED adapted from (Sorokin and Gurevych, 2017), uses an LSTM to encode the text, but further utilizes attention mechanism to absorb the context relational information for predicting.
BERT-Two-Step base , proposed by Wang et al. (2019a) on DocRED. Though similar to BERT-RE base , it first predicts whether two entities have a relationship and then predicts the specific target relation.
HIN-GloVe/HIN-BERT base , proposed by Tang et al. (2020). Hierarchical Inference Network (HIN) aggregate information from entity-level, sentence-level, and document-level to predict target relations, and use GloVe (Pennington et al., 2014) or BERT base for word embedding.
LSR-GloVe/LSR-BERT base , proposed by Nan et al. (2020) recently. They construct a graph based on the dependency tree and predict relations by latent structure induction and GCN. Nan et al. (2020) also adapted four graph-based state-of-the-art RE models to DocRED, including GAT Following Yao et al. (2019), we use the widely used metrics F1 and AUC in our experiment. We also use Ign F1 and Ign AUC, which calculate F1 and AUC excluding the common relation facts in the training and dev/test sets.

Results
We show GAIN's performance on the DocRED dataset in Table 2, in comparison with other baselines.
Among the models not using BERT or BERT variants, GAIN-GloVe consistently outperforms all sequential-based and graph-based strong baselines by 0.9 ∼ 12.82 F1 score on the test set. Among the models using BERT or BERT variants, GAIN-BERT base yields a great improvement of F1/Ign F1 on dev and test set by 2.22/6.71 and 2.19/2.03, respectively, in comparison with the strong baseline LSR-BERT base . GAIN-BERT large also improves 2.85/2.63 F1/Ign F1 on test set compared with   (Nan et al., 2020). Results with † are based on our implementation.
previous state-of-the-art method, CorefRoBERTa-RE large . It suggests that GAIN is more effective in document-level RE tasks. We can also observe that LSR-BERT base improves F1 by 3.83 and 4.87 on dev and test set with GloVe embedding replaced with BERT base . In comparison, our GAIN-BERT base yields an improvement by 5.93 and 6.16, which indicates GAIN can better utilize BERT representation.

Ablation Study
To further analyze GAIN, we also conduct ablation studies to illustrate the effectiveness of different modules and mechanisms in GAIN. We show the results of the ablation study in Table 3. First, we remove the heterogeneous Mentionlevel Graph (hMG) of GAIN. In detail, we initialize an entity node in Entity-level Graph (EG) with Eq. 5 but replace m n with h (0) n , and apply GCN to EG instead. Features in different layers of GCN are concatenated to obtain e i . Without hMG, the performance of GAIN-GloVe/GAIN-BERT base sharply drops by 2.08/2.02 Ign F1 score on dev set. This drop shows that hMG plays a vital role in capturing interactions among mentions belonging to the same and different entities and document-aware features.
Next, we remove the inference module. To be specific, the model abandon the path information between head and tail entity p h,t obtained in Entitylevel Graph, and predict relations only based on entity representation, e h and e t , and document node representation, m doc . The inference module's removal results in poor performance across all metrics, for instance, 2.21/2.17 Ign F1 score decrease on the dev set for GAIN-GloVe/GAIN-BERT base . It suggests that our path inference mechanism helps capture the potential K-hop inference paths to infer relations and, therefore, improve document-level RE performance.
Moreover, taking away the document node in hMG leads to 2.19/1.88 Ign F1 decrease on the dev set for GAIN-GloVe/GAIN-BERT base . It helps GAIN aggregate the document information and works as a pivot to facilitate the information exchange among different mentions, especially those far away from each other within the document.

Analysis & Discussion
In this subsection, we further analyze both intersentential and inferential performance on the development set. The same as Nan et al. (2020), we report Intra-F1/Inter-F1 scores in Table 4, which only consider either intra-or inter-sentence relations respectively. Similarly, in order to evaluate the inference ability of the models, Infer-F1 scores are reported in Table 5, which only considers relations that engaged in the relational reasoning process . For example, we take into account the golden relation facts r 1 , r 2 , and r 3 if there exist   Table 4: Intra-and Inter-F1 results on dev set of Do-cRED. Results with * are reported in (Nan et al., 2020).

Infer-F1.
As Table 4 shows, GAIN outperforms other baselines not only in Intra-F1 but also Inter-F1, and the removal of hMG leads to a more considerable decrease in Inter-F1 than Intra-F1, which indicates our hMG do help interactions among mentions, especially those distributed in different sentences with long-distance dependency.
Besides, Table 5 suggests GAIN can better handle relational inference. For example, GAIN-BERT base improves 5.11 Infer-F1 compared with RoBERTa-RE base . The inference module also plays an important role in capturing potential inference chains between entities, without which GAIN-BERT base would drop by 1.78 Infer-F1.

Related Work
Previous approaches focus on sentence-level relation extraction (Zeng et al., 2014;Zeng et al., 2015;Wang et al., 2016;Zhou et al., 2016;Xiao and Liu, 2016;Zhang et al., 2017;Feng et al., 2018;Zhu et al., 2019). But sentence-level RE models face an inevitable restriction in practice, where many realworld relation facts can only be extracted across sentences. Therefore, many researchers gradually shift their attention into document-level relation extraction. Several approaches Peng et al., 2017;Gupta et al., 2019;Song et al., 2018;Figure 3: The case study of our proposed GAIN and baseline models. The models take the document as input and predict relations among different entities in different colors. We only show a part of entities within the documents and the according sentences due to the space limitation. Jia et al., 2019) leverage dependency graph to better capture document-specific features, but they ignore ubiquitous relational inference in document. Recently, many models are proposed to address this problem. Tang et al. (2020) proposed a hierarchical inference network by considering information from entity-level, sentence-level, and documentlevel. However, it conducts relational inference implicitly based on a hierarchical network while we adopt the path reasoning mechanism, which is a more explicit way.  is one of the most powerful systems on document-level RE tasks recently. Compared to  and other graph-based approaches to relation extraction, our architecture features many different designs with different motivations behind them. First, the ways of graph construction are different. We create two separate graphs of different levels to capture long-distance document-aware interactions and entity path inference information, respectively. While  put mentions and entities in the same graph. Moreover, they do not conduct graph node representation learning like GCN to aggregate interactive information on the constructed graph, only using the features from BiLSTMs to represent nodes. Second, the processes of path inference are different.  use a walkbased method to iteratively generate a path for every entity pair, which requires the extra overhead of hyper-parameter tuning to control the process of inference. Instead, we use an attention mechanism to selectively fuse all possible path information for the entity pair while without extra overhead.
When we were writing this paper, (Nan et al., 2020) make their work public as preprints, which adopt the dependency tree to capture the semantic information in the document. They put mention and entity nodes in the same graph and conduct inference implicitly by using GCN. Unlike their work, our GAIN presents mention node and entity node in different graphs to better conduct inter-sentence information aggregation and infer relations more explicitly.
Some other attempts (Verga et al., 2018;Sahu et al., 2019;) study document-level RE in a specific domain like biomedical RE. However, the datasets they use usually contain very limited relation types and entity types. For instance, CDR  only has one type of relation and two types of entities, which may not be the ideal testbed for relational reasoning.

Conclusion
Extracting inter-sentence relations and conducting relational reasoning are challenging in documentlevel relation extraction.
In this paper, we introduce Graph Aggregationand-Inference Network (GAIN) to better cope with document-level relation extraction, which features double graphs in different granularity. GAIN utilizes a heterogeneous Mention-level Graph to model the interaction among different mentions across the document and capture document-aware features. It also uses an Entity-level Graph with a proposed path reasoning mechanism to infer relations more explicitly.
Experimental results on the large-scale humanannotated dataset, DocRED, show GAIN outperforms previous methods, especially in intersentence and inferential relations scenarios. The ablation study also confirms the effectiveness of different modules in our model.