Reasoning with Latent Structure Refinement for Document-Level Relation Extraction

Document-level relation extraction requires integrating information within and across multiple sentences of a document and capturing complex interactions between inter-sentence entities. However, effective aggregation of relevant information in the document remains a challenging research question. Existing approaches construct static document-level graphs based on syntactic trees, co-references or heuristics from the unstructured text to model the dependencies. Unlike previous methods that may not be able to capture rich non-local interactions for inference, we propose a novel model that empowers the relational reasoning across sentences by automatically inducing the latent document-level graph. We further develop a refinement strategy, which enables the model to incrementally aggregate relevant information for multi-hop reasoning. Specifically, our model achieves an F1 score of 59.05 on a large-scale document-level dataset (DocRED), significantly improving over the previous results, and also yields new state-of-the-art results on the CDR and GDA dataset. Furthermore, extensive analyses show that the model is able to discover more accurate inter-sentence relations.


Introduction
Relation extraction aims to detect relations among entities in the text and plays a significant role in a variety of natural language processing applications. Early research efforts focus on predicting relations between entities within the sentence (Zeng et al., 2014;Xu et al., 2015a,b). However, valuable relational information between entities, such as biomedical findings, is expressed by multiple mentions across sentence boundaries in real-world scenarios (Peng et al., 2017). Therefore, the scope Figure 1: An example adapted from the DocRED dataset. The example has four entities: Lutsenko, internal affairs, Yulia Tymoshenko and Ukrainian. Here entity Lutsenko has two mentions: Lutsenko and He. Mentions corresponding to the same entity are highlighted with the same color. The solid and dotted lines represent intra-and inter-sentence relations, respectively. of extraction in biomedical domain has recently been expanded to cross-sentence level Gupta et al., 2018;Song et al., 2019).
A more challenging, yet practical extension, is the document-level relation extraction, where a system needs to comprehend multiple sentences to infer the relations among entities by synthesizing relevant information from the entire document (Jia et al., 2019;Yao et al., 2019). Figure 1 shows an example adapted from the recently proposed document-level dataset DocRED (Yao et al., 2019). In order to infer the inter-sentence relation (i.e., country of citizenship) between Yulia Tymoshenko and Ukrainian, one first has to identify the fact that Lutsenko works with Yulia Tymoshenko. Next we identify that Lutsenko manages internal affairs, which is a Ukrainian authority. After incrementally connecting the evidence in the document and performing the step-by-step reasoning, we are able to infer that Yulia Tymoshenko is also a Ukrainian.
Prior efforts show that interactions between mentions of entities facilitate the reasoning process in the document-level relation extraction. Thus, Verga et al. (2018) and Jia et al. (2019) leverage Multi-Instance Learning (Riedel et al., 2010;Surdeanu et al., 2012). On the other hand, structural information has been used to perform better reasoning since it models the non-local dependencies that are obscure from the surface form alone. Peng et al. (2017) construct dependency graph to capture interactions among n-ary entities for cross-sentence extraction. Sahu et al. (2019) extend this approach by using co-reference links to connect dependency trees of sentences to construct the document-level graph. Instead,  construct a heterogeneous graph based on a set of heuristics, and then apply an edge-oriented model (Christopoulou et al., 2018) to perform inference.
Unlike previous methods, where a documentlevel structure is constructed by co-references and rules, our proposed model treats the graph structure as a latent variable and induces it in an end-to-end fashion. Our model is built based on the structured attention (Kim et al., 2017;Liu and Lapata, 2018). Using a variant of Matrix-Tree Theorem (Tutte, 1984;Koo et al., 2007), our model is able to generate task-specific dependency structures for capturing non-local interactions between entities. We further develop an iterative refinement strategy, which enables our model to dynamically build the latent structure based on the last iteration, allowing the model to incrementally capture the complex interactions for better multi-hop reasoning (Welbl et al., 2018).
Experiments show that our model significantly outperforms the existing approaches on DocRED, a large-scale document-level relation extraction dataset with a large number of entities and relations, and also yields new state-of-the-art results on two popular document-level relation extraction datasets in the biomedical domain. The code and pretrained model are available at https: //github.com/nanguoshun/LSR 1 .
Our contributions are summarized as follows: • We construct a document-level graph for inference in an end-to-end fashion without relying on co-references or rules, which may not always yield optimal structures. With the iterative refinement strategy, our model is able to dynamically construct a latent structure for improved information aggregation in the entire document.
• We perform quantitative and qualitative analyses to compare with the state-of-the-art mod-

Model
In this section, we present our proposed Latent Structure Refinement (LSR) model for the document-level relation extraction task. Our LSR model consists of three components: node constructor, dynamic reasoner, and classifier. The node constructor first encodes each sentence of an input document and outputs contextual representations.
Representations that correspond to mentions and tokens on the shortest dependency path in a sentence are extracted as nodes. The dynamic reasoner is then applied to induce a document-level structure based on the extracted nodes. Representations of nodes are updated based on information propagation on the latent structure, which is iteratively refined. Final representations of nodes are used to calculate classification scores by the classifier.

Node Constructor
Node constructor encodes sentences in a document into contextual representations and constructs representations of mention nodes, entity nodes and meta dependency paths (MDP) nodes, as shown in Figure 2. Here MDP indicates a set of shortest dependency paths for all mentions in a sentence, and tokens in the MDP are extracted as MDP nodes.

Context Encoding
Given a document d, each sentence d i in it is fed to the context encoder, which outputs the contextualized representations of each word in d i . The context encoder can be a bidirectional LSTM (BiL-STM) (Schuster and Paliwal, 1997) or BERT (Devlin et al., 2019). Here we use the BiLSTM as an example: represent the hidden representations of the j-th, (j+1)-th and (j-1)th token in the sentence d i of two directions, and γ i j denotes the word embedding of the j-th token. Contextual representation of each token in the sentence by concatenating hidden states of two directions, where h i j ∈ R d and d is the dimension. The ministry of internal affairs is the Ukrainian police authority.

Mention Node
Entity Node MDP Node Figure 2: Overview of the Node Constructor: A context encoder is applied to get the contextualized representations of sentences. The representations of mentions and words in the meta dependency paths are extracted as mention nodes and MDP nodes. An average pooling is used to construct the entity node from the mention nodes. For example, the entity node Lutsenko is constructed by averaging representations of its mentions Lutsenko and He.
All figures best viewed in color.

Node Extraction
We construct three types of nodes for a documentlevel graph: mention nodes, entity nodes and meta dependency paths (MDP) nodes as shown in Figure 2. Mention nodes correspond to different mentions of entities in each sentence. The representation of an entity node is computed as the average of its mentions. To build a document-level graph, existing approaches use all nodes in the dependency tree of a sentence (Sahu et al., 2019) or one sentence-level node by averaging all token representations of the sentence . Alternatively, we use tokens on the shortest dependency path between mentions in the sentence. The shortest dependency path has been widely used in the sentence-level relation extraction as it is able to effectively make use of relevant information while ignoring irrelevant information (Bunescu and Mooney, 2005;Xu et al., 2015a,b). Unlike sentence-level extraction, where each sentence only has two entities, each sentence here may involve multiple mentions.

Dynamic Reasoner
The dynamic reasoner has two modules, structure induction and multi-hop reasoning as shown in Figure 3. The structure induction module is used to learn a latent structure of a document-level graph. The multi-hop reasoning module is used to perform inference on the induced latent structure, where representations of each node will be updated based on the information aggregation scheme. We stack N blocks in order to iteratively refine the latent document-level graph for better reasoning.

Structure Induction
Unlike existing models that use co-reference links (Sahu et al., 2019) or heuristics  to construct a document-level graph Representations of nodes are fed into two feed-forward networks before the bilinear transformation. The latent document-level structure is computed by the Matrix-Tree Theorem. The second module takes the structure as input and updates representations of nodes by using the densely connected graph convolutional networks. We stack N blocks which correspond to N times of refinement. Each iteration outputs the latent structure for inference.
for reasoning, our model treats the graph as a latent variable and induces it in an end-to-end fashion. The structure induction module is built based on the structured attention (Kim et al., 2017;Liu and Lapata, 2018). Inspired by Liu and Lapata (2018), we use a variant of Kirchhoff's Matrix-Tree Theorem (Tutte, 1984;Koo et al., 2007) to induce the latent dependency structure.
Let u i denote the contextual representation of the i-th node, where u i ∈ R d , we first calculate the pair-wise unnormalized attention score s ij between the i-th and the j-th node with the node represen-tations u i and u j . The score s ij is calculated by two feed-forward neural networks and a bilinear transformation: where W p ∈ R d×d and W c ∈ R d×d are weights for two feed-forward neural networks, d is the dimension of the node representations, and tanh is applied as the activation function. W b ∈ R d×d are the weights for the bilinear transformation. Next we compute the root score s r i which represents the unnormalized probability of the i-th node to be selected as the root node of the structure: where W r ∈ R 1×d is the weight for the linear transformation. Following Koo et al. (2007), we calculate the marginal probability of each dependency edge of the document-level graph. For a graph G with n nodes, we first assign non-negative weights P ∈ R n×n to the edges of the graph: where P ij is the weight of the edge between the i-th and the j-th node. We then define the Laplacian matrix L ∈ R n×n of G in Equation (6), and its variantL ∈ R n×n in Equation (7) for further computations (Koo et al., 2007).
We use A ij to denote the marginal probability of the dependency edge between the i-th and the j-th node. Then, A ij can be derived based on Equation (8), where δ is the Kronecker delta (Koo et al., 2007).
Here, A ∈ R n×n can be interpreted as a weighted adjacency matrix of the document-level entity graph. Finally, we can feed A ∈ R n×n into the multi-hop reasoning module to update the representations of nodes in the latent structure.

Multi-hop Reasoning
Graph neural networks have been widely used in different tasks to perform multi-hop reasoning (Song et al., 2018a;Yang et al., 2019;Tu et al., 2019;, as they are able to effectively collect relevant evidence based on an information aggregation scheme. Specifically, our model is based on graph convolutional networks (GCNs) (Kipf and Welling, 2017) to perform reasoning. Formally, given a graph G with n nodes, which can be represented with an n × n adjacency matrix A induced by the previous structure induction module, the convolution computation for the node i at the l-th layer, which takes the representation u l−1 i from previous layer as input and outputs the updated representations u l i , can be defined as: where W l and b l are the weight matrix and bias vector for the l-th layer, respectively. σ is the ReLU (Nair and Hinton, 2010) activation function. u 0 i ∈ R d is the initial contextual representation of the i-th node constructed by the node constructor.
Following Guo et al. (2019b), we use dense connections to the GCNs in order to capture more structural information on a large document-level graph. With the help of dense connections, we are able to train a deeper model, allowing richer local and non-local information to be captured for learning a better graph representation. The computations on each graph convolution layer is similar to Equation (9).

Iterative Refinement
Though structured attention (Kim et al., 2017;Liu and Lapata, 2018) is able to automatically induce a latent structure, recent research efforts show that the induced structure is relatively shallow and may not be able to model the complex dependencies for document-level input (Liu et al., 2019b;Ferracane et al., 2019). Unlike previous work (Liu and Lapata, 2018) that only induces the latent structure once, we repeatedly refine the document-level graph based on the updated representations, allowing the model to infer a more informative structure that goes beyond simple parent-child relations.
As shown in Figure 3, we stack N blocks of the dynamic reasoner in order to induce the documentlevel structure N times. Intuitively, the reasoner induces a shallow structure at early iterations since the information propagates mostly between neighboring nodes. As the structure gets more refined by interactions with richer non-local information, the induction module is able to generate a more informative structure.

Classifier
After N times of refinement, we obtain representations of all the nodes. Following Yao et al. (2019), for each entity pair (e i , e j ), we use a bilinear function to compute the probability for each relation type r as: where W e ∈ R d×k×d and b e ∈ R k are trainable weights and bias, with k being the number of relation categories, σ is the sigmoid function, and the subscript r in the right side of the equation refers to the relation type.

Data
We evaluate our model on DocRED (Yao et al., 2019), the largest human-annotated dataset for document-level relation extraction, and another two popular document-level relation extraction datasets in the biomedical domain, including Chemical-Disease Reactions (CDR) (Li et al., 2016a) and Gene-Disease Associations (GDA) (Wu et al., 2019). DocRED contains 3, 053 documents for training, 1, 000 for development and 1, 000 for test, totally with 132, 375 entities and 56, 354 relational facts. CDR consists of 500 training instances, 500 development instances, and 500 testing instances. GDA contains 29, 192 documents for training and 1, 000 for test. We follow  to split training set of GDA into an 80/20 split for training and development.
With more than 40% of the relational facts requiring reading and reasoning over multiple sentences, DocRED significantly differs from previous sentence-level datasets (Doddington et al., 2004;Hendrickx et al., 2009;. Unlike existing document-level datasets (Li et al., 2016a;Peng et al., 2017;Verga et al., 2018;Jia et al., 2019) that are in the specific biomedical domain considering only the drug-genedisease relation, DocRED covers a broad range of categories with 96 relation types.

Setup
We use spaCy 2 to get the meta dependency paths of sentences in a document. Following Yao et al.  (2019), we use F 1 and Ign F 1 as the evaluation metrics. Ign F 1 denotes F 1 scores excluding relational facts shared by the training and dev/test sets. F 1 scores for intra-and intersentence entity pairs are also reported. Evaluation on the test set is done through CodaLab 3 .

Main Results
We compare our proposed LSR with the following three types of competitive models on the DocRED dataset, and show the main results in Table 2.
• Sequence-based Models. These models leverage different neural architectures to encode sentences in the document, including convolutional neural networks (CNN) (Zeng et al., 2014), LSTM, bidirectional LSTM (BiLSTM) (Cai et al., 2016) and attention-based LSTM (Con-textAware) (Sorokin and Gurevych, 2017  et al., 2019a) is the state-of-the-art sentencelevel relation extraction model, which constructs the latent structure by self-attention. These two models are able to dynamically construct taskspecific structures. • BERT-based Models. These models fine-tune BERT (Devlin et al., 2019) for DocRED. Specifically, Two-Phase BERT  is the best reported model. It is a pipeline model, which predicts if the relation exists between entity pairs in the first phase and predicts the type of the relation in the second phase.
As shown in Table 2, LSR with GloVe achieves 54.18 F 1 on the test set, which is the new state-ofthe-art result for models with GloVe. In particular, our model consistently outperforms sequencebased models by a significant margin. For example, LSR improves upon the best sequence-based model BiLSTM by 3.1 points in terms of F 1 . This suggests that models which directly encode the entire document are unable to capture the inter-sentence relations present in documents.
Under the same setting, our model consistently outperforms graph-based models based on static graphs or attention mechanisms. Compared with EoG, our LSR model achieves 3.0 and 2.4 higher F 1 on development and test set, respectively. We also have similar observations for the GCNN model, which shows that a static document-level graph may not be able to capture the complex interactions in a document. The dynamic latent structure induced by LSR captures richer non-local dependencies. Moreover, LSR also outperforms GAT and AGGCN. This empirically shows that compared to the models that use local attention and self-attention (Veličković et al., 2018;Guo et al., 2019a), LSR can induce more informative document-level structures for better reasoning. Our LSR model also shows its superiority under the setting of Ign F 1 .
In addition, LSR with GloVe obtains better results than two BERT-based models. This empirically shows that our model is able to capture longrange dependencies even without using powerful context encoders. Following , we leverage BERT as the context encoder. As shown in Table 2, our LSR model with BERT achieves a 59.05 F 1 score on DocRED, which is a new stateof-the-art result. As of the ACL deadline on the 9th of December 2019, we held the first position on the CodaLab scoreboard under the alias diskorak.

Intra-and inter-sentence performance
In this subsection, we analyze intra-and intersentence performance on the development set. An entity pair requires inter-sentence reasoning if the two entities from the same document have no mentions in the same sentence. In DocRED's development set, about 45% of entity pairs require information aggregation over multiple sentences.
Under the same setting, our LSR model outperforms all other models in both intra-and intersentence setting. The differences in F 1 scores between LSR and other models in the inter-sentence setting tend to be larger than the differences in the intra-sentence setting. These results demonstrate that the majority of LSR's superiority comes from the inter-sentence relational facts, suggesting that  the latent structure induced by our model is indeed capable of synthesizing the information across multiple sentences of a document. Furthermore, LSR with GloVe also proves better in the inter-sentence setting compared with two BERT-based  models, indicating latent structure's superiority in resolving longrange dependencies across the whole document compared with the BERT encoder. Table 3 Table 3, our LSR performs worse than the state-of-the-art models. It is challenging for an off-the-shelf parser to get high quality dependency trees in the biomedical domain, as we observe that the MDP nodes extracted by the spaCy parser from the CDR dataset contains much less informative context compared with the nodes from DocRED. Here we introduce a simplified LSR model indicated as "LSR w/o MDP Nodes" , which removes the MDP nodes and builds a fullyconnected graph using all tokens of a document. It shows that "LSR w/o MDP Nodes" consistently outperforms sequence-based and graph-based models, indicating the effectiveness the latent structure. Moreover, the simplified LSR outperforms most of the models with external resources, except for Li et al. (2016b), which leverages co-training with additional unlabeled training data. We believe such a setting also benefits our LSR model.    Table 4 shows the results on the distantly supervised GDA dataset. Here "Full" indicates EoG model with a fully connected graph as the inputs, while "NoInf" is a variant of EoG model without inference component (Christopoulou et al., 2018). The simplified LSR model achieves the new state-of-the-art result on GDA. The "Full" model ) yields a higher F 1 score on the inter-sentence setting while having a relatively low score on the intra-sentence. It is likely because that this model neglects the differences between relations expressed within the sentence and across sentences.

Model Analysis
In this subsection, we use the development set of DocRED to demonstrate the effectiveness of the latent structure and refinements.

Does Latent Structure Matter?
We investigate the extent to which the latent structures, that are induced and iteratively refined by the proposed dynamic reasoner, help to improve the overall performance. We experiment with the three different structures defined below. For fair comparisons, we use the same GCN model to perform multi-hop reasoning for all these structures.
Rule-based Structure: We use the rule-based structure in EoG . Also, Figure 5: Case study of an example from the development set of DocRED. We visualize the reasoning process for predicting the relation of an entity pair Japan, World War II by LSR and AGGCN in two refinement steps, using the attention scores of the mention World War II in each step. We scale all attention scores by 1000 to illustrate them more clearly. Some sentences are omitted due to space limitation.
We adapt rules from De Cao et al. (2019) for multihop question answering, i.e., each mention node is connected to its entity node and to the same mention nodes across sentences, while mention nodes and MDP nodes which reside in the same sentence are fully connected. The model is termed QAGCN.
Attention-based Structure: This structure is induced by AGGCN (Guo et al., 2019a) with multihead attention (Vaswani et al., 2017). We extend the model from sentence-level to document-level. We explore multiple settings of these models with different block numbers ranging from 1 to 4, where a block is composed of a graph construction component and a densely connected GCN component. As shown in Figure 4, LSR outperforms QAGCN, EoG and AGGCN in terms of overall F 1 . This empirically confirms our hypothesis that the latent structure induced by LSR is able to capture a more informative context for the entire document.

Does Refinement Matter?
As shown in Figure 4, our LSR yields the best performance in the second refinement, outperforming the first induction by 0.72% in terms of overall F 1 . This indicates that the proposed LSR is able to induce more accurate structures by iterative refinement. However, too many iterations may lead to an F 1 drop due to over-fitting.  a time. We observe that most of the components contribute to the main model, as the performance deteriorates with any of the components missing. The most significant difference is visible in the structure induction module. Removal of structure induction part leads to a 3.26 drop in terms of F 1 score. This result indicates that the latent structure plays a key role in the overall performance.

Case Study
In Figure 5, we present a case study to analyze why the latent structure induced by our proposed LSR performs better than the structures learned by AG-GCN. We use the entity World War II to illustrate the reasoning process and our goal here is to predict the relation of the entity pair Japan, World War II . As shown in Figure 5, in the first refinement of LSR, Word War II interacts with several local mentions with higher attention scores, e.g., 0.43 for the mention Lake Force, which will be used as a bridge between the mention Japan and World War II. In the second refinement, the attention scores of several non-local mentions, such as Japan and Imperial Japanese Army, significantly increase from 0.09 to 0.41, and 0.17 to 0.37, respectively, indicating that information is propagated globally at this step. With such intra-and inter-sentence structures, the relation of the entity pair Japan, World War II can be predicted as "participant of", which is denoted by P1344. Compared with LSR, the attention scores learned by AGGCN are much more balanced, indicating that the model may not be able to construct an informative structure for inference, e.g., the highest score is 0.27 in the second head, and most of the scores are near 0.11. We also depict the predicted relations of Con-textAware, AGGCN and LSR on the graph on the right side of the Figure 5. Interested reader could refer to (Yao et al., 2019) for the definition of a relation, such as P607, P17, etc. The LSR model proves capable of filling out the missing relation for Japan, World War II that requires reasoning across sentences. However, LSR also attends to the mention New Ireland with a high score, thus failing to predict that the entity pair New Ireland, World War II actually has no relation (NIL type).

Related Work
Document-level relation extraction. Early efforts focus on predicting relations between entities within a single sentence by modeling interactions in the input sequence (Zeng et al., 2014;Zhou et al., 2016;Zhang et al., 2017;Guo et al., 2020) or the corresponding dependency tree (Xu et al., 2015a,b;Liu et al., 2015;Miwa and Bansal, 2016;. These approaches do not consider interactions across mentions and ignore relations expressed across sentence boundaries. Recent work begins to explore cross-sentence extraction Peng et al., 2017;Gupta et al., 2018;Song et al., 2018cSong et al., , 2019. Instead of using discourse structure understanding techniques (Liu et al., 2019a;Lei et al., 2017Lei et al., , 2018, these approaches leverage the dependency graph to capture inter-sentence interactions, and their scope is still limited to several sentences. More recently, the extraction scope has been expanded to the entire document (Verga et al., 2018;Jia et al., 2019;Sahu et al., 2019; in the biomedical domain by only considering a few relations among chemicals. Unlike previous work, we focus on document-level relation extraction datasets (Yao et al., 2019;Li et al., 2016a;Wu et al., 2019) from different domains with a large number of relations and entities, which require understanding a document and performing multi-hop reasoning.
Structure-based relational reasoning. Structural information has been widely used for relational reasoning in various NLP applications including question answering (Dhingra et al., 2018;De Cao et al., 2019;Song et al., 2018a) and relation extraction (Sahu et al., 2019;. Song et al. (2018a) and (De Cao et al., 2019) leverage co-reference information and set of rules to construct document-level entity graph. GCNs (Kipf and Welling, 2017) or GRNs (Song et al., 2018b) are applied to perform reasoning for multi-hop question answering (Welbl et al., 2018). Sahu et al. (2019) also utilize co-reference links to construct the dependency graph and use labelled edge GCNs (Marcheggiani and Titov, 2017) for document-level relation extraction. Instead of using GNNs,  use the edgeoriented model (Christopoulou et al., 2018) for logical inference based on a heterogeneous graph constructed by heuristics. Unlike previous approaches that use syntactic trees, co-references or heuristics, LSR model treats the document-level structure as a latent variable and induces it in an iteratively refined fashion, allowing the model to dynamically construct the graph for better relational reasoning.

Conclusion
We introduce a novel latent structure refinement (LSR) model for better reasoning in the documentlevel relation extraction task. Unlike previous approaches that rely on syntactic trees, co-references or heuristics, LSR dynamically learns a documentlevel structure and makes predictions in an end-toend fashion. There are multiple avenues for future work. One possible direction is to extend the scope of structure induction for constructions of nodes without relying on an external parser.