Global-to-Local Neural Networks for Document-Level Relation Extraction

Relation extraction (RE) aims to identify the semantic relations between named entities in text. Recent years have witnessed it raised to the document level, which requires complex reasoning with entities and mentions throughout an entire document. In this paper, we propose a novel model to document-level RE, by encoding the document information in terms of entity global and local representations as well as context relation representations. Entity global representations model the semantic information of all entities in the document, entity local representations aggregate the contextual information of multiple mentions of specific entities, and context relation representations encode the topic information of other relations. Experimental results demonstrate that our model achieves superior performance on two public datasets for document-level RE. It is particularly effective in extracting relations between entities of long distance and having multiple mentions.


Introduction
Relation extraction (RE) aims to identify the semantic relations between named entities in text. While previous work (Zeng et al., 2014;Zhang et al., 2015 focuses on extracting relations within a sentence, a.k.a. sentence-level RE, recent studies (Verga et al., 2018;Sahu et al., 2019;Yao et al., 2019) have escalated it to the document level, since a large amount of relations between entities usually span across multiple sentences in the real world. According to an analysis on Wikipedia corpus (Yao et al., 2019), at least 40.7% of relations can only be extracted on the document level.
Compared with sentence-level RE, documentlevel RE requires more complex reasoning, such * Corresponding author [S1] Pacific Fair is a major shopping centre in Broadbeach Waters on the Gold Coast, Queensland, Australia.
[S11] Pacific Fair fronts Little Tallebudgera Creek and is the southern end of the Surfers Riverwalk.
as logical reasoning, coreference reasoning and common-sense reasoning. A document often contains many entities, and some entities have multiple mentions under the same phrase of alias. To identify the relations between entities appearing in different sentences, document-level RE models must be capable of modeling the complex interactions between multiple entities and synthesizing the context information of multiple mentions. Figure 1 shows an example of document-level RE. Assume that one wants to extract the relation between "Surfers Riverwalk" in S11 and "Queensland" in S1. One has to find that "Surfers Riverwalk" contains "Pacific Fair" (from S11), and "Pacific Fair" (coreference) is located in "Queensland" (from S1). This chain of interactions helps infer the inter-sentential relation "located in" between "Surfers Riverwalk" and "Queensland".
State-of-the-art. Early studies (Peng et al., 2017; confined document-level RE to short text spans (e.g., within three sentences). Some other studies (Nguyen and Verspoor, 2018;Gupta et al., 2019) were restricted to handle two entity mentions in a document. We argue that they are incapable of dealing with the example in Figure 1, which needs to consider multiple mentions of entities integrally. To encode the semantic interactions of multiple entities in long distance, recent work defined document-level graphs and proposed graph-based neural network models. For example, Sahu et al. (2019); Gupta et al. (2019) interpreted words as nodes and constructed edges according to syntactic dependencies and sequential information. However, there is yet a big gap between word representations and relation prediction.  introduced the notion of document graphs with three types of nodes (mentions, entities and sentences), and proposed an edge-oriented graph neural model for RE. However, it indiscriminately integrated various information throughout the whole document, thus irrelevant information would be involved as noise and damages the prediction accuracy.
Our approach and contributions. To cope with the above limitations, we propose a novel graphbased neural network model for document-level RE. Our key idea is to make full use of document semantics and predict relations by learning the representations of involved entities from both coarsegrained and fine-grained perspectives as well as other context relations. Towards this goal, we address three challenges below: First, how to model the complex semantics of a document? We use the pre-trained language model BERT (Devlin et al., 2019) to capture semantic features and common-sense knowledge, and build a heterogeneous graph with heuristic rules to model the complex interactions between all mentions, entities and sentences in the document.
Second, how to learn entity representations effectively? We design a global-to-local neural network to encode coarse-grained and fine-grained semantic information of entities. Specifically, we learn entity global representations by employing R-GCN (Schlichtkrull et al., 2018) on the created heterogeneous graph, and entity local representations by aggregating multiple mentions of specific entities with multi-head attention (Vaswani et al., 2017).
Third, how to leverage the influence from other relations? In addition to target relation representations, other relations imply the topic information of a document. We learn context relation representations with self-attention (Sorokin and Gurevych, 2017) to make final relation prediction.
In summary, our main contribution is twofold: • We propose a novel model, called GLRE, for document-level RE. To predict relations between entities, GLRE synthesizes entity global representations, entity local represen-tations and context relation representations integrally. For details, please see Section 3. • We conducted extensive experiments on two public document-level RE datasets. Our results demonstrated the superiority of GLRE compared with many state-of-the-art competitors. Our detailed analysis further showed its advantage in extracting relations between entities of long distance and having multiple mentions. For details, please see Section 4.

Related Work
RE has been intensively studied in a long history.
In this section, we review closely-related work.
Recently, deep learning-based work has advanced the state-of-the-art without heavy feature engineering. Various neural networks have been exploited, e.g., CNN (Zeng et al., 2014), RNN (Zhang et al., 2015;Cai et al., 2016) and GNN . Furthermore, to cope with the wrong labeling problem caused by distant supervision, Zeng et al. (2015)   Besides, a few models (Levy et al., 2017;Qiu et al., 2018) borrowed the reading comprehension techniques to document-level RE. However, they require domain knowledge to design question templates, and may perform poorly in zero-answer and multi-answers scenarios , which are very common for RE.

Proposed Model
We model document-level RE as a classification problem. Given a document annotated with enti-ties and their corresponding textual mentions, the objective of document-level RE is to identify the relations of all entity pairs in the document. Figure 2 depicts the architecture of our model, named GLRE. It receives an entire document with annotations as input. First, in (a) encoding layer, it uses a pre-trained language model such as BERT (Devlin et al., 2019) to encode the document. Then, in (b) global representation layer, it constructs a global heterogeneous graph with different types of nodes and edges, and encodes the graph using a stacked R-GCN (Schlichtkrull et al., 2018) to capture entity global representations. Next, in (c) local representation layer, it aggregates multiple mentions of specific entities using multi-head attention (Vaswani et al., 2017) to obtain entity local representations. Finally, in (d) classifier layer, it combines the context relation representations obtained with self-attention (Sorokin and Gurevych, 2017) to make final relation prediction. Please see the rest of this section for technical details.

Encoding Layer
is the j th word in it. We use BERT to encode D as follows: where h j ∈ R dw is a sequence of hidden states at the output of the last layer of BERT. Limited by the input length of BERT, we encode a long document sequentially in form of short paragraphs.

Global Representation Layer
Based on H, we construct a global heterogeneous graph, with different types of nodes and edges to capture different dependencies (e.g., co-occurrence dependencies, coreference dependencies and order dependencies), inspired by . Specifically, there are three types of nodes: • Mention nodes, which model different mentions of entities in D. The representation of a mention node m i is defined by averaging the representations of contained words. To distinguish node types, we concatenate a node type representation t m ∈ R dt . Thus, the represen- is the concatenation operator. • Entity nodes, which represent entities in D.
The representation of an entity node e i is defined by averaging the representations of the mention nodes to which they refer, together with a node type representation t e ∈ R dt . Therefore, the representation of e i is n e i = [avg m j ∈e i (n m j ); t e ]. • Sentence nodes, which encode sentences in D.
Similar to mention nodes, the representation of a sentence node s i is formalized as where t s ∈ R dt . Then, we define five types of edges to model the interactions between the nodes: • Mention-mention edges. We add an edge for any two mention nodes in the same sentence. • Mention-entity edges. We add an edge between a mention node and an entity node if the mention refers to the entity. • Mention-sentence edges. We add an edge between a mention node and a sentence node if the mention appears in the sentence. • Entity-sentence edges. We create an edge between an entity node and a sentence node if at least one mention of the entity appears in the sentence. • Sentence-sentence edges. We connect all sentence nodes to model the non-sequential information (i.e., break the sentence order). Note that there are no entity-entity edges, because they form the relations to be predicted.
Finally, we employ an L-layer stacked R-GCN (Schlichtkrull et al., 2018) to convolute the global heterogeneous graph. Different from GCN, R-GCN considers various types of edges and can better model multi-relational graphs. Specifically, its node forward-pass update for the (l + 1) th layer is defined as follows: where σ(·) is the activation function. N x i denotes the set of neighbors of node i linked with edge x, and X denotes the set of edge types. W l x , W l 0 ∈ R dn×dn are trainable parameter matrices (d n is the dimension of node representations). We refer to the representations of entity nodes after graph convolution as entity global representations, which encode the semantic information of entities throughout the whole document. We denote an entity global representation by e glo i .

Local Representation Layer
We learn entity local representations for specific entity pairs by aggregating the associated mention representations with multi-head attention (Vaswani et al., 2017). The "local" can be understood from two angles: (i) It aggregates the original mention information from the encoding layer. (ii) For different entity pairs, each entity would have multiple different local representations w.r.t. the counterpart entity. However, there is only one entity global representation.
Multi-head attention enables a RE model to jointly attend to the information of an entity composed of multiple mentions from different representation subspaces. Its calculation involves the sets of queries Q and key-value pairs (K, V): In this paper, Q is related to the entity global representations, K is related to the initial sentence node representations before graph convolution (i.e., the input features of sentence nodes in R-GCN), and V is related to the initial mention node representations. Specifically, given an entity pair (e a , e b ), we define their local representations as follows: where LN(·) denotes layer normalization (Ba et al., 2016). M a is the corresponding mention node set of e a , and S a is the corresponding sentence node set in which each mention node in M a is located. M b and S b are similarly defined for e b . Note that MHead 0 and MHead 1 learn independent model parameters for entity local representations. Intuitively, if a sentence contains two mentions m a , m b corresponding to e a , e b , respectively, then the mention node representations n ma , n m b should contribute more to predicting the relation of (e a , e b ) and the attention weights should be greater in getting e loc a , e loc b . More generally, a higher semantic similarity between the node representation of a sentence containing m a and e glo b indicates that this sentence and m b are more semantically related, and n ma should get a higher attention weight to e loc a .

Classifier Layer
To classify the target relation r for an entity pair (e a , e b ), we firstly concatenate entity global representations, entity local representations and relative distance representations to generate entity final representations:ê where δ ab denotes the relative distance from the first mention of e a to that of e b in the document. δ ba is similarly defined. The relative distance is first divided into several bins {1, 2, . . . , 2 b }. Then, each bin is associated with a trainable distance embedding. ∆(·) associates each δ to a bin. Then, we concatenate the final representations of e a , e b to form the target relation representation Furthermore, all relations in a document implicitly indicate the topic information of the document, such as "director" and "character" often appear in movies. In turn, the topic information implies possible relations. Some relations under similar topics are likely to co-occur, while others under different topics are not. Thus, we use self-attention (Sorokin and Gurevych, 2017) to capture context relation representations, which reveal the topic information of the document: where W ∈ R dr×dr is a trainable parameter matrix. d r is the dimension of target relation representations. o i (o j ) is the relation representation of the i th (j th ) entity pair. θ i is the attention weight for o i . p is the number of entity pairs.
Finally, we use a feed-forward neural network (FFNN) over the target relation representation o r and the context relation representation o c to make the prediction. Besides, considering that an entity pair may hold several relations, we transform the multi-classification problem into multiple binary classification problems. The predicted probability distribution of r over the set R of all relations is defined as follows: where y r ∈ R |R| . We define the loss function as follows: where y * r ∈ {0, 1} denotes the true label of r. We employ Adam optimizer (Kingma and Ba, 2015) to optimize this loss function.

Experiments and Results
We implemented our GLRE with PyTorch 1.5. The source code and datasets are available online. 1 In this section, we report our experimental results.

Datasets
We evaluated GLRE on two public document-level RE datasets. Table 1 lists their statistical data: • The Chemical-Disease Relations (CDR) data set (Li et al., 2016) was built for the BioCreative V challenge and annotated with one relation "chemical-induced disease" manually. • The DocRED dataset (Yao et al., 2019) was built from Wikipedia and Wikidata, covering various relations related to science, art, personal life, etc. Both manually-annotated and distantly-supervised data are offered. We only used the manually-annotated data.

Comparative Models
First, we compared GLRE with five sentence-level RE models adapted to the document level: It also leveraged BERT and designed a hierarchical inference network to aggregate inference information from entity level to sentence level, then to document level.

Experiment Setup
Due to the small size of CDR, some work (Zhou et al., 2016;Verga et al., 2018;Zheng et al., 2018;) created a new split by unionizing the training and development sets, denoted by "train + dev". Under this setting, a model was trained on the train + dev set, while the best epoch was found on the development set. To make a comprehensive comparison, we also measured the corresponding precision, recall and F1 scores. For consistency, we used the same experiment setting on DocRED. Additionally, the gold standard of the test set of DocRED is unknown, and only F1 scores can be obtained via an online interface. Besides, it was noted that some relation instances are present in both training and development/test sets (Yao et al., 2019). We also measured F1 scores ignoring those duplicates, denoted by Ign F1.    , which re-trained the BERT-Base-cased model on biomedical corpora. For DocRED, we picked up the BERT-Base-uncased model. For the comparative models without using BERT, we selected the PubMed pre-trained word embeddings (Chiu et al., 2016) for CDR and GloVe (Pennington et al., 2014) for DocRED. For the models with source code, we used our best efforts to tune the hyperparameters. Limited by the space, we refer interested readers to the appendix for more details.

Main Results
Tables 2 and 3 list the results of the comparative models and GLRE on CDR and DocRED, respectively. We have four findings below: (1) The sentence-level RE models Yao et al., 2019) obtained medium performance. They still fell behind a few document-level models, indicating the difficulty of directly applying them to the document level. (2) The graph-based RE models (Panyam et al., 2018;Verga et al., 2018; and the non-graph models (Zhou et al., 2016;Gu et al., 2017 Tang et al. (2020), the BERT-based models showed stronger prediction power for document-level RE. They outperformed the other comparative models on both CDR and DocRED. (4) GLRE achieved the best results among all the models. We owe it to entity global and local representations. Furthermore, BERT and context relation representations also boosted the performance. See our analysis below.

Detailed Analysis
Entity distance. We examined the performance of the open-source models in terms of entity distance, which is defined as the shortest sentence distance between all mentions of two entities. Figure 3 depicts the comparison results on CDR and DocRED using the training set only. We observe that: (1) GLRE achieved significant improvement in extracting the relations between entities of long distance, especially when distance ≥ 3. This is because the global heterogeneous graph can effectively model the interactions of semantic information of different nodes (i.e., mentions, entities and sentences) in a document. Furthermore, entity local representations can reduce the influence of noisy context of multiple mentions of entities in long distance. (2) According to the results on CDR, the graphbased model  performed better than the sentence-level model  and the BERT-based model  in extracting intersentential relations. The main reason is that it leveraged heuristic rules to construct the document graph at the entity level, which can bet-ter model the semantic information across sentences and avoid error accumulation involved by NLP tools, e.g., the dependency parser used in . (3) On DocRED, the models  outperformed the model , due to the power of BERT and the increasing accuracy of dependency parsing in the general domain.
Number of entity mentions. To assess the effectiveness of GLRE in aggregating the information of multiple entity mentions, we measured the performance in terms of the average number of mentions for each entity pair. Similar to the previous analysis, Figure 4 shows the results on CDR and DocRED using the training set only. We see that: (1) GLRE achieved great improvement in extracting the relations with average number of mentions ≥ 2, especially ≥ 4. The major reason is that entity local representations aggregate the contextual information of multiple mentions selectively. As an exception, when the average number of mentions was in [1,2), the performance of GLRE was slightly lower than  on CDR. This is because both GLRE and Christopoulou et al.
(2019) relied on modeling the interactions between entities in the document, which made them indistinguishable under this case. In fact, the performance of all the models decreased when the average number of mentions was small, because less relevant information was provided in the document, which made relations harder to be predicted. We will consider external knowledge in our future work. (2) As compared with  and , the BERT-based model  performed better in general, except for one interval. When the average number of mentions was in [1, 2) on CDR, its performance was significantly lower than other models. The reason is twofold. On one hand, it is more difficult to capture the latent knowledge in the biomedical field. On the other hand, the model  only relied on the semantic information of the mentions of target entity pairs to predict the relations. When the average number was small, the prediction became more difficult. Furthermore, when the average number was large, its performance increase was not significant. The  main reason is that, although BERT brought rich knowledge, the model  indiscriminately aggregated the information of multiple mentions and introduced much noisy context, which limited its performance.
Ablation study. To investigate the effectiveness of each layer in GLRE, we conducted an ablation study using the training set only. Table 4 shows the comparison results. We find that: (1) BERT had a greater influence on DocRED than CDR. This is mainly because BERT introduced valuable linguistic knowledge and common-sense knowledge to RE, but it was hard to capture latent knowledge in the biomedical field. (2) F1 scores dropped when we removed entity global representations, entity local representations or context relation representations, which verified their usefulness in documentlevel RE. (3) Particularly, when we removed entity local representations, F1 scores dropped more dramatically. We found that more than 54% and 19% of entities on CDR and DocRED, respectively, have multiple mentions in different sentences. The local representation layer, which uses multi-head attention to selectively aggregate multiple mentions, can reduce much noisy context.
Pre-trained language models. To analyze the impacts of pre-trained language models on GLRE and also its performance upper bound, we replaced BERT-Base with BERT-Large, XLNet-Large (Yang et al., 2019) or ALBERT-xxLarge (Lan et al., 2020). Table 5 shows the comparison results using the training set only, from which we observe that larger models boosted the performance of GLRE to some extent. When the "train + dev" setting was used  on DocRED, the Ign F1 and F1 scores of XLNet-Large even reached to 58.5 and 60.5, respectively. However, due to the lack of biomedical versions, XLNet-Large and ALBERT-xxLarge did not bring improvement on CDR. We argue that selecting the best pre-trained models is not our primary goal.
Case study. To help understanding, we list a few examples from the CDR test set in Table 6. See Appendix for more cases from DocRED.
(1) From Case 1, we find that logical reasoning is necessary. Predicting the relation between "rofecoxib" and "GI bleeding" depends on the bridge entity "non-users of aspirin". GLRE used R-GCN to model the document information based on the global heterogeneous graph, thus it dealt with complex inter-sentential reasoning better. (2) From Case 2, we observe that, when a sentence contained multiple entities connected by conjunctions (such as "and"), the model  might miss some associations between them. GLRE solved this issue by building the global heterogeneous graph and considering the context relation information, which broke the word sequence. (3) Prior knowledge is required in Case 3. One must know that "fatigue" belongs to "adverse effects" ahead of time. Then, the relation between "bepridil" and "dizziness" can be identified correctly. Unfortunately, both GLRE and  lacked the knowledge, and we leave it as our future work. We analyzed all 132 inter-sentential relation instances in the CDR test set that were incorrectly predicted by GLRE. Four major error types are as follows: (1) Logical reasoning errors, which occurred when GLRE could not correctly identify the relations established indirectly by the bridge entities, account for 40.9%. (2) Component missing errors, which happened when some component of a sentence (e.g., subject) was missing, account for 28.8%. In this case, GLRE needed the whole document information to infer the lost component and ... [S8] Among non-users of aspirin, the adjusted hazard ratios were: rofecoxib 1.27, naproxen 1.59, diclofenac 1.17 and ibuprofen 1.05. ... [S10] CONCLUSION: Among non-users of aspirin, naproxen seemed to carry the highest risk for AMI / GI bleeding. ...  predict the relation, which was not always accurate. (3) Prior knowledge missing errors account for 13.6%. (4) Coreference reasoning errors, which were caused by pronouns that could not be understood correctly, account for 12.9%.

Conclusion
In this paper, we proposed GLRE, a global-to-local neural network for document-level RE. Entity global representations model the semantic information of an entire document with R-GCN, and entity local representations aggregate the contextual information of mentions selectively using multi-head attention. Moreover, context relation representations encode the topic information of other relations using self-attention. Our experiments demonstrated the superiority of GLRE over many comparative models, especially the big leads in extracting relations between entities of long distance and with multiple mentions. In future work, we plan to integrate knowledge graphs and explore other document graph modeling ways (e.g., hierarchical graphs) to improve the performance.

A Notations
To help understanding, Table 7 summarizes the key notations used in this paper.
The DocRED dataset (Yao et al., 2019) is available at https://github.com/thunlp/DocRED. Note that, the gold standard of the test set of DocRED is unknown, and only F1 scores can be obtained via an online interface at https://competitions. codalab.org/competitions/20717.

C Experimental Setup
In this section, we provide more details of our experiments. We implemented GLRE with PyTorch 1.5 and trained it on a server with an Intel Xeon  Gold 5117 CPU, 120 GB memory, two NVIDIA Tesla V100 GPU cards and Ubuntu 18.04. Analogous to , we pre-processed the CDR dataset, including sentence splitting, word tokenization and hypernym filtering.
When using the training set only, we trained a model on the training set, searched the best epoch in terms of F1 scores on the development set, and tested on the test set. Under the "train + dev" setting, we first trained on the training set and eval-  uated on the development set, in order to find the best epoch. Then, we re-ran on the union of the training and development sets until the best epoch and evaluated on the test set. For both cases, we employed dropout and layer normalization (Ba et al., 2016) to prevent model overfitting.
The parameters of GLRE were initialized with a Gaussian distribution (mean = 0 and SD = 1.0) using a fixed initialization seed. We trained GLRE by Adam optimizer (Kingma and Ba, 2015) with mini-batches. The hidden size of BERT was set to 768. A transformation layer was used to project the BERT output into a low-dimensional space of size 256. All hyperparameter values used in the experiments are shown in Table 8. ier to be predicted. In contrast, GLRE without context relation representations imprecisely predicts it as "creator" (for general work).