Connecting the Dots: Document-level Neural Relation Extraction with Edge-oriented Graphs

Document-level relation extraction is a complex human process that requires logical inference to extract relationships between named entities in text. Existing approaches use graph-based neural models with words as nodes and edges as relations between them, to encode relations across sentences. These models are node-based, i.e., they form pair representations based solely on the two target node representations. However, entity relations can be better expressed through unique edge representations formed as paths between nodes. We thus propose an edge-oriented graph neural model for document-level relation extraction. The model utilises different types of nodes and edges to create a document-level graph. An inference mechanism on the graph edges enables to learn intra- and inter-sentence relations using multi-instance learning internally. Experiments on two document-level biomedical datasets for chemical-disease and gene-disease associations show the usefulness of the proposed edge-oriented approach.


Introduction
The extraction of relations between named entities in text, known as Relation Extraction (RE), is an important task of Natural Language Processing (NLP). Lately, RE has attracted a lot of attention from the field, in an effort to improve the inference capability of current methods (Zeng et al., 2017;Christopoulou et al., 2018;Luan et al., 2019).
In real-world scenarios, a large amount of relations are expressed across sentences. The task of identifying these relations is named inter-sentence RE. Typically, inter-sentence relations occur in 1 Source code available at https://github.com/ fenchri/edge-oriented-graph  (Li et al., 2016a). The solid and dotted lines represent intra-and inter-sentence relations, respectively. textual snippets with several sentences, such as documents. In these snippets, each entity is usually repeated with the same phrases or aliases, the occurrences of which are often named entity mentions and regarded as instances of the entity. The multiple mentions of the target entities in different sentences can be useful for the identification of inter-sentential relations, as these relations may depend on the interactions of their mentions with other entities in the same document.
As shown in the example of Figure 1, the entities bilateral optic neuropathy, ethambutol and isoniazid have two mentions each, while the entity scotoma has one mention. The relation between the chemical ethambutol and the disease scotoma is clearly inter-sentential. Their association can only be determined if we consider the interactions between the mentions of these entities in different sentences. A mention of bilateral optic neuropathy interacts with a mention of ethambutol in the first sentence. Another mention of the former interacts with the mention of scotoma in the third sentence. This chain of interactions can help us infer that the entity ethambutol has a relation with the entity scotoma.
The most common technique that is currently used to deal with multiple mentions of named entities is Multi-Instance Learning (MIL). Initially, MIL was introduced by Riedel et al. (2010) in order to reduce noise in distantly supervised corpora (Mintz et al., 2009). In DS, training instances are created from large, raw corpora using Knowledge Base (KB) entity linking and automatic annotation with heuristic rules. MIL in this setting considers multiple sentences (bags) that contain a pair of entities serving as the multiple instances of this pair. Verga et al. (2018) introduced another MIL setting for relation extraction between named entities in a document. In this setting, entities mapped to the same KB ID are considered as mentions of an entity concept and pairs of mentions correspond to the pair's multiple instances. However, document-level RE is not common in the general domain, as the entity types of interest can often be found in the same sentence (Banko et al., 2007). On the contrary, in the biomedical domain, document-level relations are particularly important given the numerous aliases that biomedical entities can have .
To deal with document-level RE, recent approaches assume that only two mentions of the target entities reside in the document (Nguyen and Verspoor, 2018;Verga et al., 2018) or utilise different models for intra-and inter-sentence RE (Gu et al., 2016;Li et al., 2016b;Gu et al., 2017). In contrast with approaches that employ sequential models (Nguyen and Verspoor, 2018;Gu et al., 2017;, graph-based neural approaches have proven useful in encoding longdistance, inter-sentential information (Peng et al., 2017;Gupta et al., 2019). These models interpret words as nodes and connections between them as edges. They typically perform on the nodes by updating the representations during training. However, a relation between two entities depends on different contexts. It could thus be better expressed with an edge connection that is unique for the pair. A straightforward way to address this is to create graph-based models that rely on edge representations rather focusing on node representations, which are shared between multiple entity pairs.
In this work, we tackle document-level, intraand inter-sentence RE using MIL with a graphbased neural model. Our objective is to infer the relation between two entities by exploiting other interactions in the document. We construct a doc-ument graph with heterogeneous types of nodes and edges to better capture different dependencies between nodes. In the proposed graph, a node corresponds to either entities, mentions, or sentences, instead of words. We connect distinct nodes based on simple heuristic rules and generate different edge representations for the connected nodes. To achieve our objective, we design the model to be edge-oriented in a sense that it learns edge representations (between the graph nodes) rather than node representations. An iterative algorithm over the graph edges is used to model dependencies between the nodes in the form of edge representations. The intra-and inter-sentence entity relations are predicted by employing these edges. Our contributions can be summarised as follows: • We propose a novel edge-oriented graph neural model for document-level relation extraction. The model deviates from existing graph models as it focuses on constructing unique nodes and edges, encoding information into edge representations rather than node representations. • The proposed model is independent of syntactic dependency tools and can achieve stateof-the-art performance on a manually annotated, document-level chemical-disease interaction dataset. • Analysis of the model components indicates that the document-level graph can effectively encode document-level dependencies. Additionally, we show that inter-sentence associations can be beneficial for the detection of intrasentence relations.

Proposed Model
We build our model as a significant extension of our previously proposed sentence-level model (Christopoulou et al., 2018) for documentlevel RE. The most critical difference between the two models is the introduction and construction of a partially-connected document graph, instead of a fully-connected sentence-level graph. Additionally, the document graph consists of heterogeneous types of nodes and edges in comparison with the sentence-level graph that contains only entity-nodes and single edge types among them. Furthermore, the proposed approach utilises multi-instance learning when mention-level annotations are available. As illustrated in Figure 2, the proposed model consists of four layers: sentence encoding, graph  Figure 2: Abstract architecture of the proposed approach. The model receives a document and encodes each sentence separately. A document-level graph is constructed and fed into an iterative algorithm to generate edge representations between the target entity nodes. Some node connections are not shown for brevity.
construction, inference and classification layers. The model receives as input a document with identified concept-level entities and their textual mentions. Next, a document-level graph with multiple types of nodes and edges is constructed. An inference algorithm is applied on the graph edges to generate concept-level pair representations. In the final layer, the edge representations between the target concept-entity nodes are classified into relation categories.
For the remainder of this section, we first briefly introduce the document-level RE task setting and then explain the four layers of the proposed model.

Task Setting
In concept, document-level RE the input is considered an annotated document. The annotations include concept-level entities (with assigned KB IDs), as well as multiple occurrences of each entity under the same phrase of alias, i.e., entity mentions. We consider the associations of mentions to concept entities given (also known as entity linking (Shen et al., 2014)). The objective of the task is given an annotated document, to identify all the related concept-level pairs in that document. In this work, we refer to concept-level annotations as entities and mention-level annotations as mentions.

Sentence Encoding Layer
First, each word in the sentences of the input document is transformed into a dense vector representation, i.e., a word embedding. The vectorised words of each sentence are then fed into a Bidirectional LSTM network (BiLSTM) (Hochreiter and Schmidhuber, 1997;Schuster and Paliwal, 1997), named the encoder. The output of the encoder results in contextualised representations for each word of the input sentence.

Graph Layer
The contextualised word representations from the encoder are used to construct a document-level graph structure. The graph layer comprises of two sub-layers, a node construction layer and an edge construction layer. We compose the representations of the graph nodes in the first sub-layer and the representations of the edges in the second.

Node construction
We form three distinct types of nodes in the graph: mention nodes (M) n m , entity nodes (E) n e , and sentence nodes (S) n s . Each node representation is computed as the average of the embeddings of different elements. Firstly, mention nodes correspond to different mentions of entities in the input document. The representation of a mention node is formed as the average of the words (w) that the mention contains. Secondly, entity nodes represent unique entity concepts. The representation of an entity node is computed as the average of the mention (m) representations associated with the entity. Finally, sentence nodes correspond to sentences. A sentence node is represented as the average of the word representations in the sentence. In order to distinguish different node types in the graph, we concatenate a node type (t) embedding to each node representation. The final node representations are then estimated as

Edge construction
We initially construct non-directed edges between the graph nodes using heuristic rules that stem from the natural associations between the elements of a document, i.e., mentions, entities and sentences. As we cannot know in advance if two entities are related, we do not directly connect entity nodes. Connections between nodes are based on pre-defined document-level interactions. The model objective is to generate entity-to-entity (EE) edge representations using other existing edges in the graph and consequently infer entity-to-entity relations. The different pre-defined edge types are described below. Mention-Mention (MM): Co-occurrence of mentions in a sentence might be a weak indication of an interaction. For this reason, we create mentionto-mention edges only if the corresponding mentions reside in the same sentence. The edge representation between each mention pair m i and m j is generated by concatenating the representations of the nodes, the contexts c m i ,m j and a distance embedding associated with the distance between the two mentions d m i ,m j , in terms of intermediate words: Here, we generate the context representation for these pairs in order to encode local, pair-centric information. We use an argument-based attention mechanism , to measure the importance of other words in the sentence towards the mention, denoting k ∈ {1, 2} as the mention arguments.
where n m k is a mention node representation, w i is a sentence word representation, a i is the attention weight of word i for mention pair m 1 , m 2 , H ∈ R w×d is a sentence word representations matrix, a ∈ R w is the attention weights vector for the pair and c m 1 ,m 2 is the final context representation for the mention pair.
Mention-Sentence (MS): Mention-to-sentence nodes are connected only if the mention resides in the sentence. Their initial edge representation is constructed as a concatenation of the mention and sentence nodes, Mention-Entity (ME): We connect a mention node to an entity node if the mention is associated with the entity, x ME = [n m ; n e ].
Sentence-Sentence (SS): Motivated by , we connect sentence nodes to encode non-local information. The main differences with prior work is that our edges are unlabelled, non-directed and span multiple sentences. To encode the distance between sentences, we concatenate to the sentence node representations their distance in the form of an embedding: We connect all sentence nodes in the graph. We consider SS direct as direct, ordered edges (distance equal to 1) and SS indirect as indirect, non-ordered edges (distance > 1) between S nodes, respectively. In our setting, SS denotes the combination of SS direct and SS indirect . Entity-Sentence (ES): To directly model entityto-sentence associations, we connect an entity node to a sentence node if at least one mention of the entity resides in this sentence, x ES = [n e ; n s ].
In order to result in edge representations of equal dimensionality, we use different linear reduction layers for different edge representations, where e (1) z is an edge representation of length 1, W z ∈ R dz×d corresponds to a learned matrix and z ∈ [MM, MS, ME, SS, ES].

Inference Layer
We utilise an iterative algorithm to generate edges between different nodes in the graph, as well as to update existing edges. We initialise the graph only with the edges described in Section 2.3.2, meaning that direct entity-to-entity (EE) edges are absent. We can only generate EE edge representations by representing a path between their nodes. This implies that entities can be associated through an edge path of minimum length equal to 3 2 .
For this purpose, we adapt our two-step inference mechanism, proposed in Christopoulou et al. (2018), to encode interactions between nodes and edges in the graph and hence model EE associations.
At the first step, we aim to generate a path between two nodes i and j using intermediate nodes k. We thus combine the representations of two consecutive edges e ik and e kj , using a modified bilinear transformation. This action generates an edge representation of double length. We combine all existing paths between i and j through k. The i, j, and k nodes can be any of the three node types E, M, or S. Intermediate nodes without adjacent 4929 edges to the target nodes are ignored.
where σ is the sigmoid non-linear function, W ∈ R dz×dz is a learned parameter matrix, ⊙ refers to element-wise multiplication, l is the length of the edge and e ik corresponds to the representation of the edge between nodes i and k.
During the second step, we aggregate the original (short) edge representation and the new (longer) edge representation resulted from Equation (3) with linear interpolation as follows: where β ∈ [0, 1] is a scalar that controls the contribution of the shorter edge presentation. In general β is larger for shorter edges as we expect that the relation between two nodes is better expressed through the shortest path between them (Xu et al., 2015;Borgwardt and Kriegel, 2005).
The two steps are repeated a finite number of times N . The number of iterations is correlated with the final length of the edge representations. With initial edge length l equal to 1, the first iteration results in edges of length up-to 2. The second iteration results in edges of length up-to 4. Similarly, after N iterations, the length of edges will be up-to 2 N .

Classification Layer
To classify the concept-level entity pairs of interest, we incorporate a softmax classifier, using the entity-to-entity edges (EE) of the document graph that correspond to the concept-level entity pairs.
where W c ∈ R r×dz and b c ∈ R r are learned parameters of the classification layer and r is the number of relation categories.

Experimental Settings
The model was developed using PyTorch (Paszke et al., 2017). We incorporated early stopping to identify the best training epoch and used Adam (Kingma and Ba, 2015) as the model optimiser.

Data and Task Settings
We evaluated the proposed model on two datasets: CDR (BioCreative V): The Chemical-Disease Reactions dataset was created by Li et al. (2016a) for document-level RE. It consists of 1, 500 PubMed abstracts, which are split into three equally sized sets for training, development and testing. The dataset was manually annotated with binary interactions between Chemical and Disease concepts. For this dataset, we utilised PubMed pre-trained word embeddings (Chiu et al., 2016). GDA (DisGeNet): The Gene-Disease Associations dataset was introduced by Wu et al. (2019), containing 30, 192 MEDLINE abstracts, split into 29, 192 articles for training and 1, 000 for testing. The dataset was annotated with binary interactions between Gene and Disease concepts at the document-level, using distant supervision. Associations between concepts were generated by aligning the DisGeNet (Piñero et al., 2016) platform with PubMed 3 abstracts. We further split the training set into a 80/20 percentage split as training and development sets. For the GDA dataset, we used randomly initialized word embeddings.

Model Settings
We explore multiple settings of the proposed graph using different edges (MM, ME, MS, ES, SS) and enhancements (node type embeddings, mention-pairs context embeddings, distance embeddings). We name our model EoG, an abbreviation of Edge-oriented Graph. We briefly describe the model settings in this section. EoG refers to our main model with edges {MM, ME, MS, ES, SS }. The EoG (Full) setting refers to a model with a fully connected graph, where the graph nodes are all connected to each other, including E nodes. For this purpose, we introduce an additional linear layer for the EE edges as in Equation (2). The EoG (NoInf) setting refers to a no inference model, where the iterative inference algorithm (Section 2.4) is ignored. The concatenation of the entity node embeddings is used to represent the target pair. In this case, we also make use of an additional EE linear layer for EE edges. Finally, the EoG (Sent) setting refers to a model that was trained on sentences instead of documents. For each entity-level pair we merge the predictions of the mention-level pairs in different sentences using a maximum assumption: if at least one mention-level prediction indicates a relation then we predict the entity pair as related, similarly to Gu et al. (2017). All of the settings incorporate node type embeddings, contextual embeddings for MM edges and distance embeddings for MM and SS edges, unless otherwise stated.  Gu et al. (2017) develops separate models for intra-and inter-sentence pairs. As it can be observed, the proposed model outperforms the state-of-the-art in CDR dataset by 1.3 percentage points of overall performance. We also show the methods that take advantage of syntactic dependency tools. Li et al. (2016b) uses cotraining with additional unlabeled training data. Our model performs significantly better on intraand inter-sentential pairs, even compared to most of the models with external knowledge, except for Li et al. (2016b). In addition, we report the performance of three baseline models. The EoG model outperforms all baselines for all pair types. In particular, for the inter-sentence pairs, performance significantly drops with a fully connected graph (Full) or without inference (NoInf). The former might indicate the existence of certain reasoning paths that should be followed in order to relate entities residing in different sentences. It is also important  to note that the intra-sentence pairs substantially benefit from the document-level information, as EoG surpasses the performance of training on single sentences (Sent) by 3%. Finally, the performance drop in intra-sentence pairs, as a result of the inference algorithm removal (NoInf), suggests that multiple entity associations exist in sentences (Christopoulou et al., 2018). Their interactions can be beneficial in cases of lack of word context information.

Results
We also apply our model on the distantly supervised GDA dataset. As shown in Table 2 results for intra-sentence pairs are consistent with the findings of the CDR dataset for both development and test sets. This indicates that documentlevel information is helpful. However, performance differs for inter-sentence pairs and in particular for the fully connected graph (Full) baseline. We partially attribute this behavior to the small number of inter-sentence pairs in the GDA dataset (only 13% compared to 30% in the CDR dataset) that results in inadequate learning patters for EoG. We leave further investigation as part of future work.

Analysis & Discussion
We first analyse the performance of our main model (EoG) using different pre-trained word embeddings. Table 3 shows the performance difference between domain-specific (PubMed) (Chiu et al., 2016), general-domain (GloVe) (Pennington et al., 2014) and randomly initialized (random) word embeddings. As observed, our proposed model performs consistently with both in-domain and out-of-domain pre-trained word embeddings. The low performance of random embeddings is due to the small size of the dataset, which results in lower quality embeddings. For further analysis, we choose the CDR dataset as it is manually annotated. To better analyse the behaviour of our model, we conduct analysis on the effect of direct and indirect sentence-tosentence edges as a function of the inference steps. Figures 3a, 3b and 3c illustrate the performance of both graphs for overall, intra-and inter-sentence pairs respectively.
The first observation is that usage of direct edges only, reduces the overall performance al-  Table 1 where we showed that intersentence information can act as complementary evidence for intra-sentence pairs. We additionally conduct ablation analysis on the graph edges and nodes, as shown in Table 4. Usage of EE edges only results in poor performance across pairs. Removal of MM and ME edges does not significantly affect the performance as ES edges can replace their impact. Complete removal of connections to M nodes results in low inter-sentence performance. This behaviour pinpoints the importance of some local dependencies in identifying cross-sentence relations.
Removal of ES edges reduces the performance of all pairs, as encoding of EE edges becomes more difficult 4 . We further observe very poor identification of inter-sentence pairs without sentence-to-sentence connections. This is complementary with the inability of the model to identify any inter-sentence pairs without connections to S nodes. In this scenario, we enable identification of pairs across sentences only through MM and ME edges, as shown in Figure 4a. In the CDR dataset,  78% of inter-sentential pairs have at least one argument that is mentioned only once in the document. The identification of these pairs, without S nodes, requires very long inference paths 5 . As shown in Figure 4b, the introduction of S nodes results in a path with half the length, which we expect to better represent the relation. Longer inference representations are much weaker than shorter ones. This suggests that the inference mechanism has limited capability in identifying very complex associations. We then investigate the additional enhancements of the graph edges in Table 5. In general, intra-sentence pairs are not affected by these settings. However, for inter-sentence pairs, removal of node type embeddings and distance embeddings results in a 2% and 5% drop in terms of F1score. These results indicate that the interactions between different elements in a document, along with the distance between sentences and mentions, play an important role in inter-sentence pair inference. Removing all of these settings does not perform worse than removing one of them, which might indicate model overfitting. We plan to further investigate this as part of future work.
We examine the performance of different models on inter-sentence pairs, based on their sentence-level distances. Figure 5 illustrates that for long-distanced pairs, EoG has lower performance, indicating the difficulty in predicting them and a possible requirement for other, latent document-level information (EoG (Full)).   As final analysis, we investigate some of the cases where the graph models are unable to identify inter-sentence related pairs. For this purpose, we randomly check some of the common false negative errors among the EoG models. We identify three frequent cases of errors, as shown in Table 6. In the first case, when multiple entities reside in the same sentence and are connected with conjunctions (e.g., 'and') or commas, the model often failed to find associations with all of them. The second error derives from missing coreference connections. For instance, pyeloureteritis cystica is referred to as disease. Although our model cannot directly create these edges, S nodes potentially simulate such links, by encoding the co-referring entities into the sentence representation. Finally, incomplete entity linking results into additional model errors. For instance, in the third example, hemorrhage and intracranial bleeding are synonymous terms. However, they are assigned different KB IDs, hence treated as different entities. The model can find the intra-sentential relation but not the inter-sentential one.

Related Work
Traditional approaches focus on intra-sentence supervised RE, utilising CNN or RNN, ignoring multiple entities in a sentence (Zeng et al., 2014;Nguyen and Grishman, 2015) as well as incorporating external syntactic tools (Miwa and Bansal, 2016;. Christopoulou et al. (2018) considered intra-sentence entity interactions without domain dependencies by modelling long dependencies between the entities of a sentence.
Other approaches deal with distantlysupervised datasets but are also limited to intra-sentential relations. They utilise Piecewise Convolutional Neural Networks (PCNN) (Zeng et al., 2015), attention mechanisms (Lin et al., 2016;Zhou et al., 2018), entity descriptors (Jiang et al., 2016) and graph CNNs (Vashishth et al., 2018) to perform MIL on bags-of-sentences that contain multiple mentions of an entity pair. Recently, Zeng et al. (2017) proposed a method for extracting paths between entities using the target entities' mentions in several different sentences (in possibly different documents) as intermediate connectors. They allow mention-mention edges only if these mentions belong to the same entity and consider that a single mention pair exists in a sentence. On the contrary, we not only allow interactions between all mentions in the same sentence, but also consider multiple edges between mentions, entities and sentences in a document.
Current approaches that try to deal with document-level RE are mostly graph-based.  introduced the notion of a document graph, where nodes are words and edges represent intra-and inter-sentential relations between the words. They connected words with different dependency edges and trained a binary logistic regression classifier. They evaluated their model on distantly supervised full-text articles from PubMed for Gene-Drug associations, restricting pairs within a window of consecutive sentences. Following this work, other approaches incorporated graphical models for document-level RE such as graph LSTM (Peng et al., 2017), graph CNN (Song et al., 2018) or RNNs on dependency tree structures (Gupta et al., 2019). Recently, Jia et al. (2019) improved n-ary RE using information from multiple sentences and paragraphs in a document. Similar to our approach, they choose to directly classify concept-level pairs rather than multiple mention-level pairs. Although they consider sub-relations to model related tuples, they ignore interactions with other entities outside of the target tuple in the discourse units.
Non-graph-based approaches utilise different intra-and inter-sentence models and merge the resulted predictions (Gu et al., 2016(Gu et al., , 2017. Other approaches extract document-level representations for each candidate entity pair (Zheng et al., 2018;Wu et al., 2019), or use syntactic dependency structures . Verga et al. (2018) proposed a Transformer-based model for documentlevel relation extraction with multi-instance learning, merging multiple mention pairs. Nguyen and Verspoor (2018) used a CNN with additional character-level embeddings. Singh and Bhatia (2019) also utilised Transformer and connected two target entities by combining them directly and via a contextual token. However, they consider a single target entity pair per document.

Conclusion
We presented a novel edge-oriented graph neural model for document-level relation extraction using multi-instance learning. The proposed model constructs a document-level graph with heterogeneous types of nodes and edges, modelling intraand inter-sentence pairs simultaneously with an iterative algorithm over the graph edges. To the best of our knowledge, this is the first approach to utilise an edge-oriented model for document-level RE.
Analysis on intra-and inter-sentence pairs indicated that the proposed, partially-connected, document graph structure can effectively encode dependencies between document elements. Additionally, we deduce that document-level information can contribute to the identification of intrasentence pairs leading to higher precision and F1score.
As future work, we plan to improve the inference mechanism and potentially incorporate additional information in the document-graph structure. We hope that this study will inspire the community to further investigate the usage of edgeoriented models on RE and other related tasks.   chosen batchsize was equal to 2. For the GDA dataset, EoG and EoG (Full) performed best with l = 16 and EoG (Sent) with l = 8 inference steps. The chosen batchsize was equal to 3. For all experiments performance was measured in terms of micro precision (P), recall (R) and F1score (F1). We list the hyper-parameters used to train the proposed model in Table 9.