The Dots Have Their Values: Exploiting the Node-Edge Connections in Graph-based Neural Models for Document-level Relation Extraction

The goal of Document-level Relation Extraction (DRE) is to recognize the relations between entity mentions that can span beyond sentence boundary. The current state-of-the-art method for this problem has involved the graph-based edge-oriented model where the entity mentions, entities, and sentences in the documents are used as the nodes of the document graphs for representation learning. However, this model does not capture the representations for the nodes in the graphs, thus preventing it from effectively encoding the specific and relevant information of the nodes for DRE. To address this issue, we propose to explicitly compute the representations for the nodes in the graph-based edge-oriented model for DRE. These node representations allow us to introduce two novel representation regularization mechanisms to improve the representation vectors for DRE. The experiments show that our model achieves state-of-the-art performance on two benchmark datasets.


Introduction
An important task of Information Extraction is Relation Extraction (RE) that seeks to identify the semantic relationships between entities mentioned in text. The prior works have mainly focused on the intra-sentence scenario where the two entity mentions appear in the same sentences (Zhou et al., 2005;Zeng et al., 2014;Nguyen and Grishman, 2015). In this work, we study a more recent setting for RE that additionally considers relations between two entity mentions in different sentences of the documents (i.e., inter-sentence relations) (called document-level RE (DRE)).
The current methods for document-level RE have intensively relied on deep learning to induce effective representation vectors for relation prediction. Among these deep learning models, graphbased neural networks have been demonstrated as one of the most effective approaches for DRE due to their ability to capture long-distance and intersentential information in text (Peng et al., 2017;Gupta et al., 2019). In particular,  has recently introduced a graph-based edge-oriented network that achieves state-of-the-art performance for DRE.
The key idea in this model is to build a interaction graph for each input document where the nodes include the entity mentions, the entities, and the sentences. Note that this is fundamentally different from the prior graph-based models for RE that have mostly used words as the nodes for the graphs Gupta et al., 2019). In this model, the edges between these nodes are determined by the coreferences of the entity mentions and the appearance of the entity mentions in the sentences. The representation vectors for the edges of the graph (thus called edge-oriented) are then computed via several inference layers, serving as the features to predict the relations between the pairs of entities in the documents. In this way, the model can leverage the interactions between the nodes and edges of different types to obtain richer representation vectors for the edges between the entity nodes .
Despite its good performance, a major limitation of the graph-based edge-oriented model for DRE is that it only focuses on the edge representations and ignores the representations for the nodes of the graphs. On the one hand, this edge-only representation approach cannot explicitly encode the information that is specific to the entities/entity mentions in the documents (as it only captures the representations of pairs of entities/entity mentions), potentially missing an important clue to boost the performance for DRE (e.g., entity subtypes). One the other hand, the lack of the node representations prevents the models from exploiting the relations between the node and edge representations (e.g., the translation relation (Bordes et al., 2013)) and the similarity for the representation vectors of the entity mention nodes (i.e., of the same entities) to enhance the representation learning for DRE. Consequently, in this work, we propose to improve the graph-based edge-oriented model for DRE by explicitly computing the representations for the nodes in the graphs. Based on such node representations, we introduce two novel regularization techniques to improve the representations for DRP, capturing the similarities between the node and edge representations of the same edges or the same entities. We conduct extensive experiments to demonstrate the effectiveness of the proposed method for DRE, yielding the state-of-the-art performance for this task on two benchmark datasets.

Model
Formally, in the DRE problem, the input involves a document D with S, M , and E as the sets of the sentences, entity mentions, and entities (respectively) in D. The goal of DRE is to predict the semantic relationships between each pair of entities in E (i.e., including the type NONE for the entity pairs with no relations). In this section, we will first describe the graph-based edge-oriented model (EoG) in . After-ward, we will present our novel node representation computation and regularization for this model.

The Graph-based Edge-oriented Model
In EoG, the words in the sentences in S are first transformed into vectors using some pre-trained word embeddings. A bidirectional Long-short Term Memory (BiLSTM) network is then applied over the sentences in S (i.e., treated as word embedding vector sequences) to obtain contextualized representation vectors for the words in D (called the BiLSTM vectors for the words). Afterward, EoG constructs an interaction graph G = (V, E) to compute the representations for the entity pairs in E for relation prediction. In particular, the node set V in G involves three types of nodes: the sentence nodes n s for each sentence s in S, the entity mention nodes n m for each entity mention m in M , and the entity nodes n e for each entity e in E. In EoG, each node in V is associated with an initial embedding vector to facilitate the edge representation computation later. The embedding vectors n s , n m and n e for the sentence node n s , the mention node n m and the entity node n e (respectively) are formed by: n s = [avg w∈s w, t s ], n m = [avg w∈m w, t m ] and n e = [avg m∈e n m , t e ] where: avg is the averaging operation of the vectors in a set, w is the BiLSTM vector of the word w, [] is the vector concatenation, and t s , t m , and t e are the embedding vectors to specify the types of the nodes.
Given the nodes V, the edges in E in EoG are non-directed and include the five major types, i.e., Mention-Mention, Mention-Sentence, Mention-Entity, Sentence-Sentence, and Entity-Sentence. These edge types essentially employ the coreference information of the entity mentions, their correspondence with the entities and their appearance in the sentences to establish the edges in E .
In EoG, each edge z = (i, j) ∈ E would be assigned with an initial embedding vector e 1 z based on the initial embedding vectors for the nodes n i and n j described above . These initial edge embedding vectors will then be fed into N inference layers to produce the representation vectors for the entity pairs in E for relation prediction. In particular, in the l-th inference layer (1 ≤ l ≤ N ), EoG computes the edge embedding vector e 2 l ij for two nodes i = j ∈ N by: where σ is the sigmoid function, is the elementwise product, and β ∈ [0, 1] is a controlling constant. Note that the representation vector e 2 l (i,j) is able to capture paths with the length up to 2 l in G. The representation vectors for the entity node pairs in the last inference layer (i.e., e 2 N (e i ,e j ) ) would be used to perform relation prediction in EoG.

The Proposed Model
The Node Representation: As mentioned in the introduction, a problem with the original EoG model is the failure to exploit the representations for the nodes in N to improve the representations for DRE. In this work, we propose to explicitly compute the node representation vectors and use them to aid the representation learning for DRE. In particular, starting with the initial embedding vectors for the nodes n 1 s , n 1 m , and n 1 e (i.e., n 1 s = n s , n 1 m = n m , and n 1 e = n e for the nodes in N ), we obtain the representation for the node i ∈ N at the l-th inference layer via (motivated by the edge representation computation in EoG) (1 ≤ l ≤ N ): where γ is a controlling factor. With these node representation vectors, we predict the relation between the entities e i , e j ∈ E by first forming a feature vector V e i ,e j = [n 2 N e i , n 2 N e j , e 2 N (e i ,e j ) ] (i.e., the vectors in the last inference layers). V e i ,e j would then be sent to a feed-forward network with the softmax layer in the end to produce the distribution P e i ,e j (.|e i , e j ) over the possible relations for DRE. Finally, the negative log-likelihood L pred for all the entity pairs in E would be employed to train the models in this work: where y e i ,e j is the golden relation for e i and e j . The Node-Edge Representation Consistency: The introduction of the node representation vectors allows us to leverage the relations between the representation vectors of an edge (i, j) ∈ V and its two end nodes (i.e., i, j ∈ N ) to regularize the representations, potentially improving their quality for DRE. In particular, motivated from the knowledge graph embedding methods (Bordes et al., 2013), in this work, we propose to enforce the representation vectors for the nodes and edges in G to follow the translation relation (i.e., the sum of the representation for the node i ∈ N and the representation for the edge (i, j) ∈ E should be as close as possible to the representation for the node j ∈ N ). Formally, this amounts to adding the following term L rel into the overall loss function to train the models: L rel = N l=1 (i,j)∈E ||n 2 l i + e 2 l (ij) − n 2 l j ||. The Mention Representation Consistency: In addition to the node-edge consistency, the explicit representations for the nodes in G facilitates the use of the coreference information between the entity mentions to constrain the representations for DRE. Specifically, to further improve the representation vectors for DRE, we propose to encourage the embedding vectors for the entity mention nodes of the same entities to be similar to each other. This is based on the intuitive assumption that the embedding vectors for the coreferred entity mentions should capture the underlying semantics/information of the entity they are referring to, thus being close to each other. We expect that this explicit similarity regulation between the entity mention representations would help to regularize the embedding vectors to encode more meaningful information for DRE. In particular, to achieve such similarities, we propose to incorporate the following loss term L const into the overall loss function: (1 − cosine(n 2 l m i , n 2 l m j )). Here, 1 − cosine(n 2 l m i , n 2 l m j ) is to measure the similarity between n 2 l m i and n 2 l m j . To summarize, the overall loss function L in this work is: L = L pred +α rel L rel +α const L const with α rel and α const as the trade-off parameters. The EoG model Augmented with Node Representations in this work is called EoGANE for convenience.

Experiments
Datasets & Parameters: We evaluate the models on two benchmark datasets for DRE, i.e., CDR and GDA. The CDR (Chemical-Disease Reactions) dataset is manually annotated for the binary interactions between Chemical and Disease concepts (Li et al., 2016a) while GDA (Gene-Disease Associations) (Ye et al., 2019) provides the annotations for the binary interactions between Gene and Disease concepts using distant supervision. For both datasets, we follow the same data preprocessing and spits (i.e., for training/development/test data) as the prior work  to achieve a fair comparison. Also, similar to , we use the PubMed pre-trained word embeddings (Chiu et al., 2016) for the models on CDR while randomly initialized word embeddings are employed for GDA. These word embeddings are optimized during the training process of the models in this work.
We implement the EoGANE model in this work by extending the code for the EoG model that is provided in its original paper . As such, we inherit the values for the common hyper-parameters between EoG and EoGANE from  for the proposed model EoGANE (e.g., 0.002 for the learning rate of the Adam optimizer, 2 and 3 for the batch sizes with the CDR and GDA datasets respectively, 100 for the dimension of the node/edge representation vectors in the inference layers). The values for the specific hyper-parameters of EoGANE (i.e., tuned with the development data for each dataset) include: γ = 0.4 for the controlling constant in the node representation computation of the inference layers (for both CDR and GDA), α rel = 0.5 and α const = 0.1 for the trade-off parameters in the overall loss function L for the CDR dataset, and α rel = 0.4 and α const = 0.6 for the GDA dataset.
Comparing to the State of the Art: This part compares the proposed model EoGANE with the state-of-the-art models for DRE. Table 1 reports the performance of the models on the CDR test set. In particular, the direct baseline for EoGANE is the EoG model in  that also has the best-reported performance on the CDR dataset. In this table, we distinguish the models for DRE based on whether they use external knowledge/resources (i.e., the syntactic dependency tools) or not. As we can see from the table, among the models that do not rely on any external knowledge/resources (i.e., (Gu et al., 2017;Verga et al., 2018;Nguyen and Verspoor, 2018; and the proposed model EoGANE), EoGANE achieves the best performance for all the overall, intra-and intersentence settings. In particular, EoGANE is 2.5% better than the second-best and direct baseline EoG on the absolute overall F1 score. This is significant with p < 0.01 and clearly demonstrates the effectiveness of the proposed model for DRE. Comparing to the models with external resources (e.g., additional training data, external tools) (Zhou et al., 2016;Zheng et al., 2018), EoGANE is also significantly better than the other models over different settings (i.e., the overall, inter-, and intra-sentence performance). The only exception is with the over-

Model
Overall Intra Inter (Gu et al., 2017) 61.3 57.2 11.7 (Verga et al., 2018) 62.1 -- (Nguyen and Verspoor, 2018) 62.3 --EoG  63.6 68.2 50.9 EoGANE (   all F1 score for the model in (Li et al., 2016b) that utilizes additional unlabeled training data. This further testifies to the benefits of EoGANE for DRE. Finally, Table 2 shows the performance of EoG and EoGANE on the GDA test set. Note that in this dataset, EoG also has the best-reported performance. It is clear from the table that EoGANE is still significantly better than EoG for all performance settings (with p < 0.01), thus further confirming the advantage of EoGANE for DRE.
Ablation Study: There are three major components of EoGANE in this work, i.e., the node representation computation in the inference layers for G (called NodeRep), the L rel loss term for the consistency of the node and edge representations in the inference layers, and the L const loss term for the similarity of the mention representations. This part evaluates the contribution of these components by removing each of them from EoGANE and evaluating the performance of the remaining models. Note that if the node representation computation is not included in the inference layers (i.e., EoGANE-NodeRep), the two losses L rel and L const are also not used and the feature vectors V e i ,e j only involves the edge representation e 2 N e i ,e j (i.e., V e i ,e j = [e 2 N (e i ,e j ) ]). In order to further shows the benefits of including the node representations in the inference layers, we also evaluate the performance of the "EoGANE-NodeRep" model when the initial embedding vectors for the nodes (i.e., n 1 i ) are incorporated into the feature vectors V e i ,e j for  prediction (i.e., V e i ,e j = [n 1 e i , n 1 e j , e 2 N (e i ,e j ) ]) (called EoGANE-NodeRep+Init). Table 3 presents the overall performance of the models on the CDR development set. As we can see from the table, the elimination of any component in EoGANE would significantly hurt the performance, clearly verifying the effectiveness of the node representations and the proposed consistency constraints for DRE.
Analysis: In order to better understand the contribution of the node representation vectors in the proposed model EoGANE, we examine the examples in the CDR test set that are correctly predicted by EoGANE, but lead to incorrect predictions for EoG. Our analysis reveals that these examples tend to involve entities where the local/specific information of the individual entities is crucial to determine the relations between them. As EoGANE explicitly induces the node representations for the entities and include them in the feature vector for relation prediction, it can learn to capture those specific information of the entities to make correct predictions for these examples. Consider the following document (with two sentences) as an example: "The annual incidence of warfarin 1 -related bleeding at Brigham and Women's Hospital increased from 0.97/1,000 patient admissions in the first time period (January 1995 to October 1998) to 1.19/1,000 patient admissions in the second time period (November 1998 to August 2002) of this study. The proportion of patients with major and intracranial bleeding 2 increased from 20.2% and 1.9%, respectively, in the first time period , to 33.3% and 7.8%, respectively, in the second." The two entities of interest in this document are "warfarin" (a chemical) and "intracranial bleeding" (a disease) whose entity mentions are in bold (i.e., "warfarin" and "intracranial bleeding"). In order to successfully predict the interaction relation between these two entities, the most important information for the models is that both entities are connected to the topic phrase "warfarin-related bleeding" of the document. In particular, for the entity "warfarin", the appearance of its only mention "warfarin" in the topic phrase "warfarin-related bleeding" in the document directly helps to identify "warfarin" as the chemical causing or being related to the bleeding in the phrase. Afterward, for the entity "intracranial bleeding", it also has only one mention in the document (i.e., "intracranial bleeding"). Based on the appearance of the word "bleeding" in both the only mention "intracranial bleeding" and the topic phrase "warfarin-related bleeding", we can infer that the entity "intracranial bleeding" is referring to the bleeding type expressed by the topic phrase in the document. Consequently, combining these evidences, we can conclude that the entity "intracranial bleeding" is caused by the entity "warfarin" in this case.
A notable point in our argument is that for both entities "warfarin" and "intracranial bleeding", their connections to the topic phrase can only be induced from the local context of their own entity mentions (i.e., the phrases "warfarin-related bleeding" and "intracranial bleeding"), highlighting the importance of the local/specific context of entity mention nodes for DRE. As EoGANE explicitly computes representation vectors for nodes in document graphs, it can learn to encode such local/specific context information of the entity mentions/entities into its representation vectors for the entities. These entity representation vectors, once incorporated into the feature vector V e i ,e j for relation prediction, can help the model to emphasize on these entity-specific information to do appropriate reasoning and produce correct prediction. This is in contrast to EoG that only focuses on the representation vectors of the edges, potentially blurring the information specific to the individual entities/entity mentions and failing to predict the relation in this case.

Conclusion
We present a novel deep learning model for DRE that explicitly computes the node representations for the document graphs in the graph-based edgeoriented models for DRE. This enables the models to better capture the specific information of the nodes and facilitates the incorporation of two novel consistency constraints to improve the representation vectors. The experiments demonstrate the effectiveness of the proposed method for DRE. In the future, we plan to apply the models in this work to the related tasks in information extraction.