Graph Enhanced Dual Attention Network for Document-Level Relation Extraction

Document-level relation extraction requires inter-sentence reasoning capabilities to capture local and global contextual information for multiple relational facts. To improve inter-sentence reasoning, we propose to characterize the complex interaction between sentences and potential relation instances via a Graph Enhanced Dual Attention network (GEDA). In GEDA, sentence representation generated by the sentence-to-relation (S2R) attention is refined and synthesized by a Heterogeneous Graph Convolutional Network before being fed into the relation-to-sentence (R2S) attention . We further design a simple yet effective regularizer based on the natural duality of the S2R and R2S attention, whose weights are also supervised by the supporting evidence of relation instances during training. An extensive set of experiments on an existing large-scale dataset show that our model achieve competitive performance, especially for the inter-sentence relation extraction, while the neural predictions can also be interpretable and easily observed.


Introduction
Relation extraction (RE) is an important research topic in natural language processing with plenty of applications. The task of RE is to detect the relationship from a given context and target entities. Depending on the given context, RE can be divided into two types: sentence-level RE (Zeng et al., 2015;Zhou et al., 2016;Ji et al., 2017;He et al., 2018;Fu et al., 2019;Guo et al., 2019; and document-level RE (Sahu et al., 2019;Gupta et al., 2019;Wang et al., 2019). Compared with widely-studied sentence-level RE, document-level RE is a more challenging task that requires more investigations.
One of main challenges in document-level RE is how to perform inter-sentence reasoning over a long document to synthesize local and global information for potential relation facts. According to the statistics of DocRed (Yao et al., 2019), a large-scale human-annotated document-level RE dataset, there are 40.7% relational facts can only be extracted from multiple sentences in this dataset. That means a desirable neural model for document-level RE requires sophisticated inter-sentence reasoning capabilities.
In this paper, we propose to improve inter-sentence reasoning from a perspective of better characterizing the complex interaction between multiple sentences and multiple potential relation instances in a document. Since one relation instance may be expressed by multiple sentences and one sentence may reveal multiple relational facts (or parts of relational facts), it is natural and straightforward to use attention between sentences and potential relation instances to capture the complex many-to-many interaction. To this end, we introduce a bi-directional attention mechanism consisting of the attention paid by a sentence to relation instances (sentence-to-relation, S2R) and the attention paid by a relation instance to sentences (relation-to-sentence, R2S). Though bi-directional attention has been widely used in other NLP tasks (e.g., machine comprehension (Seo et al., 2017) and sentiment analysis (Zhao et al., 2020), what makes our architecture unique and innovative is the following three designs based on the classic bi-directional attention.
• Graph-enhancing operation. Sentences that express a specific relational fact may be located in different parts of a document, e.g., with a long distance. Cross-sentence information synthesizing with many noisy sentences in-between may not be accurate enough if we generate sentence representation with a classic attention mechanism. Since sentences and entities in a document naturally form a graph with rich semantic, we refine and synthesize the sentence representation generated by S2R attention via a Heterogeneous Graph Convolutional Network before feeding them into the R2S attention layer, to generate more accurate representation of potential relation instances. As demonstrated in the experiments, this graph-enhancing operation profits inter-sentence reasoning significantly.
• Regularizer of attention duality. Intuitively, the more attention a sentence pays to a relation instance, the more supporting evidence the sentence contains for the relation instance. Conversely, the relation instance should also pay more attention to the sentence to obtain a more accurate representation. This observation inspires us that there is a duality between S2R and R2S attention. The natural duality can provide our architecture with a useful induction bias as a simple and effective regularizer.
• Attention supervision from supporting evidence. We have achieved an architecture with graphenhanced dual attention by the above two novel designs. Normally, the attention weights are trained implicitly with the signal from the ground truth of relation instances. We further leverage the supporting evidence for relation instance as a supervised signal for the weights of R2S attention, which also provides our model with more interpretability.
To evaluate our approach, We carried out extensive experiments on the DocRED dataset (Yao et al., 2019). From the experiment result, we found that characterizing the interaction between sentences and relation instances by our graph-enhanced dual attention network could significantly improve the performance of document-level RE. Our main contributions are: • We proposed a Graph Enhanced Dual Attention network (GEDA) for document-level relation extraction, which is capable of improving inter-sentence reasoning by better characterizing the complex interaction between sentences and potential relation instances.
• The novelty of GEDA lies in its three well-designed components consisting of a graph-enhancing operation, a regularizer of attention duality, and the attention supervision from support evidence, which are proved to be effective on improving the performance of document-level RE and providing sound interpretability.

Proposed Approach
In this section, we present the GEDA in detail. The overall architecture is shown in Fig.1. Given a document D contains n words, m sentences, and k different entities, there will be k · (k − 1) potential relation instances, our goal is to extract all the potential relation instances in a parallel way. GEDA mainly consists of 5 components: 1) the encoding layer, which generates the preliminary representations of sentences, entities and relation instances; 2) the graph-enhanced bi-directional attention layer, which synthesizes the intra-sentence and inter-sentence information to generate a refined representation of relation instances; 3) the constraint of attention duality as a regularizer; 4) the evidence-supervision loss; 5) and the final classification layer, which projects the representation of a potential relation instance to the probabilities for each relation type.

Encoding Layer
The encoding layer first converts the input document into a real-valued matrix, which contains three types of embeddings: 1) the word embedding; 2) the entity type embedding, which indicates the entity type information of each word; 3) and the entity order embedding, which represents the order of its first appearance in the document (Yao et al., 2019). An BiLSTM layer with h hidden units is used as an encoder to extract the semantic information, and the output of BiLSTM is denoted as the semantic representation H of a document, where H ∈ R n×2h . We then generate preliminary representations of sentences, entities and relation instances based on H. Preliminary representations of sentences. We use max-pooling to obtain the preliminary representation for each sentence. Here we use l i to denotes the preliminary representation of i-th sentence and l i ∈ R 1×2h .
Preliminary representations of entities. Since there may be several entity mentions existing in the document for a given entity, to obtain the preliminary entity representation, we first extracts entity mention representations from H. For an entity mention ranging from the a-th to b-th word, the current entity mention representation for entity t j is calculated ast j = 1 b−a+1 b loc=a H loc . And e j is the average of all entity mention vectors of j-th entity, where e j ∈ R 1×2h .
Preliminary representations of relation instances.We generates the preliminary relation representation for every entity pair < e p , e q > using a bilinear function, where p, q ∈ [1, k] and p q. For the k · (k − 1) potential relation instances, there will be k · (k − 1) vectors generated. we stack them together as an preliminary relation instance representation T, where T ∈ R k·(k−1)×d .

Graph-Enhanced Bi-directional Attention
The graph-enhanced bi-directional attention layer aims to model the complex interactions between sentences and relation instances, which generates refined representation of relation instance by synthesizing both intra-sentence and inter-sentence information. This component consists of the S2R layer, the GCN layer, and the R2S layer.

S2R Layer
The S2R layer outputs the relation-oriented representation of sentence, where the query vector is the preliminary sentence representation l i , and the key vector v j is each row of tensor T. The weight of the attention paid by the sentence i to the relation instance j is denoted as α i j and computed as follows: . (1) where w i j = l i · W 1 · v j , l i ∈ R 1×2h . We get the weighted combination all the attention vectors calculated over each row in T as relation-oriented representation l i of sentence i , where l i = k(k−1) j=1 α i j · v j . Via S2R layer, we obtain an attention weight matrix W S 2R ∈ R m×k·(k−1) .

GCN Layer
In GEDA, we build a heterogeneous GCN with two types of nodes: entity nodes and sentence nodes. There are three different edges: 1) sentence-sentence edges, which link two sentence nodes if the two sentences contain the same entity; 2) entity-entity edges, which link two entity nodes if the two entities are co-occurrent in a sentence; 3) entity-sentence edges, which link an entity node and a sentence node if the entity resides in the sentence.
Since the entity representation e i has different dimension with sentence representation l j , a matrix W 2 ∈ R 2h×d is used to transform the e i into e i ∈ R 1×d . Then the feature matrix X of GCN is computed as X = [e 1 ; e 2 ; ..., e k ; l 1 ; l 2 ; ...; l m ], where X ∈ R (k+m)×d . As for the adjacency matrix A ∈ R (k+m)×(k+m) , we set the diagonal elements to 1 since the self-loops. The weight of the edge is set to 1 if the edge exists between two nodes else 0. For a one-layer GCN, we can get the new node feature matrix L ∈ R (m+k)×s by the following equation: whereÂ is the normalized symmetric adjacency matrix and W 3 ∈ R d×s is a weight matrix. Output of GCN can be interpreted as two parts: 1) the refined entity representations as the first k rows in L, denoted as (ê 1 ,ê 2 , ...,ê k ); 2) the refined sentence representation ranging from the (k + 1)-th row to the (k + m)-th row in L, denoted as (l 1 ,l 2 , ...,l m ). Then the bilinear function is applied on refined entity representation to obtain refined relation instance representationT.

R2S Layer
Similar to S2R layer, R2S layer is used to obtain the sentence-oriented representation of relation instances, and the differences are in two aspects: 1) the query vector isv i inT, which is the representation of each relation instance, and 2) the key vector is the representationl j of each sentence. Finally, R2S layer outputs the sentence-oriented representation matrixT for all the potential relation instances, in whichṽ i is the i-th row corresponding to the i-th relation instance. We also obtain a weight matrix W R2S ∈ R k·(k−1)×m .

Regularizer of Attention Duality
The attention paid by a sentence to a relation instance is generally consistent with the attention in the opposite direction paid by the relation instance to the sentence, which means there exists a natural duality between the two weight matrices W S 2R and W R2S . We leverage this duality to design a simple regularization term to introduce this useful induction bias. The mathematical expression of this regularizer is shown in the following, where || · || 2 is the L2-regularization:

Evidence Supervision
Supporting evidence information identifies which sentences contribute to a specific relation instance. We can transform this information into a real-valued vector. For example, given a document that has m sentences, for the i-th relation instance, if the first two sentences are the supporting evidence, then the evidence vector is: If a given relation instance can not be assigned to any relation type, the values in c i are then all set to 1/m.
Note that the i-th row in W R2S , termed as w i , is the attention weights that i-th relation instance paid to all sentences. Intuitively, w i should be close to the evidence vector to focus on most relevant sentences. Thus, we use Kullback-Leibler divergence (Kullback and Leibler, 1951) to measure the distribution differences between c i and w i as an extra loss. The loss of all potential relation instances in a document are as follows: where D KL (p|q) = x p(x)log p(x) q(x) .

Classification Layer
Since the relation prediction in our scenario is a multi-label problem, we useṽ i , which is the i-th row iñ T, to predict whether the i-th relation instance has the relation type r: where σ is the sigmoid function, W 4 and b 4 are the trainable parameters. Finally, for a given document contains m sentences, k different entities and t pre-defined relation types, the loss function is defined as: where y i r is a binary label of the i-th relation instance for the relation type r, θ is the parameters that need to be regularized, and ||θ|| 2 is the L2-regularization, α, β, and λ are the coefficients.

Dataset and Evaluation Metrics
The dataset we used is DocRED (Yao et al., 2019), which is a large-scale document-level RE dataset. DocRED has 3053 training documents, 1000 development documents and 1000 test documents, with 96 relation types. Note that an entity pair may be assigned by one or more relations in DocRED, we formulate RE as a multi-label classification problem in the experiment.
Following prior work (Yao et al., 2019), we use F1 and IgnF1 as the evaluation metrics, in which IgnF1 is calculated after removing the entity pairs that have appeared in the training set. Besides, to evaluate inter-sentence reasoning capabilities, we split the development set into two parts based on whether an entity pair exist in the same sentence. F1 on both splits are reported, named intra-F1 and inter-F1 respectively.

Baseline Models
In this paper, we compare GEDA with three types of models: 1) the vanilla models, including CNN (Zeng et al., 2014), BiLSTM (Cai et al., 2016) and Context-Aware (Sorokin and Gurevych, 2017); 2) graph-based models, such as Graph LSTM (Peng et al., 2017), GCNN (Sahu et al., 2019) and EoG ; 3) BERT-based models, including BERT model and BERT-Two-Step model (Wang et al., 2019). For comparison with BERT-based models, we build a variant of our model named BERT-GEDA, which uses Uncased BERT-Base (Devlin et al., 2019) as the encoder instead of the word embedding layer.

Experiment Settings
Most of the experiment settings are the same as (Yao et al., 2019). Specifically, 1) words initialized with 100 dimension Glove Embeddings (Pennington et al., 2014), and are fixed during training procedure; 2) the dimension of entity order embedding and entity type embedding are 20; 3) the optimizer is Adam (Kingma and Ba, 2015) with learning rate of 0.001; 4) the hidden size of LSTM is 128; 5) all coefficients are 1e-3; 6) as for a BERT based GEDA model (named BERT-GEDA), we use a transformation layer to project the BERT embedding of each word into a low-dimensional space of size 100, which is the same as the word embedding. The learning rate of the BERT-base model is 10 −5 ; 7) the batch size is 20.  (Yao et al., 2019) or (Wang et al., 2019), others are reproduced by ourselves. Besides, we also present the intra-F1 and inter-F1 for further analysis. The significance tests are conducted for testing the robustness of approaches.

Main Results
The experimental results for all models are shown in Table 1, from which we can observe that: 1. GEDA (Ours) and BERT-GEDA (Ours) outperform other proposed models significantly, showing the effectiveness of our graph enhanced dual attention network. For example, compared with the highest score among all previous none-BERT models, GEDA (Ours) enhances F1 for 2.15% and IgnF1 for 2.92% on the development set, and F1 for 1.72% and IgnF1 for 2.74% on the test dataset. Note that BERT-based methods leverage external knowledge from large-scale corpus and greatly improve RE. Our proposed BERT-GEDA (Ours) can enhance BERT-based methods with reasoning capacities, thus further gains significant improvement over BERT-based methods. This observation verifies that GEDA can bring consistent and robust improvement, even over very competitive baseline models.
2. Context-Aware has a similar performance with BiLSTM, though it employs an attention mechanism based on BiLSTM. Compared with Context-Aware, however, GEDA (Ours) improves F1-score by over 2.5%. This indicates that document-level RE requires a more sophisticated attention mechanism to handle the complex interaction between sentences and relation instances, and our technique is more suitable for document-level RE. We attribute this superiority to the well-designed graphenhancing operation and effective regularizers over a classic bi-directional attention mechanism.
3. GNN-based methods achieve better performance over vanilla methods especially on Inter-F1, which shows that GNN is capable of characterizing the document-level context and thus learn the latent patterns of relations across sentences better.

Analysis of Inter-Sentence Reasoning
To investigate the reasoning ability of different methods, we report their performance on both intrasentence and inter-sentence instances on development set, as the last two columns of Table 1 shows.
Taking none-BERT models as an example, GEDA (Ours) outperforms other models in both scenarios, and the relative improvement in inter-sentence is more significant than intra-sentence cases (3.20% of inter-F1 v.s. 1.08% of intra-F1 ). The experimental results of BERT-based model follow a similar trend. The results verifies that modeling the complex interaction between sentences and relation instances by our graph-enhanced dual attention can enrich inter-sentence reasoning skills. Besides, the improvements in intra-sentence cases are also notable, which reveals that contextual information is also useful to identify intra-sentence relational facts and GEDA (Ours) can well capture it.   To evaluate the effect of three core components, we test three GEDA variants by remove attention duality, evident supervision, and GCN, respectively. Due to space limitations, we only report the results of F1, IngF1, and Inter-F1 on the developing set. Results in Table 2 shows that three components all yield significant enhancements, verifying the effectiveness of the three novel designs based on the classic bi-directional attention mechanism. Meanwhile, the performance decrements of all GEDA variants on IngF1 and inter-F1 are much more notable than that on F1. This observation demonstrated that all three components not only improve inner-sentence reasoning but also provide GEDA a better generalization ability to predict the unseen potential relation instances more accurately.

Analysis of Attention Weights
We further investigate the impact of the two regularizers on attention weights. For each relation instance, we calculate the sum of R2S attention weights on the relevant sentences (supporting evidence), then get their average value over all the relation instances. As Table 3 shows, average weight value of GEDA (Ours) is larger than value of other two scenarios. The results indicate the two regularizers can both help GEDA pay more attention to the relevant sentences.  To promote understanding of the neural predictions, we use a document with 8 sentences from Do-cRED for case study, as shown in Fig.2 (a). There are 3 relevant sentences for the entity pair {United States, Colorado}. Due to lack of reasoning ability over multiple sentences, BiLSTM mislabels the instance, while GEDA can perform inter-sentence reasoning thus predicts a correct relation for the entity pair. Meanwhile, we visualize the R2S attention weights generated by GEDA. As shown in Fig.2(b), the "Evidence" row shows the ground-truth attention weights for sentences within the document. B-weight is generated by bi-directional attention, B-C-weight is generated by bi-directional attention with duality constraint and B-C-S-weight is the attention weights generated by further adding supporting evidence. As shown in Fig.2(b), by incorporating duality constraints and supervised attention, correct sentences are assigned with more weights. The visualization results not only verifies the effectiveness of the regularizers, but also reveals the interpretability of our proposed GEDA.

Related Work
For document-level RE, (Peng et al., 2017) used graph-LSTM networks to cross-sentence n-ary RE, (Gupta et al., 2019) models inter-sentential dependency through a similar graph. (Nguyen and Verspoor, 2018) applied CNN to cross-sentence RE with improved character encoding. (Verga et al., 2018) used a modified Transformer (Vaswani et al., 2017) with CNN, followed by a bi-affine pairwise scoring prediction, and applying distant supervision and multi-task learning. (Jia et al., 2019) addressed document-level n-ary RE with the design of multi-scale representation learning, which aims to learn the representation of entity tuples at both mention-level and entity-level. (Sahu et al., 2019) used GCN as encoder and a multi-instance based classifier.  applied GCN with a graph consisted of different types of nodes and edges, aiming to infer those representations from other edges. Unlike previous work, we focus on inter-sentence reasoning and model the interaction between relation instances and sentences.

Conclusion
We have introduced our neural architecture GEDA for document-level relation extraction. The intuition behind GEDA is to characterize the interaction between sentences and relation instances better to improve inter-sentence reasoning over the whole document. The novelty of GEDA mainly lies in the graph-based refinement of sentence representation and two simple yet effective regularizers based on attention duality and supporting evidence respectively. Experiments verified the superiority of the graph-enhanced dual attention mechanism, especially for the inter-sentence relation extraction. In the future, we will investigate more sophisticated reasoning techniques targeting more specific scenarios of inter-sentence relation extraction, e.g., involving common-sense reasoning.