Inter-sentence Relation Extraction with Document-level Graph Convolutional Neural Network

Inter-sentence relation extraction deals with a number of complex semantic relationships in documents, which require local, non-local, syntactic and semantic dependencies. Existing methods do not fully exploit such dependencies. We present a novel inter-sentence relation extraction model that builds a labelled edge graph convolutional neural network model on a document-level graph. The graph is constructed using various inter- and intra-sentence dependencies to capture local and non-local dependency information. In order to predict the relation of an entity pair, we utilise multi-instance learning with bi-affine pairwise scoring. Experimental results show that our model achieves comparable performance to the state-of-the-art neural models on two biochemistry datasets. Our analysis shows that all the types in the graph are effective for inter-sentence relation extraction.


Introduction
Semantic relationships between named entities often span across multiple sentences. In order to extract inter-sentence relations, most approaches utilise distant supervision to automatically generate document-level corpora (Peng et al., 2017;Song et al., 2018). Recently, Verga et al. (2018) introduced multi-instance learning (MIL) (Riedel et al., 2010;Surdeanu et al., 2012) to treat multiple mentions of target entities in a document.
Inter-sentential relations depend not only on local but also on non-local dependencies. Dependency trees are often used to extract local dependencies of semantic relations (Culotta and Sorensen, 2004;Liu et al., 2015) in intra-sentence * Corresponding author. Figure 1: Sentences with non-local dependencies between named entities. The red arrow represents a relation between co-referred entities and yellow arrows represent semantically dependent relations. Example adapted from the CDR dataset (Wei et al., 2015). relation extraction (RE). However, such dependencies are not adequate for inter-sentence RE, since different sentences have different dependency trees. Figure 1 illustrates such a case between Oxytocin and hypotension. To capture their relation, it is essential to connect the co-referring entities Oxytocin and Oxt. RNNs and CNNs, which are often used for intra-sentence RE (Zeng et al., 2014;dos Santos et al., 2015;Zhou et al., 2016b;Lin et al., 2016), are not effective on longer sequences (Sahu and Anand, 2018) thus failing to capture such non-local dependencies.
We propose a novel inter-sentence RE model that builds a labelled edge Graph CNN (GCNN) model (Marcheggiani and Titov, 2017) on a document-level graph. The graph nodes correspond to words and edges represent local and nonlocal dependencies among them. The documentlevel graph is formed by connecting words with local dependencies from syntactic parsing and sequential information, as well as non-local dependencies from coreference resolution and other semantic dependencies (Peng et al., 2017). We infer relations between entities using MIL-based bi-affine pairwise scoring function (Verga et al., 2018) on the entity node representations.
Our contribution is threefold. Firstly, we pro- Figure 2: Proposed model architecture. The input word sequence is mapped to a graph structure, where nodes are words and edges correspond to dependencies. We omit several edges, such as self-node edges of all words and syntactic dependency edges of different labels, for brevity. GCNN is employed to encode the graph and a bi-affine layer aggregates all mention pairs. pose a novel model for inter-sentence RE using GCNN to capture local and non-local dependencies. Secondly, we apply the model on two biochemistry corpora and show its effectiveness. Finally, we developed a novel, distantly supervised dataset with chemical reactant-product relations from PubMed abstracts. 1

Proposed Model
We formulate the inter-sentence, document-level RE task as a classification problem. Let [w 1 , w 2 , · · · , w n ] be the words in a document t and e 1 and e 2 be the entity pair of interest in t. We name the multiple occurrences of these entities in the document entity mentions. A relation extraction model takes a triple (e 1 , e 2 , t) as input and returns a relation for the pair, including the "no relation" category, as output. We assume that the relationship of the target entities in t can be inferred based on all their mentions. We thus apply multi-instance learning on t to combine all mention-level pairs and predict the final relation category of a target pair.
We describe the architecture of our proposed model in Figure 2. The model takes as input an entire abstract of scientific articles and two target entities with all their mentions in the input layer. It then constructs a graph structure with words as nodes and labelled edges that correspond to local and non-local dependencies. Next, it encodes the graph structure using a stacked GCNN layer and classifies the relation between the target entities by applying MIL (Verga et al., 2018) to aggregate all 1 The dataset is publicly available at http://nactem. ac.uk/CHR/. mention pair representations.

Input Layer
In the input layer, we map each word i and its relative positions to the first and second target entities into real-valued vectors, w i , d 1 i , d 2 i , respectively. As entities can have more than one mention, we calculate the relative position of a word from the closest target entity mention. For each word i, we concatenate the word and position representations into an input representation,

Graph Construction
In order to build a document-level graph for an entire abstract, we use the following categories of inter-and intra-sentence dependency edges, as shown with different colours in Figure 2. Syntactic dependency edge: The syntactic structure of a sentence reveals helpful clues for intrasentential RE (Miwa and Bansal, 2016). We thus use labelled syntactic dependency edges between the words of each sentence, by treating each syntactic dependency label as a different edge type. Coreference edge: As coreference is an important indicator of local and non-local dependencies (Ma et al., 2016), we connect co-referring phrases in a document using coreference type edges. Adjacent sentence edge: We connect the syntactic root of a sentence with the roots of the previous and next sentences with adjacent sentence type edges (Peng et al., 2017) for non-local dependencies between neighbouring sentences. Adjacent word edge: In order to keep sequential information among the words of a sentence, we connect each word with its previous and next words with adjacent word type edges. Self-node edge: GCNN learns a node representation based solely on its neighbour nodes and their edge types. Hence, to include the node information itself into the representation, we form selfnode type edges on all the nodes of the graph.

GCNN Layer
We compute the representation of each input word i by applying GCNN (Kipf and Welling, 2017;Defferrard et al., 2016) on the constructed document graph. GCNN is an advanced version of CNN for graph encoding that learns semantic representations for the graph nodes, while preserving its structural information. In order to learn edge type-specific representations, we use a labelled edge GCNN, which keeps separate parameters for each edge type (Vashishth et al., 2018). The GCNN iteratively updates the representation of each input word i as follows: is the i-th word representation resulted from the k-th GCNN block, ν(i) is a set of neighbouring nodes to i, W k l(i,u) and b k l(i,u) are the parameters of the k-th block for edge type l between nodes i and u. We stack K GCNN blocks to accumulate information from distant neighbouring nodes and use edge-wise gating to control information from neighbouring nodes.
Similar to Marcheggiani and Titov (2017), we maintain separate parameters for each edge direction. We, however, tune the number of model parameters by keeping separate parameters only for the top-N types and using the same parameters for all the remaining edge types, named "rare" type edges. This can avoid possible overfitting due to over-parameterisation for different edge types.

MIL-based Relation Classification
Since each target entity can have multiple mentions in a document, we employ a multi-instance learning (MIL)-based classification scheme to aggregate the predictions of all target mention pairs using bi-affine pairwise scoring (Verga et al., 2018). As shown in Figure 2, each word i is firstly projected into two separate latent spaces using two-layered feed-forward neural networks (FFNN), which correspond to the first (head) or second (tail) argument of the target pair.
i corresponds to the representation of the i-th word after |K| blocks of GCNN encoding, W (0) , W (1) are the parameters of two FFNNs for head and tail respectively and x head i , x tail i ∈ R d are the resulted head/tail representations for the ith word.
Then, mention-level pairwise confidence scores are generated by a bi-affine layer and aggregated to obtain the entity-level pairwise score.
where, R ∈ R d×r×d is a learned bi-affine tensor with r the number of relation categories, and E head , E tail denote a set of mentions for entities e head and e tail respectively.

Experimental Settings
We first briefly describe the datasets where the proposed model is evaluated along with their preprocessing. We then introduce the baseline models we use for comparison. Finally, we show the training settings.

Data Sets
We evaluated our model on two biochemistry datasets.

Chemical-Disease Relations dataset (CDR):
The CDR dataset is a document-level, intersentence relation extraction dataset developed for the BioCreative V challenge (Wei et al., 2015). CHemical Reactions dataset (CHR): We created a document-level dataset with relations between chemicals using distant supervision. Firstly, we used the back-end of the semantic faceted search engine Thalia 2 (Soto et al., 2018) to obtain abstracts annotated with several biomedical named entities from PubMed. We selected chemical compounds from the annotated entities and aligned them with the graph database Biochem4j (Swainston et al., 2017). Biochem4j is a freely available database that integrates several resources such as UniProt, KEGG and NCBI Taxonomy 3 . If two chemical entities have a relation in Biochem4j, we consider them as positive instances in the dataset, otherwise as negative. Table 1 shows the statistics for CDR and CHR datasets. For both datasets, the annotated entities can have more than one associated Knowledge Base (KB) ID. If there is at least one common KB ID between mentions then we considered all these mentions to belong to the same entity. This technique results in less negative pairs. We ignored entities that were not grounded to a known KB ID and removed relations between the same entity (self-relations). For the CDR dataset, we performed hypernym filtering similar to Gu et al. (2017) and Verga et al. (2018). In the CHR dataset, both directions were generated for each candidate chemical pair as chemicals can be either a reactant (first argument) or a product (second argument) in an interaction.

Data Pre-processing
We processed the datasets using the GENIA Sentence Splitter 4 and GENIA tagger (Tsuruoka et al., 2005) for sentence splitting and word tokenisation, respectively. Syntactic dependencies were obtained using the Enju syntactic parser (Miyao and Tsujii, 2008) with predicate-argument structures. Coreference type edges were constructed using the Stanford CoreNLP software (Manning et al., 2014).

Baseline Models
For the CDR dataset, we compare with five stateof-the-art models: SVM , ensemble of feature-based and neural-based models (Zhou et al., 2016a), CNN and Maximum Entropy (Gu et al., 2017), Piece-wise CNN (Li et al., 2018) and Transformer (Verga et al., 2018). We additionally prepare and evaluate the following models: CNN-RE, a re-implementation from Kim (2014) and Zhou et al. (2016a) and RNN-RE, a reimplementation from Sahu and Anand (2018). In all models we use bi-affine pairwise scoring to detect relations.

Model Training
We used 100-dimentional word embeddings trained on PubMed with GloVe (Pennington et al., 2014;TH et al., 2015). Unlike Verga et al. (2018), we used the pre-trained word embeddings in place of sub-word embeddings to align with our word graphs. Due to the size of the CDR dataset, we merged the training and development sets to train the models, similarly to Xu et al. (2016a) and Gu et al. (2017). We report the performance as the average of five runs with different parameter initialisation seeds in terms of precision (P), recall (R) and F1-score. We used the frequencies of the edge types in the training set to choose the top-N edges in Section 2.3. We refer to the supplementary materials for the details of the training and hyper-parameter settings.

Results
We show the results of our model for the CDR and CHR datasets in Table 2. We report the performance of state-of-the-art models without any additional enhancements, such as joint training with NER, model ensembling and heuristic rules, to avoid any effects from the enhancements in the comparison. We observe that the GCNN outperforms the baseline models (CNN-RE/RNN-RE) in both datasets. However, in the CDR dataset, the performance of GCNN is 1.6 percentage points lower than the best performing system of (Gu et al., 2017). In fact, Gu et al. (2017) incorporates two separate neural and feature-based models for intra-and inter-sentence pairs, respectively, whereas we utilize a single model for both pairs. Additionally, GCNN performs comparably to the second state-of-the-art neural model Li et al. (2018), which requires a two-step process for mention aggregation unlike our unified approach. Figure 3 illustrates the performance of our model on the CDR development set when using a varying number of most frequent edge types N . While tuning N , we observed that the best performance was obtained for top-4 edge types, but it slightly deteriorated with more. We chose the top-4 edge types in other experiments.    We perform ablation analysis on the CDR dataset by separating the development set to intraand inter-sentence pairs (approximately 70% and 30% of pairs, respectively). Table 3 shows the performance when removing an edge category at a time. In general, all dependency types have positive effects on inter-sentence RE and the overall performance, although self-node and adjacent sentence edges slightly harm the performance of intra-sentence relations. Additionally, coreference does not affect intra-sentence pairs.

Related Work
Inter-sentence RE is a recently introduced task. Peng et al. (2017) and Song et al. (2018) used graph-based LSTM networks for n-ary RE in multiple sentences for protein-drug-disease associa-tions. They restricted the relation candidates in up to two-span sentences. Verga et al. (2018) considered multi-instance learning for document-level RE. Our work is different from Verga et al. (2018) in that we replace Transformer with a GCNN model for full-abstract encoding using non-local dependencies such as entity coreference.
GCNN was firstly proposed by Kipf and Welling (2017) and applied on citation networks and knowledge graph datasets. It was later used for semantic role labelling (Marcheggiani and Titov, 2017), multi-document summarization (Yasunaga et al., 2017) and temporal relation extraction (Vashishth et al., 2018). Zhang et al. (2018) used a GCNN on a dependency tree for intrasentence RE. Unlike previous work, we introduced a GCNN on a document-level graph, with both intra-and inter-sentence dependencies for intersentence RE.

Conclusion
We proposed a novel graph-based method for inter-sentence RE using a labelled edge GCNN model on a document-level graph. The graph is constructed with words as nodes and multiple intra-and inter-sentence dependencies between them as edges. A GCNN model is employed to encode the graph structure and MIL is incorporated to aggregate the multiple mention-level pairs . We show that our method achieves comparable performance to the state-of-the-art neural models on two biochemistry datasets. We tuned the number of labelled edges to maintain the number of parameters in the labelled edge GCNN. Analysis showed that all edge types are effective for inter-sentence RE.
Although the model is applied to biochemistry corpora for inter-sentence RE, our method is also applicable to other relation extraction tasks. As future work, we plan to incorporate joint named entity recognition training as well as sub-word embeddings in order to further improve the performance of the proposed model.

A Training and Hyper-parameter Settings
We implemented all models using Tensorflow 5 . The development set was used for hyperparameter tuning. For all models, parameters were optimised using the Adam optimisation algorithm with exponential moving average (Kingma and Ba, 2015), learning rate of 0.0005, learning rate decay of 0.75 and gradient clipping 10. We used early stopping with patience equal to 5 epochs in order to determine the best training epoch. For other hyper-parameters, we performed a non-exhaustive hyper-parameter search based on the development set. We used the same hyperparameters of both CDR and CHR datasets. The best hyper-parameter values are shown in Table 4.