N-ary Relation Extraction using Graph-State LSTM

Cross-sentence n-ary relation extraction detects relations among n entities across multiple sentences. Typical methods formulate an input as a document graph, integrating various intra-sentential and inter-sentential dependencies. The current state-of-the-art method splits the input graph into two DAGs, adopting a DAG-structured LSTM for each. Though being able to model rich linguistic knowledge by leveraging graph edges, important information can be lost in the splitting procedure. We propose a graph-state LSTM model, which uses a parallel state to model each word, recurrently enriching state values via message passing. Compared with DAG LSTMs, our graph LSTM keeps the original graph structure, and speeds up computation by allowing more parallelization. On a standard benchmark, our model shows the best result in the literature.


Introduction
As a central task in natural language processing, relation extraction has been investigated on news, web text and biomedical domains. It has been shown to be useful for detecting explicit facts, such as cause-effect (Hendrickx et al., 2009), and predicting the effectiveness of a medicine on a cancer caused by mutation of a certain gene in the biomedical domain Peng et al., 2017). While most existing work extracts relations within a sentence (Zelenko et al., 2003;Palmer et al., 2005;Zhao and Grishman, 2005;Jiang and Zhai, 2007;Plank and Moschitti, 2013;Li and Ji, 2014;Gormley et al., 2015;Miwa and Bansal, 2016;Zhang et al., 2017), the task of cross-sentence relation extraction has received increasing attention (Gerber and Chai, 2010;Yoshikawa et al., 2011). Recently, Peng ⇤ Equal contribution The deletion mutation on exon-19 of EGFR gene was present in 16 patients, while the 858E point mutation on exon-21 was noted in 10. All patients were treated with gefitinib and showed a partial response.  et al. (2017) extend cross-sentence relation extraction by further detecting relations among several entity mentions (n-ary relation). Table 1 shows an example, which conveys the fact that cancers caused by the 858E mutation on EGFR gene can respond to the gefitinib medicine. The three entity mentions form a ternary relation yet appear in distinct sentences. Peng et al. (2017) proposed a graph-structured LSTM for n-ary relation extraction. As shown in Figure 1 (a), graphs are constructed from input sentences with dependency edges, links between adjacent words, and inter-sentence relations, so that syntactic and discourse information can be used for relation extraction. To calculate a hidden state encoding for each word, Peng et al. (2017) first split the input graph into two directed acyclic graphs (DAGs) by separating left-to-right edges from right-to-left edges (Figure 1 (b)). Then, two separate gated recurrent neural networks, which extend tree LSTM (Tai et al., 2015), were adopted for each single-directional DAG, respectively. Finally, for each word, the hidden states of both directions are concatenated as the final state. The bi-directional DAG LSTM model showed superior performance over several strong baselines, such as tree-structured LSTM (Miwa and Bansal, 2016), on a biomedical-domain benchmark.
However, the bidirectional DAG LSTM model suffers from several limitations. First, important information can be lost when converting a graph A potential solution to the problems above is to model a graph as a whole, learning its representation without breaking it into two DAGs. Due to the existence of cycles, naive extension of tree LSTMs cannot serve this goal. Recently, graph convolutional networks (GCN) (Kipf and Welling, 2017;Bastings et al., 2017) and graph recurrent networks (GRN)  have been proposed for representing graph structures for NLP tasks. Such methods encode a given graph by hierarchically learning representations of neighboring nodes in the graphs via their connecting edges. While GCNs use CNN for information exchange, GRNs take gated recurrent steps to this end. For fair comparison with DAG LSTMs, we build a graph LSTM by extending , which strictly follow the configurations of Peng et al. (2017) such as the source of features and hyper parameter settings. In particular, the full input graph is modeled as a single state, with words in the graph being its sub states. State transitions are performed on the graph recurrently, allowing word-level states to exchange information through dependency and discourse edges. At each recurrent step, each word advances its current state by receiving information from the current states of its adjacent words. Thus with increasing numbers of recurrent steps each word receives information from a larger context. Figure 2 shows the recurrent transition steps where each node works simultaneously within each transition step.
Compared with bidirectional DAG LSTM, our method has several advantages. First, it keeps the original graph structure, and therefore no information is lost. Second, sibling information can be easily incorporated by passing information up and then down from a parent. Third, information exchange allows more parallelization, and thus can be very efficient in computation.
Results show that our model outperforms a bidirectional DAG LSTM baseline by 5.9% in accuracy, overtaking the state-of-the-art system of Peng et al. (2017) by 1.2%. Our code is available at https://github.com/ freesunshine0316/nary-grn.
Our contributions are summarized as follows.
• We empirically compared graph LSTM with DAG LSTM for n-ary relation extraction tasks, showing that the former is better by more effective use of structural information; • To our knowledge, we are the first to investigate a graph recurrent network for modeling dependency and discourse relations.

Task Definition
Formally, the input for cross-sentence n-ary relation extraction can be represented as a pair (E, T ), where E = (✏ 1 , . . . , ✏ N ) is the set of entity mentions, and T = [S 1 ; . . . ; S M ] is a text consisting of multiple sentences. Each entity mention ✏ i belongs to one sentence in T . There is a predefined relation set R = (r 1 , . . . , r L , None), where None represents that no relation holds for the entities.
This task can be formulated as a binary classification problem of determining whether ✏ 1 , . . . , ✏ N together form a relation (Peng et al., 2017), or a multi-class classification problem of detecting which relation holds for the entity mentions. Take  Table 1 as an example. The binary classification task is to determine whether gefitinib would have an effect on this type of cancer, given a cancer patient with 858E mutation on gene EGFR. The multi-class classification task is to detect the exact drug effect: response, resistance, sensitivity, etc.
3 Baseline: Bi-directional DAG LSTM Peng et al. (2017) formulate the task as a graphstructured problem in order to adopt rich dependency and discourse features. In particular, Stanford parser ) is used to assign syntactic structure to input sentences, and heads of two consecutive sentences are connected to represent discourse information, resulting in a graph structure. For each input graph G = (V, E), the nodes V are words within input sentences, and each edge e 2 E connects two words that either have a relation or are adjacent to each other. Each edge is denoted as a triple (i, j, l), where i and j are the indices of the source and target words, respectively, and the edge label l indicates either a dependency or discourse relation (such as "nsubj") or a relative position (such as "next tok" or "prev tok"). Throughout this paper, we use E in (j) and E out (j) to denote the sets of incoming and outgoing edges for word j. For a bi-directional DAG LSTM baseline, we follow Peng et al. (2017), splitting each input graph into two separate DAGs by separating leftto-right edges from right-to-left edges ( Figure 1). Each DAG is encoded by using a DAG LSTM (Section 3.2), which takes both source words and edge labels as inputs (Section 3.1). Finally, the hidden states of entity mentions from both LSTMs are taken as inputs to a logistic regression classifier to make a prediction: where h ✏ j is the hidden state of entity ✏ j . W 0 and b 0 are parameters.

Input Representation
Both nodes and edge labels are useful for modeling a syntactic graph. As the input to our DAG LSTM, we first calculate the representation for each edge (i, j, l) by: where W 1 and b 1 are model parameters, e i is the embedding of the source word indexed by i, and e l is the embedding of the edge label l.

State transition
The baseline LSTM model learns DAG representations sequentially, following word orders. Taking the edge representations (such as x l i,j ) as input, gated state transition operations are executed on both the forward and backward DAGs. For each word j, the representations of its incoming edges E in (j) are summed up as one vector: Similarly, for each word j, the states of all incoming nodes are summed to a single vector before being passed to the gated operations: Finally, the gated state transition operation for the hidden state h j of the j-th word can be defined as: where i j , o j and f i,j are a set of input, output and forget gates, respectively, and W x , U x and b x (x 2 {i, o, f, u}) are model parameters.

Comparison with Peng et al. (2017)
Our baseline is computationally similar to Peng et al. (2017), but different on how to utilize edge labels in the gated network. In particular, Peng et al. (2017)   U s (in Equation 5) to different edge types, so that each edge label is associated with a 2D weight matrix to be tuned in training. On the other hand, EM-BED assigns each edge label to an embedding vector, but complicates the gated operations by changing the U s to be 3D tensors. 1 In contrast, we take edge labels as part of the input to the gated network. In general, the edge labels are first represented as embeddings, before being concatenated with the node representation vectors (Equation 2). We choose this setting for both the baseline and our graph state LSTM model in Section 4, since it requires fewer parameters compared with FULL and EMBED, thus being less exposed to overfitting on small-scaled data.

Graph State LSTM
Our input graph formulation strictly follows Section 3. In particular, our model adopts the same methods for calculating input representation (as in Section 3.1) and performing classification as the baseline model. However, different from the baseline bidirectional DAG LSTM model, we leverage a graph-structured LSTM to directly model the input graph, without splitting it into two DAGs. Figure 2 shows an overview of our model. Formally, given an input graph G = (V, E), we define a state vector h j for each word v j 2 V . The state of the graph consists of all word states, and thus can be represented as: 1 For more information please refer Section 3.3 of Peng et al. (2017).
In order to capture non-local information, our model performs information exchange between words through a recurrent state transition process, resulting in a sequence of graph states g 0 , g 1 , . . . , g t , where g t = {h j t }| v j 2V . The initial graph state g 0 consists of a set of initial word states h j 0 = h 0 , where h 0 is a zero vector.

State transition
Following the approches of  and , a recurrent neural network is utilized to model the state transition process. In particular, the transition from g t 1 to g t consists of hidden state transition for each word, as shown in Figure 2. At each step t, we allow information exchange between a word and all words that are directly connected to the word. To avoid gradient diminishing or bursting, gated LSTM cells are adopted, where a cell c j t is taken to record memory for h j t . We use an input gate i j t , an output gate o j t and a forget gate f j t to control information flow from the inputs and to h j t . The inputs to a word v j , include representations of edges that are connected to v j , where v j can be either the source or the target of the edge. Similar to Section 3.1, we define each edge as a triple (i, j, l), where i and j are indices of the source and target words, respectively, and l is the edge label.
x l i,j is the representation of edge (i, j, l). The inputs for v j are distinguished by incoming and outgoing directions, where: Here E in (j) and E out (j) denote the sets of incoming and outgoing edges of v j , respectively. In addition to edge inputs, a cell also takes the hidden states of its incoming and outgoing words during a state transition. In particular, the states of all incoming words and outgoing words are summed up, respectively: Based on the above definitions of x i j , x o j , h i j and h o j , the recurrent state transition from g t 1 to g t , as represented by h j t , is defined as: where i j t , o j t and f j t are the input, output and forget gates, respectively.
Graph State LSTM vs bidirectional DAG LSTM A contrast between the baseline DAG LSTM and our graph LSTM can be made from the perspective of information flow. For the baseline, information flow follows the natural word order in the input sentence, with the two DAG components propagating information from left to right and from right to left, respectively. In contrast, information flow in our graph state LSTM is relatively more concentrated at individual words, with each word exchanging information with all its graph neighbors simultaneously at each sate transition. As a result, wholistic contextual information can be leveraged for extracting features for each word, as compared to separated handling of bi-directional information flow in DAG LSTM. In addition, arbitrary structures, including arbitrary cyclic graphs, can be handled.
From an initial state with isolated words, information of each word propagates to its graph neighbors after each step. Information exchange between non-neighboring words can be achieved through multiple transition steps. We experiment with different transition step numbers to study the effectiveness of global encoding. Unlike the baseline DAG LSTM encoder, our model allows parallelization in node-state updates, and thus can be highly efficient using a GPU.

Training
We train our models with a cross-entropy loss over a set of gold standard data: where X i is an input graph, y i is the gold class label of X i , and ✓ is the model parameters. Adam (Kingma and Ba, 2014)

Experiments
We conduct experiments for the binary relation detection task and the multi-class relation extraction task discussed in Section 2.

Data
We use the dataset of Peng et al. (2017), which is a biomedical-domain dataset focusing on druggene-mutation ternary relations, 2 extracted from PubMed. It contains 6987 ternary instances about drug-gene-mutation relations, and 6087 binary instances about drug-mutation sub-relations. Table  2 shows statistics of the dataset. Most instances of ternary data contain multiple sentences, and the average number of sentences is around 2. There are five classification labels: "resistance or nonresponse", "sensitivity", "response", "resistance" and "None". We follow Peng et al. (2017) and binarize multi-class labels by grouping all relation classes as "Yes" and treat "None" as "No".

Development Experiments
We first analyze our model on the drug-genemutation ternary relation dataset, taking the first among 5-fold cross validation settings for our data setting. Figure 3 shows the devset accuracies of different state transition numbers, where forward and backward execute our graph state model only on the forward or backward DAG, respectively. Concat concatenates the hidden states of forward and backward. All executes our graph state model on original graphs. The performance of forward and backward lag behind concat, which is consistent with the intuition that both forward and backward relations are useful (Peng et al., 2017). In addition, all gives better accuracies compared with concat, demonstrating the advantage of simultaneously considering forward and backward relations during representation learning. For all the models, more state transition steps result in better accuracies, where larger contexts can be integrated in the representations of graphs. The performance of all starts to converge after 4 and 5 state transitions, so we set the number of state transitions to 5 in the remaining experiments.

Final results
Table 3 compares our model with the bidirectional DAG baseline and the state-of-the-art results on this dataset, where EMBED and FULL have been briefly introduced in Section 3.3. +multitask applies joint training of both ternary (druggene-mutation) relations and their binary (drugmutation) sub-relations.  use a statistical method with a logistic regression classifier and features derived from shortest paths between all entity pairs. Bidir DAG LSTM

Model
Single Cross    Table  3), our graph state LSTM model shows the highest test accuracy among all methods, which is 5.9% higher than our baseline. 4 The accuracy of our baseline is lower than EMBED and FULL of Peng et al. (2017), which is likely due to the differences mentioned in Section 3.3. Our final results are better than Peng et al. (2017), despite the fact that we do not use multi-task learning.
We also report accuracies only on instances within single sentences (column Single in Table  3), which exhibit similar contrasts. Note that all systems show performance drops when evaluated only on single-sentence relations, which are actually more challenging. One reason may be that some single sentences cannot provide sufficient context for disambiguation, making it necessary to study cross-sentence context. Another reason may be overfitting caused by relatively fewer training instances in this setting, as only 30% instances are within a single sentence. One interesting observation is that our baseline shows the least performance drop of 1.7 points, in contrast to up to 4.1 for other neural systems. This can be a supporting evidence for overfitting, as our baseline has fewer parameters at least than FULL and EMBED.

Analysis
Efficiency. Table 4 shows the training and decoding time of both the baseline and our model. Our model is 8 to 10 times faster than the baseline in training and decoding speeds, respectively. By revisiting  74, which means that the baseline model has to execute 74 recurrent transition steps for calculating a hidden state for each input word. On the other hand, our model only performs 5 state transitions, and calculations between each pair of nodes for one transition are parallelizable. This accounts for the better efficiency of our model.
Accuracy against sentence length Figure 5 (a) shows the test accuracies on different sentence lengths. We can see that GS GLSTM and Bidir DAG LSTM show performance increase along increasing input sentence lengths. This is likely because longer contexts provide richer information for relation disambiguation. GS GLSTM is consistently better than Bidir DAG LSTM, and the gap is larger on shorter instances. This demonstrates that GS GLSTM is more effective in utilizing a smaller context for disambiguation.
Accuracy against the maximal number of neighbors Figure 5 (b) shows the test accuracies against the maximum number of neighbors. Intuitively, it is easier to model graphs containing nodes with more neighbors, because these nodes can serve as a "supernode" that allow more efficient information exchange. The performances of both GS GLSTM and Bidir DAG LSTM increase with increasing maximal number of neighbors, which coincide with this intuition. In addition, GS GLSTM shows more advantage than Bidir DAG LSTM under the inputs having lower maximal number of neighbors, which further demonstrates the superiority of GS GLSTM over Bidir DAG LSTM in utilizing context information.
Case study Figure 4 visualizes the merits of GS GLSTM over Bidir DAG LSTM using two examples. GS GLSTM makes the correct predictions for both cases, while Bidir DAG LSTM fails to.
The first case generally mentions that Gefitinib does not have an effect on T790M mutation on EGFR gene. Note that both "However" and "was not" serve as indicators; thus incorporating them into the contextual vectors of these entity men-

Model
Single Cross  73.9 75.2 Miwa and Bansal (2016) 75.9 75.9 Peng et al. (2017)  tions is important for making a correct prediction. However, both indicators are leaves of the dependency tree, making it impossible for Bidir DAG LSTM to incorporate them into the contextual vectors of entity mentions up the tree through dependency edges. 5 On the other hand, it is easier for GS GLSTM. For instance, "was not" can be incorporated into "Gefitinib" through "suppressed agent ! treatment nn ! Gefitinib". The second case is to detect the relation among "cetuximab" (drug), "EGFR" (gene) and "S492R" (mutation), which does not exist. However, the context introduces further ambiguity by mentioning another drug "Panitumumab", which does have a relation with "EGFR" and "S492R". Being sibling nodes in the dependency tree, "can not" is an indicator for the relation of "cetuximab". GS GLSTM is correct, because "can not" can be easily included into the contextual vector of "cetuximab" in two steps via "bind nsubj !cetuximab".

Results on Binary Sub-relations
Following previous work, we also evaluate our model on drug-mutation binary relations. Table 5 shows the results, where Miwa and Bansal (2016) is a state-of-the-art model using sequential and tree-structured LSTMs to jointly capture linear and dependency contexts for relation extraction. Other models have been introduced in Section 6.4.
Similar to the ternary relation extraction experiments, GS GLSTM outperforms all the other systems with a large margin, which shows that the message passing graph LSTM is better at encoding rich linguistic knowledge within the input graphs. Binary relations being easier, both GS GLSTM and Bidir DAG LSTM show increased or similar performances compared with the ternary relation ex-  periments. On this set, our bidirectional DAG LSTM model is comparable to FULL using all instances ("Cross") and slightly better than FULL using only single-sentence instances ("Single").

Fine-grained Classification
Our dataset contains five classes as mentioned in Section 6.1. However, previous work only investigates binary relation detection. Here we also study the multi-class classification task, which can be more informative for applications. Table 6 shows accuracies on multi-class relation extraction, which makes the task more ambiguous compared with binary relation extraction. The results show similar comparisons with the binary relation extraction results. However, the performance gaps between GS GLSTM and Bidir DAG LSTM dramatically increase, showing the superiority of GS GLSTM over Bidir DAG LSTM in utilizing context information.  1998), which focuses on entity-attribution relations. It has also been studied in biomedical domain (McDonald et al., 2005), but only the instances within a single sentence are considered. Previous work on cross-sentence relation extraction relies on either explicit co-reference annotation (Gerber and Chai, 2010;Yoshikawa et al., 2011), or the assumption that the whole document refers to a single coherent event (Wick et al., 2006;Swampillai and Stevenson, 2011). Both simplify the problem and reduce the need for learning better contextual representation of entity mentions. A notable exception is , who adopt distant supervision and integrated contextual evidence of diverse types without relying on these assumptions. However, they only study binary relations. We follow Peng et al. (2017) by studying ternary cross-sentence relations. Graph encoder Liang et al. (2016) build a graph LSTM model for semantic object parsing, which aims to segment objects within an image into more fine-grained, semantically meaningful parts. The nodes of an input graph come from image superpixels, and the edges are created by connecting spatially neighboring nodes. Their model is similar as Peng et al. (2017) by calculating node states sequentially: for each input graph, a start node and a node sequence are chosen, which determines the order of recurrent state updates. In contrast, our graph LSTM do not need ordering of graph nodes, and is highly parallelizable.

Related Work
Graph convolutional networks (GCNs) and very recently graph recurrent networks (GRNs) have been used to model graph structures in NLP tasks, such as semantic role labeling , machine translation (Bastings et al., 2017), text generation , text representation  and semantic parsing (Xu et al., 2018b,a). In particular,  use GRN to represent raw sentences by building a graph structure of neighboring words and a sentence-level node, showing that the encoder outperforms BiLSTMs and Transformer (Vaswani et al., 2017) on classification and sequence labeling tasks;  build a GRN for encoding AMR graphs, showing that the representation is superior compared to BiLSTM on serialized AMR. Our work is in line with their work in the investigation of GRN on NLP. To our knowledge, we are the first to use GRN for representing dependency and discourse structures. Under the same recurrent framework, we show that modeling the original graphs with one GRN model is more useful than two DAG LSTMs for our relation extraction task. We choose GRN as our main method because it gives a more fair comparison with DAG LSTM. We leave it to future work to compare GCN and GRN for our task.

Conclusion
We explored a graph-state LSTM model for crosssentence n-ary relation extraction, which uses a recurrent state transition process to incrementally refine a neural graph state representation capturing graph structure contexts. Compared with a bidirectional DAG LSTM baseline, our model has several advantages. First, it does not change the input graph structure, so that no information can be lost. For example, it can easily incorporate sibling information when calculating the contextual vector of a node. Second, it is better parallelizable. Experiments show significant improvements over the previously reported numbers, including that of the bidirectional graph LSTM model.
For future work, we consider adding coreference information as an entity mention can have coreferences, which help on information collection. Another possible direction is including word sense information. Confusing caused by word senses can be a severe problem. Not only content words, but also propositions can introduce word sense problem (Gong et al., 2018).