End-to-End Relation Extraction using LSTMs on Sequences and Tree Structures

We present a novel end-to-end neural model to extract entities and relations between them. Our recurrent neural network based model captures both word sequence and dependency tree substructure information by stacking bidirectional tree-structured LSTM-RNNs on bidirectional sequential LSTM-RNNs. This allows our model to jointly represent both entities and relations with shared parameters in a single model. We further encourage detection of entities during training and use of entity information in relation extraction via entity pretraining and scheduled sampling. Our model improves over the state-of-the-art feature-based model on end-to-end relation extraction, achieving 12.1% and 5.7% relative error reductions in F1-score on ACE2005 and ACE2004, respectively. We also show that our LSTM-RNN based model compares favorably to the state-of-the-art CNN based model (in F1-score) on nominal relation classification (SemEval-2010 Task 8). Finally, we present an extensive ablation analysis of several model components.


Introduction
Extracting semantic relations between entities in text is an important and well-studied task in information extraction and natural language processing (NLP).Traditional systems treat this task as a pipeline of two separated tasks, i.e., named entity recognition (NER) (Nadeau and Sekine, 2007;Ratinov and Roth, 2009) and relation extraction (Zelenko et al., 2003;Zhou et al., 2005), but recent studies show that end-to-end (joint) modeling of entity and relation is important for high performance (Li and Ji, 2014;Miwa and Sasaki, 2014) since relations interact closely with entity information.For instance, to learn that Toefting and Bolton have an Organization-Affiliation (ORG-AFF) relation in the sentence Toefting transferred to Bolton, the entity information that Toefting and Bolton are Person and Organization entities is important.Extraction of these entities is in turn encouraged by the presence of the context words transferred to, which indicate an employment relation.Previous joint models have employed feature-based structured learning.An alternative approach to this end-to-end relation extraction task is to employ automatic feature learning via neural network (NN) based models.
There are two ways to represent relations between entities using neural networks: recurrent/recursive neural networks (RNNs) and convolutional neural networks (CNNs).Among these, RNNs can directly represent essential linguistic structures, i.e., word sequences (Hammerton, 2001) and constituent/dependency trees (Tai et al., 2015).Despite this representation ability, for relation classification tasks, the previously reported performance using long short-term memory (LSTM) based RNNs (Xu et al., 2015b;Li et al., 2015) is worse than one using CNNs (dos Santos et al., 2015).These previous LSTM-based systems mostly include limited linguistic structures and neural architectures, and do not model entities and relations jointly.We are able to achieve improvements over state-of-the-art models via endto-end modeling of entities and relations based on richer LSTM-RNN architectures that incorporate complementary linguistic structures.
Word sequence and tree structure are known to be complementary information for extracting relations.For instance, dependencies between words are not enough to predict that source and U.S. have an ORG-AFF relation in the sentence "This is ...", one U.S. source said, and the context word said is required for this prediction.Many traditional, feature-based relation classification models extract features from both sequences and parse trees (Zhou et al., 2005).However, previous RNNbased models focus on only one of these linguistic structures (Socher et al., 2012).
We present a novel end-to-end model to extract relations between entities on both word sequence and dependency tree structures.Our model allows joint modeling of entities and relations in a single model by using both bidirectional sequential (left-to-right and right-to-left) and bidirectional tree-structured (bottom-up and top-down) LSTM-RNNs.Our model first detects entities and then extracts relations between the detected entities using a single incrementally-decoded NN structure, and the NN parameters are jointly updated using both entity and relation labels.Unlike traditional incremental end-to-end relation extraction models, our model further incorporates two enhancements into training: entity pretraining, which pretrains the entity model, and scheduled sampling (Bengio et al., 2015), which replaces (unreliable) predicted labels with gold labels in a certain probability.These enhancements alleviate the problem of low-performance entity detection in early stages of training, as well as allow entity information to further help downstream relation classification.
On end-to-end relation extraction, we improve over the state-of-the-art feature-based model, with 12.1% (ACE2005) and 5.7% (ACE2004) relative error reductions in F1-score.On nominal relation classification (SemEval-2010 Task 8), our model compares favorably to the state-of-the-art CNNbased model in F1-score.Finally, we also ablate and compare our various model components, which leads to some key findings (both positive and negative) about the contribution and effectiveness of different RNN structures, input dependency relation structures, different parsing models, external resources, and joint learning settings.

Related Work
LSTM-RNNs have been widely used for sequential labeling, such as clause identification (Hammerton, 2001), phonetic labeling (Graves and Schmidhuber, 2005), and NER (Hammerton, 2003).Recently, Huang et al. (2015) showed that building a conditional random field (CRF) layer on top of bidirectional LSTM-RNNs performs comparably to the state-of-the-art methods in the partof-speech (POS) tagging, chunking, and NER.
For relation classification, in addition to traditional feature/kernel-based approaches (Zelenko et al., 2003;Bunescu and Mooney, 2005), several neural models have been proposed in the SemEval-2010Task 8 (Hendrickx et al., 2010), including embedding-based models (Hashimoto et al., 2015), CNN-based models (dos Santos et al., 2015), and RNN-based models (Socher et al., 2012).Recently, Xu et al. (2015a) and Xu et al. (2015b) showed that the shortest dependency paths between relation arguments, which were used in feature/kernel-based systems (Bunescu and Mooney, 2005), are also useful in NN-based models.Xu et al. (2015b) also showed that LSTM-RNNs are useful for relation classification, but the performance was worse than CNN-based models.Li et al. (2015) compared separate sequence-based and tree-structured LSTM-RNNs on relation classification, using basic RNN model structures.
Research on tree-structured LSTM-RNNs (Tai et al., 2015) fixes the direction of information propagation from bottom to top, and also cannot handle an arbitrary number of typed children as in a typed dependency tree.Furthermore, no RNNbased relation classification model simultaneously uses word sequence and dependency tree information.We propose several such novel model structures and training settings, investigating the simultaneous use of bidirectional sequential and bidirectional tree-structured LSTM-RNNs to jointly capture linear and dependency context for end-toend extraction of relations between entities.
As for end-to-end (joint) extraction of relations between entities, all existing models are featurebased systems (and no NN-based model has been proposed).Such models include structured prediction (Li and Ji, 2014;Miwa and Sasaki, 2014), integer linear programming (Roth and Yih, 2007;Yang and Cardie, 2013), card-pyramid parsing (Kate and Mooney, 2010), and global probabilistic graphical models (Yu and Lam, 2010;Singh et al., 2013).Among these, structured prediction methods are state-of-the-art on several corpora.We present an improved, NN-based alternative for the end-to-end relation extraction.

Model
We design our model with LSTM-RNNs that represent both word sequences and dependency tree structures, and perform end-to-end extraction of relations between entities on top of these RNNs.Fig. 1 illustrates the overview of the model.The model mainly consists of three representation layers: a word embeddings layer (embedding layer), a word sequence based LSTM-RNN layer (sequence layer), and finally a dependency subtree based LSTM-RNN layer (dependency layer).During decoding, we build greedy, left-to-right entity detection on the sequence layer and realize relation classification on the dependency layers, where each subtree based LSTM-RNN corresponds to a relation candidate between two detected entities.After decoding the entire model structure, we update the parameters simultaneously via backpropagation through time (BPTT) (Werbos, 1990).The dependency layers are stacked on the sequence layer, so the embedding and sequence layers are shared by both entity detection and relation classification, and the shared parameters are affected by both entity and relation labels.

Embedding Layer
The embedding layer handles embedding representations.n w , n p , n d and n e -dimensional vectors v (w) , v (p) , v (d) and v (e) are embedded to words, part-of-speech (POS) tags, dependency types, and entity labels, respectively.

Sequence Layer
The sequence layer represents words in a linear sequence using the representations from the embedding layer.This layer represents sentential context information and maintains entities, as shown in bottom-left part of Fig. 1.
We represent the word sequence in a sentence with bidirectional LSTM-RNNs (Graves et al., 2013).The LSTM unit at t-th word consists of a collection of n ls -dimensional vectors: an input gate i t , a forget gate f t , an output gate o t , a memory cell c t , and a hidden state h t .The unit receives an n-dimensional input vector x t , the previous hidden state h t−1 , and the memory cell c t−1 , and calculates the new vectors using the following equations: where σ denotes the logistic function, denotes element-wise multiplication, W and U are weight matrices, and b are bias vectors.The LSTM unit at t-th word receives the concatenation of word and POS embeddings as its input vector: . We also concatenate the hidden state vectors of the two directions' LSTM units corresponding to each word (denoted as − → h t and ← − h t ) as its output vector, and pass it to the subsequent layers.

Entity Detection
We treat entity detection as a sequence labeling task.We assign an entity tag to each word using a commonly used encoding scheme BILOU (Begin, Inside, Last, Outside, Unit) (Ratinov and Roth, 2009), where each entity tag represents the entity type and the position of a word in the entity.For example, in Fig. 1, we assign B-PER and L-PER (which denote the beginning and last words of a person entity type, respectively) to each word in Sidney Yates to represent this phrase as a PER (person) entity type.
We perform entity detection on top of the sequence layer.We employ a two-layered NN with an n he -dimensional hidden layer h (e) and a softmax output layer for entity detection.
Here, W are weight matrices and b are bias vectors.
We assign entity labels to words in a greedy, left-to-right manner. 1 During this decoding, we use the predicted label of a word to predict the label of the next word so as to take label dependencies into account.The NN above receives the concatenation of its corresponding outputs in the sequence layer and the label embedding for its previous word (Fig. 1).

Dependency Layer
The dependency layer represents a relation between a pair of two target words (corresponding to a relation candidate in relation classification) in the dependency tree, and is in charge of relationspecific representations, as is shown in top-right part of Fig. 1.This layer mainly focuses on the shortest path between a pair of target words in the dependency tree (i.e., the path between the least common node and the two target words) since these paths are shown to be effective in relation classification (Xu et al., 2015a).For example, we show the shortest path between Yates and Chicago in the bottom of Fig. 1, and this path well captures the key phrase of their relation, i.e., born in.
We employ bidirectional tree-structured LSTM-RNNs (i.e., bottom-up and top-down) to represent a relation candidate by capturing the dependency structure around the target word pair.This bidirectional structure propagates to each node not only the information from the leaves but also information from the root.This is especially important for relation classification, which makes use of argument nodes near the bottom of the tree, and our top-down LSTM-RNN sends information from the top of the tree to such near-leaf nodes (unlike in standard bottom-up LSTM-RNNs).2Note that the two variants of tree-structured LSTM-RNNs by Tai et al. (2015) are not able to represent our target structures which have a variable number of typed children: the Child-Sum Tree-LSTM does not deal with types and the N -ary Tree assumes a fixed number of children.We thus propose a new variant of tree-structured LSTM-RNN that shares weight matrices U s for same-type children and also allows variable number of children.For this variant, we calculate n lt -dimensional vectors in the LSTM unit at t-th node with C(t) children using following equations: where m(•) is a type mapping function.
To investigate appropriate structures to represent relations between two target word pairs, we experiment with three structure options.We primarily employ the shortest path structure (SP-Tree), which captures the core dependency path between a target word pair and is widely used in relation classification models, e.g., (Bunescu and Mooney, 2005;Xu et al., 2015a).We also try two other dependency structures: SubTree and Full-Tree.SubTree is the subtree under the lowest common ancestor of the target word pair.This provides additional modifier information to the path and the word pair in SPTree.FullTree is the full dependency tree.This captures context from the entire sentence.While we use one node type for SPTree, we define two node types for SubTree and FullTree, i.e., one for nodes on shortest paths and one for all other nodes.We use the type mapping function m(•) to distinguish these two nodes types.

Stacking Sequence and Dependency Layers
We stack the dependency layers (corresponding to relation candidates) on top of the sequence layer to incorporate both word sequence and dependency tree structure information into the output.The dependency-layer LSTM unit at the t-th word receives as input , i.e., the concatenation of its corresponding hidden state vectors s t in the sequence layer, dependency type embedding v (d) t (denotes the type of dependency to the parent 3 ), and label embedding v (e) t (corresponds to the predicted entity label).

Relation Classification
We incrementally build relation candidates using all possible combinations of the last words of detected entities, i.e., words with L or U labels in the BILOU scheme, during decoding.For instance, in Fig. 1, we build a relation candidate using Yates with an L-PER label and Chicago with an U-LOC label.For each relation candidate, we realize the dependency layer d p (described above) corresponding to the path between the word pair p in the relation candidate, and the NN receives a relation candidate vector constructed from the output of the dependency tree layer, and predicts its relation label.We treat a pair as a negative relation when the detected entities are wrong or when the pair has no relation.We represent relation labels by type and direction, except for negative relations that have no direction.
The relation candidate vector is constructed as the concatenation d p = [↑h p A ; ↓h p 1 ; ↓h p 2 ], where ↑h p A is the hidden state vector of the top LSTM 3 We use the dependency to the parent since the number of children varies.Dependency types can also be incorporated into m(•), but this did not help in initial experiments.
unit in the bottom-up LSTM-RNN (representing the lowest common ancestor of the target word pair p), and ↓h p 1 , ↓h p 2 are the hidden state vectors of the two LSTM units representing the first and second target words in the top-down LSTM-RNN. 4All the corresponding arrows are shown in Fig. 1.
Similarly to the entity detection, we employ a two-layered NN with an n hr -dimensional hidden layer h (r) and a softmax output layer (with weight matrices W , bias vectors b). (5) We construct the input d p for relation classification from tree-structured LSTM-RNNs stacked on sequential LSTM-RNNs, so the contribution of sequence layer to the input is indirect.Furthermore, our model uses words for representing entities, so it cannot fully use the entity information.To alleviate these problems, we directly concatenate the average of hidden state vectors for each entity from the sequence layer to the input d p to relation classification, i.e., d p = s i (Pair), where I p 1 and I p 2 represent sets of word indices in the first and second entities. 5lso, we assign two labels to each word pair in prediction since we consider both left-to-right and right-to-left directions.When the predicted labels are inconsistent, we select the positive and more confident label, similar to Xu et al. (2015a).

Training
We update the model parameters including weights, biases, and embeddings by BPTT and Adam (Kingma and Ba, 2015) with gradient clipping, parameter averaging, and L2-regularization (we regularize weights W and U , not the bias terms b).We also apply dropout (Srivastava et al., 2014) to the embedding layer and to the final hidden layers for entity detection and relation classification.
We employ two enhancements, scheduled sampling (Bengio et al., 2015) and entity pretraining, to alleviate the problem of unreliable prediction of entities in the early stage of training, and to encourage building positive relation instances from the detected entities.In scheduled sampling, we use gold labels as prediction in the probability of i that depends on the number of epochs i during training if the gold labels are legal.As for i , we choose the inverse sigmoid decay i = k/(k + exp(i/k)), where k(≥ 1) is a hyper-parameter that adjusts how often we use the gold labels as prediction.Entity pretraining is inspired by (Pentina et al., 2015), and we pretrain the entity detection model using the training data before training the entire model parameters.ACE05 defines 7 coarse-grained entity types and 6 coarse-grained relation types between entities.We use the same data splits, preprocessing, and task settings as Li and Ji (2014).We report the primary micro F1-scores as well as micro precision and recall on both entity and relation extraction to better explain model performance.We treat an entity as correct when its type and the region of its head are correct.We treat a relation as correct when its type and argument entities are correct; we thus treat all non-negative relations on wrong entities as false positives.

Results and Discussion
ACE04 defines the same 7 coarse-grained entity types as ACE05 (Doddington et al., 2004), but defines 7 coarse-grained relation types.We follow the cross-validation setting of Chan and Roth (2011) and Li and Ji (2014), and the preprocessing and evaluation metrics of ACE05.
SemEval-2010 Task 8 defines 9 relation types between nominals and a tenth type Other when two nouns have none of these relations (Hendrickx et al., 2010).We treat this Other type as a negative relation type, and no direction is considered.The dataset consists of 8,000 training and 2,717 test sentences, and each sentence is annotated with a relation between two given nominals.We randomly selected 800 sentences from the training set as our development set.We followed the official task setting, and report the official macro-averaged F1-score (Macro-F1) on the 9 relation types.
For more details of the data and task settings, please refer to the supplementary material.

Experimental Settings
We implemented our model using the cnn library. 6e parsed the texts using the Stanford neural dependency parser7 (Chen and Manning, 2014) with the original Stanford Dependencies.Based on preliminary tuning, we fixed embedding dimensions n w to 200, n p , n d , n e to 25, and dimensions of intermediate layers (n ls , n lt of LSTM-RNNs and n he , n hr of hidden layers) to 100.We initialized word vectors via word2vec (Mikolov et al., 2013) trained on Wikipedia8 and randomly initialized all other parameters.We tuned hyper-parameters using development sets for ACE05 and SemEval-2010 Task 8 to achieve high primary (Micro-and Macro-) F1-scores.9For ACE04, we directly employed the best parameters for ACE05.The hyperparameter settings are shown in the supplementary material.For SemEval-2010 Task 8, we also omitted the entity detection and label embeddings since only target nominals are annotated and the task defines no entity types.Our statistical significance results are based on the Approximate Randomization (AR) test (Noreen, 1989).

End-to-end Relation Extraction Results
Table 1 compares our model with the state-of-theart feature-based model of Li and Ji (2014) 10 on final test sets, and shows that our model performs better than the state-of-the-art model.
To analyze the contributions and effects of the various components of our end-to-end relation extraction model, we perform ablation tests on the ACE05 development set (Table 2).The performance slightly degraded without scheduled sampling, and the performance significantly degraded when we removed entity pretraining or removed both (p<0.05).This is reasonable because the model can only create relation instances when both of the entities are found and, without these enhancements, it may get too late to find some relations.Removing label embeddings did not affect We also show the performance without sharing parameters, i.e., embedding and sequence layers, for detecting entities and relations (−Shared parameters); we first train the entity detection model, detect entities with the model, and build a separate relation extraction model using the detected entities, i.e., without entity detection.This setting can be regarded as a pipeline model since two separate models are trained sequentially.Without the shared parameters, both the performance in entity detection and relation classification drops slightly, although the differences are not significant.When we removed all the enhancements, i.e., scheduled sampling, entity pretraining, label embedding, and shared parameters, the performance is significantly worse than SP-Tree (p<0.01),showing that these enhancements provide complementary benefits to end-to-end relation extraction.
Next, we show the performance with different LSTM-RNN structures in Table 3.We first compare the three input dependency structures (SPTree, SubTree, FullTree) for tree-structured LSTM-RNNs.Performances on these three structures are almost same when we distinguish the nodes in the shortest paths from other nodes, but when we do not distinguish them (-SP), the information outside of the shortest path, i.e., FullTree (-SP), significantly hurts performance (p<0.05).We then compare our tree-structured LSTM-RNN (SPTree) with the Child-Sum treestructured LSTM-RNN on the shortest path of Tai et al. (2015).Child-Sum performs worse than our SPTree model, but not with as big of a decrease as above.This may be because the difference in the models appears only on nodes that have multiple children and all the nodes except for the least common node have one child.
We finally show results with two counterparts of sequence-based LSTM-RNNs using the shortest path (last two rows in Table 3).SPSeq is a bidirectional LSTM-RNN on the shortest path.The LSTM unit receives input from the sequence layer concatenated with embeddings for the surrounding dependency types and directions.We concatenate the outputs of the two RNNs for the relation candidate.SPXu is our adaptation of the shortest path LSTM-RNN proposed by Xu et al. (2015b) to match our sequence-layer based model. 11This has two LSTM-RNNs for the left and right subpaths of the shortest path.We first calculate the max pooling of the LSTM units for each of these two RNNs, and then concatenate the outputs of the pooling for the relation candidate.The comparison with these sequence-based LSTM-RNNs indicates that a tree-structured LSTM-RNN is comparable to sequence-based ones in representing shortest paths.
Overall, the performance comparison of the LSTM-RNN structures in Table 3 show that for end-to-end relation extraction, selecting the appropriate tree structure representation of the input (i.e., the shortest path) is more important than the choice of the LSTM-RNN structure on that input (i.e., sequential versus tree-based).

Relation Classification Analysis Results
To thoroughly analyze the relation classification part alone, e.g., comparing different LSTM structures, architecture components such as hidden layers and input information, and classification task settings, we use the SemEval-2010 Task 8.This dataset, often used to evaluate NN models for relation classification, annotates only relation-related nominals (unlike ACE datasets), so we can focus cleanly on the relation classification part.We first report official test set results in Table 4.Our novel LSTM-RNN model is comparable to both the state-of-the-art CNN-based models on this task with or without external sources, i.e., WordNet, unlike the previous best LSTM-RNN model (Xu et al., 2015b). 12ext, we compare different LSTM-RNN structures in Table 5.As for the three input dependency structures (SPTree, SubTree, FullTree), FullTree performs significantly worse than other structures regardless of whether or not we distinguish the nodes in the shortest paths from the other nodes, which hints that the information outside of the shortest path significantly hurts the performance (p<0.05).We also compare our treestructured LSTM-RNN (SPTree) with sequencebased LSTM-RNNs (SPSeq and SPXu) and treestructured LSTM-RNNs (Child-Sum).All these LSTM-RNNs perform slightly worse than our SP- Tree model, but the differences are small.Overall, for relation classification, although the performance comparison of the LSTM-RNN structures in Table 5 produces different results on FullTree as compared to the results on ACE05 in Table 3, the trend still holds that selecting the appropriate tree structure representation of the input is more important than the choice of the LSTM-RNN structure on that input.
Finally, Table 6 summarizes the contribution of several model components and training settings on SemEval relation classification.We first remove the hidden layer by directly connecting the LSTM-RNN layers to the softmax layers, and found that this slightly degraded performance, but the difference was small.We then skip the sequence layer and directly use the word and POS embeddings for the dependency layer.Removing the sequence layer13 or entity-related information from the sequence layer (−Pair) slightly degraded performance, and, on removing both, the performance dropped significantly (p<0.05).This indicates that the sequence layer is necessary but the last words of nominals are almost enough for expressing the relations in this task.
When we replace the Stanford neural dependency parser with the Stanford lexicalized PCFG parser (Stanford PCFG), the performance slightly dropped, but the difference was small.This indicates that the selection of parsing models is not critical.We also included WordNet, and this slightly improved the performance (+WordNet), but the difference was small.Lastly, for the generation of relation candidates, generating only leftto-right candidates slightly degraded the perfor-mance, but the difference was small and hence the creation of right-to-left candidates was not critical.Treating the inverse relation candidate as a negative instance (Negative sampling) also performed comparably to other generation methods in our model (unlike Xu et al. (2015a), which showed a significance improvement over generating only left-to-right candidates).

Conclusion
We presented a novel end-to-end relation extraction model that represents both word sequence and dependency tree structures by using bidirectional sequential and bidirectional tree-structured LSTM-RNNs.This allowed us to represent both entities and relations in a single model, achieving gains over the state-of-the-art, feature-based system on end-to-end relation extraction (ACE04 and ACE05), and showing favorably comparable performance to recent state-of-the-art CNNbased models on nominal relation classification (SemEval-2010 Task 8).
Our evaluation and ablation led to three key findings.First, the use of both word sequence and dependency tree structures is effective.Second, training with the shared parameters improves relation extraction accuracy, especially when employed with entity pretraining, scheduled sampling, and label embeddings.Finally, the shortest path, which has been widely used in relation classification, is also appropriate for representing tree structures in neural LSTM models.

4. 1
Data and Task SettingsWe evaluate on three datasets: ACE05 and ACE04 for end-to-end relation extraction, and SemEval-2010 Task 8 for relation classification.We use the first two datasets as our primary target, and use the last one to thoroughly analyze and ablate the relation classification part of our model.

Table 1 :
Comparison with the state-of-the-art on the ACE05 test set and ACE04 dataset.

Table 3 :
Comparison of LSTM-RNN structures on the ACE05 development dataset.