Going out on a limb: Joint Extraction of Entity Mentions and Relations without Dependency Trees

We present a novel attention-based recurrent neural network for joint extraction of entity mentions and relations. We show that attention along with long short term memory (LSTM) network can extract semantic relations between entity mentions without having access to dependency trees. Experiments on Automatic Content Extraction (ACE) corpora show that our model significantly outperforms feature-based joint model by Li and Ji (2014). We also compare our model with an end-to-end tree-based LSTM model (SPTree) by Miwa and Bansal (2016) and show that our model performs within 1% on entity mentions and 2% on relations. Our fine-grained analysis also shows that our model performs significantly better on Agent-Artifact relations, while SPTree performs better on Physical and Part-Whole relations.


Introduction
Extraction of entities and their relations from text belongs to a very well-studied family of structured prediction tasks in NLP. There are several NLP tasks such as fine-grained opinion mining (Choi et al., 2006), semantic role labeling (Gildea and Jurafsky, 2002), etc., which have a similar structure; thus making it an important and a challenging task.
Several methods have been proposed for entity mention and relation extraction at the sentencelevel. These can be broadly categorized into -1) pipeline models that treat the identification of entity mentions (Nadeau and Sekine, 2007) and relation classification (Zhou et al., 2005) as two separate tasks; and 2) joint models, also the more recent, which simultaneously identify the entity mention and relations (Li and Ji, 2014;Miwa and Sasaki, 2014). Joint models have been argued to perform better than the pipeline models as knowledge of the typed relation can increase the confidence of the model on entity extraction and vice versa.
Recurrent networks (RNNs) (Elman, 1990) have recently become very popular for sequence tagging tasks such as entity extraction that involves a set of contiguous tokens. However, their ability to identify relations between non-adjacent tokens in a sequence, e.g., the head nouns of two entities, is less explored. For these tasks, RNNs that make use of tree structures have been deemed more suitable. Miwa and Bansal (2016), for example, propose an RNN comprised of a sequencebased long short term memory (LSTM) for entity identification and a separate tree-based dependency LSTM layer for relation classification using shared parameters between the two components. As a result, their model depends critically on access to dependency trees, restricting it to sentencelevel extraction and to languages for which (good) dependency parsers exist. Also, their model does not jointly extract entities and relations; they first extract all entities and then perform relation classification on all pairs of entities in a sentence.
In our previous work (Katiyar and Cardie, 2016), we address the same task in an opinion extraction context. Our LSTM-based formulation explicitly encodes distance between the head of entities into opinion relation labels. The output space of our model is quadratic in size of the entity and relation label set and we do not specifically identify the relation type. Unfortunately, adding relation type makes the output label space very sparse, making it difficult for the model to learn.
In this paper, we propose a novel RNN-based model for the joint extraction of entity mentions and relations. Unlike other models, our model does not depend on any dependency tree information. Our RNN-based model is a multi-layer bidirectional LSTM over a sequence. We encode the output sequence from left-to-right. At each time step, we use an attention-like model on the previously decoded time steps, to identify the tokens in a specified relation with the current token. We also add an additional layer to our network to encode the output sequence from right-to-left and find significant improvement on the performance of relation identification using bi-directional encoding.
Our model significantly outperforms the feature-based structured perceptron model of Li and Ji (2014), showing improvements on both entity and relation extraction on the ACE05 dataset. In comparison to the dependency treebased LSTM model of Miwa and Bansal (2016), our model performs within 1% on entities and 2% on relations on ACE05 dataset. We also find that our model performs significantly better than their tree-based model on the AGENT-ARTIFACT relation, while their tree-based model performs better on PHYSICAL and PART-WHOLE relations; the two models perform comparably on all other relation types. The very competitive performance of our non-tree-based model bodes well for relation extraction of non-adjacent entities in low-resource languages that lack good parsers.
In the sections that follow, we describe related work (Section 2); our bi-directional LSTM model with attention (Section 3); the training (Section 4); the experiments on ACE dataset (Section 5); results (Section 6); error analysis (Section 7) and conclusion (Section 8).

Related Work
RNNs (Hochreiter and Schmidhuber, 1997) have been recently applied to many sequential modeling and prediction tasks, such as machine translation (Bahdanau et al., 2015;, named entity recognition (NER) (Hammerton, 2003), opinion mining (Irsoy and Cardie, 2014). Variants such as adding CRF-like objective on top of LSTMs have been found to produce state-of-the-art results on several sequence prediction NLP tasks (Collobert et al., 2011;Huang et al., 2015;Katiyar and Cardie, 2016). These models assume conditional independence at the output layer whereas the model we propose in this paper does not assume any conditional indepen-dence at the output layer, allowing it to model an arbitrary distribution over output sequences.
Relation classification has been widely studied as a stand-alone task, assuming that the arguments of the relations are known in advance. There have been several models proposed including featurebased models (Bunescu and Mooney, 2005;Zelenko et al., 2003) and neural network based models (Socher et al., 2012;dos Santos et al., 2015;Hashimoto et al., 2015;Xu et al., 2015a,b).
For joint-extraction of entities and relations, feature-based structured prediction models (Li and Ji, 2014;Miwa and Sasaki, 2014), joint inference integer linear programming models (Yih and Roth, 2007;Yang and Cardie, 2013), card-pyramid parsing (Kate and Mooney, 2010) and probabilistic graphical models (Yu and Lam, 2010;Singh et al., 2013) have been proposed. In contrast, we propose a neural network model which does not depend on the availability of any features such as part of speech (POS) tags, dependency trees, etc.
Recently, Miwa and Bansal (2016) proposed an end-to-end LSTM based sequence and treestructured model. They extract entities via a sequence layer and relations between the entities via the shortest path dependency tree network. In this paper, we try to investigate recurrent neural networks with attention for extracting semantic relations between entity mentions without using any dependency parse tree features. We also present the first neural network based joint model that can extract entity mentions and relations along with the relation type. In our previous work (Katiyar and Cardie, 2016), as explained earlier, we proposed a LSTM-based model for joint extraction of opinion entities and relations, but no relation types. This model cannot be directly extended to include relation types as the output space becomes sparse making it difficult for the model to learn.
Recent advances in recurrent neural network has seen the application of attention on recurrent neural networks to obtain a representation weighted by the importance of tokens in the sequence model. Such models have been very frequently used in question-answering tasks (for recent examples, see Chen et al. (2016) and Lee et al. (2016)), machine translation (Luong et al., 2015;Bahdanau et al., 2015), and many other NLP applications. Pointer networks (Vinyals et al., 2015), an adaptation of attention models, use these tokenlevel weights as pointers to the input elements.  Figure 1: Gold standard annotation for an example sentence from ACE05 dataset. Zhai et al. (2017), for example, have used these for neural chunking, and Nallapati et al. (2016) and Cheng and Lapata (2016), for summarization. However, to the best of our knowledge, these networks have not been used for joint extraction of entity mentions and relations. We present first such attempt to use these attention models with recurrent neural networks for joint extraction of entity mentions and relations.

Model
Our model comprises of a multi-layer bidirectional recurrent network which learns a representation for each token in the sequence. We use the hidden representation from the top layer for joint entity and relation extraction. For each token in the sequence, we output an entity tag and a relation tag. The entity tag corresponds to the entity type, whereas the relation tag is a tuple of pointers to related entities and their respective relation types. Figure 1 shows the annotation for an example sentence from the dataset. We transform the relation tags from entity level to token level. For example, we separately model the relation "ORG-AFF" for each token in the entity "ITV News". Thus, we model the relations between "ITV" and "Martin Geissler", and "News" and "Martin Geissler" separately. We employ a pointer-like network on top of the sequence layer in order to find the relation tag for each token as shown in Figure 2. At each time step, the network utilizes the information available about all output tags from the previous time steps in order to output the entity tag and relation tag jointly for the current token.

Multi-layer Bi-directional Recurrent Network
We use multi-layer bi-directional LSTMs for sequence tagging because LSTMs are more capable of capturing long-term dependencies between tokens, making it ideal for both entity mention and relation extraction. Using LSTMs, we can compute the hidden state − → h t in the forward direction and ← − h t in the backward direction for every token as below: For every token t in the subsequent layer l, we combine the representations − → h l−1 t and ← − h l−1 t from previous layer l-1 and feed it as an input. In this paper, we only use the hidden state from the last layer L for output layer and compute the top hidden layer representation as below: V are weight matrices for combining hidden representations from the two directions.

Entity detection
We formulate entity detection as a sequence labeling task using BILOU scheme similar to Li and Ji (2014) and Miwa and Bansal (2016). We assign each token in the entity with the tag B appended with the entity type if it is the beginning of the entity, I for inside of an entity, L for the end of the entity or U if there is only one token in the entity. Figure 1 shows an example of the entity tag sequence assigned to the sentence. For each token in the sequence, we perform a softmax over all candidate tags to output the most likely tag: Our network structure as shown in Figure 2 also contains connections from the output y t−1 of the previous time step to the current top hidden layer. Thus our outputs are not conditionally independent from each other. In order to add connections from y t−1 , we transform this output k into a label embedding b k t−1 1 . We represent each label type Figure 2: Our network structure based on bi-directional LSTMs for joint entity and relation extraction. This snapshot shows the network when encoding the relation tag for the word "Safwan" in the sentence. The dotted lines in the figure show that top hidden layer and label embeddings for tokens is copied into relation layer. The pointers at attention layer indicate the probability distribution over tokens, the length of the pointers is used to denote the probability value.
k with a dense representation b k . We compute the output layer representations as: We decode the output sequence from left to right in a greedy manner.

Attention Model
We use attention model for relation extraction. Attention models, over an encoder sequence of representations z, can compute a soft probability distribution p over these learned representations, where d i is the i th token in decoder sequence. These probabilities are an indication of the importance of different tokens in the encoder sequence: v is a weight matrix for attention which transforms the hidden representations into attention scores.
We use pointer networks (Vinyals et al., 2015) in our approach, which are a variation of these attention models. Pointer networks interpret these p i t as the probability distribution over the input encoding sequence and use u i t as pointers to the input elements. We can use these pointers to encode relation between the current token and the previous predicted tokens, making it fit for relation extraction as explained in Section 3.4.

Relation detection
We formulate relation extraction also as a sequence labeling task. For each token, we want to find the tokens in the past that the current token is related to along with its relation type. In Figure 1, "Safwan" is related to the tokens "Martin" as well as "Geissler" by the relation type "PHYS". For simplicity, let us assume that there is only one previous token the current token is related to when training, i.e., "Safwan" is related to "Geissler" via PHYS relation. We can extend our approach to output multiple relations as explained in Section 4.
We use pointer networks as described in Sec-tion 3.3. At each time step, we stack the top hidden layer representations from the previous time steps z ≤t 2 and its corresponding label embeddings b ≤t . We only stack the top hidden layer representations for the tokens which were predicted as non-O's for previous time steps as shown in Figure 2. Our decoding representation at time t is the concatenation of z t and b t . The attention probabilities can now be computed as below: Thus, p t ≤t corresponds to the probability of each token, in the sequence so far, being related to the current token at time step t. For the case of NONE relations, the token at t is related to itself.
We also want to find the type of the relations. In order to achieve this, we add an extra dimension to v corresponding to the size of relation types R space. Thus, u i t is no longer a score but a R dimensional vector. We then take softmax over this vector of size O(|z ≤t |×R) to find the most likely tuple of pointer to the related entity and its relation type.

Bi-directional Encoding
Bi-directional LSTMs have been found to be able to capture context better than plain left-to-right LSTMs, based on their performance on various NLP tasks (Irsoy and Cardie, 2014). Also,  found that their performance on machine translation task improved on reversing the input sentences during training. Inspired by these developments, we experiment with bi-directional encoding at the output layer. We add another top hidden layer on Bi-LSTM in Figure 2 which encodes the output sequence from rightto-left. The two encoding share the same multilayer bi-directional LSTM except for the top hidden layer. Thus, we have two output layers in our network which output the entity tags and relation tags separately. At inference time, we employ heuristics to combine the output from the two directions.

Training
We train our network by maximizing the logprobability of the correct entity E and relation R tag sequences jointly given the sentence S as below: Thus, we can decompose our objective into the sum of log-probabilities over entity sequence and relation sequence. We use the gold entity tags while training. As shown in Figure 2, we input the label embedding from the previous time step to the top hidden layer at the current time step along with the other recurrent inputs. During training, we pass the gold label embedding to the next time step which enables better training of our model. However, at test time when the gold label is not available we use the predicted label at previous time step as input to the current step. At inference time, we can greedily decode the sequence to find the most likely entity E and relation R tag sequences: Since, we add another top layer to encode tag sequences in the reverse order as explained in Section 3.5, there may be conflicts in the output. We select the positive and more confident label similar to Miwa and Bansal (2016). Miwa and Bansal (2016). Miwa and Bansal (2016) present each pair of entities to their model for relation classification. In our approach, we use pointer networks to identify the related entities. Thus, for our approach described so far if we only compute the argmax on our objective then we limit our model to output only one relation label per token. However, from our analysis of the dataset, an entity may be related to more than one entity in the sentence. Hence, we modify our objective to include multiple relations. In Figure 2, token "Safwan" is related to both tokens "Martin" and "Geissler" of the entity "Martin Geissler", hence we assign probability of 0.5 to both these tokens. This can be easily expanded to include tokens from other related entities, such that we assign equal probability 1 N to all tokens 3 depending on the number N of these related tokens.

Multiple Relations Our approach to relation extraction is different from
The log-probability for the entity part remain the same as in our objective discussed in Section 4, however we modify the relation log-probability as below: |j:r i,j >0| r i,j log p(r i,j |e ≤i , r <i , S, θ) where, r i is the true distribution over relation label space and r i is the softmax output from our model. From empirical analysis, we find that r i is generally sparse and hence using a cross entropy objective like this can be useful to find multiple relations. We can also use Sparsemax (Martins and Astudillo, 2016) instead of softmax which is more suitable for sparse distributions. However, we leave it for future work.
At inference time, we output all the labels with probability value above a certain threshold. We adapt this threshold based on the validation set.

Data
We evaluate our proposed model on the two datasets from the Automatic Content Extraction (ACE) program -ACE05 and ACE04. There are 7 main entity types namely Person (PER), Organization (ORG), Geographical Entities (GPE), Location (LOC), Facility (FAC), Weapon (WEA) and Vehicle (VEH). For each entity, both entity mentions and its head phrase are annotated. For the scope of this paper, we only use the entity head phrase similar to Li and Ji (2014) and Miwa and Bansal (2016). Also, there are relation types namely Physical (PHYS), Person-Social (PER-SOC), Organization-Affiliation (ORG-AFF), Agent-Artifact (ART), GPE-Affiliation (GPE-AFF).
ACE05 has a total of 6 relation types including PART-WHOLE. We use the same data splits as Li and Ji (2014) and Miwa and Bansal (2016) such that there are 351 documents for training, 80 for development and the remaining 80 documents for the test set.
ACE04 has 7 relation types with an additional Discourse (DISC) type and split ORG-AFF relation type into ORG-AFF and OTHER-AFF. We perform 5-fold cross validation similar to Chan and Roth (2011) for fair comparison with the state-of-theart.

Evaluation Metrics
In order to compare our system with the previous systems, we report micro F1-scores, Precision and Recall on both entities and relations similar to Li and Ji (2014) and Miwa and Bansal (2016). An entity is considered correct if we can identify its head and the entity type correctly. A relation is considered correct if we can identify the head of the argument entities and also the relation type. We also report a combined score when both argument entities and relations are correct.

Baselines and Previous Models
We compare our approach with two previous approaches. The model proposed by Li and Ji (2014) is a feature-based structured perceptron model with efficient beam-search. They employ a segment-based decoder instead of token-based decoding. Their model outperformed previous stateof-the-art pipelined models. Miwa and Sasaki (2014) (SPTree) recently proposed a LSTM-based model with a sequence layer for entity identification, and a tree-based dependency layer which identifies relations between pairs of candidate entities using the shortest dependency path between them. We also employed our previous approach (Katiyar and Cardie, 2016) for extraction of opinion entities and relations to this task. We found that the performance was not competitive with the two approaches mentioned above, performing upto 10 points lower on relations. Hence, we do not include the results in Table 1. Also, Li and Ji (2014) showed that the joint model performs better than the pipelined approaches. Thus, we do not include any pipeline baselines.

Hyperparameters and Training Details
We train our model using Adadelta (Zeiler, 2012) with gradient clipping. We regularize our network using dropout (Srivastava et al., 2014) with the drop-out rate tuned using development set. We initialized our word embeddings  Table 1: Performance on ACE05 test dataset. The dashed ("-") performance numbers were missing in the original paper (Miwa and Bansal, 2016). 1 We ran the system made publicly available by Miwa and Bansal (2016), on ACE05 dataset for filling in the missing values and comparing our system with theirs at fine-grained level.  with 300-dimensional word2vec (Mikolov et al., 2013) word embeddings trained on Google News dataset. We have 3 hidden layers in our network and the dimensionality of the hidden units is 100. All the weights in the network are initialized from small random uniform noise. We tune our hyperparameters based on ACE05 development set and use them for training on ACE04 dataset. Table 1 compares the performance of our system with respect to the baselines on ACE05 dataset. We find that our joint model significantly outperforms the joint structured perceptron model (Li and Ji, 2014) on both entities and relations, despite the unavailability of features such as dependency trees, POS tags, etc. However, if we compare our model to the SPTree models, then we find that their model has better recall on both entities and relations. In Section 7, we perform error analysis to understand the difference in the performance of the two models in detail. We also compare the performance of various encoding schemes in Table 2. We compare the benefits of introducing multiple relations in our objective and bi-directional encoding compared to leftto-right encoding.

Results
Multiple Relations We find that modifying our objective to include multiple relations improves the recall of our system on relations, leading to slight improvement on the overall performance on relations. However, careful tuning of the threshold may further improve precision.
Bi-directional Encoding By adding bidirectional encoding to our system, we find that we can significantly improve the performance of our system compared to left-to-right encoding. It also improves precision compared to left-toright decoding combined with multiple relations objective.
We find that for some relations it is easier to detect them with respect to one of the entities in the entity pair. PHYS relation is easier identified with respect to GPE entity than PER entity. Thus, our bi-directional encoding of relations allows us to encode these relations with respect to both entities in the relation. Table 3 shows the performance of our model on ACE04 dataset. We believe that tuning the hyperparameters of our model can further improve the results on this dataset. As also pointed out by Li and Ji (2014) that ACE05 has better annotation quality, we focused on ACE05 dataset for this work.

Error Analysis
In this section, we perform a fine-grained comparison of our model with respect to the SPTree (Miwa and Bansal, 2016) model. We compare the performance of the two models with respect to entities, relation types and the distance between the relation arguments and provide examples from the test set in Table 6.  Table 3: Performance on ACE04 test dataset. The dashed ("-") performance numbers were missing in the original paper (Miwa and Bansal, 2016).

Entities
We find that our model has lower recall on entity extraction than SPTree as shown in Table 1. Miwa and Bansal (2016), in one of the ablation tests on ACE05 development set, show that their model can gain upto 2% improvement in recall by entity pretraining. Since we propose a jointmodel, we cannot directly apply their pretraining trick on entities separately. We leave it for future work. Li and Ji (2014) mentioned in their analysis of the dataset that there were many "UNK" tokens in the test set which were never seen during training. We verified the same and we hypothesize that for this reason the performance on the entities depends largely on the pretrained word embeddings being used. We found considerable improvements on entity recall when using pretrained word embeddings, if available, for these "UNK" tokens. Miwa and Bansal (2016) also use additional features such as POS tags in addition to pretrained word embeddings at the input layer.   in Table 4. Interestingly, we find that the performance of the two models is varied over different relation types. The dependency tree-based model significantly outperforms our joint-model on PHYS and PART-WHOLE relations, whereas our model is significantly better than tree-based model on ART relation. We show an example sentence (S1) in Table 6, where SPTree model identifies the entities in ART relation correctly but fails to identify ART relation. We compare the performance with respect to PHYS relation in Section 7.3.

Distance-based Analysis
We also compare the performance of the two models on relations based on the distance between the entities in a relation in Table 5. We find that the performance of both the models is very low for distance greater than 7. SPTree model can identify 36 relations out of 131 such relations correctly, while our model can only identify 20 relations in this category. We manually compare the output of the two systems on these cases on several examples to understand the gain of using dependency tree on longer distances. Interestingly, the majority of these relations belong to PHYS type, thus resulting in lower performance on PHYS as discussed in Section 7.2. We found that there were a few instances of co-reference errors as shown in S2 in Table 6. Our model identifies a PHYS relation between "here" and "baghdad", whereas the gold annotation has PHYS relation between "location" and "baghdad". We think that  Table 6: Examples from the dataset with label annotations from SPTree and our model for comparison. The first row for each example is the gold standard.
incorporating these co-reference information during both training and evaluation will further improve the performance of both systems. Another source of error that we found was the inability of our system to extract entities (lower recall) as in S3. Our model could not identify the FAC entity "residence". Hence, we think an improvement on entity performance via methods like pretraining might be helpful in identifying more relations. For distance less than 7, we find that our model has better recall but lower precision, as expected.

Conclusion
In this paper, we propose a novel attention-based LSTM model for joint extraction of entity mentions and relations. Experimentally, we found that our model significantly outperforms feature-rich structured perceptron joint model by Li and Ji (2014). We also compare our model to an endto-end LSTM model by Miwa and Bansal (2016) which comprises of a sequence layer for entity extraction and a tree-based dependency layer for relation classification. We find that our model, without access to dependency trees, POS tags, etc performs within 1% on entities and 2% on relations on ACE05 dataset. We also find that our model performs significantly better than their treebased model on the ART relation, while their treebased model performs better on PHYS and PART-WHOLE relations; the two models perform com-parably on all other relation types.
In future, we plan to explore pretraining methods for our model which were shown to improve recall on entity and relation performance by Miwa and Bansal (2016). We introduce bi-directional output encoding as well as an objective to learn multiple relations in this paper. However, this presents the challenge of combining predictions from the two directions. We use heuristics in this paper to combine the predictions. We think that using probabilistic methods to combine model predictions from both directions may further improve the performance. We also plan to use Sparsemax (Martins and Astudillo, 2016) instead of Softmax for multiple relations, as the former is more suitable for multi-label classification for sparse labels.
It would also be interesting to see the effect of reranking (Collins and Koo, 2005) on our joint model. We also plan to extend the identification of entities to full entity mention span instead of only the head phrase as in Lu and Roth (2015).