Position-aware Attention and Supervised Data Improve Slot Filling

Organized relational knowledge in the form of “knowledge graphs” is important for many applications. However, the ability to populate knowledge bases with facts automatically extracted from documents has improved frustratingly slowly. This paper simultaneously addresses two issues that have held back prior work. We first propose an effective new model, which combines an LSTM sequence model with a form of entity position-aware attention that is better suited to relation extraction. Then we build TACRED, a large (119,474 examples) supervised relation extraction dataset obtained via crowdsourcing and targeted towards TAC KBP relations. The combination of better supervised data and a more appropriate high-capacity model enables much better relation extraction performance. When the model trained on this new dataset replaces the previous relation extraction component of the best TAC KBP 2015 slot filling system, its F1 score increases markedly from 22.2% to 26.7%.


Introduction
A basic but highly important challenge in natural language understanding is being able to populate a knowledge base with relational facts contained in a piece of text. For the text shown in Figure 1, the system should extract triples, or equivalently, knowledge graph edges, such as hPenner, per:spouse, Lisa Dillmani. Combining such extractions, a system can produce a knowledge graph of relational facts between persons, organizations, and locations in the text. This task involves entity recognition, mention coreference and/or entity linking, and relation extraction; we focus on the Penner is survived by his brother, John, a copy editor at the Times, and his former wife, Times sportswriter Lisa Dillman.  most challenging "slot filling" task of filling in the relations between entities in the text. Organized relational knowledge in the form of "knowledge graphs" has become an important knowledge resource. These graphs are now extensively used by search engine companies, both to provide information to end-users and internally to the system, as a way to understand relationships. However, up until now, automatic knowledge extraction has proven sufficiently difficult that most of the facts in these knowledge graphs have been built up by hand. It is therefore a key challenge to show that NLP technology can effectively contribute to this important problem.
Existing work on relation extraction (e.g., Zelenko et al., 2003;Mintz et al., 2009; has been unable to achieve sufficient recall or precision for the results to be usable versus hand-constructed knowledge bases. Supervised training data has been scarce and, while techniques like distant supervision appear to be a promising way to extend knowledge bases at low cost, in practice the training data has often been too noisy for reliable training of relation extraction systems (Angeli et al., 2015). As a result most systems fail to make correct extractions even in apparently straightforward cases like Figure 1,

Example Entity Types & Label
Carey will succeed Cathleen P. Black, who held the position for 15 years and will take on a new role as chairwoman of Hearst Magazines, the company said.  where the best system at the NIST TAC Knowledge Base Population (TAC KBP) 2015 evaluation failed to recognize the relation between Penner and Dillman. 1 Consequently most automatic systems continue to make heavy use of hand-written rules or patterns because it has been hard for machine learning systems to achieve adequate precision or to generalize as well across text types. We believe machine learning approaches have suffered from two key problems: (1) the models used have been insufficiently tailored to relation extraction, and (2) there has been insufficient annotated data available to satisfy the training of data-hungry models, such as deep learning models. This work addresses both of these problems. We propose a new, effective neural network sequence model for relation classification. Its architecture is better customized for the slot filling task: the word representations are augmented by extra distributed representations of word position relative to the subject and object of the putative relation. This means that the neural attention model can effectively exploit the combination of semantic similarity-based attention and positionbased attention. Secondly, we markedly improve the availability of supervised training data by using Mechanical Turk crowd annotation to produce a large supervised training dataset (Table 1), suitable for the common relations between people, organizations and locations which are used in the TAC KBP evaluations. We name this dataset the TAC Relation Extraction Dataset (TACRED), and will make it available through the Linguistic Data Consortium (LDC) in order to respect copyrights on the underlying text.
Combining these two gives a system with markedly better slot filling performance. This is 1 Note: former spouses count as spouses in the ontology.
shown not only for a relation classification task on the crowd-annotated data but also for the incorporation of the resulting classifiers into a complete cold start knowledge base population system. On TACRED, our system achieves a relation classification F 1 score that is 7.9% higher than that of a strong feature-based classifier, and 3.5% higher than that of the best previous neural architecture that we re-implemented. When this model is used in concert with a pattern-based system on the TAC KBP 2015 Cold Start Slot Filling evaluation data, the system achieves an F 1 score of 26.7%, which exceeds the previous state-of-the-art by 4.5% absolute. While this performance certainly does not solve the knowledge base population problemachieving sufficient recall remains a formidable challenge -this is nevertheless notable progress.

A Position-aware Neural Sequence Model Suitable for Relation Extraction
Existing work on neural relation extraction (e.g., Zeng et al., 2014;Nguyen and Grishman, 2015;Zhou et al., 2016) has focused on convolutional neural networks (CNNs), recurrent neural networks (RNNs), or their combination. While these models generally work well on the datasets they are tested on, as we will show, they often fail to generalize to the longer sentences that are common in real-world text (such as in TAC KBP).
We believe that existing model architectures suffer from two problems: (1) Although modern sequence models such as Long Short-Term Memory (LSTM) networks have gating mechanisms to control the relative influence of each individual word to the final sentence representation (Hochreiter and Schmidhuber, 1997), these controls are not explicitly conditioned on the entire sentence being classified; (2) Most existing work either does not explicitly model the positions of entities (i.e., subject and object) in the sequence, or models the positions only within a local region.
Here, we propose a new neural sequence model with a position-aware attention mechanism over an LSTM network to tackle these challenges. This model can (1) evaluate the relative contribution of each word after seeing the entire sequence, and (2) base this evaluation not only on the semantic information of the sequence, but also on the global positions of the entities within the sequence.
We formalize the relation extraction task as follows: Let X = [x 1 , ..., x n ] denote a sentence, where x i is the i-th token. A subject entity s and an object entity o are identified in the sentence, corresponding to two non-overlapping consecutive spans: Given the sentence X and the positions of s and o, the goal is to predict a relation r 2 R (R is the set of relations) that holds between s and o or no relation otherwise.
Inspired by the position encoding vectors used in Collobert et al. (2011) and Zeng et al. (2014), we define a position sequence relative to the subject entity [p s 1 , ..., p s n ], where Here s 1 , s 2 are the starting and ending indices of the subject entity respectively, and p s i 2 Z can be viewed as the relative distance of token x i to the subject entity. Similarly, we obtain a position sequence [p o 1 , ..., p o n ] relative to the object entities. Let x = [x 1 , ..., x n ] be word embeddings of the sentence, obtained using an embedding matrix E. Similarly, we obtain position embedding vectors p s = [p s 1 , ..., p s n ] and p o = [p o 1 , ..., p o n ] using a shared position embedding matrix P respectively. Next, as shown in Figure 2, we obtain hidden state representations of the sentence by feeding x into an LSTM: We define a summary vector q = h n (i.e., the output state of the LSTM). This summary vector encodes information about the entire sentence. Then for each hidden state h i , we calculate an attention weight a i as: Here W h , W q 2 R da⇥d , W s , W o 2 R da⇥dp and v 2 R da are learnable parameters of the network, where d is the dimension of hidden states, d p is the dimension of position embeddings, and d a is the size of attention layer. Additional parameters of the network include embedding matrices E 2 R |V|⇥d and P 2 R (2L 1)⇥dp , where V is the vocabulary and L is the maximum sentence length. We regard attention weight a i as the relative contribution of the specific word to the sentence representation. The final sentence representation z is computed as: z is later fed into a fully-connected layer followed by a softmax layer for relation classification. Note that our model significantly differs from the attention mechanism in Bahdanau et al. (2015) and Zhou et al. (2016) in our use of the summary vector and position embeddings, and the way our attention weights are computed. An intuitive way to understand the model is to view the attention calculation as a selection process, where the goal is to select relevant contexts over irrelevant ones.  Here the summary vector (q) helps the model to base this selection on the semantic information of the entire sentence (rather than on each word only), while the position vectors (p s i and p o i ) provides important spatial information between each word and the entities.

The TAC Relation Extraction Dataset
Previous research has shown that slot filling systems can greatly benefit from supervised data. For example, Angeli et al. (2014b) showed that even a small amount of supervised data can boost the end-to-end F 1 score by 3.9% on the TAC KBP tasks. However, existing relation extraction datasets such as the SemEval-2010 Task 8 dataset (Hendrickx et al., 2009) and the Automatic Content Extraction (ACE) (Strassel et al., 2008) dataset are less useful for this purpose. This is mainly because: (1) these datasets are relatively small for effectively training high-capacity models (see Table 2), and (2) they capture very different types of relations. For example, the SemEval dataset focuses on semantic relations (e.g., Cause-Effect, Component-Whole) between two nominals.
One can further argue that it is easy to obtain a large amount of training data using distant supervision (Mintz et al., 2009). In practice, however, due to the large amount of noise in the induced data, training relation extractors that perform well becomes very difficult. For example, Riedel et al. (2010) show that up to 31% of the distantly supervised labels are wrong when creating training data from aligning Freebase to newswire text.
To tackle these challenges, we collect a large supervised dataset TACRED, targeted towards the TAC KBP relations.
Data collection. We create TACRED based on query entities and annotated system responses in the yearly TAC KBP evaluations. In each year of the TAC KBP evaluation (2009)(2010)(2011)(2012)(2013)(2014)(2015), 100 entities (people or organizations) are given as queries,  for which participating systems should find associated relations and object entities. We make use of Mechanical Turk to annotate each sentence in the source corpus that contains one of these query entities. For each sentence, we ask crowd workers to annotate both the subject and object entity spans and the relation types.  Table 3.
Discussion. Table 1 presents sampled examples from TACRED. Compared to existing datasets, TACRED has four advantages. First, it contains an order of magnitude more relation instances (Table 2), enabling the training of expressive models. Second, we reuse the entity and relation types of the TAC KBP tasks. We believe these relation types are of more interest to downstream applications. Third, we fully annotate all negative instances that appear in our data collection process, to ensure that models trained on TACRED are not biased towards predicting false positives on realworld text. Lastly, the average sentence length in TACRED is 36.2, compared to 19.1 in the Sem-Eval dataset, reflecting the complexity of contexts in which relations occur in real-world text. Due to space constraints, we describe the data collection and validation process, system interfaces, and more statistics and examples of TAC-RED in the supplementary material. We will make TACRED publicly available through the LDC.

Experiments
In this section we evaluate the effectiveness of our proposed model and TACRED on improving slot filling systems. Specifically, we run two sets of experiments: (1) we evaluate model performance on the relation extraction task using TACRED, and (2) we evaluate model performance on the TAC KBP 2015 cold start slot filling task, by training the models on TACRED.

Baseline Models
We compare our model against the following baseline models for relation extraction and slot filling: TAC KBP 2015 winning system. To judge our proposed model against a strong baseline, we compare against Stanford's top performing system on the TAC KBP 2015 cold start slot filling task (Angeli et al., 2015). At the core of this system are two relation extractors: a pattern-based extractor and a logistic regression (LR) classifier. The pattern-based system uses a total of 4,528 surface patterns and 169 dependency patterns. The logistic regression model was trained on approximately 2 million bootstrapped examples (using a small annotated dataset and high-precision pattern system output) that are carefully tuned for TAC KBP slot filling evaluation. It uses a comprehensive feature set similar to the MIML-RE system for relation extraction (Surdeanu et al., 2012), including lemmatized n-grams, sequence NER tags and POS tags, positions of entities, and various features over dependency paths, etc.
Convolutional neural networks. We follow the 1-dimensional CNN architecture by Nguyen and Grishman (2015) for relation extraction. This model learns a representation of the input sentence, by first running a series of convolutional operations on the sentence with various filters, and then feeding the output into a max-pooling layer to reduce the dimension. The resulting representation is then fed into a fully-connected layer followed by a softmax layer for relation classification. As an extension, positional embeddings are also introduced into this model to better capture the relative position of each word to the subject and object entities and were shown to achieve improved results. We use "CNN-PE" to represent the CNN model with positional embeddings.

Dependency-based recurrent neural networks.
In dependency-based neural models, shortest dependency paths between entities are often used as input to the neural networks. The intuition is to eliminate tokens that are potentially less relevant to the classification of the relation. For the example in Figure 1, the shortest dependency path between the two entities is: We follow the SDP-LSTM model proposed by Xu et al. (2015b). In this model, each shortest dependency path is divided into two separate sub-paths from the subject entity and the object entity to the lowest common ancestor node. Each sub-path is fed into an LSTM network, and the resulting hidden units at each word position are passed into a max-over-time pooling layer to form the output of this sub-path. Outputs from the two sub-paths are then concatenated to form the final representation.
In addition to the above models, we also compare our proposed model against an LSTM sequence model without attention mechanism.

Implementation Details
We map words that occur less than 2 times in the training set to a special <UNK> token. We use the pre-trained GloVe vectors (Pennington et al., 2014) to initialize word embeddings. For all the LSTM layers, we find that 2-layer stacked LSTMs generally work better than one-layer LSTMs. We minimize cross-entropy loss over all 42 relations using AdaGrad (Duchi et al., 2011). We apply Dropout with p = 0.5 to CNNs and LSTMs. During training we also find a word dropout strategy to be very effective: we randomly set a token to be <UNK> with a probability p. We set p to be 0.06 for the SDP-LSTM model and 0.04 for all other models.
Entity masking. We replace each subject entity in the original sentence with a special <NER>-SUBJ token where <NER> is the corresponding NER signature of the subject as provided in TAC-RED. We do the same processing for object entities. This processing step helps (1) provide a model with entity type information, and (2) prevent a model from overfitting its predictions to specific entities.
Multi-channel augmentation. Instead of using only word vectors as input to the network, we augment the input with part-of-speech (POS) and named entity recognition (NER) embeddings. We run Stanford CoreNLP  to obtain the POS and NER annotations.  We describe our model hyperparameters and training in detail in the supplementary material.

Evaluation on TACRED
We first evaluate all models on TACRED. We train each model for 5 separate runs with independent random initializations. For each run we perform early stopping using the dev set. We then select the run (among 5) that achieves the median F 1 score on the dev set, and report its test set performance. Table 4 summarizes our results. We observe that all neural models achieve higher F 1 scores than the logistic regression and patterns systems, which demonstrates the effectiveness of neural models for relation extraction. Although positional embeddings help increase the F 1 by around 2% over the plain CNN model, a simple (2-layer) LSTM model performs surprisingly better than CNN and dependency-based models. Lastly, our proposed position-aware mechanism is very effective and achieves an F 1 score of 65.4%, with an absolute increase of 3.9% over the best baseline neural model (LSTM) and 7.9% over the baseline logistic regression system. We also run an ensemble of our position-aware attention model which takes majority votes from 5 runs with random initializations and it further pushes the F 1 score up by 1.6%.
We find that different neural architectures show a different balance between precision and recall. CNN-based models tend to have higher precision; RNN-based models have better recall. This can be explained by noting that the filters in CNNs are essentially a form of "fuzzy n-gram patterns".

Evaluation on TAC KBP Slot Filling
Second, we evaluate the slot filling performance of all models using the TAC KBP 2015 cold start slot filling task (Ellis et al., 2015). In this task, about 50k newswire and Web forum documents are selected as the evaluation corpus. A slot filling system is asked to answer a series of queries with two-hop slots (Figure 3): The first slot asks about fillers of a relation with the query entity as the subject (Mike Penner), and we term this a hop-0 slot; the second slot asks about fillers with the system's hop-0 output as the subject, and we term this a hop-1 slot. System predictions are then evaluated against gold annotations, and micro-averaged precision, recall and F 1 scores are calculated at the hop-0 and hop-1 levels. Lastly hop-all scores are calculated by combining hop-0 and hop-1 scores. 2 Evaluating relation extraction systems on slot filling is particularly challenging in that: (1) Endto-end cold start slot filling scores conflate the performance of all modules in the system (i.e., entity recognizer, entity linker and relation extractor).
(2) Errors in hop-0 predictions can easily propagate to hop-1 predictions. To fairly evaluate each relation extraction model on this task, we use Stanford's 2015 slot filling system as our basic pipeline. 3 It is a very strong baseline specifically tuned for TAC KBP evaluation and ranked top in the 2015 evaluation. We then plug in the corresponding relation extractor trained on TACRED, keeping all other modules unchanged. Table 5 presents our results. We find that: (1) by only training our logistic regression model on TACRED (in contrast to on the 2 million bootstrapped examples used in the 2015 Stanford system) and combining it with patterns, we obtain a higher hop-0 F 1 score than the 2015 Stanford sys-

Analysis
Model ablation. Table 6 presents the results of an ablation test of our position-aware attention model on the development set of TACRED. The entire attention mechanism contributes about 1.5% F 1 , where the position-aware term in Eq. (3) alone contributes about 1% F 1 score. Figure 4 shows how the slot filling evaluation scores change as we change the amount of negative (i.e., no relation) training data provided to our proposed model. We find that: (1) At hop-0 level, precision increases as we provide more negative examples, while recall stays almost unchanged. F 1 score keeps increasing.

Impact of negative examples.
(2) At hop-all level, F 1 score increases by Performance by sentence length. Figure 5 shows performance on varying sentence lengths. We find that: (1) Performance of all models degrades substantially as the sentences get longer.  Improvement by slot types. We calculate the F 1 score for each slot type and compare the improvement from using our proposed model across slot types. When compared with the CNN-PE model, our position-aware attention model achieves improved F 1 scores on 30 out of the 41 slot types, with the top 5 slot types being org:members, per:country of death, org:shareholders, per:children and per:religion. When compared with SDP-LSTM model, our model achieves improved F 1 scores on 26 out of the 41 slot types, with the top 5 slot types being org:political/religious affiliation, per:country of death, org:alternate names, per:religion and per:alternate names. We observe that slot types with relatively sparse training examples tend to be improved by using the position-aware attention model. Attention visualization. Lastly, Figure 6 shows the visualization of attention weights assigned by our model on sampled sentences from the development set. We find that the model learns to pay more attention to words that are informative for the relation (e.g., "graduated from", "niece" and "chairman"), though it still makes mistakes (e.g., "refused to name the three"). We also observe that the model tends to put a lot of weight onto object entities, as the object NER signatures are very informative to the classification of relations.

Related Work
Relation extraction. There are broadly three main lines of work on relation extraction: first, fully-supervised approaches (Zelenko et al., 2003;Bunescu and Mooney, 2005), where a statisti-cal classifier is trained on an annotated dataset; second, distant supervision (Mintz et al., 2009;Surdeanu et al., 2012), where a training set is formed by projecting the relations in an existing knowledge base onto textual instances that contain the entities that the relation connects; and third, Open IE (Fader et al., 2011;Mausam et al., 2012), which views its goal as producing subject-relationobject triples and expressing the relation in text.
Slot filling and knowledge base population. The most widely-known effort to evaluate slot filling and KBP systems is the yearly TAC KBP slot filling tasks, starting from 2009 (McNamee and Dang, 2009). Participants in slot filling tasks usually make use of hybrid systems that combine patterns, Open IE, distant supervision and supervised systems for relation extraction (Kisiel et al., 2015;Finin et al., 2015;Zhang et al., 2016).
Datasets for relation extraction. Popular general-domain datasets include the ACE dataset (Strassel et al., 2008) and the SemEval-2010 task 8 dataset (Hendrickx et al., 2009). In addition, the BioNLP Shared Tasks ) are yearly efforts on creating datasets and evaluations for biomedical information extraction systems.
Deep learning models for relation extraction. Many deep learning models have been proposed for relation extraction, with a focus on end-to-end training using CNNs (Zeng et al., 2014;Nguyen and Grishman, 2015) and RNNs (Zhang et al., 2015). Other popular approaches include using CNN or RNN over dependency paths between entities (Xu et al., 2015a,b), augmenting RNNs with different components Zhou et al., 2016), and combining RNNs and CNNs (Vu et al., 2016;Wang et al., 2016).  compares the performance of CNN models against traditional approaches on slot filling using a portion of the TAC KBP evaluation data.

Conclusion
We introduce a state-of-the-art position-aware neural sequence model for relation extraction, as well as TACRED, a large-scale, crowd-sourced dataset that is orders of magnitude larger than previous relation extraction datasets. Our proposed model outperforms a strong feature-based classifier and all baseline neural models. In combination with the new dataset, it improves the state-of-the-

per:schools attended
The cause was a heart attack following a case of pneumonia , said PER-SUBJ 's niece , PER-OBJ PER-OBJ .

per:other family
Independent ORG-SUBJ ORG-SUBJ ORG-SUBJ ( ECC ) chairman PER-OBJ PER-OBJ refused to name the three , saying they would be identified when the final list of candidates for the august 20 polls is published on Friday .
org:top members/employees Figure 6: Sampled sentences from the TACRED development set, with words highlighted according to the attention weights produced by our best model. art hop-all F 1 on the TAC KBP 2015 slot filling task by 4.5% absolute.