Context-Aware Representations for Knowledge Base Relation Extraction

We demonstrate that for sentence-level relation extraction it is beneficial to consider other relations in the sentential context while predicting the target relation. Our architecture uses an LSTM-based encoder to jointly learn representations for all relations in a single sentence. We combine the context representations with an attention mechanism to make the final prediction. We use the Wikidata knowledge base to construct a dataset of multiple relations per sentence and to evaluate our approach. Compared to a baseline system, our method results in an average error reduction of 24 on a held-out set of relations. The code and the dataset to replicate the experiments are made available at https://github.com/ukplab/.


Introduction
The main goal of relation extraction is to determine a type of relation between two target entities that appear together in a text. In this paper, we consider the sentential relation extraction task: to each occurrence of the target entity pair e 1 , e 2 in some sentence s one has to assign a relation type r from a given set R (Hoffmann et al., 2011). A triple e 1 , r, e 2 is called a relation instance and we refer to the relation of the target entity pair as target relation. Relation extraction is a fundamental task that enables a wide range of semantic applications from question answering (Xu et al., 2016) to fact checking (Vlachos and Riedel, 2014).
For relation extraction, it is crucial to be able to extract relevant features from the sentential context (Riedel et al., 2010;Zeng et al., 2015). Modern approaches focus just on the relation between the target entities and disregard other relations that might be present in the same sentence (Zeng et al., 2015;Lin et al., 2016). For example, in order to correctly identify the relation type between the movie e 1 and the director e 2 in (1), it is important to separate out the INSTANCE OF relation between the movie and its type e 3 : (1) [e 1 Star Wars VII] is an American [e 3 space opera epic film] directed by [e 2 J. J. Abrams].
We present a novel architecture that considers other relations in the sentence as a context for predicting the label of the target relation. We use the term context relations to refer to them throughout the paper. Our architecture uses an LSTM-based encoder to jointly learn representations for all relations in a single sentence. The representation of the target relation and representations of the context relations are combined to make the final prediction.
To facilitate the experiments we construct a dataset that contains multiple positive and negative relation instances per sentence. We employ a fast growing community managed knowledge base (KB) Wikidata (Vrandečić and Krötzsch, 2014) to build the dataset.
Our main contribution is the new neural network architecture for extracting relations between an entity pair that takes into account other relations in the sentence.

Related Work
We employ a neural network to automatically encode the target relation and the sentential context into a fixed-size feature vector. Mintz et al. (2009) and Riedel et al. (2010) have used manually engineered features based on part-of-speech tags and dependency parses to represent the target relations. Recently, Zeng et al. (2015) and  have shown that one can successfully apply convo-lutional neural networks to extract sentence-level features automatically.
Most of the methods (Riedel et al., 2010;Zeng et al., 2015;Lin et al., 2016) focus on predicting a single relation type based on the combined evidence from all of the occurrences of an entity pair. Hoffmann et al. (2011) and Surdeanu et al. (2012) assign multiple relation types to each entity pair, such that the predictions are tied to particular occurrences of the entity pair. We regard the relation extraction task similarly and predict relation types on the sentence level.
We use a distant supervision approach (Mintz et al., 2009) to construct the dataset. Mintz et al. (2009) and Riedel et al. (2010) have applied it to create relation extraction datasets for a large-scale KB. In contrast to our dataset, their data contains a single relation instance per sentence. That makes it incompatible with our method.
All of the aforementioned approaches consider just the relation between the target entities and disregard other relations that might be present in the same sentence. Our method uses context relations to predict the target relation. One can also use other types of structured information from the nearby context to improve relation extraction. Roth and Yih (2004) have combined named entity recognition and relation extraction in a structured prediction approach to improve both tasks. Later, Miwa and Bansal (2016) have implemented an end-to-end neural network to construct a context representation for joint entity and relation extraction. Finally, Li et al. (2013) have designed global features and constraints to extract multiple events and their arguments from the same sentence.
We don't implement global constraints in our approach, since unlike events and arguments, there are no restrictions as to what relations can appear together. Instead we encode all relations in the same context into fixed-size vectors and use an attention mechanism to combine them.

Data generation with Wikidata
Wikidata is a collaboratively constructed KB that encodes common world knowledge in a form of binary relation instances (e.g. CAPITAL:P36 (Hawaii:Q782, Honolulu:Q18094)) 1 . It contains more than 28 million entities and 160 million re-   (Färber et al., 2015). We use the complete English Wikipedia corpus to generate training and evaluation data. Wikipedia and Wikidata are tightly integrated which enables us to employ manual wiki annotations to extract high quality data. From each sentence in a complete article we extract link annotations and retrieve Wikidata entity IDs corresponding to the linked articles. There is an unambiguous one-to-one mapping between Wikidata entities and Wikipedia articles. Columbia University → Q49088 For further processing, we filter out sentences that contain fewer than 3 annotated entities, since we need to have multiple relations per sentence for training (see Section 4).
We extract named entities and noun chunks from the input sentences with the Stanford CoreNLP toolkit  to identify entities that are not covered by the Wikipedia annotations (e.g. Obama in the sentence above). We retrieve IDs for those entities by searching through entity labels in Wikidata. We use HeidelTime (Strötgen and Gertz, 2013) to extract dates.
For each pair of entities, we query Wikidata for relation types that connect them. We discard an occurrence of an entity pair if the relation is ambiguous, i. e. multiple relation types were retrieved. For comparison, Surdeanu et al. (2012) report that only 7.5% of entity pairs have more than one corresponding relation type in the distantly supervised dataset of Riedel et al. (2010). The entity pairs that have no relation in the knowledge base are stored as negative instances.
The constructed dataset features 353 different relation types (out of approximately 1700 non-meta relation types in the Wikidata scheme). We split  Relation Encoder Figure 1: The architecture of the relation encoder it into train, validation and held-out sets, ensuring that there is no overlap in either sentences or relation triples between the three sets. The relation encoder produces a fixed-size vector representation o s of a relation between two entities in a sentence (see Figure 1). First, each token of the sentence x = {x 1 , x 2 . . . x n } is mapped to a k-dimensional embedding vector using a matrix W ∈ R |V |×k , where |V | is the size of the vocabulary. Throughout the experiments in this paper, we use 50-dimensional GloVe embeddings pre-trained on a 6 billion corpus (Pennington et al., 2014).
Second, we mark each token in the sentence as either belonging to the first entity e 1 , the second entity e 2 or to neither of those. A marker embedding matrix P ∈ R 3×d is randomly initialized (d is the dimension of the position embedding and there are three marker types). For each token, we concatenate the marker embedding with the word embedding: (W n , P n ).
We apply a recurrent neural network (RNN) on the token embeddings. The length n naturally varies from sentence to sentence and an RNN provides a way to accommodate inputs of various  sizes. It maps a sequence of n vectors to a fixedsize output vector o s ∈ R o . We take the output vector o s as the representation of the relation between the target entities in the sentence. We use the Long Short-Term Memory (LSTM) variant of RNN (Hochreiter and Schmidhuber, 1997) that was successfully applied to information extraction before (Miwa and Bansal, 2016).

Model variants
LSTM baseline As the first model variant, we feed the output vector o s of the relation encoder to a softmax layer to predict the final relation type for the target entity (see the upper part of Figure 1): (1) where y i is a weight vector and b i is a bias. ContextSum We argue that for predicting a relation type for a target entity pair other context relations in the same sentence are relevant. Some relation types may tend to co-occur, such as DI-RECTED BY and PRODUCED BY, whereas others may be restrictive (e. g. one can only have a single PLACE OF BIRTH).
Therefore, in addition to the target entity pair, we take other entities from the same sentence that were extracted at the data generation step. We construct a set of context relations by taking each possible pair of entities. 3 Example (2) shows a target entity pair e 1 , e 2 and context entities highlighted in bold.
(2) We apply the same relation encoder on the target and context relations (see Figure 2). That ensures that representation for target and context relations are learned jointly. We sum the context relation representations: where g i computes an attention score for a context relation with respect to the target relation: g(o i , o s ) = o i Ao s , and A is a weight matrix that is learned.

Training the models
All models were trained using the Adam optimizer (Kingma and Ba, 2014) with categorical crossentropy as the loss function. We use an early stopping criterion on the validation data to determine the number of training epochs. The learning rate is fixed to 0.01 and the rest of the optimization parameters are set as recommended in Kingma and Ba (2014): β 1 = 0.9, β 2 = 0.999, ε = 1e − 08. The training is performed in batches of 128 instances.
We apply Dropout (Srivastava et al., 2014) on the penultimate layer as well as on the embeddings layer with a probability of 0.5. We choose the size of the layers (RNN layer size o = 256) and entity marker embeddings (d = 3) with a random search on the validation set. 4

Held-out evaluation
As an additional baseline, we re-implement a sentence-level model based on convolutional neural networks (CNNs) described in Lin et al. (2016). This is a state-of-the-art model for fine-grained relation extraction that was previously tested on the single-relation dataset from Riedel et al. (2010). In addition to CNNs, their architecture uses a different position encoding scheme: position markers encode a relative position of each word with respect to the target entities. 5 We use the same GloVe word embeddings for this model and perform a hyperparameter optimization on the validation set.
Our dataset lets us compare the baseline models and the models that use context relations on the same data. Following the previous work on rela-  tion extraction, we report the aggregated precisionrecall curves for each model on the held-out data ( Figure 3). 6 To compute the curves, we rank the predictions of each model by their confidence and traverse this list top to bottom measuring the precision and recall at each step. The models that take the context into account perform similar to the baselines at the smallest recall numbers, but start to positively deviate from them at higher recall rates. In particular, the ContextAtt model performs better than any other system in our study over the entire recall range. Compared to the competitive LSTM-baseline that uses the same relation encoder, the ContextAtt model achieves a 24% reduction of the average error: from 0.2096 ± 0.002 to 0.1590 ± 0.002. The difference between the models is statistically significant (p = 0.009). 7 We also compute macro precision-recall curves that give equal weights to all relations in the dataset. Figure 4 shows that the ContextAtt model performs best over all relation types. One can also see that the ContextSum doesn't universally outperforms the LSTM-baseline. It demonstrates again that using attention is crucial to extract relevant information from the context relations.
On the relation-specific results (Table 2) we observe that the context-enabled model demonstrates the most improvement on precision and seems to be especially useful for taxonomy relations (see SUBCLASS OF, PART OF). 6 We do not compare against the approach of Surdeanu et al.
(2012) that also performs sentence-level relation extraction, since the provided implementation does not feature the complete pipeline and is only applicable on a particular Freebase dataset. 7 The average error and the standard deviation are estimated on 5 training iterations for each model. The statistical significance is computed using the Wilcoxon rank-sum test on the error rates.

Conclusions
We have introduced a neural network architecture for relation extraction on the sentence level that takes into account other relations from the same context. We have shown by comparison with competitive baselines that these context relations are beneficial for relation extraction with a large set of relation types.
Our approach can be easily applied to other types of relation extraction models as well. For instance, Lin et al. (2016) extract sentence-level features and then combine features from multiple sentences with a selective attention mechanism. It would be possible to replace their sentence-level feature extractor with our model.