RESIDE: Improving Distantly-Supervised Neural Relation Extraction using Side Information

Distantly-supervised Relation Extraction (RE) methods train an extractor by automatically aligning relation instances in a Knowledge Base (KB) with unstructured text. In addition to relation instances, KBs often contain other relevant side information, such as aliases of relations (e.g., founded and co-founded are aliases for the relation founderOfCompany). RE models usually ignore such readily available side information. In this paper, we propose RESIDE, a distantly-supervised neural relation extraction method which utilizes additional side information from KBs for improved relation extraction. It uses entity type and relation alias information for imposing soft constraints while predicting relations. RESIDE employs Graph Convolution Networks (GCN) to encode syntactic information from text and improves performance even when limited side information is available. Through extensive experiments on benchmark datasets, we demonstrate RESIDE’s effectiveness. We have made RESIDE’s source code available to encourage reproducible research.


Introduction
The construction of large-scale Knowledge Bases (KBs) like Freebase (Bollacker et al., 2008) and Wikidata (Vrandečić and Krötzsch, 2014) has proven to be useful in many natural language processing (NLP) tasks like question-answering, web search, etc. However, these KBs are not exhaustive. Relation Extraction (RE) attempts to fill this gap by extracting semantic relationships between entity pairs from plain text. This task can be modeled as a simple classification problem after the entity pairs are specified. Formally, given an entity pair (e 1 ,e 2 ) from the KB and an entity annotated sentence (or instance), we aim to predict the * Research internship at Indian Institute of Science. relation r, from a predefined relation set, that exists between e 1 and e 2 . If no relation exists, we simply label it NA.
Most supervised relation extraction methods require large labeled training data which is expensive to construct. Distant Supervision (DS) (Mintz et al., 2009) helps with the construction of this dataset automatically, under the assumption that if two entities have a relationship in a KB, then all sentences mentioning those entities express the same relation. While this approach works well in generating large amounts of training instances, the DS assumption does not hold in all cases. Riedel et al. (2010); Hoffmann et al. (2011);Surdeanu et al. (2012) propose multi-instance based learning to relax this assumption. However, they use NLP tools to extract features, which can be noisy.
Recently, neural models have demonstrated promising performance on RE. Zeng et al. (2014Zeng et al. ( , 2015 employ Convolutional Neural Networks (CNN) to learn representations of instances. For alleviating noise in distant supervised datasets, attention has been utilized by (Lin et al., 2016;Jat et al., 2018). Syntactic information from dependency parses has been used by (Mintz et al., 2009;He et al., 2018) for capturing long-range dependencies between tokens. Recently proposed Graph Convolution Networks (GCN) (Defferrard et al., 2016) have been effectively employed for encoding this information Bastings et al., 2017). However, all the above models rely only on the noisy instances from distant supervision for RE.
Relevant side information can be effective for improving RE. For instance, in the sentence, Microsoft was started by Bill Gates., the type information of Bill Gates (person) and Microsoft (organization) can be helpful in predicting the correct relation founderOfCompany. This is because every relation constrains the type of its target en-  Figure 1: Overview of RESIDE. RESIDE first encodes each sentence in the bag by concatenating embeddings (denoted by ⊕) from Bi-GRU and Syntactic GCN for each token, followed by word attention. Then, sentence embedding is concatenated with relation alias information, which comes from the Side Information Acquisition Section (Figure 2), before computing attention over sentences. Finally, bag representation with entity type information is fed to a softmax classifier. Please see Section 5 for more details.
tities. Similarly, relation phrase "was started by" extracted using Open Information Extraction (Open IE) methods can be useful, given that the aliases of relation founderOfCompany, e.g., founded, co-founded, etc., are available. KBs used for DS readily provide such information which has not been completely exploited by current models.
In this paper, we propose RESIDE, a novel distant supervised relation extraction method which utilizes additional supervision from KB through its neural network based architecture. RESIDE makes principled use of entity type and relation alias information from KBs, to impose soft constraints while predicting the relation. It uses encoded syntactic information obtained from Graph Convolution Networks (GCN), along with embedded side information, to improve neural relation extraction. Our contributions can be summarized as follows: • We propose RESIDE, a novel neural method which utilizes additional supervision from KB in a principled manner for improving distant supervised RE. • RESIDE uses Graph Convolution Networks (GCN) for modeling syntactic information and has been shown to perform competitively even with limited side information. • Through extensive experiments on benchmark datasets, we demonstrate RESIDE's effectiveness over state-of-the-art baselines.
RESIDE's source code and datasets used in the paper are available at http://github.com/ malllabiisc/RESIDE.

Related Work
Distant supervision: Relation extraction is the task of identifying the relationship between two entity mentions in a sentence. In supervised paradigm, the task is considered as a multi-class classification problem but suffers from lack of large labeled training data. To address this limitation, (Mintz et al., 2009) propose distant supervision (DS) assumption for creating large datasets, by heuristically aligning text to a given Knowledge Base (KB). As this assumption does not always hold true, some of the sentences might be wrongly labeled. To alleviate this shortcoming, Riedel et al. (2010) relax distant supervision for multi-instance single-label learning. Subsequently, for handling overlapping relations between entities (Hoffmann et al., 2011;Surdeanu et al., 2012) propose multi-instance multi-label learning paradigm.
Neural Relation Extraction: The performance of the above methods strongly rely on the quality of hand engineered features. Zeng et al. (2014) propose an end-to-end CNN based method which could automatically capture relevant lexical and sentence level features. This method is further improved through piecewise max-pooling by (Zeng et al., 2015). Lin et al. (2016); Nagarajan et al. (2017) use attention  for learning from multiple valid sentences. We also make use of attention for learning sentence and bag representations.
Dependency tree based features have been found to be relevant for relation extraction (Mintz et al., 2009). He et al. (2018) use them for getting promising results through a recursive tree-GRU based model. In RESIDE, we make use of recently proposed Graph Convolution Networks (Defferrard et al., 2016;Kipf and Welling, 2017), which have been found to be quite effective for modelling syntactic information Nguyen and Grishman, 2018;Vashishth et al., 2018a).
Side Information in RE: Entity description from KB has been utilized for RE (Ji et al., 2017), but such information is not available for all entities. Type information of entities has been used by Ling and Weld (2012);  as features in their model. Yaghoobzadeh et al. (2017) also attempt to mitigate noise in DS through their joint entity typing and relation extraction model. However, KBs like Freebase readily provide reliable type information which could be directly utilized. In our work, we make principled use of entity type and relation alias information obtained from KB. We also use unsupervised Open Information Extraction (Open IE) methods (Mausam et al., 2012;Angeli et al., 2015), which automatically discover possible relations without the need of any predefined ontology, which is used as a side information as defined in Section 5.2.

Background: Graph Convolution Networks (GCN)
In this section, we provide a brief overview of Graph Convolution Networks (GCN) for graphs with directed and labeled edges, as used in .

GCN on Labeled Directed Graph
For a directed graph, G = (V, E), where V and E represent the set of vertices and edges respectively, an edge from node u to node v with label l uv is represented as (u, v, l uv ). Since, informa-tion in directed edge does not necessarily propagate along its direction, following  we define an updated edge set E which includes inverse edges (v, u, l −1 uv ) and selfloops (u, u, ) along with the original edge set E, where is a special symbol to denote self-loops. For each node v in G, we have an initial representation x v ∈ R d , ∀v ∈ V. On employing GCN, we get an updated d-dimensional hidden representation h v ∈ R d , ∀v ∈ V, by considering only its immediate neighbors (Kipf and Welling, 2017). This can be formulated as: Here, W luv ∈ R d×d and b luv ∈ R d are label dependent model parameters which are trained based on the downstream task. N (v) refers to the set of neighbors of v based on E and f is any non-linear activation function. In order to capture multihop neighborhood, multiple GCN layers can be stacked. Hidden representation of node v in this case after k th GCN layer is given as:

Integrating Edge Importance
In automatically constructed graphs, some edges might be erroneous and hence need to be discarded. Edgewise gating in GCN by (Bastings et al., 2017; allows us to alleviate this problem by subduing the noisy edges. This is achieved by assigning a relevance score to each edge in the graph. At k th layer, the importance of an edge (u, v, l uv ) is computed as: Here,ŵ k luv ∈ R m andb k luv ∈ R are parameters which are trained and σ(·) is the sigmoid function. With edgewise gating, the final GCN embedding for a node v after k th layer is given as: In multi-instance learning paradigm, we are given a bag of sentences (or instances) {s 1 , s 2 , ...s n } for a given entity pair, the task is to predict the relation between them. RESIDE consists of three components for learning a representation of a given bag, which is fed to a softmax classifier. We briefly present the components of RESIDE below. Each component will be described in detail in the subsequent sections. The overall architecture of RE-SIDE is shown in Figure 1.
1. Syntactic Sentence Encoding: RESIDE uses a Bi-GRU over the concatenated positional and word embedding for encoding the local context of each token. For capturing long-range dependencies, GCN over dependency tree is employed and its encoding is appended to the representation of each token. Finally, attention over tokens is used to subdue irrelevant tokens and get an embedding for the entire sentence. More details in Section 5.1. 2. Side Information Acquisition: In this module, we use additional supervision from KBs and utilize Open IE methods for getting relevant side information. This information is later utilized by the model as described in Section 5.2. 3. Instance Set Aggregation: In this part, sentence representation from syntactic sentence encoder is concatenated with the matched relation embedding obtained from the previous step. Then, using attention over sentences, a representation for the entire bag is learned. This is then concatenated with entity type embedding before feeding into the softmax classifier for relation prediction. Please refer to Section 5.3 for more details.

RESIDE Details
In this section, we provide the detailed description of the components of RESIDE.

Syntactic Sentence Encoding
For each sentence in the bag s i with m tokens {w 1 , w 2 , ...w m }, we first represent each token by k-dimensional GloVe embedding (Pennington et al., 2014). For incorporating relative position of tokens with respect to target entities, we use p-dimensional position embeddings, as used by (Zeng et al., 2014). The combined token embeddings are stacked together to get the sentence representation H ∈ R m×(k+2p) . Then, using Bi-GRU  over H, we get the new sentence representation H gru ∈ R m×dgru , where d gru is the hidden state dimension. Bi-GRUs have been found to be quite effective in encoding the context of tokens in several tasks (Sutskever et al., 2014;Graves et al., 2013). Although Bi-GRU is capable of capturing local context, it fails to capture long-range dependencies which can be captured through dependency edges. Prior works (Mintz et al., 2009;He et al., 2018) have exploited features from syntactic dependency trees for improving relation extraction. Motivated by their work, we employ Syntactic Graph Convolution Networks for encoding this information. For a given sentence, we generate its dependency tree using Stanford CoreNLP . We then run GCN over the dependency graph and use Equation 2 for updating the embeddings, taking H gru as the input. Since dependency graph has 55 different edge labels, incorporating all of them overparameterizes the model significantly. Therefore, following Nguyen and Grishman, 2018;Vashishth et al., 2018a) we use only three edge labels based on the direction of the edge {forward (→), backward (←), selfloop ( )}. We define the new edge label L uv for an edge (u, v, l uv ) as follows: For each token w i , GCN embedding h gcn i k+1 ∈ R dgcn after k th layer is defined as: Here, g k iu denotes edgewise gating as defined in Equation 1 and L iu refers to the edge label defined above. We use ReLU as activation function f , throughout our experiments. The syntactic graph encoding from GCN is appended to Bi-GRU output to get the final token representation, h concat i as [h gru i ; h gcn i k+1 ]. Since, not all tokens are equally relevant for RE task, we calculate the degree of relevance of each token using attention as used in  Figure 2: Relation alias side information extraction for a given sentence. First, Syntactic Context Extractor identifies relevant relation phrases P between target entities. They are then matched in the embedding space with the extended set of relation aliases R from KB. Finally, the relation embedding corresponding to the closest alias is taken as relation alias information. Please refer Section 5.2. (Jat et al., 2018). For token w i in the sentence, attention weight α i is calculated as: where r is a random query vector and u i is the relevance score assigned to each token. Attention values {α i } are calculated by taking softmax over {u i }. The representation of a sentence is given as a weighted sum of its tokens, s = m j=1 α i h concat i .

Side Information Acquisition
Relevant side information has been found to improve performance on several tasks (Ling and Weld, 2012;Vashishth et al., 2018b). In distant supervision based relation extraction, since the entities are from a KB, knowledge about them can be utilized to improve relation extraction. Moreover, several unsupervised relation extraction methods (Open IE) (Angeli et al., 2015;Mausam et al., 2012) allow extracting relation phrases between target entities without any predefined ontology and thus can be used to obtain relevant side information. In RESIDE, we employ Open IE methods and additional supervision from KB for improving neural relation extraction.

Relation Alias Side Information
RESIDE uses Stanford Open IE (Angeli et al., 2015) for extracting relation phrases between target entities, which we denote by P. As shown in Figure 2, for the sentence Matt Coffin, executive of lowermybills, a company.., Open IE methods extract "executive of" between Matt Coffin and lowermybills. Further, we extend P by including tokens at one hop distance in dependency path from target entities. Such features from dependency parse have been exploited in the past by (Mintz et al., 2009;He et al., 2018). The degree of match between the extracted phrases in P and aliases of a relation can give important clues about the relevance of that relation for the sentence. Several KBs like Wikidata provide such relation aliases, which can be readily exploited. In RESIDE, we further expand the relation alias set using Paraphrase database (PPDB) (Pavlick et al., 2015). We note that even for cases when aliases for relations are not available, providing only the names of relations give competitive performance. We shall explore this point further in Section 7.3. For matching P with the PPDB expanded relation alias set R, we project both in a d-dimensional space using GloVe embeddings (Pennington et al., 2014). Projecting phrases using word embeddings helps to further expand these sets, as semantically similar words are closer in embedding space (Mikolov et al., 2013;Pennington et al., 2014). Then, for each phrase p ∈ P, we calculate its cosine distance from all relation aliases in R and take the relation corresponding to the closest relation alias as a matched relation for the sentence. We use a threshold on cosine distance to remove noisy aliases. In RESIDE, we define a k r -dimensional embedding for each relation which we call as matched relation embedding (h rel ). For a given sentence, h rel is concatenated with its representa-tion s, obtained from syntactic sentence encoder (Section 5.1) as shown in Figure 1. For sentences with |P| > 1, we might get multiple matched relations. In such cases, we take the average of their embeddings. We hypothesize that this helps in improving the performance and find it to be true as shown in Section 7.

Entity Type Side Information
Type information of target entities has been shown to give promising results on relation extraction (Ling and Weld, 2012;Yaghoobzadeh et al., 2017). Every relation puts some constraint on the type of entities which can be its subject and object. For example, the relation person/place of birth can only occur between a person and a location. Sentences in distance supervision are based on entities in KBs, where the type information is readily available.
In RESIDE, we use types defined by FIGER (Ling and Weld, 2012) for entities in Freebase. For each type, we define a k t -dimensional embedding which we call as entity type embedding (h type ). For cases when an entity has multiple types in different contexts, for instance, Paris may have types government and location, we take the average over the embeddings of each type. We concatenate the entity type embedding of target entities to the final bag representation before using it for relation classification. To avoid over-parameterization, instead of using all fine-grained 112 entity types, we use 38 coarse types which form the first hierarchy of FIGER types.

Instance Set Aggregation
For utilizing all valid sentences, following (Lin et al., 2016;Jat et al., 2018), we use attention over sentences to obtain a representation for the entire bag. Instead of directly using the sentence representation s i from Section 5.1, we concatenate the embedding of each sentence with matched relation embedding h rel i as obtained from Section 5.2. The attention score α i for i th sentence is formulated as: here q denotes a random query vector. The bag representation B, which is the weighted sum of its sentences, is then concatenated with the entity type embeddings of the subject (h type sub ) and object  (h type obj ) from Section 5.2 to obtainB.
Finally,B is fed to a softmax classifier to get the probability distribution over the relations.
6 Experimental Setup

Datasets
In our experiments, we evaluate the models on Riedel and Google Distant Supervision (GIDS) dataset. Statistics of the datasets is summarized in Table 1. Below we described each in detail 1 .

Riedel:
The dataset is developed by (Riedel et al., 2010) (Finkel et al., 2005) and are linked to Freebase. The dataset has been widely used for RE by (Hoffmann et al., 2011;Surdeanu et al., 2012) and more recently by (Lin et al., 2016;Feng et al.;He et al., 2018). Jat et al. (2018) created Google Distant Supervision (GIDS) dataset by extending the Google relation extraction corpus 2 with additional instances for each entity pair. The dataset assures that the at-least-one assumption of multi-instance learning, holds. This makes automatic evaluation more reliable and thus removes the need for manual verification.

Baselines
For evaluating RESIDE, we compare against the following baselines: • Mintz: Multi-class logistic regression model proposed by (Mintz et al., 2009) for distant supervision paradigm. • MultiR: Probabilistic graphical model for multi instance learning by (Hoffmann et al., 2011) • MIMLRE: A graphical model which jointly models multiple instances and multiple labels. More details in (Surdeanu et al., 2012). • PCNN: A CNN based relation extraction model by (Zeng et al., 2015) which uses piecewise max-pooling for sentence representation. • PCNN+ATT: A piecewise max-pooling over CNN based model which is used by (Lin et al., 2016) to get sentence representation followed by attention over sentences. • BGWA: Bi-GRU based relation extraction model with word and sentence level attention (Jat et al., 2018). • RESIDE: The method proposed in this paper, please refer Section 5 for more details.

Evaluation Criteria
Following the prior works (Lin et al., 2016;Feng et al.), we evaluate the models using held-out evaluation scheme. This is done by comparing the relations discovered from test articles with those in Freebase. We evaluate the performance of models with Precision-Recall curve and top-N precision (P@N) metric in our experiments.

Results
In this section we attempt to answer the following questions: Q1. Is RESIDE more effective than existing approaches for distant supervised RE? (7.1) Q2. What is the effect of ablating different components on RESIDE's performance? (7.2) Q3. How is the performance affected in the absence of relation alias information? (7.3)

Performance Comparison
For evaluating the effectiveness of our proposed method, RESIDE, we compare it against the baselines stated in Section 6.2. We use only the neural baselines on GIDS dataset. The Precision-Recall curves on Riedel and GIDS are presented in Figure  3. Overall, we find that RESIDE achieves higher precision over the entire recall range on both the datasets. All the non-neural baselines could not perform well as the features used by them are mostly derived from NLP tools which can be erroneous. RESIDE outperforms PCNN+ATT and BGWA which indicates that incorporating side information helps in improving the performance of the model. The higher performance of BGWA and PCNN+ATT over PCNN shows that attention helps in distant supervised RE. Following (Lin et al., 2016;, we also evaluate our method with different number of sentences. Results summarized in  the efficacy of our model.

Ablation Results
In this section, we analyze the effect of various components of RESIDE on its performance. For this, we evaluate various versions of our model with cumulatively removed components. The experimental results are presented in Figure 4. We observe that on removing different components from RESIDE, the performance of the model degrades drastically. The results validate that GCNs are effective at encoding syntactic information. Further, the improvement from side information shows that it is complementary to the features extracted from text, thus validating the central thesis of this paper, that inducing side information leads to improved relation extraction.

Effect of Relation Alias Side Information
In this section, we test the performance of the model in setting where relation alias information is not readily available. For this, we evaluate the performance of the model on four different settings: • None: Relation aliases are not available. • One: The name of relation is used as its alias.
• One+PPDB: Relation name extended using Paraphrase Database (PPDB). • All: Relation aliases from Knowledge Base 3 The overall results are summarized in Figure 5. We find that the model performs best when aliases are provided by the KB itself. Overall, we find that RESIDE gives competitive performance even when very limited amount of relation alias information is available. We observe that performance improves further with the availability of more alias information.

Conclusion
In this paper, we propose RESIDE, a novel neural network based model which makes principled use of relevant side information, such as entity type and relation alias, from Knowledge Base, for improving distant supervised relation extraction. RE-SIDE employs Graph Convolution Networks for encoding syntactic information of sentences and is robust to limited side information. Through extensive experiments on benchmark datasets, we demonstrate RESIDE's effectiveness over stateof-the-art baselines. We have made RESIDE's source code publicly available to promote reproducible research.