Heterogeneous Supervision for Relation Extraction: A Representation Learning Approach

Relation extraction is a fundamental task in information extraction. Most existing methods have heavy reliance on annotations labeled by human experts, which are costly and time-consuming. To overcome this drawback, we propose a novel framework, REHession, to conduct relation extractor learning using annotations from heterogeneous information source, e.g., knowledge base and domain heuristics. These annotations, referred as heterogeneous supervision, often conflict with each other, which brings a new challenge to the original relation extraction task: how to infer the true label from noisy labels for a given instance. Identifying context information as the backbone of both relation extraction and true label discovery, we adopt embedding techniques to learn the distributed representations of context, which bridges all components with mutual enhancement in an iterative fashion. Extensive experimental results demonstrate the superiority of REHession over the state-of-the-art.


Introduction
One of the most important tasks towards text understanding is to detect and categorize semantic relations between two entities in a given context.For example, in Fig. 1, with regard to the sentence of c 1 , relation between Jesse James and Missouri should be categorized as died in.With accurate identification, relation extraction systems can provide essential support for many applications.One example is question answering, regarding a specific question, relation among entities can provide valuable information, which helps to seek better answers (Bao et al., 2014).Similarly, for medical science literature, relations like protein-protein interactions (Fundel et al., 2007) and gene disease associations (Chun et al., 2006) can be extracted and used in knowledge base population.Additionally, relation extractors can be used in ontology construction (Schutz and Buitelaar, 2005).
Typically, existing methods follow the supervised learning paradigm, and require extensive annotations from domain experts, which are costly and time-consuming.To alleviate such drawback, attempts have been made to build relation extractors with a small set of seed instances or human-crafted patterns (Nakashole et al., 2012(Nakashole et al., , 2011;;Carlson et al., 2010), based on which more patterns and instances will be iteratively generated by bootstrap learning.However, these methods often suffer from semantic drift (Mintz et al., 2009).Besides, knowledge bases like Freebase have been leveraged to automatically generate training data and provide distant supervision (Mintz et al., 2009).Nevertheless, for many domain-specific applications, distant supervision is either non-existent or insufficient (usually less than 25% of relation mentions are covered (Ren et al., 2015;Lin et al., 2012)).
Only recently have preliminary studies been developed to unite different supervisions, including knowledge bases and domain specific patterns, which are referred as heterogeneous supervision.As shown in Fig. 1, these supervisions often conflict with each other (Ratner et al., 2016).To address these conflicts, data programming (Ratner et al., 2016) employs a generative model, which encodes supervisions as labeling functions, and adopts the source consistency assumption: a source is likely to provide true information with Robert Newton "Bob" Ford was an American outlaw best known for killing his gang leader Jesse James ( ) in Missouri ( ) Hussein ( ) was born in Amman ( ) on 14 November 1935.
Gofraid ( ) died in 989, said to be killed in Dal Riata ( ). return died_in for < , , s> if DiedIn( , ) in KB return born_in for < , , s> if match(' * born in * ', s) return died_in for < , , s> if match(' * killed in * ', s) return born_in for < , , s> if BornIn( , )  the same probability for all instances.This assumption is widely used in true label discovery literature (Li et al., 2016) to model reliabilities of information sources like crowdsourcing and infer the true label from noisy labels.Accordingly, most true label discovery methods would trust a human annotator on all instances to the same level.
However, labeling functions, unlike human annotators, do not make casual mistakes but follow certain "error routine".Thus, the reliability of a labeling function is not consistent among different pieces of instances.In particular, a labeling function could be more reliable for a certain subset (Varma et al., 2016) (also known as its proficient subset) comparing to the rest.We identify these proficient subsets based on context information, only trust labeling functions on these subsets and avoid assuming global source consistency.
Meanwhile, embedding methods have demonstrated great potential in capturing semantic meanings, which also reduce the dimension of overwhelming text features.Here, we present REHES-SION, a novel framework capturing context's semantic meaning through representation learning, and conduct both relation extraction and true label discovery in a context-aware manner.Specifically, as depicted in Fig. 1, we embed relation mentions in a low-dimension vector space, where similar relation mentions tend to have similar relation types and annotations.'True' labels are further inferred based on reliabilities of labeling functions, which are calculated with their proficient subsets' representations.Then, these inferred true labels would serve as supervision for all components, including context representation, true label discovery and relation extraction.Besides, the context representation bridges relation extraction with true label discovery, and allows them to enhance each other.
To the best of our knowledge, the framework proposed here is the first method that utilizes representation learning to provide heterogeneous supervision for relation extraction.The high-quality context representations serve as the backbone of true label discovery and relation extraction.Extensive experiments on benchmark datasets demonstrate significant improvements over the state-ofthe-art.
The remaining of this paper is organized as follows.Section 2 gives the definition of relation extraction with heterogeneous supervision.We then present the REHESSION model and the learning algorithm in Section 3, and report our experimental evaluation in Section 4. Finally, we briefly survey related work in Section 5 and conclude this study in Section 6.

Preliminaries
In this section, we would formally define relation extraction and heterogeneous supervision.For a POS-tagged corpus D with detected entities, we refer its relation mentions as Our goal is to annotate entity mentions with relation types of interest (R = {r 1 , . . ., r K }) or None.We require users to provide heterogeneous supervision in the form of labeling function Λ = {λ 1 , . . ., λ M }, and mark the annotations generated by Λ as O = {o c,i |λ i generate annotation o c,i for c ∈ C}.We record relation mentions annotated by Λ as C l , and refer relation mentions without annotation as C u .Then, our task is to train a relation extractor based on C l and categorize relation mentions in C u .

The REHESSION Framework
Here, we present REHESSION, a novel framework to infer true labels from automatically generated noisy labels, and categorize unlabeled instances into a set of relation types.Intuitively, errors of annotations (O) come from mismatch of contexts, e.g., in Fig. 1, λ 1 annotates c 1 and c 2 with 'true' labels but for mismatched contexts 'killing' and 'killed'.Accordingly, we should only trust labeling functions on matched context, e.g., trust λ 1 on c 3 due to its context 'was born in', but not on c 1 and c 2 .On the other hand, relation extraction can be viewed as matching appropriate relation type to a certain context.These two matching processes are closely related and can enhance each other, while context representation plays an important role in both of them.

Framework Overview.
We propose a general framework to learn the relation extractor from automatically generated noisy labels.As plotted in Fig. 1, distributed representation of context bridges relation extraction with true label discovery, and allows them to enhance each other.Specifically, it follows the steps below: 1.After being extracted from context, text features are embedded in a low dimension space by representation learning (see Fig. 2); 2. Text feature embeddings are utilized to calculate relation mention embeddings (see Fig. 2); 3.With relation mention embeddings, true labels are inferred by calculating labeling functions' reliabilities in a context-aware manner (see Fig. 1); 4. Inferred true labels would 'supervise' all components to learn model parameters (see Fig. 1).
We now proceed by introducing these components of the model in further details.

Modeling Relation Mention
As shown in Table 2, we extract abundant lexical features (Ren et al., 2016;Mintz et al., 2009;Chan and Roth, 2010) to characterize relation mentions.However, this abundance also results in the gigantic dimension of original text features (∼ 10 7 in our case).In order to achieve better generalization ability, we represent relation mentions with low dimensional (∼ 10 2 ) vectors.In Fig. 2, for example, relation mention c 3 is first represented as bag-offeatures.After learning text feature embeddings, we use the average of feature embedding vectors to derive the embedding vector for c 3 .
Text Feature Representation.Similar to other principles of embedding learning, we assume text features occurring in the same contexts tend to have similar meanings (also known as distributional hypothesis (Harris, 1954)).Furthermore, we let each text feature's embedding vector to predict other text features occurred in the same relation mentions or context.Thus, text features with similar meaning should have similar embedding vectors.Formally, we mark text features as  ("Hussein", "Amman","Hussein was born in Amman") is used as an example.
log likelihood: where However, the optimization of this likelihood is impractical because the calculation of ∇p(f i |f j ) requires summation over all text features, whose size exceeds 10 7 in our case.In order to perform efficient optimization, we adopt the negative sampling technique (Mikolov et al., 2013) to avoid this summation.Accordingly, we replace the log likelihood with Eq. 1 as below: where P is noise distribution used in (Mikolov et al., 2013), σ is the sigmoid function and V is number of negative samples.
where z c is the representation of c ∈ C l , W is a n z × n v matrix, n z is the dimension of relation mention embeddings and tanh is the element-wise hyperbolic tangent function.
In other words, we represent bag of text features with their average embedding, then apply linear map and hyperbolic tangent to transform the embedding from text feature semantic space to relation mention semantic space.The non-linear tanh function allows non-linear class boundaries in other components, and also regularize relation mention representation to range [−1, 1] which avoids numerical instability issues.

True Label Discovery
Because heterogeneous supervision generates labels in a discriminative way, we suppose its errors follow certain underlying principles, i.e., if a labeling function annotates a instance correctly / wrongly, it would annotate other similar instances correctly / wrongly.For example, λ 1 in Fig. 1 generates wrong annotations for two similar instances c 1 , c 2 and would make the same errors on other similar instances.Since context representation captures the semantic meaning of relation mention and would be used to identify relation types, we also use it to identify the mismatch of context and labeling functions.Thus, we suppose for each labeling function λ i , there exists an proficient subset S i on R nz , containing instances that λ i can precisely annotate.In Fig. 1, for instance, c 3 is in the proficient subset of λ 1 , while c 1 and c 2 are not.Moreover, the generation of annotations are not really random, and we propose a probabilistic model to describe the level of mismatch from labeling functions to real relation types instead of annotations' generation.
As shown in Fig. 3, we assume the indicator of whether c belongs to S i , s c,i = δ(c ∈ S i ), would first be generated based on context representation Then the correctness of annotation o c,i , ρ c,i = δ(o c,i = o * c ), would be generated.Furthermore, we assume p(ρ c,i = 1|s c,i = 1) = ϕ 1 and p(ρ c,i = 1|s c,i = 0) = ϕ 0 to be constant for all relation mentions and labeling functions.
Because s c,i would not be used in other components of our framework, we integrate out s c,i and write the log likelihood as Note that o * c is a hidden variable but not a model parameter, and J T is the likelihood of ρ c,i = δ(o c,i = o * c ).Thus, we would first infer o * c = argmax o * c J T , then train the true label discovery model by maximizing J T .

Modeling Relation Type
We now discuss the model for identifying relation types based on context representation.For each relation mention c, its representation z c implies its relation type, and the distribution of relation type can be described by the soft-max function: where t i ∈ R vz is the representation for relation type r i .Moreover, with the inferred true label o * c , the relation extraction model can be trained as a multi-class classifier.Specifically, we use Eq. 5 to approach the distribution Moreover, we use KL-divergence to measure the dissimilarity between two distributions, and formulate model learning as maximizing J R in Eq. 7, where KL(p(.|zc) ) has the form of Eq. 5 and Eq. 6.

Model Learning
Based on Eq. 1, Eq. 4 and Eq. 7, we form the joint optimization problem for model parameters as Collectively optimizing Eq. 8 allows heterogeneous supervision guiding all three components, while these components would refine the context representation, and enhance each other.In order to solve the joint optimization problem in Eq. 8 efficiently, we adopt the stochastic gradient descent algorithm to update {W, v, v * , l, t} iteratively, and o c * is estimated by maximizing J T after calculating z c .Additionally, we apply the widely used dropout techniques (Srivastava et al., 2014) to prevent overfitting and improve generalization performance.
The learning process of REHESSION is summarized as below.In each iteration, we would sample a relation mention c from C l , then sample c's text features and conduct the text features' representation learning.After calculating the representation of c, we would infer its true label o * c based on our true label discovery model, and finally update model parameters based on o * c .

Relation Type Inference
We now discuss the strategy of performing type inference for C u .As shown in

Experiments
In this section, we empirically validate our method by comparing to the state-of-the-art relation extraction methods on news and Wikipedia articles.

Datasets and settings
In the experiments, we conduct investigations on two benchmark datasets from different domains:1 NYT (Riedel et al., 2010)  them are annotated by authors of (Hoffmann et al., 2011) and used as test data; Wiki-KBP utilizes 1.5M sentences sampled from 780k Wikipedia articles (Ling and Weld, 2012) as training corpus, while test set consists of the 2k sentences manually annotated in 2013 KBP slot filling assessment results (Ellis et al., 2012).
For both datasets, the training and test sets partitions are maintained in our experiments.Furthermore, we create validation sets by randomly sampling 10% mentions from each test set and used the remaining part as evaluation sets.
Feature Generation.As summarized in Table 2, we use a 6-word window to extract context features for each entity mention, apply the Stanford CoreNLP tool (Manning et al., 2014) to generate entity mentions and get POS tags for both datasets.Brown clusters (Brown et al., 1992) are derived for each corpus using public implementation2 .All these features are shared with all compared methods in our experiments.

Labeling Functions.
In our experiments, labeling functions are employed to encode two kinds of supervision information.One is knowledge base, the other is handcrafted domain-specific patterns.For domain-specific patterns, we manually design a number of labeling functions3 ; for knowledge base, annotations are generated following the procedure in (Ren et al., 2016;Riedel et al., 2010).
Regarding two kinds of supervision information, the statistics of the labeling functions are summarized in Table 4.We can observe that heuristic patterns can identify more relation types for KBP datasets, while for NYT datasets, knowledge base can provide supervision for more relation types.This observation aligns with our intuition that single kind of information might be insufficient while different kinds of information can complement each other.
We further summarize the statistics of annotations in Table 5

Compared Methods
We compare REHESSION with below methods: FIGER (Ling and Weld, 2012) adopts multi-label learning with Perceptron algorithm.
BFK (Bunescu and Mooney, 2005) applies bag-offeature kernel to train a support vector machine; DSL (Mintz et al., 2009) trains a multi-class logistic classifier4 on the training data; MultiR (Hoffmann et al., 2011) models training label noise by multi-instance multi-label learning; FCM (Gormley et al., 2015) performs compositional embedding by neural language model.CoType-RM (Ren et al., 2016) adopts partial-label loss to handle label noise and train the extractor.Moreover, two different strategies are adopted to feed heterogeneous supervision to these methods.The first is to keep all noisy labels, marked as 'NL'.Alternatively, a true label discovery method, Investment (Pasternack and Roth, 2010), is applied to resolve conflicts, which is based on the source consistency assumption and iteratively updates inferred true labels and label functions' reliabilities.Then, the second strategy is to only feed the inferred true labels, referred as 'TD'.

Evaluation Metrics.
For relation classification task, which excludes None type from training / testing, we use the classification accuracy (Acc) for evaluation, and for relation extraction task, precision (Prec), recall (Rec) and F1 score (Bunescu and Mooney, 2005;Bach and Badaskar, 2007) (Bao et al., 2014).

Performance Comparison
Given the experimental setup described above, the averaged evaluation scores in 10 runs of relation classification and relation extraction on two datasets are summarized in Table 6.
From the comparison, it shows that NL strategy yields better performance than TD strategy, since the true labels inferred by Investment are actually wrong for many instances.On the other hand, as discussed in Sec.4.4, our method introduces context-awareness to true label discovery, while the inferred true label guides the relation extractor achieving the best performance.This observation justifies the motivation of avoiding the source consistency assumption and the effectiveness of proposed true label discovery model.
One could also observe the difference between REHESSION and the compared methods is more significant on the NYT dataset than on the Wiki-KBP dataset.This observation accords with the fact that the NYT dataset contains more conflicts than KBP dataset (see Table 5), and the intuition is that our method would have more advantages on more conflicting labels.Among four tasks, the relation classification of Wiki-KBP dataset has highest label quality, i.e. conflicting label ratio, but with least number of training instances.And CoType-RM and DSL reach relatively better performance among all compared methods.CoType-RM performs much better than DSL on Wiki-KBP relation classification task, while DSL gets better or similar performance with CoType-RM on other tasks.This may be because the representation learning method is able to generalize better, thus performs better when the training set size is small.However, it is rather vulnerable to the noisy labels compared to DSL.Our method employs embedding techniques, and also integrates context-aware true label discovery to de-noise labels, making the embedding method rather robust, thus achieves the best performance on all tasks.

Case Study
Context Awareness of True Label Discovery.

None None
Raila Odinga was examined at ..., in Maseno, Kisumu District, ... Table 8 shows the output of REHESSION and Investment.As mentioned before, most true label discovery methods adopt the source consistency assumption, which means if they trust a labeling function, they would trust it on all annotations.For example, Investment refers None as true type for all four instances.Our method infers true labels in a context-aware manner, which means we only trust labeling functions on matched contexts.For example, our method infers born-in as the true label for the first two relation mentions; after replacing born with other words (elected and examined), our method no longer trusts born-in since the modified contexts are no longer matched, then infers None as the true label.

None None
Effectiveness of True Label Discovery.We explore the effectiveness of the proposed context-aware true label discovery component by comparing RE-HESSION to its variant REHESSION-TD, which uses the Investment method to resolve conflicts.The averaged evaluation scores are summarized in Table 7.We can observe that REHESSION-TD achieves much worse performance compared to the REHESSION.Since the only difference between REHESSION and REHESSION-TD is the model employed to resolve conflicts, this gap verifies the effectiveness of the proposed contextaware true label discovery method.

Relation Extraction.
Relation extraction is one of the most important tasks in NLP.To alleviate the dependency of annotations given by human experts, weak supervision (Bunescu and Mooney, 2007;Etzioni et al., 2004) and distant supervision (Ren et al., 2016) have been employed to automatically generate annotations based on knowledge base (or seed patterns/instances).Here we propose a more general framework to consolidate heterogeneous information and further refine the true label from noisy labels, which gives the re-lation extractor potential to detect more types of relations in a more precise way.
Word embedding has demonstrated great potential in capturing semantic meaning (Mikolov et al., 2013), and achieved great success in a wide range of NLP tasks like relation extraction (Zeng et al., 2014;Takase and Inui, 2016;Nguyen and Grishman, 2015).In our model, we employed the embedding techniques to represent context information, and reduce the dimension of text features, which allows our model to generalize better.

True Label Discovery.
True label discovery methods have been developed to resolve conflicts among multi-source information (Li et al., 2016;Zhao et al., 2012;Zhi et al., 2015).Specifically, in the spammer-hammer model (Karger et al., 2013(Karger et al., , 2011)), each source could either be a spammer, which annotates instances randomly, or a hammer, which annotates instances precisely.In this paper, we assume each labeling function would be a hammer on its proficient subset, and would be a spammer otherwise, while the proficient subsets are identified in the embedding space.
Besides data programming, socratic learning (Varma et al., 2016) has been developed to conduct binary classification under heterogeneous supervision.Its true label discovery module supervises the discriminative module in label level, while the discriminative module influences the true label discovery module by selecting a feature subset.Although delicately designed, it fails to make full use of the connection between these modules, i.e., not refine the context representation for classifier.Thus, its discriminative module might suffer from the overwhelming size of text features.

Conclusion and Future Work
In this paper, we propose REHESSION, an embedding framework to extract relation under heterogeneous supervision.It resolves conflicts based on context representation and avoid source consistency assumption.Heterogeneous supervision allows our model extract more relation types, while the inferred high-quality true labels allows our model to be more accurate.Experimental evaluation fully justifies the effectiveness of the proposed framework on two real-world datasets.
There exist many directions for future work.One is to apply transfer learning techniques to handle the difference between label distributions of training set and test set.Another is to further incorporate information like relation type hierarchy for fine-grained relation extraction.

Figure 1 :
Figure 1: REHESSION Framework except Extraction and Representation of Text Features

Figure 2 :
Figure 2: Relation Mention Representation relation type based on one elementary piece of information, e.g., four examples are listed in Fig. 1.Problem Definition.

leader Jesse James ( ) in Missouri ( ) Gofraid ( ) died in 989, said to be killed in Dal Riata ( ).
record the feature set for ∀c ∈ C as f c , and represent the embedding vector for f i as v i ∈ R nv , and we aim to maximize the following

Table 2 :
Text features F used in this paper.

Table 3 :
Proportion of None in Training/Test Set

Table 3
(Ren et al., 2016)f None in C u is usually much larger than in C l .Additionally, not like other relation types in R, None does not have a coherent semantic meaning.Similar to(Ren et al., 2016), we introduce a heuristic rule: identifying a relation mention as None when (1) our relation extractor predict it as None, or (2) the entropy of p(.|zc) over R exceeds a pre-defined threshold η.The entropy is calculated asH(p(.|zc))=− ∑ r i ∈R p(ri|zc)log(p(ri|zc)).And the second situation means based on relation extractor this relation mention is not likely belonging to any relation types in R.

Table 4 :
Number of labeling functions and the relation types they can annotated w.r.t.two kinds of information

Table 5 :
. It can be observed that a large portion of instances is only annotated as None, Number of relation mentions (RM), relation mentions annotated as None, relation mentions with conflicting annotations and conflicts involving None while lots of conflicts exist among other instances.This phenomenon justifies the motivation to employ true label discovery model to resolve the conflicts among supervision.Also, we can observe most conflicts involve None type, accordingly, our proposed method should have more advantages over traditional true label discovery methods on the relation extraction task comparing to the relation classification task that excludes None type.

Table 6 :
are employed.Note that both relation extraction and relation classification are conducted and evaluated Performance comparison of relation extraction and relation classification in sentence-level

Table 8 :
Example output of true label discovery.The first two relation mentions come from Wiki-KBP, and their annotations are {born-in, None}.The last two are created by replacing key words of the first two.Key words are marked as bold and entity mentions are marked as Italics.