Joint Bootstrapping Machines for High Confidence Relation Extraction

Semi-supervised bootstrapping techniques for relationship extraction from text iteratively expand a set of initial seed instances. Due to the lack of labeled data, a key challenge in bootstrapping is semantic drift: if a false positive instance is added during an iteration, then all following iterations are contaminated. We introduce BREX, a new bootstrapping method that protects against such contamination by highly effective confidence assessment. This is achieved by using entity and template seeds jointly (as opposed to just one as in previous work), by expanding entities and templates in parallel and in a mutually constraining fashion in each iteration and by introducing higherquality similarity measures for templates. Experimental results show that BREX achieves an F1 that is 0.13 (0.87 vs. 0.74) better than the state of the art for four relationships.


Introduction
Traditional semi-supervised bootstrapping relation extractors (REs) such as BREDS (Batista et al., 2015), SnowBall (Agichtein and Gravano, 2000) and DIPRE (Brin, 1998) require an initial set of seed entity pairs for the target binary relation. They find occurrences of positive seed entity pairs in the corpus, which are converted into extraction patterns, i.e., extractors, where we define an extractor as a cluster of instances generated from the corpus. The initial seed entity pair set is expanded with the relationship entity pairs newly extracted by the extractors from the text iteratively. The augmented set is then used to extract new relationships until a stopping criterion is met.
Due to lack of sufficient labeled data, rulebased systems dominate commercial use (Chiticariu et al., 2013). Rules are typically defined by creating patterns around the entities (entity extraction) or entity pairs (relation extraction). Recently, supervised machine learning, especially deep learning techniques (Gupta et al., 2015;Nguyen and Grishman, 2015;Vu et al., 2016a,b;Gupta et al., 2016), have shown promising results in entity and relation extraction; however, they need sufficient hand-labeled data to train models, which can be costly and time consuming for webscale extractions. Bootstrapping machine-learned rules can make extractions easier on large corpora. Thus, open information extraction systems (Carlson et al., 2010;Fader et al., 2011;Mausam et al., 2012;Mesquita et al., 2013;Angeli et al., 2015) have recently been popular for domain specific or independent pattern learning.
Hearst (1992) used hand written rules to generate more rules to extract hypernym-hyponym pairs, without distributional similarity. For entity extraction, Riloff (1996) used seed entities to generate extractors with heuristic rules and scored them by counting positive extractions. Prior work (Lin et al., 2003; investigated different extractor scoring measures.  improved scores by introducing expected number of negative entities. Brin (1998) developed the bootstrapping relation extraction system DIPRE that generates extractors by clustering contexts based on string matching. SnowBall (Agichtein and Gravano, 2000) is inspired by DIPRE but computes a TF-IDF representation of each context. BREDS (Batista et al., 2015) uses word embeddings (Mikolov et al., 2013) to bootstrap relationships.
Related work investigated adapting extractor scoring measures in bootstrapping entity extraction with either entities or templates (Table 1) as seeds ( Table 2). The state-of-the-art relation extractors bootstrap with only seed entity pairs and suffer due to a surplus of unknown extractions and the lack of labeled data, leading to low confidence extractors. This in turn leads to to low confidence in the system output. Prior RE sys-BREE Bootstrapping Relation Extractor with Entity pair BRET Bootstrapping Relation Extractor with Template BREJ Bootstrapping Relation Extractor in Joint learning type a named entity type, e.g., person typed entity a typed entity, e.g., "Obama",person¡ entity pair a pair of two typed entities template a triple of vectors ( v ¡1 , v 0 , v 1 ) and an entity pair instance entity pair and template (types must be the same)  (1) We propose a Joint Bootstrapping Machine 1 (JBM), an alternative to the entity-pair-centered bootstrapping for relation extraction that can take advantage of both entity-pair and template-centered methods to jointly learn extractors consisting of instances due to the occurrences of both entity pair and template seeds. It scales up the number of positive extractions for non-noisy extractors and boosts their confidence scores. We focus on improving the scores for non-noisy-low-confidence extractors, resulting in higher recall. The relation extractors bootstrapped with entity pair, template and joint seeds are named as BREE, BRET and BREJ (Table 1), respectively.
(2) Prior work on embedding-based context comparison has assumed that relations have consistent syntactic expression and has mainly addressed synonymy by using embeddings (e.g.,"acquired" -"bought"). In reality, there is large variation in the syntax of how relations are expressed, e.g., "MSFT to acquire NOK for $8B" 1 github.com/pgcool/Joint-Bootstrapping-Machines vs. "MSFT earnings hurt by NOK acquisition". We introduce cross-context similarities that compare all parts of the context (e.g., "to acquire" and "acquisition") and show that these perform better (in terms of recall) than measures assuming consistent syntactic expression of relations.
(3) Experimental results demonstrate a 13% gain in F 1 score on average for four relationships and suggest eliminating four parameters, compared to the state-of-the-art method.
The motivation and benefits of the proposed JBM for relation extraction is discussed in depth in section 2.3. The method is applicable for both entity and relation extraction tasks. However, in context of relation extraction, we call it BREJ.

Notation and definitions
We first introduce the notation and terms (Table 1).
Given a relationship like "x acquires y", the task is to extract pairs of entities from a corpus for which the relationship is true. We assume that the arguments of the relationship are typed, e.g., x and y are organizations. We run a named entity tagger in preprocessing, so that the types of all candidate entities are given. The objects the bootstrapping algorithm generally handles are therefore typed entities (an entity associated with a type).
For a particular sentence in a corpus that states that the relationship (e.g., "acquires") holds between x and y, a template consists of three vectors that represent the context of x and y. v ¡1 represents the context before x, v 0 the context between x and y and v 1 the context after y. These vectors are simply sums of the embeddings of the corresponding words. A template is "typed", i.e., in addition to the three vectors it specifies the types of the two entities. An instance joins an entity pair and a template. The types of entity pair and template must be the same.
The first step of bootstrapping is to extract a set of instances from the input corpus. We refer to this set as γ. We will use i and j to refer to instances. xpiq is the entity pair of instance i and xpiq is the template of instance i.
A required input to our algorithm are sets of positive and negative seeds for either entity pairs (G p and G n ) or templates (G p and G n ) or both. We define G to be a tuple of all four seed sets.
We run our bootstrapping algorithm for k it iterations where k it is a parameter.
A key notion is the similarity between two instances. We will experiment with different similarity measures. The baseline is (Batista et al., 2015)'s measure given in Figure 4, first line: the similarity of two instances is given as a weighted sum of the dot products of their before contexts ( v ¡1 ), their between contexts ( v 0 ) and their after contexts ( v 1 ) where the weights w p are parameters. We give this definition for instances, but it also applies to templates since only the context vectors of an instance are used, not the entities.
The similarity between an instance i and a cluster λ of instances is defined as the maximum similarity of i with any member of the cluster; see Figure 2, right, Eq. 5. Again, there is a straightforward extension to a cluster of templates: see Figure 2, right, Eq. 6.
The extractors Λ can be categorized as follows: where R is the relation to be bootstrapped. The λ cat is a member of Λ cat . For instance, a λ N N LC is called as a non-noisy-low-confidence extractor if it represents the target relation (i.e., λ Þ Ñ R), however with the confidence below a certain threshold (τ cnf ). Extractors of types Λ N N HC and Λ N LC are desirable, those of types Λ N HC and Λ N N LC undesirable within bootstrapping.

The Bootstrapping Machines: BREX
To describe BREX ( Figure 1) in its most general form, we use the term item to refer to an entity pair, a template or both. The input to BREX (Figure 2, left, line 01) is a set γ of instances extracted from a corpus and G seed , a structure consisting of one set of positive and one set of negative seed items. G yield (line 02) collects the items that BREX extracts in several iterations. In each of k it iterations (line 03), BREX first initializes the cache G cache (line 04); this cache collects the items that are extracted in this iteration. The design of the algorithm balances elements that ensure high recall with elements that ensure high precision.
High recall is achieved by starting with the seeds and making three "hops" that consecutively consider order-1, order-2 and order-3 neighbors . . .  Figure 1: Joint Bootstrapping Machine. The red and blue filled circles/rings are the instances generated due to seed entity pairs and templates, respectively. Each dashed rectangular box represents a cluster of instances. Numbers indicate the flow. Follow the notations from Table 1 and Figure 2.

Seed
of the seeds. On line 05, we make the first hop: all instances that are similar to a seed are collected where "similarity" is defined differently for different BREX configurations (see below). The collected instances are then clustered, similar to work on bootstrapping by Agichtein and Gravano (2000) and Batista et al. (2015). On line 06, we make the second hop: all instances that are within τ sim of a hop-1 instance are added; each such instance is only added to one cluster, the closest one; see definition of µ: Figure 2, Eq. 8. On line 07, we make the third hop: we include all instances that are within τ sim of a hop-2 instance; see definition of ψ: Figure 2, Eq. 7. In summary, every instance that can be reached by three hops from a seed is being considered at this point. A cluster of hop-2 instances is named as extractor. High precision is achieved by imposing, on line 08, a stringent check on each instance before its information is added to the cache. The core function of this check is given in Figure 2, Eq. 9. This definition is a soft version of the following hard max, which is easier to explain: We are looking for a cluster λ in Λ that licenses the extraction of i with high confidence. cnfpi, λ, Gq (Figure 2, Eq. 10), the confidence of a single cluster (i.e., extractor) λ for an instance, is defined as the product of the overall reliability of λ (which is independent of i) and the similarity of i to λ, the second factor in Eq. 10, i.e., simpi, λq. This factor simpi, λq prevents an extraction by a cluster whose members are all distant from the instance -even if the cluster itself is highly reliable.

BREE BRET BREJ
Seed Type Entity pairs Templates Joint (Entity pairs + Templates) The first factor in Eq. 10, i.e., cnfpλ, Gq, assesses the reliability of a cluster λ: we compute the ratio N pλ,Gnq N pλ,Gpq , i.e., the ratio between the number of instances in λ that match a negative and positive gold seed, respectively; see Figure 3, line (i). If this ratio is close to zero, then likely false positive extractions are few compared to likely true positive extractions. For the simple version of the algorithm (for which we set w n 1, w u 0), this results in cnfpλ, Gq being close to 1 and the reliability measure it not discounted. On the other hand, if N pλ,Gnq N pλ,Gpq is larger, meaning that the relative number of likely false positive extractions is high, then cnfpλ, Gq shrinks towards 0, resulting in progressive discounting of cnfpλ, Gq and leading to non-noisy-low-confidence extractor, particularly for a reliable λ. Due to lack of labeled data, the scoring mechanism cannot distinguish between noisy and non-noisy extractors. Therefore, an extractor is judged by its ability to extract more positive and less negative extractions. Note that we carefully designed this precision component to give good assessments while at the same time making maximum use of the available seeds.
The reliability statistics are computed on λ, i.e., on hop-2 instances (not on hop-3 instances). The ratio N pλ,Gnq N pλ,Gpq is computed on instances that directly match a gold seed -this is the most reliable information we have available.
After all instances have been checked (line 08) and (if they passed muster) added to the cache (line 09), the inner loop ends and the cache is merged into the yield (line 10). Then a new loop (lines 03-10) of hop-1, hop-2 and hop-3 extensions and cluster reliability tests starts.
Thus, the algorithm consists of k it iterations. There is a tradeoff here between τ sim and k it . We will give two extreme examples, assuming that we want to extract a fixed number of m instances where m is given. We can achieve this goal either by setting k it =1 and choosing a small τ sim , which will result in very large hops. Or we can achieve this goal by setting τ sim to a large value and running the algorithm for a larger number of k it . The flexibility that the two hyperparameters k it and τ sim afford is important for good performance. sim match pi, jq °p t¡1,0,1u wp vppiq vppjq ; sim asym cc pi, jq max pt¡1,0,1u vppiq v0pjq Figure 4: Similarity measures. These definitions for instances equally apply to templates since the definitions only depend on the "template part" of an instance, i.e., its vectors. (value is 0 if types are different)

BREE, BRET and BREJ
The main contribution of this paper is that we propose, as an alternative to entity-pair-centered BREE (Batista et al., 2015), template-centered BRET as well as BREJ (Figure 1), an instantiation of BREX that can take advantage of both entity pairs and templates. The differences and advantages of BREJ over BREE and BRET are: (1) Disjunctive Matching of Instances: The first difference is realized in how the three algorithms match instances with seeds (line 05 in Figure 3). BREE checks whether the entity pair of an instance is one of the entity pair seeds, BRET checks whether the template of an instance is one of the template seeds and BREJ checks whether the disjunction of the two is true. The disjunction facilitates a higher hit rate in matching instances with seeds. The introduction of a few handcrafted templates along with seed entity pairs allows BREJ to leverage discriminative patterns and learn similar ones via distributional semantics. In Figure 1, the joint approach results in hybrid extractors Λ that contain instances due to seed occurrences Θ of both entity pairs and templates.
(2) Hybrid Augmentation of Seeds: On line 09 in Figure 3, we see that the bootstrapping step is defined in a straightforward fashion: the entity pair of an instance is added for BREE, the template for BRET and both for BREJ. Figure 1 demonstrates the hybrid augmentation of seeds via red and blue rings of output instances.
(3) Scaling Up Positives in Extractors: As discussed in section 2.2, a good measure of the quality of an extractor is crucial and N , the number of instances in an extractor λ that match a seed, is an important component of that. For BREE and BRET, the definition follows directly from the fact that these are entity-pair and template-centered instantiations of BREX, respectively. However, the disjunctive matching of instances for an extractor with entity pair and template seeds in BREJ ( Figure 3 line "(i)" ) boosts the likelihood of finding positive instances. In Figure 5, we demonstrate computing the count of positive instances  N pλ, Gq for an extractor λ within the three systems. Observe that an instance i in λ can scale its N pλ, Gq by a factor of maximum 2 in BREJ if i is matched in both entity pair and template seeds. The reliability cnfpλ, Gq (Eq. 11) of an extractor λ is based on the ratio N pλ,Gnq N pλ,Gpq , therefore suggesting that the scaling boosts its confidence. In Figure 6, we demonstrate with an example how the joint bootstrapping scales up the positive instances for a non-noisy extractor λ, resulting in λ N N HC for BREJ compared to λ N N LC in BREE.
Due to unlabeled data, the instances not matching in seeds are considered either to be ignored/unknown N 0 or negatives in the confidence measure (Eq. 11). The former leads to high confidences for noisy extractors by assigning high scores, the latter to low confidences for non-noisy extractors by penalizing them. For a simple version of the algorithm in the illustration, we consider them as negatives and set w n 1. Figure 6 shows the three extractors (λ) generated and their confidence scores in BREE, BRET and BREJ. Observe that the scaling up of positives in BREJ due to BRET extractions (without w n ) discounts cnfpλ, Gq relatively lower than BREE. The discounting results in λ N N HC in BREJ and λ N N LC in BREE. The discounting in BREJ is adapted for non-noisy extractors facilitated by BRET in generating mostly non-noisy extractors due to stringent checks ( Figure 3, line "(i)" and 05). Intuitively, the intermixing of non-noisy extractors (i.e., hybrid) promotes the scaling and boosts recall.

Similarity Measures
The before ( v ¡1 ) and after ( v 1 ) contexts around the entities are highly sparse due to large variation in the syntax of how relations are expressed. SnowBall, DIPRE and BREE assumed that the between ( v 0 ) context mostly defines the syntactic expression for a relation and used weighted mechanism on the three contextual similarities in    (Figure 4). They assigned higher weights to the similarity in between (p 0) contexts, that resulted in lower recall. We introduce attentive (max) similarity across all contexts (for example, v ¡1 piq v 0 pjq) to automatically capture the large variation in the syntax of how relations are expressed, without using any weights. We investigate asymmetric (Eq 13) and symmetric (Eq 14 and 15) similarity measures, and name them as cross-context attentive (sim cc ) similarity.

Dataset and Experimental Setup
We re-run BREE (Batista et al., 2015) for baseline with a set of 5.5 million news articles from AFP and APW (Parker et al., 2011). We use processed dataset of 1.2 million sentences (released by BREE) containing at least two entities linked to FreebaseEasy (Bast et al., 2014). We extract four relationships: acquired (ORG-ORG), founderof (ORG-PER), headquartered (ORG-LOC) and affiliation (ORG-PER) for Organization (ORG), Person (PER) and Location (LOC) entity types. We bootstrap relations in BREE, BRET and BREJ, each with 4 similarity measures using seed entity   Table 5: Precision (P ), Recall (R) and F 1 compared to the state-of-the-art (baseline). #out: count of output instances with cnfpi, Λ, Gq ¥ 0.5. avg: average. Bold and underline: Maximum due to BREJ and sim cc , respectively. pairs and templates (Table 2). See Tables 3, 4 and 5 for the count of candidates, hyperparameters and different configurations, respectively. Our evaluation is based on Bronzi et al. (2012)'s framework to estimate precision and recall of large-scale RE systems using FreebaseEasy (Bast et al., 2014). Also following Bronzi et al. (2012), we use Pointwise Mutual Information (PMI) (Turney, 2001) to evaluate our system automatically, in addition to relying on an external knowledge base. We consider only extracted relationship instances with confidence scores cnfpi, Λ, Gq equal or above 0.5. We follow the same approach as BREE (Batista et al., 2015) to detect the correct order of entities in a relational triple, where we try to identify the presence of passive voice using partof-speech (POS) tags and considering any form of the verb to be, followed by a verb in the past tense or past participle, and ending in the word 'by'. We use GloVe (Pennington et al., 2014) embeddings. Table 5 shows the experimental results in the three systems for the different relationships with ordered entity pairs and similarity measures (sim match , sim cc ). Observe that BRET (config 5 ) is precision-oriented while BREJ (config 9 ) recalloriented when compared to BREE (baseline). We see the number of output instances #out are also higher in BREJ, therefore the higher recall. The BREJ system in the different similarity configura-   Table 7: Comparative analysis using different thresholds τ to evaluate the extracted instances for acquired tions outperforms the baseline BREE and BRET in terms of F 1 score. On an average for the four relations, BREJ in configurations config 9 and config 10 results in F 1 that is 0.11 (0.85 vs 0.74) and 0.13 (0.87 vs 0.74) better than the baseline BREE. We discover that sim cc improves #out and recall over sim match correspondingly in all three systems. Observe that sim cc performs better with BRET than BREE due to non-noisy extractors in BRET. The results suggest an alternative to the weighting scheme in sim match and therefore, the state-of-the-art (sim cc ) performance with the 3 parameters (w ¡1 , w 0 and w 1 ) ignored in bootstrap-    ping. Observe that sim asym cc gives higher recall than the two symmetric similarity measures. Table 6 shows the performance of BREJ in different iterations trained with different similarity τ sim and confidence τ cnf thresholds. Table 7 shows a comparative analysis of the three systems, where we consider and evaluate the extracted relationship instances at different confidence scores.

Disjunctive Seed Matching of Instances
As discussed in section 2.3, BREJ facilitates disjunctive matching of instances (line 05 Figure 3) with seed entity pairs and templates. Table 8 shows #hit in the three systems, where the higher values of #hit in BREJ conform to the desired property. Observe that some instances in BREJ are found to be matched in both the seed types.

Deep Dive into Attributes of Extractors
We analyze the extractors Λ generated in BREE, BRET and BREJ for the 4 relations to demonstrate the impact of joint bootstrapping. Table 9 shows the attributes of Λ. We manually annotate the extractors as noisy and non-noisy. We compute AN N LC and the lower values in BREJ compared to BREE suggest fewer non-noisy extractors with lower confidence in BREJ due to the scaled confi-   BREJ that shrink N pλ,Gnq N pλ,Gpq i.e., ANP. It facilitates λ N N LC to boost its confidence, i.e., λ N N HC in BREJ suggested by AES that results in higher #out and recall (Table 5, BREJ).

Weighting Negatives Vs Scaling Positives
As discussed, Table 5 shows the performance of BREE, BRET and BREJ with the parameter w n 0.5 in computing extractors' confidence cnfpλ, Gq(Eq. 11). In other words, config 9 (Table 5) is combination of both weighted negative and scaled positive extractions. However, we also investigate ignoring w n p 1.0q in order to demonstrate the capability of BREJ with only scaling positives and without weighting negatives. In Table 10, observe that BREJ outperformed both BREE and BRET for all the relationships due to higher #out and recall. In addition, BREJ scores are comparable to config 9 (Table 5) suggesting that the scaling in BREJ is capable enough to remove the parameter w n . However, the combination of both weighting negatives and scaling positives results in the state-of-the-art performance. Table 11 lists some of the non-noisy extractors (simplified) learned in different configurations to illustrate boosting extractor confidence cnfpλ, Gq. Since, an extractor λ is a cluster of instances, therefore to simplify, we show one in- Gates co-founded [X] with school friend [Y] X 0.99 [X] started by [Y] 1.00 [X] started by [Y] 1.00 who co-founded [X] with [Y] X 0.95 [X] was founded by [Y] 1.00 [X] was founded by [Y] 0.99 to co-found [X] with partner [Y] X Table 11: Subset of the non-noisy extractors (simplified) with their confidence scores cnfpλ, Gq learned in different configurations for each relation. ¦ denotes that the extractor was never learned in config 1 and config 5 . X indicates that the extractor was never learned in config 1 , config 5 and config 9 . [X] and [Y] indicate placeholders for entities. stance (mostly populated) from every λ. Each cell in Table 11 represents either a simplified representation of λ or its confidence. We demonstrate how the confidence score of a non-noisy extractor in BREE (config 1 ) is increased in BREJ (config 9 and config 10 ). For instance, for the relation acquired, an extractor { [X] acquiring [Y]} is generated by BREE, BRET and BREJ; however, its confidence is boosted from 0.75 in BREE (config 1 ) to 0.95 in BREJ (config 9 ). Observe that BRET generates high confidence extractors. We also show extractors (marked by X) learned by BREJ with sim cc (config 10 ) but not by config 1 , config 5 and config 9 .

Entity Pairs: Ordered Vs Bi-Set
In Table 5, we use ordered pairs of typed entities. Additionally, we also investigate using entity sets and observe improved recall due to higher #out in both BREE and BREJ, comparing correspondingly Table 12 and 5 (baseline and config 9 ).

Conclusion
We have proposed a Joint Bootstrapping Machine for relation extraction (BREJ) that takes advantage  of both entity-pair-centered and template-centered approaches. We have demonstrated that the joint approach scales up positive instances that boosts the confidence of NNLC extractors and improves recall. The experiments showed that the crosscontext similarity measures improved recall and suggest removing in total four parameters.