Semi-Supervised Bootstrapping of Relationship Extractors with Distributional Semantics

Semi-supervised bootstrapping techniques for relationship extraction from text iteratively expand a set of initial seed relationships while limiting the semantic drift. We research bootstrapping for relationship extraction using word embeddings to ﬁnd similar relationships. Experimental results show that relying on word embeddings achieves a better performance on the task of extracting four types of relationships from a collection of newswire documents when compared with a baseline using TF-IDF to ﬁnd similar relationships.


Introduction
Relationship Extraction (RE) transforms unstructured text into relational triples, each representing a relationship between two named-entities. A bootstrapping system for RE starts with a collection of documents and a few seed instances. The system scans the document collection, collecting occurrence contexts for the seed instances. Then, based on these contexts, the system generates extraction patterns. The documents are scanned again using the patterns to match new relationship instances. These newly extracted instances are then added to the seed set, and the process is repeated until a certain stop criteria is met.
The objective of bootstrapping is thus to expand the seed set with new relationship instances, while limiting the semantic drift, i.e. the progressive deviation of the semantics for the extracted relationships from the semantics of the seed relationships.
State-of-the-art approaches rely on word vector representations with TF-IDF weights (Salton and Buckley, 1988). However expanding the seed set by relying on TF-IDF representations to find similar instances has limitations, since the similarity between any two relationship instance vectors of TF-IDF weights is only positive when the instances share at least one term. For instance, the phrases was founded by and is the co-founder of do not have any common words, but they have the same semantics. Stemming techniques can aid in these cases, but only for variations of the same root word (Porter, 1980). We propose to address this challenge with an approach based on word embeddings (Mikolov et al., 2013a). By relying on word embeddings, the similarity of two phrases can be captured even if no common words exist. The word embeddings for co-founder and founded should be similar, since these words tend to occur in the same contexts. Word embeddings can nonetheless also introduce semantic drift. When using word embeddings, phrases like studied history at can, for instance, have a high similarity with phrases like history professor at. In our approach, we control the semantic drift by ranking the extracted relationship instances, and by scoring the generated extraction patterns.
We implemented these ideas in BREDS, a bootstrapping system for RE based on word embeddings. BREDS was evaluated with a collection of 1.2 million sentences from news articles. The experimental results show that our method outperforms a baseline bootstrapping system based on the ideas of Agichtein and Gravano (2000) which relies on TF-IDF representations.

Bootstrapping Relationship Extractors
Brin (1999) developed DIPRE, the first system to apply bootstrapping for RE, which represents the occurrences of seeds as three contexts of strings: words before the first entity (BEF), words between the two entities (BET), and words after the second entity (AFT). DIPRE generates extraction patterns by grouping contexts based on string matching, and controls semantic drift by limiting the number of instances a pattern can extract. Agichtein and Gravano (2000) developed Snowball, which is inspired on DIPRE's method of collecting three contexts for each occurrence, but computing a TF-IDF representation for each context. The seed contexts are clustered with a single-pass algorithm based on the cosine similarity between contexts using the three vector representations: In the formula, the parameters α, β and γ weight each vector. An extraction pattern is represented by the centroid of the vectors that form a cluster. The patterns are used to scan the text again, and for each segment of text where any pair of entities with the same semantic types as the seeds cooccur, three vectors are generated. If the similarity from the context vectors towards an extraction pattern is greater than a threshold τ sim , the instance is extracted. Snowball scores the patterns and ranks the extracted instances to control the semantic drift. A pattern is scored based on the instances that it extracted, which can be included in three sets: P , N , and U . If an extracted instance contains an entity e 1 , which is part of a seed, and if the associated entity e 2 in the instance is the same as in in the seed, then the extraction is considered positive (included in set P ). If the relationship contradicts a relationship in the seed set (i.e., e 2 does not match), then the extraction is considered negative (included in a set N ). If the relationship is not part of the seed set, the extraction is considered unknown (included in a set U ). A score is assigned to each pattern p according to: (2) Conf ρ (p) = |P | |P |+W ngt · |N |+W unk · |U | W ngt and W unk are weights associated to the negative and unknown extractions, respectively. The confidence of an instance is calculated based on the similarity scores towards the patterns that extracted it, weighted by the pattern's confidence: where, ξ is the set of patterns that extracted i, and C i is the textual context where i occurred. Instances with a confidence above a threshold τ t are used as seeds in the next iteration.

Bootstrapping Relationship Extractors with Word Embeddings
BREDS follows the architecture of Snowball, having the same processing phases: find seed matches, generating extraction patterns, finding relationship instances, and detecting semantic drift. It differs, however, in that it attempts to find similar relationships using word embeddings, instead of relying on TF-IDF representations.

Find Seed Matches
BREDS scans the document collection and, if both entities of a seed instance co-occur in a text segment within a sentence, then that segment is considered and BREDS extracts the three textual contexts as in Snowball: BEF, BET, and AFT.
In the BET context, BREDS tries to identify a relational pattern based on a shallow heuristic originally proposed in ReVerb (Fader et al., 2011). The pattern limits a relation context to a verb (e.g., invented), a verb followed by a preposition (e.g., located in), or a verb followed by nouns, adjectives, or adverbs ending in a preposition (e.g., has atomic weight of). These patterns will nonetheless only consider verb mediated relationships. If no verbs exist between two entities, BREDS extracts all the words between the two entities, to build the representations for the BET context.
Each context is transformed into a single vector by a simple compositional function that starts by removing stop-words and adjectives and then sums the word embedding vectors of each individual word. Representing small phrases by summing each individual word's embedding results in good representations for the semantics in small phrases (Mikolov et al., 2013b).
A relationship instance i is represented by three embedding vectors: V BEF , V BET , and V AF T . Considering the sentence: The tech company Soundcloud is based in Berlin, capital of Germany.
BREDS generates the relationship instance with: where, E(x) is the word embedding for word x. BREDS also tries to identify the passive voice using part-of-speech (PoS) tags, which can help to detect the correct order of the entities in a relational triple. BREDS identifies the presence of the passive voice by considering any form of the verb to be, followed by a verb in the past tense or the past participle, and ending in the word by.
For instance, the seed <Google, owns, DoubleClick> states that the organisation Google owns the organisation DoubleClick. Using this seed, if BREDS detects a pattern like agreed to be acquired by it will swap the order of the entities when producing a relational triple, outputting the triple <ORG 2 , owns, ORG 1 >, instead of the triple <ORG 1 , owns, ORG 2 >.

Extraction Patterns Generation
As Snowball, BREDS generates extraction patterns by applying a single-pass clustering algorithm to the relationship instances gathered in the previous step. Each resulting cluster contains a set of relationship instances, represented by their three context vectors.
Algorithm 1 describes the clustering approach taken by BREDS, which takes as input a list of relationship instances and assigns the first instance to a new empty cluster. Next, it iterates through the list of instances, computing the similarity between an instance i n and every cluster Cl j . The instance i n is assigned to the first cluster whose similarity is higher or equal to a threshold τ sim . If all the clusters have a similarity lower than a threshold τ sim , a new cluster C m is created, containing the instance i n .
The similarity function Sim(i n , Cl j ), between an instance i n and a cluster Cl j , returns the maximum of the similarities between an instance i n and any of the instances in a cluster Cl j , if the majority of the similarity scores is higher than a threshold τ sim . A value of zero is returned otherwise. The similarity between two instances is computed according to Formula (1). As a result, clustering in Algorithm 1 differs from the original Snowball method, which instead computes similarities towards cluster centroids.

Find Relationship Instances
After the generation of extraction patterns, BREDS finds relationship instances with Algorithm 2. It scans the documents once again, collecting all segments of text containing entity pairs Algorithm 1: Single-Pass Clustering.
Input: Instances = {i 1 , i 2 , i 3 , ..., i n } Output: P atterns = {} Cl 1 = {i 1 } P atterns = {Cl 1 } for i n ∈ Instances do for Cl j ∈ P atterns do if Sim(i n , Cl j ) >= τ sim then Cl j = Cl j ∪ {i n } else Cl m = {i n } P atterns = P atterns ∪ {Cl m } whose semantic types are the same as those in the seed instances. For each segment, an instance i is generated as described in Section 3.1, and the similarity towards all previously generated extraction patterns (i.e., clusters) is computed. If the similarity between i and a pattern Cl j is equal or above τ sim , then i is considered a candidate instance, and the confidence score of the pattern is updated, according to Formula (2). The pattern which has the highest similarity (pattern best ) is associated with i, along with the corresponding similarity score (sim best ). This information is kept in a history of Candidates. Note that the histories of Candidates and P atterns are kept through all the bootstrap iterations, and new patterns or instances can be added, or the scores of existing patterns or instances can change.

Semantic Drift Detection
As Snowball, BREDS ranks the candidate instances at the end of each iteration, based on the scores computed with Formula (3). Instances with a score equal or above the threshold τ t are added to the seed set, for use in the next iteration of the bootstrapping algorithm.

Evaluation
In our evaluation we used a set of 5.5 million news articles from AFP and APW (Parker et al., 2011). Our pre-processing pipeline is based on the models provided by the NLTK toolkit (Bird et al., 2009): sentence segmentation 1 , tokenisation 2 , PoS-tagging 3 and named-entity recognition (NER). The NER module in NLTK is a wrapper over the Stanford NER toolkit (Finkel et al., 2005).
We performed weak entity-linking by matching entity names in sentences with FreebaseEasy (Bast et al., 2014). FreebaseEasy is a processed version of Freebase (Bollacker et al., 2008), which contains a unique meaningful name for every entity, together with canonical binary relations. For our experiments, we selected only the sentences containing at least two entities linked to FreebaseEasy, which corresponded to 1.2 million sentences.
With the full articles set, we computed word embeddings with the skip-gram model 4 using the word2vec 5 implementation from Mikolov et. al. (2013a). The TF-IDF representations used by Snowball were calculated over the same articles set. We adopted a previously proposed framework for the evaluation of large-scale RE systems by Bronzi et al. (2012), to estimate precision and recall, using FreebaseEasy as the knowledge base.
We considered entity pairs no further away than 6 tokens, and a window of 2 tokens for the BEF and AFT contexts, ignoring the remaining of the sentence. We discarded the clusters with only one relationship instances, and ran a maximum of 4 bootstrapping iterations. The W unk and W ngt parameters were set to 0.1 and 2, respectively, based on the results reported by Yu et al. (2003).
We compared BREDS against Snowball in four relationship types, shown in Table 1. For each relationship type we considered several bootstrap-  where Conf 1 only considers the BET context and Conf 2 uses the three contexts, while giving more importance to the BET context. Table 2 shows, for each relationship type, the best F 1 score and the corresponding precision and recall, for all combinations of τ sim and τ t values, and considering only extracted relationship instances with confidence scores equal or above 0.5. Table 2a shows the results for the BREDS system, while Table 2b shows the results for Snowball (ReVerb), a modified Snowball in which a relational pattern based on ReVerb is used to select the words for the BET context. Finally, Table 2c shows the results for Snowball, implemented as described in the original paper.
Overall, BREDS achieves better F 1 scores than both versions of Snowball. The F 1 score of BREDS is higher, mainly as a consequence of much higher recall scores, which we believe to be due to the relaxed semantic matching caused by using the word embeddings. For some relationship types, the recall more than doubles when using word embeddings instead of TF-IDF. For the acquired relationship, when considering Conf 1 , the precision of BREDS drops compared with the other versions of Snowball, but without affecting the F 1 score, since the higher recall compensates for the small loss in precision.
Regarding the context weighting configurations, Conf 2 produces a lower recall when compared to Conf 1 . This might be caused by the sparsity of both BEF and AFT, which contain many different words that do not contribute to capture the relationship between the two entities. Although, sometimes, the phrase or word that indicates a relationship occurs on the BEF or AFT contexts, it is more often the case that these phrases or words occur in the BET context.
The performance results of Snowball (Classic) and Snowball (ReVerb) suggest that selecting words based on a relational pattern to represent the BET context, instead of using all the words, works better for TF-IDF representations.
The results also show that word embeddings can generate more extraction patterns. For instance, for the founder-of relationship, BREDS learns patterns based on words such as founder, cofounder, co-founders or founded, while Snowball only learns patterns that have the word founder, like CEO and founder or founder and chairman.
The implementations of BREDS and Snowball, as described in this paper, are available on-line 6 .

Conclusions and Future Work
This paper reports on a novel bootstrapping system for relation extraction based on word embeddings. In our experiments, bootstrapped RE achieved better results when using word embeddings to find similar relationships than with similarities between TF-IDF weighted vectors.
We have identified two main sources of errors: NER problems and incorrect relational patterns extraction due to the use of a shallow heuristic that only captures local relationships.
In future work, more robust entity-linking approaches, as proposed by Hoffart et al. (2011), could be included in our pre-processing pipeline. This could alleviate NER errors and enable experimentation with other relationship types. Gabbard et al. (2011) have shown that coreference resolution can increase bootstrapping RE performance, and the method of Durrett and Klein (2014) could also be included in our preprocessing pipeline.
Finally, we could explore richer compositional functions, combining word embeddings with syntactic dependencies (SD) (Yu et al., 2014). The shortest path between two entities in an SD tree supports the extraction of local and long-distance relationships (Bunescu and Mooney, 2005