Relation extraction pattern ranking using word similarity

Our thesis proposal aims at integrating word similarity measures in pattern ranking for relation extraction bootstrapping algorithms. We note that although many contributions have been done on pattern ranking schemas, few explored the use of word-level semantic similarity. Our hypothesis is that word similarity would allow better pattern comparison and better pattern ranking, resulting in less semantic drift commonly problematic in bootstrapping algorithms. In this paper, as a ﬁrst step into this research, we explore different pattern representations, various existing pattern ranking approaches and some word similarity measures. We also present a methodology and evaluation approach to test our hypothesis.


Introduction
In this thesis, we look at the problem of information extraction from the web; more precisely at the problem of extracting structured information, in the form of triples (predicate, subject, object), e.g. (Object-MadeFromMaterial, table, wood) from unstructured text. This topic of Relation Extraction (RE), is a current and popular research topic within NLP, given the large amount of unstructured text on the WWW.
In the literature, machine learning algorithms have shown to be very useful for RE from textual resources. Although supervised (Culotta and Sorensen, 2004;Bunescu and Mooney, 2005) and unsupervised learning (Hasegawa et al., 2004;Zhang et al., 2005) have been used for RE, in this thesis, we will focus on semi-supervised bootstrapping algorithms.
In such algorithms (Brin, 1999;Agichtein and Gravano, 2000;Alfonseca et al., 2006a), the input is a set of related pairs called seed instances (e.g., (table,wood), (bottle, glass)) for a specific relation (e.g., ObjectMadeFromMaterial). These seed instances are used to collect a set of candidate patterns representing the relation in a corpus. A subset containing the best candidate patterns is added in the set of promoted patterns. The promoted patterns are used to collect candidate instances. A subset containing the best candidate instances is selected to form the set of promoted instances. The promoted instances are either added to the initial seed set or used to replace it. With the new seed set, the algorithm is repeated until a stopping criterion is met.
The advantage of bootstrapping algorithms is that they require little human annotation. Unfortunately, the system may introduce wrongly extracted instances. Due to its iterative approach, errors can quickly cumulate in the next few iterations; therefore, precision will suffer. This problem is called semantic drift. Different researchers have studied how to counter semantic drift by using better pattern representations, by filtering unreliable patterns, and filtering wrongly extracted instances (Brin, 1999;Agichtein and Gravano, 2000;Alfonseca et al., 2006a). Nevertheless, this challenge is far from being resolved, and we hope to make a contribution in that direction.
The semantic drift is directly related to which candidate patterns become promoted patterns. A crucial decision at that point is how to establish pattern confidence so as to rank the patterns. There are many ways to estimate the confidence of a pattern. Blohm et al. (2007) identified general types of pattern filtering functions for well-known systems. As we review pattern ranking approaches, we note that many include a notion of "resemblance", as either comparing patterns between successive iterations, or comparing instances generated at an iteration to instances in the seed set, etc. Although this notion of resemblance seems important to many ranking schemas, we do not find much research which combines word similarity approaches within pattern ranking. This is where we hope to make a research contribution and where our hypothesis lies, that using word similarity would allow for better pattern ranking.
In order to suggest better pattern ranking approaches incorporating word similarity, we need to look at the different pattern representations suggested in the literature and understand how they affect pattern similarity measures. This is introduced in Section 2. Then, section 3 provides a nonexhaustive survey of pattern ranking approaches with an analysis of commonality and differences; Section 4 presents a few word similarity approaches; Section 5 presents the challenges we face, as well as our methodology toward the validation of our hypothesis; Section 6 briefly explores other anticipated issues (e.g. seed selection) in relation to our main contribution and Section 7 presents the conclusion.

Pattern representation
In the literature, pattern representations are classified as lexical or syntactic.
Lexical patterns represent lexical terms around a relation instance as a pattern. For relation instance (X,Y) where X and Y are valid noun phrases, Brin (1999), Agichtein and Gravano (2000), Pasca et al. (2006), Alfonseca et al. (2006a) take N words before X, N words after Y and all intervening words between X and Y to form a pattern (e.g., well-known author X worked on Y daily.). Extremes for the choice of N exist, as in the CPL subsystem of NELL (Carlson et al., 2010) setting N = 0 and the opposite in Espresso (Pantel and Pennacchiotti, 2006) where the whole sentence is used.
Syntactic patterns convert a sentence containing a relation instance to a structured form such as a parse tree or a dependency tree. Yangarber (2003) and Stevenson and Greenwood (2005) use Subject-Verb-Object (SVO) dependency tree patterns such as [Company appoint Person] or [Person quit]. Culotta (2004) uses full dependency trees on which a tree kernel will be used to measure similarity. Bunescu and Mooney (2005) and Sun and Grishman (2010) use the shortest dependency path (SDP) between a relation instance in the dependency tree as a pattern (e.g., "nsubj ← met → prep in"). Zhang et al. (2014) add a semantic constraint to the SDP; they define the semantic shortest dependency path (SSDP) as a SDP containing at least one trigger word representing the relation, if any. Trigger words are defined as words most representative of the target relation (e.g. home, house, live, for the relation PersonResidesIn).
We anticipate the use of word similarity to be possible when comparing either lexical or syntactic patterns, adapting to either words in sequence, or nodes within parse or dependency trees. In fact, as researchers have explored pattern generalization, some have already looked at ways of grouping similar words. For example, Alfonseca et al. (2006a) present a simple algorithm to generalize the set of lexical patterns using an edit-distance similarity. Also, Pasca et al. (2006) add term generalization to a pattern representation similar to Agichtein and Gravano (2000); terms are replaced with their corresponding classes of distributionally similar words, if any (e.g., let CL3 = {March, October, April,...} in the pattern CL3 00th : X's Birthday (Y)).

Pattern ranking approaches
We now survey pattern ranking algorithms to better understand in which ones similarity measures would be more likely to have an impact. We follow a categorization introduced in Blohm et al. (2007) as they quantified the impact of different relation pattern/instance filtering functions on their generic bootstrapping algorithm. The filtering functions proposed by Brin (1999), Agichtein and Gravano (2000), Pantel and Pennacchiotti (2006) and Etzioni et al. (2004) were described in their work.
Although non-exhaustive, our survey includes further pattern ranking approaches found in the literature, in order to best illustrate Blohm's different categories. A potential use of those categories would be to define a pattern ranking measure composed of voting experts representing each category. A combination of these votes might provide a better confidence measure for a pattern. We define the following notation, as to allow the description of the different measures in a coherent way. Let p be a pattern and i be an instance; I is the set of promoted instances; P is the set of promoted patterns; H(p) is the set of unique instances matched by p; K(i) is the set of unique patterns matching i; count(i, p) is the number of times p matches i; count(p) is the number of p occurs in a corpus; S is the set of seed instances.

Syntactic assessment
This filtering assessment is purely based on the syntactic criteria (e.g., length, structure, etc.) of the pattern. Brin (1999) uses the length of the pattern to measure its specificity. Blohm et al. (2007) named this category interpattern comparison. Their intuition was that candidate patterns could be rated based on how similar their generated instances are in comparison to the instances generated by the promoted patterns. We generalize this category to also include rating of candidate patterns based directly on their semantic similarity with promoted pattern. Stevenson and Greenwood (2005) assign a score on a candidate pattern based on the similarity with promoted patterns. The pattern scoring function uses the Jiang and Conrath (1997) WordNet-based word similarity for pattern similarity. They represent the SVO pattern as a vector (e.g., [subject COMPANY, verb fired, object ceo], or [subject chairman, verb resign]). The similarity between two pattern vectors is measured as :

Pattern comparison
where W is a matrix that contains the word similarity between every possible element-filler pairs (e.g., subject COMPANY, verb fired, object ceo) contained in every SVO pattern extracted from a corpus. The top-N (e.g., 4) patterns with a score larger than 95% are promoted. Zhang et al. (2014) defines a bottom-up kernel (BUK) to filter undesired relation patterns. The BUK measures the similarity between two dependency tree patterns. The system accepts new patterns that are the most similar to seed patterns. The BUK defines a matching function t and a similarity function k on dependency trees. Let dep be the pair (rel, w) where rel is the dependency relation and w is the word of the relation (e.g., (nsubj, son)). The matching function is defined as: where W tr is the set of trigger words for the target relation. The similarity function is defined as: where γ 1 and γ 2 are manually defined weights for attributes dep.w and dep.rel respectively. The word comparison is string-based.

Support-based assessment
This ranking assessment estimates the quality of a pattern based on the set of occurrences/patterns that generated this pattern. This assessment is usually used for patterns that were created by a generalization procedure. For example, if pattern X BE mostly/usually made of/from Y was generated by patterns X is usually made of Y and X are mostly made from Y, then the quality of the generalized pattern will be based on the last two patterns. Brin (1999) filters patterns if (specif icity(p) × n) > t, where n is the occurrence count of pattern p applied in a corpus and t is a manually set threshold.

Performance-based assessment
The quality of a candidate pattern can be estimated by the comparing its correctly produced instances with the set of promoted instances. Blohm et al. (2007) defines a precision formula similar to Agichtein and Gravano (2000) to approximate a performance-based precision: Alfonseca et al. (2006b) propose a procedure to measure the precision of candidate patterns in order to filter overly-general patterns. For every relation, and every hook X and target Y of the set of promoted instances (X,Y), a hook and target corpus is extracted from corpus C; C contains only sentences which contain X or Y. For every pattern p, instances of H(p) are extracted. Then, a set of heuristics label every instance as correct/incorrect. The precision of p is number of correct extracted instances divided by the total number of extracted instances.
NELL (Carlson et al., 2010) ranks relation patterns by their precision: Sijia et al. (2013) filters noisy candidate relation patterns that generate instances which appear in the seed set of relations other than the target relation.

Instance-Pattern correlation
Pattern quality can be assessed by measuring its correlation with the set of promoted instances. These measures estimate the correlation by counting pattern occurrences, promoted instance occurrences, and pattern occurrences with a specific promoted instance. Blohm et al. (2007) classified Espresso (Pantel and Pennacchiotti, 2006) and KnowItAll (Etzioni et al., 2004) in this category. Pantel et Pennacchiotti (2006) ranks candidate relation patterns by the following reliability score: where max pmi is the maximum PMI between all pattern and all instances, and pmi(i, p) can be estimated using the following formula: pmi(i, p) = log |x, p, y| |x, * , y| × | * , p, * | where i is an instance (x,y), |x, p, y| is the occurrence of pattern p with terms x and y and (*) represents a wildcard. The reliability of an instance r l (i) is defined as: Since r l (i) and r π (p) are defined recursively, r l (i) = 1 for any seed instance. The top-N patterns are promoted where N is the number of patterns of the previous bootstrapping iteration plus one. Sun and Grishman (2010) accept the top-N ranked candidate pattern by the following confidence formula: where Sup(p) = i∈H(p) Conf (i) is the support candidate pattern p can get from the set of matched instances. Every relation instance in Sun and Grishman (2010) has a cluster membership, where a cluster contains similar patterns. The confidence of an newly extracted instance i is defined as: where C t is the target cluster where the patterns of the target relation belong, Semi Conf (i) is defined as the confidence given by the patterns matching the candidate relation instance and Cluster Conf (i) is defined how strongly a candidate instance is associated with the target cluster.

Word similarity
Within the pattern ranking survey, we often saw the idea of comparing patterns and/or instances, but only once, was there a direct use of word similarity measures. Stevenson and Greenwood (2005) assign a score to a candidate pattern based on its similarity to promoted patterns using a WordNet-based word similarity measure (Jiang and Conrath, 1997). This measure is only one among many WordNet-based approaches, as can be found in (Lesk, 1986;Wu and Palmer, 1994;Resnik, 1995;Jiang and Conrath, 1997;Lin, 1998;Leacock and Chodorow, 1998;Banerjee and Pedersen, 2002).
There are limitations to these approaches, mainly that WordNet (Miller, 1995), although large, is still incomplete. Other similarity approaches are corpusbased (e.g. (Agirre et al., 2009)) where the distributional similarity between words is measured. Words are no longer primitives, but they are represented by a feature vector. The feature vector could contain the co-occurrences, the syntactic dependencies, etc. of the word with their corresponding frequencies from a corpus. The cosine similarity (among many possible measures) between the feature vector of two words indicates their semantic similarity.
Newer approaches to word similarity are based on neural network word embeddings. Mikolov et al. (2013) present algorithms to learn those distributed word representations which can then be compared to provide word similarity estimations.
Word similarity could be in itself the topic of a thesis. Therefore, we will not attempt at developing new word similarity measures, but rather we will search for measures which are intrinsically good and valuable for the pattern ranking task. The few mentioned above are a good start toward a more extensive survey. The methods found can be evaluated on existing datasets such as RG (Rubenstein and Goodenough, 1965), MC (Miller and Charles, 1991), WordSim353 (Finkelstein et al., 2001;Agirre et al., 2009), MTurk (Radinsky et al., 2011) and MEN (Bruni et al., 2013) datasets. However, these datasets are limited, since they contain only nouns (except MEN). When using word similarity in pattern ranking schemas, we will likely want to measure similarity between nouns, verbs, adjectives and adverbs. Still, these datasets provide a good starting point for evaluation of word similarity.

Word similarity in pattern ranking
The hypothesis of our research is that the use of word similarity will allow better pattern ranking to better prevent semantic drift. We face three main challenges in supporting this hypothesis. First, we need to understand the interdependence of the three elements presented in the three previous sections: pattern representation, pattern confidence estimation, and word similarity. Second, we need to devise an appropriate set-up to perform our bootstrapping approach. Third, we need to properly evaluate the role of the different variations in preventing semantic drift.
An important exploration will be to decide where the word similarity has the largest potential. For example, in the work of Stevenson and Green-wood (2005), similarity is directly applied on parts of the triples found (Subject, Verb predicate or Object), or in the work of Zhang et al. (2014), word similarity would be integrated in the matching and similarity functions over dependency trees, instead of using string equality.
As we see, the integration of word similarity measures would be different depending on the type of pattern representation used. Furthermore, in some representation, there is already a notion of pattern generalisation, such as in the work of Pasca et al. (2006), where words are replaced with more general classes, if any. In such case, word similarity measures are used at the core of the pattern representation, and will further impact pattern ranking.
As we will eventually be building a complex system, we intend to follow a standard methodology of starting with a baseline system for which we have an evaluation, and then further evaluate the different variations to measure their impact. As the number of combination of possible variations will be too large, time will be spent also on partial evaluation, to determine most promising candidates among word similarity measures, and/or pattern representation and/or pattern confidence estimation, to understand strength and weaknesses of each aspect independently of the others.
Our proposed methodology is to take promising ranking approaches among the one presented in Section 3, and promising pattern representations from what was presented in Section 2. We can evaluate their combined performance through N different iteration intervals and incorporate different similarity measures (some best measures chosen from the evaluation on known datasets) to measure the performance of the system. As our baseline system, we are inspired by CPL subsystem of NELL (Carlson et al., 2010) since it is one of the largest, currently active, bootstrapping system in the literature. As in NELL, we will use ClueWeb 1 as our corpus, and for the set of relations, we will use the same seed instances and relations as in the evaluation of NELL (Carlson et al., 2010).
As for the bootstrapping RE system, to evaluate the precision, we will randomly sample knowledge from the knowledge base and evaluate them by sev-eral human judges. The extracted knowledge could be validated using a crowdsourcing application such as MTurk. This method is based on NELL (Carlson et al., 2010). To evaluate its recall, we have to concentrate on already annotated relations. For example, Pasca et al. (2006) evaluates the relation Person-BornIN-Year. As a Gold Standard, 6617 instances were automatically extracted from Wikipedia. Instead of measuring recall for specific relation, we could use relative recall (Pantel et al., 2004;Pantel and Pennacchiotti, 2006). We can evaluate our contributions by the relative recall of system A (our system) given system B (baseline).

Related issues in pattern ranking
Our main contribution on the impact of word similarity on pattern ranking will necessarily bring forward other interesting questions that we will address within our thesis.

Choice of seed
As we saw, pattern ranking is often dependent on the comparison of instances found from one iteration to the next. At iteration 0, we start with a seed of instances. We can imagine that the manual selection of these seeds will have an impact on the following decisions. As our similarity measures are used to compare candidate instances to seed instances, and as we will start with NELL seed set, we will want to evaluate its impact on the bootstrapping process.
It was shown that the performance of bootstrapping algorithms highly depend on the seed instance selection (Kozareva and Hovy, 2010). Ehara et al. (2013) proposed an iterative approach where unlabelled instances are chosen to be labelled depending on their similarity with the seed instances and are added in the seed set.

Automatic selection of patterns
Something noticeable among our surveyed pattern ranking approaches is the inclusion of empirically set thresholds that will definitely have an impact on the semantic drift, but which impact is not discussed. Most authors (e.g (Carlson et al., 2010;Sun and Grishman, 2010;McIntosh and Yencken, 2011;Zhang et al., 2014) among recent ones) select the top-N best ranked patterns to be promoted to next iteration. Other authors (Pasca et al., 2006;Dang and Aizawa, 2008;Carlson et al., 2010) select the top-M ranked instances to add in the seed set for the next iteration. Other authors (Brin, 1999;Agichtein and Gravano, 2000;Sijia et al., 2013) only apply a filtering step without limiting pattern/instance selection.
In our work, including word similarity within pattern ranking will certainly impact the decision on the number of patterns to be promoted. We hope to contribute in developing a pattern selection mechanism that will be based on the pattern confidence themselves rather than on an empirically set N or M.

Conclusion
In this paper, we have presented our research proposal, aiming at determining the impact of employing word similarity measures within pattern ranking approaches in bootstrapping systems for relation extraction. We presented two aspects of pattern ranking on which the integration of word similarity will be dependent, that of pattern representation and pattern ranking schemas. We showed that there are minimally lexical and syntactic pattern representations on which different methods of generalizations can be applied. We performed a non-exhaustive survey of pattern ranking measures classified in five different categories. We also briefly looked into different word similarity approaches.
This sets the ground for the methodology that we will pursue, that of implementing a baseline bootstrapping system (inspired by NELL, and working with ClueWeb as a corpus), and then measuring the impact of modifying the pattern representation and the pattern ranking approaches, with and without the use of word similarity measures. There is certainly a complex intricate mutual influence of the preceding aspects which we need to look into. Lastly, we briefly discussed two related issues: the choice of seed set and better estimation of number of patterns to promote.