A Lexicalized Tree Kernel for Open Information Extraction

In contrast with traditional relation extraction, which only considers a ﬁxed set of relations, Open Information Extraction (Open IE) aims at extracting all types of relations from text. Because of data sparseness, Open IE systems typically ignore lexical information, and instead employ parse trees and Part-of-Speech (POS) tags. However, the same syntactic structure may correspond to different relations. In this paper, we propose to use a lexical-ized tree kernel based on the word embed-dings created by a neural network model. We show that the lexicalized tree kernel model surpasses the unlexicalized model. Experiments on three datasets indicate that our Open IE system performs better on the task of relation extraction than the state-of-the-art Open IE systems of Xu et al. (2013) and Mesquita et al. (2013).


Introduction
Relation Extraction (RE) is the task of recognizing relationships between entities mentioned in text. In contrast with traditional relation extraction, for which a target set of relations is fixed a priori, Open Information Extraction (Open IE) is a generalization of RE that attempts to extract all relations (Banko et al., 2007). Although Open IE models that extract N-ary relations have been proposed, here we concentrate on binary relations.
Most Open IE systems employ syntactic information such as parse trees and part of speech (POS) tags, but ignore lexical information. However, previous work suggests that Open IE would benefit from lexical information because the same syntactic structure may correspond to different relations. For instance, the relation <Annacone, coach of, Federer> is correct for the sentence "Federer hired Annacone as a coach", but not for the sentence "Federer considered Annacone as a coach," even though they have the same dependency path structure (Mausam et al., 2012). Lexical information is required to distinguish the two cases.
Here we propose a lexicalized tree kernel model that combines both syntactic and lexical information. In order to avoid lexical sparsity issues, we investigate two smoothing methods that use word vector representations: Brown clustering (Brown et al., 1992) and word embeddings created by a neural network model (Collobert and Weston, 2008). To our knowledge, we are the first to apply word embeddings and to use lexicalized tree kernel models for Open IE.
Experiments on three datasets demonstrate that our Open IE system achieves absolute improvements in F-measure of up to 16% over the current state-of-the-art systems of Xu et al. (2013) and Mesquita et al. (2013). In addition, we examine alternative approaches for including lexical information, and find that excluding named entities from the lexical information results in an improved F-score.

System Architecture
The goal of the Open IE task is to extract from text a set of triples {< E 1 , R, E 2 >}, where E 1 and E 2 are two named entities, and R is a textual fragment that indicates the semantic relation between the two entities. We concentrate on binary, single-word relations between named entities. The candidate relation words are extracted from dependency structures, and then filtered by a supervised tree kernel model.
Our system consists of three modules: entity extraction, relation candidate extraction, and tree kernel filtering. The system structure is outlined in Figure 1. We identify named entities, parse sentences, and convert constituency trees into dependency structures using the Stanford tools (Manning et al., 2014). Entities within a fixed token distance (set to 20 according to development results) are extracted as pairs {< E 1 , E 2 >}. We then identify relation candidates R for each entity pair in a sentence, using dependency paths. Finally, the candidate triples {< E 1 , R, E 2 >} are paired with their corresponding tree structures, and provided as input to the SVM tree kernel. Our Open IE system outputs the triples that are classified as positive. In the following sections, we describe the components of the system in more detail.

Relation Candidates
Relation candidates are words that may represent a relation between two entities. We consider only lemmatized nouns, verbs and adjectives that are within two dependency links from either of the entities. Following Wu and Weld (2010) and Mausam et al. (2012), we use dependency patterns rather than POS patterns, which allows us to identify relation candidates which are farther away from entities in terms of token distance.
We extract the first two content words along the dependency path between E 1 and E 2 . In the following example, the path is E 1 → encounter → build → E 2 , and the two relation word candidates between "Mr. Wathen" and "Plant Security Service" are encounter and build, of which the latter is the correct one.
If there are no content words on the dependency path between the two entities, we instead consider words that are directly linked to either of them. In the following example, the only relation candidate is the word battle, which is directly linked to "Edelman." The relation candidates are manually annotated as correct/incorrect in the training data for the tree kernel models described in the following section.

Lexicalized Tree Kernel
We use a supervised lexicalized tree kernel to filter negative relation candidates from the results of the candidate extraction module. For semantic tasks, the design of input structures to tree kernels is as important as the design of the tree kernels themselves. In this section, we introduce our tree structure, describe the prior basic tree kernel, and finally present our lexicalized tree kernel function.

Tree Structure
In order to formulate the input for tree kernel models, we need to convert the dependency path to a tree-like structure with unlabelled edges. The target dependency path is the shortest path that includes the triple and other content words along the path. Consider the following example, which is a simplified representation of the sentence "Georgia-Pacific Corp.'s unsolicited $3.9 billion bid for Great Northern Nekoosa Corp. was hailed by Wall Street." The candidate triple identified by the relation candidate extraction module is <Georgia-Pacific Corp., bid, Great Northern Nekoosa Corp.>.
Our unlexicalized tree representation model is similar to the unlexicalized representations of Xu et al. (2013), except that instead of using the POS tag of the path's head word as the root, we create an abstract Root node. We preserve the dependency labels, POS tags, and entity information as tree nodes: (a) the top dependency labels are in-  cluded as children of the abstract Root node, other labels are attached to the corresponding parent labels; (b) the POS tag of the head word of the dependence path is a child of the Root; (c) other POS tags are attached as children of the dependency labels; and (d) the relation tag 'R' and the entity tags 'NE' are the terminal nodes attached to their respective POS tags. Figure 2(a) shows the unlexicalized dependency tree for our example sentence.
Our lexicalized tree representation is derived from the unlexicalized representation by attaching words as terminal nodes. In order to reduce the number of nodes, we collapse the relation and entity tags with their corresponding POS tags. Figure 2(b) shows the resulting tree for the example sentence.

Tree Kernels
Tree kernel models extract features from parse trees by comparing pairs of tree structures. The essential distinction between different tree kernel functions is the ∆ function that calculates similarity of subtrees. Our modified kernel is based on the SubSet Tree (SST) Kernel proposed by Collins and Duffy (2002). What follows is a simplified description of the kernel; a more detailed description can be found in the original paper.
The general function for a tree kernel model over trees T 1 and T 2 is: where n 1 and n 2 are tree nodes. The ∆ function of SST kernel is defined recursively: 1. ∆(n 1 , n 2 ) = 0 if the productions (contextfree rules) of n 1 and n 2 are different.

Lexicalized Tree Kernel
Since simply adding words to lexicalize a tree kernel leads to sparsity problems, a type of smoothing must be applied. Bloehdorn and Moschitti (2007) measure the similarity of words using WordNet.  (2008), but their tree kernel does not incorporate POS tags or dependency labels.
We propose using word embeddings created by a neural network model (Collobert and Weston, 2008), in which words are represented by n-dimensional real valued vectors. Each dimension represents a latent feature of the word that reflects its semantic and syntactic properties. Next, we describe how we embed these vectors into tree kernels.
Our lexicalized tree kernel model is the same as SST, except in the following case: if n 1 and n 2 are matching pre-terminals (POS tags), then ∆(n 1 , n 2 ) = 1 + G(c(n 1 ), c(n 2 )), where c(n) denotes the word w that is the unique child of n, and G(w 1 , w 2 ) = exp(−γ∥w 1 −w 2 ∥ 2 ) is a Gaussian function for two word vectors, which is a valid kernel. We examine the contribution of different types of words by comparing three methods of including lexical information: (1) relation words only; (2) all words (relation words, named entities, and other words along the dependency path fragment); and (3) all words, except named entities. The words that are excluded are assumed to be different; for example, in the third method, G(E 1 , E 2 ) is always zero, even if the entities, E 1 and E 2 , are the same.

281
Here we evaluate alternative tree kernel configurations, and compare our Open IE system to previous work.
We perform experiments on three datasets (Table 1): the Penn Treebank set (Xu et al., 2013), the New York Times set (Mesquita et al., 2013), and the ClueWeb set which we created for this project from a large collection of web pages. 1 The models are trained on the Penn Treebank training set and tested on the three test sets, of which the Penn Treebank set is in-domain, and the other two sets are out-of-domain. For word embedding and Brown clustering representations, we use the data provided by Turian et al. (2010). The SVM parameters, as well as the Brown cluster size and code length, are tuned on the development set.   Table 2 shows the effect of different smoothing and lexicalization techniques on the tree kernels. In order to focus on tree kernel functions, we use the relation candidate extraction (Section 3) and tree structure (Section 4.1) proposed in this paper. The results in the first two rows indicate that adding unsmoothed lexical information to the method of Xu et al. (2013) is not helpful, which we attribute to data sparsity. On the other hand, smoothed word representations do improve F-measure. Surprisingly, a neural network approach of creating word embeddings actually achieves a lower recall than the method of Plank and Moschitti (2013) that uses Brown clustering; the difference in F-measure is not statistically significant according to compute-intensive randomization test (Padó, 2006).
With regards to lexicalization, the inclusion of relation words is important. However, unlike Plank and Moschitti (2013), we found that it is better to exclude the lexical information of entities themselves, which confirms the findings of Riedel et al. (2013). We hypothesize that the correctness of a relation triple in Open IE is not closely re-1 The Treebank set of (Xu et al., 2013), with minor corrections, and the ClueWeb set are appended to this publication.  lated to entities. Consider the example mentioned in (Riedel et al., 2013): for relations like "X visits Y", X could be a person or organization, and Y could be a location, organization, or person. Our final set of experiments evaluates the bestperforming version of our system (the last row in Table 2) against two state-of-the-art Open IE systems: Mesquita et al. (2013), which is based on several hand-crafted dependency patterns; and Xu et al. (2013), which uses POS-based relation candidate extraction and an unlexicalized tree kernel. Tree kernel systems are all trained on the Penn Treebank training set, and tuned on the corresponding development sets.
The results in Table 3 show that our system consistently outperforms the other two systems, with absolute gains in F-score between 4 and 16%. We include the reported results of (Xu et al., 2013) on the Penn Treebank set, and of (Mesquita et al., 2013) on the New York Times set. The ClueWeb results were obtained by running the respective systems on the test set, except that we used our relation candidate extraction method for the tree kernel of (Xu et al., 2013). We conclude that the substantial improvement on the Penn Treebank set can be partly attributed to a superior tree kernel, and not only to a better relation candidate extraction method. We also note that word embeddings statistically outperform Brown clustering on the ClueWeb set, but not on the other two sets.
The ClueWeb set is quite challenging because it contains web pages which can be quite noisy. As a result we've found that a number of Open IE errors are caused by parsing. Conjunction structures are especially difficult for both parsing and relation extraction. For example, our system extracts the relation triple <Scotland, base, Scott> from the sentence "Set in 17th century Scotland  → Scott. In the future, we will investigate whether adding information from context words that are not on the dependency path between two entities may alleviate this problem.

Conclusion
We have proposed a lexicalized tree kernel model for Open IE, which incorporates word embeddings learned from a neural network model. Our system combines a dependency-based relation candidate extraction method with a lexicalized tree kernel, and achieves state-of-the-art results on three datasets. Our experiments on different configurations of the smoothing and lexicalization techniques show that excluding named entity information is a better strategy for Open IE.
In the future, we plan to mitigate the performance drop on the ClueWeb set by adding information about context words around relation words. We will also investigate other ways of collapsing different types of tags in the lexicalized tree representation.