Relation Schema Induction using Tensor Factorization with Side Information

Given a set of documents from a specific domain (e.g., medical research journals), how do we automatically identify the schema of relations i.e., type signature of arguments of relations (e.g., undergo (Patient, Surgery)) - a necessary first step towards building a Knowledge Graph (KG) out of the given set of documents? We refer to this problem as Relation Schema Induction (RSI). While Open Information Extraction (OIE) techniques aim at extracting surface level triples of the form (John, underwent, Angioplasty), they don't induce the yet unknown schema of relations themselves. Tensors provide a natural representation for such triples, and factorization of such tensors provide a plausible solution for the RSI problem. To the best of our knowledge, tensor factorization methods have not been used for the RSI problem. We fill this gap and propose Coupled Non-negative Tensor Factorization (CNTF), a tensor factorization method which is able to incorporate additional side information in a principled way for more effective Relation Schema Induction. We report our findings on multiple real-world datasets and demonstrate CNTF's effectiveness over state-of-the-art baselines both in terms of accuracy and speed.


Introduction
Over the last few years, several techniques to build Knowledge Graphs (KGs) from large unstructured text corpus have been proposed, examples include NELL (Mitchell et al., 2015) and Google Knowledge Vault (Dong et al., 2014). Such KGs consist of millions of entities (e.g., Oslo, Norway, etc.), their types (e.g., isA(Oslo, City), isA(Norway, Country)), and relationships among them (e.g., cityLo-catedInCountry(Oslo, Norway)). These KG construction techniques are called ontology-guided as they require as input list of relations, their schemas (i.e., their type signatures, e.g., cityLocatedInCountry(City, Country)), and seed instances of each such relation. Listing of such relations and their schemas are usually prepared by human domain experts.
The reliance on domain expertise poses significant challenges when such ontology-guided KG construction techniques are applied to domains where domain experts are either not available or are too expensive to employ. Even when such a domain expert may be available for a limited time, she may be able to provide only a partial listing of relations and their schemas relevant to that particular domain. Moreover, this expert-mediated model is not scalable when new data in the domain becomes available, bringing with it potential new relations of interest. In order to overcome these challenges, we need automatic techniques which can discover relations and their schemas from unstructured text data itself, without requiring extensive human input. We refer to this problem as Relation Schema Induction (RSI).
In contrast to ontology-guided KG construction techniques mentioned above, Open Information Extraction (OpenIE) techniques (Etzioni et al., 2011) aim to extract surface-level triples from unstructured text. Such OpenIE triples may provide a suitable starting point for the RSI problem. In fact, KB-LDA,  a topic modeling-based method for inducing an ontology from SVO (Subject-Verb-Object) triples was recently proposed in (Movshovitz-Attias and Cohen, 2015). We note that ontology induction (Velardi et al., 2013) is a more general problem than RSI, as we are primarily interested in identifying categories and relations from a domain corpus, and not necessarily any hierarchy over them. Nonetheless, KB-LDA maybe used for the RSI problem and we use it as a representative of the state-of-the-art of this area. Instead of a topic modeling approach, we take a tensor factorization-based approach for RSI in this paper. Tensors are a higher order generalization of matrices and they provide a natural way to represent OpenIE triples. Applying tensor factorization methods over OpenIE triples to identify relation schemas is a natural approach, but one that has not been explored so far. Also, a tensor factorizationbased approach presents a flexible and principled way to incorporate various types of side information. Moreover, as we shall see in Section 4, compared to state-of-the-art baselines such as KB-LDA, tensor factorization-based approach results in better and faster solution for the RSI problem. In this paper, we make the following contributions: • We present Schema Induction using Coupled Tensor Factorization (SICTF), a novel and principled tensor factorization method which jointly factorizes a tensor constructed out of OpenIE triples extracted from a domain corpus, along with various types of additional side information for relation schema induction.
• We compare SICTF against state-of-the-art baseline on various real-world datasets from diverse domains. We observe that SICTF is not only significantly more accurate than such baselines, but also much faster. For example, SICTF achieves 14x speedup over KB-LDA (Movshovitz-Attias and Cohen, 2015).
• We have made the data and code available 1 .

Related Work
Schema Induction: Properties of SICTF and other related methods are summarized in Table 1 2 . A method for inducing (binary) relations and the categories they connect was proposed by (Mohamed et al., 2011). However, in that work, categories and their instances were known a-priori. In contrast, in case of SICTF, both categories and relations are to be induced. A method for event schema induction, the task of learning high-level representations of complex events and their entity roles from unlabeled text, was proposed in (Chambers, 2013). This gives the schemas of slots per event, but our goal is to find schemas of relations.  and  deal with the problem of finding semantic slots for unsupervised spoken language understanding, but we are interested in finding schemas of relations relevant for a given domain. Methods for link prediction in the Universal Schema setting using matrix and a combination of matrix and tensor factorization are proposed in (Riedel et al., 2013) and (Singh et al., 2015), respectively. Instead of link prediction where relation schemas are assumed to be given, SICTF focuses on discovering such relation schemas. Moreover, in contrast to such methods which assume access to existing KGs, the setting in this paper is unsupervised.
Tensor Factorization: Due to their flexibility of representation and effectiveness, tensor factorization methods have seen increased application in Knowledge Graph (KG) related problems over the last few years. Methods for decomposing ontological KGs such as YAGO (Suchanek et al., 2007) were proposed in (Nickel et al., 2012;Chang et al., 2014b;Chang et al., 2014a). In these cases, relation schemas are known in advance, while we are interested in inducing such relation schemas from unstructured text. A PARAFAC (Harshman, 1970) based method for jointly factorizing a matrix and tensor for data fusion was proposed in (Acar et al., 2013). In such cases, the matrix is used to provide auxiliary information (Narita et al., 2012;Erdos and Miettinen, 2013). Similar PARAFAC-based ideas are explored in Rubik  to factorize structured electronic health records. In contrast to such structured data sources, SICTF aims at inducing relation schemas from unstructured text data. Propstore, a tensor-based model for distributional semantics, a problem different from RSI, was presented in (Goyal et al., 2013). Even though coupled factorization of tensor and matrices constructed out of unstructured text corpus provide a natural and plausible approach for the RSI problem, they have not yet been explored -we fill this gap in this paper.
Ontology Induction: Relation Schema Induction can be considered a sub problem of Ontology Induction (Velardi et al., 2013). Instead of building a full-fledged hierarchy over categories and relations as in ontology induction, we are particularly interested in finding relations and their schemas from unstructured text corpus. We consider KB-LDA 3 (Movshovitz-Attias and Cohen, 2015), a topic-modeling based approach for ontology induction, as a representative of this area. Among all prior work, KB-LDA is most related to SICTF. While both KB-LDA and SICTF make use of noun phrase side information, SICTF is also able to exploit relational side information in a principled manner. In Section 4, through experiments on multiple realworld datasets, we observe that SICTF is not only more accurate than KB-LDA but also significantly faster with a speedup of 14x.
A method for canonicalizing noun and relation phrases in OpenIE triples was recently proposed in (Galárraga et al., 2014). The main focus of this approach is to cluster lexical variants of a single entity or relation. This is not directly relevant for RSI, as we are interested in grouping multiple entities of the same type into one cluster, and use that to induce relation schema.

Overview
SICTF poses the relation schema induction problem as a coupled factorization of a tensor along with matrices containing relevant side information. Overall architecture of the SICTF system is presented in Figure 1. First, a tensor X ∈ R n×n×m + is constructed to store OpenIE triples and their scores extracted from the text corpus 4 . Here, n and m represent the number of NPs and relation phrases, respectively. Following (Movshovitz-Attias and Cohen, 2015), SICTF makes use of noun phrase (NP) side information in the form of (noun phrase, hypernym). Additionally, SICTF also exploits relationrelation similarity side information. These two side information are stored in matrices W ∈ {0, 1} n×h and S ∈ {0, 1} m×m , where h is the number of hypernyms extracted from the corpus. SICTF then performs collective non-negative factorization over X, W , and S to output matrix A ∈ R n×c + and the core tensor R ∈ R c×c×m + . Each row in A corresponds to an NP, while each column corresponds to an induced category (latent factor). For brevity, we shall refer to the induced category corresponding to the q th column of A as A q . Each entry A pq in the output matrix provides a membership score for NP p in induced category A q . Please note that each induced category is represented using the NPs participating in it, with the NPs ranked by their membership scores in the induced category. In Figure 1, Each slice of the core tensor R is a matrix which corresponds to a specific relation, e.g., the matrix R undergo highlighted in Figure 1 corresponds to the relation undergo. Each cell in this matrix corresponds to an induced schema connecting two induced categories (two columns of the A matrix), with the cell value representing model's score of the induced schema. For example, in Figure 1, undergo(A 2 , A 4 ) is an induced relation schema with 4 R+ is the set of non-negative reals. MEDLINE (hypertension, disease), (hypertension, state), (hypertension, disorder) , (neutrophil, blood element), (neutrophil, effector cell), (neutrophil, cell type) StackOverflow (image, resource), (image, content), (image, file), (perl, language), (perl, script), (perl, programs)  In Section 3.2, we present details of the side information used by SICTF, and then in Section 3.3 present details of the optimization problem solved by SICTF.

Side Information
• Noun Phrase Side Information: Through this type of side information, we would like to capture type information of as many noun phrases (NPs) as possible. We apply Hearst patterns (Hearst, 1992), e.g., "<Hypernym> such as <NP>", over the corpus to extract such (NP, Hypernym) pairs. Please note that neither hypernyms nor NPs are pre-specified, and they are all extracted from the data by the patterns. Examples of a few such pairs extracted from two different datasets are shown in Table 2. These extracted tuples are stored in a matrix W n×h whose rows correspond to NPs and columns correspond to extracted hypernyms. We define, Please note that we don't expect W to be a fully specified matrix, i.e., we don't assume that we know all possible hypernyms for a given NP.
• Relation Side Information: In addition to the side information involving NPs, we would also like to take prior knowledge about textual relations into account during factorization. For example, if we know two relations to be similar to one another, then we also expect their induced schemas to be similar as well. Consider the following sentences "Mary purchased a stuffed animal toy." and "Janet bought a toy car for her son.". From these we can say that both relations purchase and buy have the schema (Person, Item). Even if one of these relations is more abundant than the other in the corpus, we still want to learn similar schemata for both the relations. As mentioned before, S ∈ R m×m + is the relation similarity matrix, where m is the number of textual relations. We define, where γ is a threshold 5 . For the experiments in this paper, we use cosine similarity over word2vec (Mikolov et al., 2013) vector representations of the relational phrases. Examples of a few similar relation pairs are shown in Table 3.

SICTF Model Details
SICTF performs coupled non-negative factorization of the input triple tensor X n×n×m along with the two side information matrices W n×h and S m×m by solving the following optimization problem.
In the objective above, the first term f (X k , A, R k ) minimizes reconstruction error for the k th relation, with additional regularization on the R :,:,k matrix 6 .
The second term, f np (W, A, V ), factorizes the NP side information matrix W n×h into two matrices A n×c and V c×h , where c is the number of induced categories. We also enforce A to be non-negative. Typically, we require c h to get a lower dimensional embedding of each NP (rows of A). Finally, the third term f rel (S, R) enforces the requirement that two similar relations as given by the matrix S should have similar signatures (given by the corresponding R matrix). Additionally, we require V and R to be non-negative, as marked by the (nonnegative) constraints. In this objective, λ R , λ np , λ A , λ V , and λ rel are all hyper-parameters.
We derive non-negative multiplicative updates for A, R k and V following the rules proposed in (Lee and Seung, 2000), which has the following general form: Here C(θ) represents the cost function of the nonnegative variables θ and ∂C(θ) − ∂θ i and ∂C(θ) − ∂θ i are the negative and positive parts of the derivative of C(θ) (Mørup et al., 2008). (Lee and Seung, 2000) proved that for α = 1, the cost function C(θ) monotonically decreases with the multiplicative updates 7 . C(θ) for SICTF is given in equation (1). The above procedure will give the following updates: For brevity, we also refer to R :,:,k as R k , and similarly X :,:,k as X k 7 We also use α = 1.  In the equations above, * is the Hadamard or element-wise product 8 . In all our experiments, we find the iterative updates above to converge in about 10-20 iterations.

Experiments
In this section, we evaluate performance of different methods on the Relation Schema Induction (RSI) task. Specifically, we address the following questions.
• Which method is most effective on the RSI task? (Section 4.3.1) • How important are the additional side information for RSI? (Section 4.3.2) • What is the importance of non-negativity in RSI with tensor factorization? (Section 4.3.3)

Experimental Setup
Datasets: We used two datasets for the experiments in this paper, they are summarized in Table 4. For MEDLINE dataset, we used Stanford CoreNLP (Manning et al., 2014) for coreference resolution and Open IE v4.0 9 for triple extraction. Triples with Noun Phrases that have Hypernym information were retained. We obtained the StackOverflow triples directly from the authors of (Movshovitz-Attias and Cohen, 2015), which were also prepared using a very similar process. In both datasets, we use corpus frequency of triples for constructing the tensor. Side Information: Seven Hearst patterns such as "<hypernym> such as <NP>", "<NP> or other <hypernym>" etc., given in (Hearst, 1992) were used to extract NP side information from the MEDLINE documents. NP side information for the StackOverflow dataset was obtained from the authors of (Movshovitz-Attias and Cohen, 2015).
As described in Section 3, word2vec embeddings of the relation phrases were used to extract relationsimilarity based side-information. This was done for 8 (A * B)i,j = Ai,j × Bi,j 9 Open IE v4.0: http://knowitall.github.io/openie/ both datasets. Cosine similarity threshold of γ = 0.7 was used for the experiments in the paper.
Samples of side information used in the experiments are shown in Table 2 and Table 3. A total of 2067 unique NP-hypernym pairs were extracted from MEDLINE data and 16,639 were from Stack-Overflow data. 25 unique pairs of relation phrases out of 1172 were found to be similar in MEDLINE data, whereas 280 unique pairs of relation phrases out of approximately 3200 were found similar in StackOverflow data.
Hyperparameters were tuned using grid search and the set which gives minimum reconstruction error for both X and W was chosen. We set λ np = λ rel = 100 for StackOverflow, and λ np = 0.05 and λ rel = 0.001 for Medline and we use c = 50 for our experiments. Please note that our setting is unsupervised, and hence there is no separate train, dev and test sets.

Evaluation Protocol
In this section, we shall describe how the induced schemas are presented to human annotators and how final accuracies are calculated. In factorizations produced by SICTF and other ablated versions of SICTF, we first select a few top relations with best reconstruction score. The schemas induced for each selected relation k is represented by the matrix slice R k of the core tensor obtained after factorization (see Section 3). From each such matrix, we identify the indices (i, j) with highest values. The indices i and j select columns of the matrix A. A few top ranking NPs from the columns A i and A j along with the relation k are presented to the human annotator, who then evaluates whether the tuple Relation k (A i , A j ) constitutes a valid schema for relation k. Examples of a few relation schemas induced by SICTF are presented in Table 5. A human annotator would see the first and second columns of this table and then offer judgment as indicated in the third column of the table. All such judgments across all top-reconstructed relations are aggregated to get the final accuracy score. This evaluation protocol was also used in (Movshovitz-Attias and Cohen, 2015) to measure learned relation accuracy.
All evaluations were blind, i.e., the annotators were not aware of the method that generated the output they were evaluating. Moreover, the anno-   tators are experts in software domain and has highschool level knowledge in medical domain. Though recall is a desirable statistic to measure, it is very challenging to calculate it in our setting due to the non-availability of relation schema annotated text on large scale.

Effectiveness of SICTF
Experimental results comparing performance of various methods on the RSI task in the two datasets are presented in Figure 2(a). RSI accuracy is calculated based on the evaluation protocol described in Section 4.2. Performance number of KB-LDA for StackOveflow dataset is taken directly from the (Movshovitz-Attias and Cohen, 2015) paper, we used our implementation of KB-LDA for the MED-LINE dataset. Annotation accuracies from two annotators were averaged to get the final accuracy.
From Figure 2(a), we observe that SICTF outperforms KB-LDA on the RSI task. Please note that the inter-annotator agreement for SICTF is 88% and 97% for MEDLINE and StackOverflow datasets respectively. This is the main result of the paper.
In addition to KB-LDA, we also compared SICTF with PARAFAC, a standard tensor factorization method. PARAFAC induced extremely poor and small number of relation schemas, and hence we didn't consider it any further.
Runtime comparison: Runtimes of SICTF and KB-LDA over both datasets are compared in Figure 2(b). From this figure, we find that SICTF is able to achieve a 14x speedup on average over KB-LDA 10 . In other words, SICTF is not only able to 0.46 0.50 0.48 0.84 0.33 0.59 SICTF (λ rel =0, λ np = 0, and no non-negativity constraints ) 0.14 0.10 0.12 0.20 0.14 0.17 Table 6: RSI accuracy comparison of SICTF with its ablated versions when no relation side information is used (λ rel = 0), when no NP side information is used (λnp = 0), when no side information of any kind is used (λ rel = 0, λnp = 0), and when additionally there are no non-negative constraints. From this, we observe that additional side information improves performance, validating one of the central thesis of this paper. Please see Section 4.3.2 and Section 4.3.3 for details. induce better relation schemas, but also do so at a significantly faster speed.

Importance of Side Information
One of the central hypothesis of our approach is that coupled factorization through additional side information should result in better relation schema induction. In order to evaluate this thesis further, we compare performance of SICTF with its ablated versions: (1) SICTF (λ rel = 0), which corresponds to the setting when no relation side information is used, (2) SICTF (λ np = 0), which corresponds to the setting when no noun phrases side information is used, and (3) SICTF (λ rel = 0, λ np = 0), which corresponds to the setting when no side information of any kind is used. Hyperparameters are separately tuned for the variants of SICTF. Results are presented in the first four rows of Table 6. From this, we observe that additional coupling through the side information significantly helps improve SICTF performance. This further validates the central thesis of our paper.

Importance of Non-Negativity on Relation Schema Induction
In the last row of Table 6, we also present an ablated version of SICTF when no side information no non-negativity constraints are used. Comparing the last two rows of this table, we observe that non-negativity constraints over the A matrix and core tensor R result in significant improvement in performance. We note that the last row in Table 6 is equivalent to RESCAL (Nickel et al., 2011) and the fourth row is equivalent to Non-Negative RESCAL (Krompaß et al., 2013), two tensor factor-ization techniques. We also note that none of these tensor factorization techniques have been previously used for the relation schema induction problem.
The reason for this improved performance may be explained by the fact that absence of non-negativity constraint results in an under constrained factorization problem where the model often overgenerates incorrect triples, and then compensates for this overgeneration by using negative latent factor weights. In contrast, imposition of non-negativity constraints restricts the model further forcing it to commit to specific semantics of the latent factors in A. This improved interpretability also results in better RSI accuracy as we have seen above. Similar benefits of non-negativity on interpretability have also been observed in matrix factorization (Murphy et al., 2012).

Conclusion
Relation Schema Induction (RSI) is an important first step towards building a Knowledge Graph (KG) out of text corpus from a given domain. While human domain experts have traditionally prepared listing of relations and their schemas, this expert-mediated model poses significant challenges in terms of scalability and coverage. In order to overcome these challenges, in this paper, we present SICTF, a novel non-negative coupled tensor factorization method for relation schema induction. SICTF is flexible enough to incorporate various types of side information during factorization. Through extensive experiments on real-world datasets, we find that SICTF is not only more accurate but also significantly faster (about 14x speedup) compared to state-of-the-art baselines. As part of future work, we hope to analyze SICTF further, as-sign labels to induced categories, and also apply the model to more domains.