Higher-order Relation Schema Induction using Tensor Factorization with Back-off and Aggregation

Relation Schema Induction (RSI) is the problem of identifying type signatures of arguments of relations from unlabeled text. Most of the previous work in this area have focused only on binary RSI, i.e., inducing only the subject and object type signatures per relation. However, in practice, many relations are high-order, i.e., they have more than two arguments and inducing type signatures of all arguments is necessary. For example, in the sports domain, inducing a schema win(WinningPlayer, OpponentPlayer, Tournament, Location) is more informative than inducing just win(WinningPlayer, OpponentPlayer). We refer to this problem as Higher-order Relation Schema Induction (HRSI). In this paper, we propose Tensor Factorization with Back-off and Aggregation (TFBA), a novel framework for the HRSI problem. To the best of our knowledge, this is the first attempt at inducing higher-order relation schemata from unlabeled text. Using the experimental analysis on three real world datasets we show how TFBA helps in dealing with sparsity and induce higher-order schemata.


Introduction
Building Knowledge Graphs (KGs) out of unstructured data is an area of active research. Research in this has resulted in the construction of several large scale KGs, such as NELL (Mitchell et al., 2015), Google Knowledge Vault (Dong et al., 2014) and YAGO (Suchanek et al., 2007). These KGs consist of millions of entities and beliefs involving those entities. Such KG construc-tion methods are schema-guided as they require the list of input relations and their schemata (e.g., playerPlaysSport(Player, Sport)). In other words, knowledge of schemata is an important first step towards building such KGs.
While beliefs in such KGs are usually binary (i.e., involving two entities), many beliefs of interest go beyond two entities. For example, in the sports domain, one may be interested in beliefs of the form win(Roger Federer, Nadal, Wimbledon, London), which is an instance of the high-order (or n-ary) relation win whose schema is given by win (WinningPlayer, OpponentPlayer, Tournament, Location). We refer to the problem of inducing such relation schemata involving multiple arguments as Higher-order Relation Schema Induction (HRSI). In spite of its importance, HRSI is mostly unexplored.
Recently, tensor factorization-based methods have been proposed for binary relation schema induction (Nimishakavi et al., 2016), with gains in both speed and accuracy over previously proposed generative models. To the best of our knowledge, tensor factorization methods have not been used for HRSI. We address this gap in this paper.
Due to data sparsity, straightforward adaptation of tensor factorization from (Nimishakavi et al., 2016) to HRSI is not feasible, as we shall see in Section 3.1. We overcome this challenge in this paper, and make the following contributions.
• We propose Tensor Factorization with Backoff and Aggregation (TFBA), a novel tensor factorization-based method for Higher-order RSI (HRSI). In order to overcome data sparsity, TFBA backs-off and jointly factorizes multiple lower-order tensors derived from an extremely sparse higher-order tensor.
• As an aggregation step, we propose a constrained clique mining step which constructs the higher-order schemata from multiple binary schemata.
• Through experiments on multiple real-world datasets, we show the effectiveness of TFBA for HRSI.
The remainder of the paper is organized as follows. We discuss related work in Section 2. In Section 3.1, we first motivate why a back-off strategy is needed for HRSI, rather than factorizing the higher-order tensor. Further, we discuss the proposed TFBA framework in Section 3.2. In Section 4, we demonstrate the effectiveness of the proposed approach using multiple real world datasets. We conclude with a brief summary in Section 5.

Related Work
In this section, we discuss related works in two broad areas: schema induction, and tensor and matrix factorizations.
Schema Induction: Most work on inducing schemata for relations has been in the binary setting (Mohamed et al., 2011;Movshovitz-Attias and Cohen, 2015;Nimishakavi et al., 2016). Mc-Donald et al. (2005) and Peng et al. (2017) extract n-ary relations from Biomedical documents, but do not induce the schema, i.e., type signature of the n-ary relations. There has been significant amount of work on Semantic Role Labeling (Lang and Lapata, 2011;Titov and Khoddam, 2015;Roth and Lapata, 2016), which can be considered as nary relation extraction. However, we are interested in inducing the schemata, i.e., the type signature of these relations. Event Schema Induction is the problem of inducing schemata for events in the corpus (Balasubramanian et al., 2013;Chambers, 2013;Nguyen et al., 2015). Recently, a model for event representations is proposed in (Weber et al., 2018). Cheung et al. (2013) propose a probabilistic model for inducing frames from text. Their notion of frame is closer to that of scripts (Schank and Abelson, 1977). Script learning is the process of automatically inferring sequence of events from text (Mooney and DeJong, 1985). There is a fair amount of recent work in statistical script learning (Pichotta and Mooney, 2016), (Pichotta and Mooney, 2014). While script learning deals with the sequence of events, we try to find the schemata of relations at a corpus level. Ferraro and Durme (2016) propose a unified Bayesian model for scripts, frames and events. Their model tries to capture all levels of Minsky Frame structure (Minsky, 1974), however we work with the surface semantic frames.
Tensor and Matrix Factorizations: Matrix factorization and joint tensor-matrix factorizations have been used for the problem of predicting links in the Universal Schema setting (Riedel et al., 2013;Singh et al., 2015).  use matrix factorizations for the problem of finding semantic slots for unsupervised spoken language understanding. Tensor factorization methods are also used in factorizing knowledge graphs (Chang et al., 2014;Nickel et al., 2012). Joint matrix and tensor factorization frameworks, where the matrix provides additional information, is proposed in (Acar et al., 2013) and . These models are based on PARAFAC (Harshman, 1970), a tensor factorization model which approximates the given tensor as a sum of rank-1 tensors. A boolean Tucker decomposition for discovering facts is proposed in (Erdos and Miettinen, 2013). In this paper, we use a modified version (Tucker2) of Tucker decomposition (Tucker, 1963).
RESCAL (Nickel et al., 2011) is a simplified Tucker model suitable for relational learning. Recently, SICTF (Nimishakavi et al., 2016), a variant of RESCAL with side information, is used for the problem of schema induction for binary relations. SICTF cannot be directly used to induce higher order schemata, as the higher-order tensors involved in inducing such schemata tend to be extremely sparse. TFBA overcomes these challenges to induce higher-order relation schemata by performing Non-Negative Tucker-style factorization of sparse tensor while utilizing a back-off strategy, as explained in the next section.

Higher Order Relation Schema
Induction using Back-off Factorization In this section, we start by discussing the approach of factorizing a higher-order tensor and provide the motivation for back-off strategy. Next, we discuss the proposed TFBA approach in detail. Please refer to Table 1 for notations used in this paper.

R+
Set of non-negative reals. X ∈ R n 1 ×n 2 ×...×n N + N th -order non-negative tensor. X (i) mode-i matricization of tensor X . Please see (Kolda and Bader, 2009) for details. A ∈ R n×r + Non-negative matrix of order n × r. * Hadamard product: (A * B)i,j = Ai,j × Bi,j.  Step 1 of TFBA. Rather than factorizing the higher-order tensor X , TFBA performs joint Tucker decomposition of multiple 3-mode tensors, X 1 , X 2 , and X 3 , derived out of X . This joint factorization is performed using shared latent factors A, B, and C. This results in binary schemata, each of which is stored as a cell in one of the core tensors G 1 , G 2 , and G 3 . Please see Section 3.2.1 for details.

Factorizing a Higher-order Tensor
Given a text corpus, we use OpenIEv5 (Mausam, 2016) to extract tuples. Consider the following sentence "Federer won against Nadal at Wimbledon.". Given this sentence, OpenIE extracts the 4-tuple (Federer, won, against Nadal, at Wimbledon). We lemmatize the relations in the tuples and only consider the noun phrases as arguments. Let T represent the set of these 4-tuples. We can construct a 4-order tensor X ∈ R n 1 ×n 2 ×n 3 ×m + from T. Here, n 1 is the number of subject noun phrases (NPs), n 2 is the number of object NPs, n 3 is the number of other NPs, and m is the number of relations in T. Values in the tensor correspond to the frequency of the tuples. In case of 5-tuples of the form (subject, relation, object, other-1, other-2), we split the 5-tuples into two 4-tuples of the form (subject, relation, object, other-1) and (subject, relation, object, other-2) and frequency of these 4tuples is considered to be same as the original 5tuple. Factorizing the tensor X results in discovering latent categories of NPs, which help in in-ducing the schemata. We propose the following approach to factorize X . where, Here, I is the identity matrix. Non-negative updates for the variables can be obtained following (Lee and Seung, 2000). Similar to (Nimishakavi et al., 2016), schemata induced will be of the form relation A i , B j , C k . Here, P i represents the i th column of a matrix P. A is the embedding matrix of subject NPs in T (i.e., mode-1 of X ), r 1 is the embedding rank in mode-1 which is the number of latent categories of subject NPs. Similarly, B and Step 2 of TFBA. Induction of higher-order schemata from the tri-partite graph formed from the columns of matrices A, B, and C. Triangles in this graph (solid) represent a 3-ary schema, n-ary schemata for n > 3 can be induced from the 3-ary schemata. Please refer to Section 3.2.2 for details.
C are the embedding matrices of object NPs and other NPs respectively. r 2 and r 3 are the number of latent categories of object NPs and other NPs respectively. G is the core tensor. λ a , λ b and λ c are the regularization weights.
However, the 4-order tensors are heavily sparse for all the datasets we consider in this work. The sparsity ratio of this 4-order tensor for all the datasets is of the order 1e-7. As a result of the extreme sparsity, this approach fails to learn any schemata. Therefore, we propose a more successful back-off strategy for higher-order RSI in the next section.

TFBA: Proposed Framework
To alleviate the problem of sparsity, we construct three tensors X 3 , X 2 , and X 1 from T as follows: • X 3 ∈ R n 1 ×n 2 ×m + is constructed out of the tuples in T by dropping the other argument and aggregating resulting tuples, i.e., X 3 i,j,p = n 3 k=1 X i,j,k,p . For example, 4tuples (Federer, Win, Nadal, Wimbledon), 10 and (Federer, Win, Nadal, Australian Open), 5 will be aggregated to form a triple (Federer, Win, Nadal), 15 .
• X 2 ∈ R n 1 ×n 3 ×m + is constructed out of the tuples in T by dropping the object argument and aggregating resulting tuples i.e., X 2 i,j,p = n 2 k=1 X i,k,j,p .
• X 1 ∈ R n 2 ×n 3 ×m + constructed out of the tuples in T by dropping the subject argument and aggregating resulting tuples i.e., The proposed framework TFBA for inducing higher order schemata involves the following two steps. • Step 1: In this step, TFBA factorizes multiple lower-order overlapping tensors, X 1 , X 2 , and X 3 , derived from X to induce binary schemata. This step is illustrated in Figure  1 and we discuss details in Section 3.2.1. • Step 2: In this step, TFBA connects multiple binary schemata identified above to induce higher-order schemata. The method accomplishes this by solving a constrained clique problem. This step is illustrated in Figure 2 and we discuss the details in Section 3.2.2.

Step 1: Back-off Tensor Factorization
A schematic overview of this step is shown in Figure 1. TFBA first preprocesses the corpus and extracts OpenIE tuple set T out of it. The 4-mode tensor X is constructed out of T. Instead of performing factorization of the higher-order tensor X as in Section 3.1, TFBA creates three tensors out of X : X 1 n 2 ×n 3 ×m , X 2 n 1 ×n 3 ×m and X 3 n 1 ×n 2 ×m . TFBA performs a coupled non-negative Tucker factorization of the input tensors X 1 , X 2 and X 3 by solving the following optimization problem. where, We enforce non-negativity constraints on the matrices A, B, C and the core tensors G i (i ∈ {1, 2, 3}). Non-negativity is essential for learning interpretable latent factors (Murphy et al., 2012).
Each slice of the core tensor G 3 corresponds to one of the m relations. Each cell in a slice corresponds to an induced schema in terms of the latent factors from matrices A and B. In other words, G 3 i,j,k is an induced binary schema for relation k involving induced categories represented by columns A i and B j . Cells in G 1 and G 2 may be interpreted accordingly.
We derive non-negative multiplicative updates for A, B and C following the NMF updating rules given in (Lee and Seung, 2000). For the update of A, we consider the mode-1 matricization of first and the second term in Equation 1 along with the regularizer. where, In order to estimate B, we consider mode-2 matricization of first term and mode-1 matricization of third term in Equation 1, along with the regularization term. We get the following update rule for B where, For updating C, we consider mode-2 matricization of second and third terms in Equation 1 along with the regularization term, and we get where, Finally, we update the three core tensors in Equation 1 following (Kim and Choi, 2007) as follows, In all the above updates, P Q represents elementwise division and I is the identity matrix.
Initialization: For initializing the component matrices A, B, and C, we first perform a nonnegative Tucker2 Decomposition of the individual input tensors X 1 , X 2 , and X 3 . Then compute the average of component matrices obtained from each individual decomposition for initialization. We initialize the core tensors G 1 , G 2 , and G 3 with the core tensors obtained from the individual decompositions.

Step 2: Binary to Higher-Order Schema Induction
In this section, we describe how a higher-order schema is constructed from the factorization described in the previous sub-section. Each relation k has three representations given by the slices G 1 k , G 2 k and G 3 k from each core tensor. We need a principled way to produce a joint schema from these representations. For a relation, we select top-n indices (i, j) with highest values from each matrix. The indices i and j from G 3 k correspond to column numbers of A and B respectively, indices from G 2 k correspond to columns from A and C and columns from G 1 k correspond to columns from B and C. We construct a tri-partite graph with the column numbers from each of the component matrices A, B and C as the vertices belonging to independent sets, the top-n indices selected are the edges between these vertices. From this tri-partite graph, we find all the triangles which will give schema with three arguments for a relation, illustrated in Figure 2. We find higher order schemata, i.e., schemata with more than three arguments by merging two third order schemata with same column number from A and B. For example, if we find two schemata (A 2 , B 4 , C 10 ) and (A 2 , B 4 , C 8 ) then we merge these two to give (A 2 , B 4 , C 10 , C 8 ) as a higher order schema. This can be continued further for even higher order schemata. This process may be thought of as finding a constrained clique over the tri-partite graph. Here the constraint is that in the maximal clique, there can only be one edge between sets corresponding to columns of A and columns of B.
The procedure above is inspired by (McDonald et al., 2005). However, we note that (McDonald et al., 2005) solved a different problem, viz., n-ary relation instance extraction, while our focus is on inducing schemata. Though we discuss the case of back-off from 4-order to 3-order, ideas presented above can be extended for even higher orders depending on the sparsity of the tensors.

Experiments
In this section, we evaluate the performance of TFBA for the task of HRSI. We also propose a baseline model for HRSI called HardClust. HardClust: We propose a baseline model called the Hard Clustering Baseline (HardClust) for the task of higher order relation schema induction. This model induces schemata by grouping perrelation NP arguments from OpenIE extractions. In other words, for each relation, all the Noun Phrases (NPs) in first argument form a cluster that represents the subject of the relation, all the NPs in the second argument form a cluster that represents object and so on. Then from each cluster, the top most frequent NPs are chosen as the representative NPs for the argument type. We note that this method is only able to induce one schema per relation.
Datasets: We run our experiments on three datasets. The first dataset (Shootings) is a collection of 1,335 documents constructed from a publicly available database of mass shootings in the United States. The second is New York Times Sports (NYT Sports) dataset which is a collection of 20,940 sports documents from the period 2005 and 2007. And the third dataset (MUC) is a set of 1300 Latin American newswire documents about terrorism events. After performing the processing steps described in Section 3, we obtained 357,914 unique OpenIE extractions from the NYT Sports dataset, 10,847 from Shootings dataset, and 8,318 from the MUC dataset. However, in order to properly analyze and evaluate the model, we consider only the 50 most frequent relations in the datasets and their corresponding OpenIE extractions. This is done to avoid noisy OpenIE extractions to yield better data quality and to aid subsequent manual evaluation of the data. We construct input tensors following the procedure described in Section 3.2. Details on the dimensions of tensors obtained are given in Table 2.
Model Selection: In order to select appropriate TFBA parameters, we perform a grid search over the space of hyper-parameters, and select the set of hyper-parameters that give best Average FIT score (AvgFIT). where, We perform a grid search for the rank parameters between 5 and 20, for the regularization weights we perform a grid search over 0 and 1. Table 3 provides the details of hyperparameters set for different datasets. Evaluation Protocol: For TFBA, we follow the protocol mentioned in Section 3.2.2 for constructing higher order schemata. For every relation, we consider top 5 binary schemata from the factorization of each tensor. We construct a tripartite graph, as explained in Section 3.2.2, and mine constrained maximal cliques from the tripartite graphs for schemata. Table 4 provides some qualitative examples of higher-order schemata induced by TFBA. Accuracy of the schemata induced by the model is evaluated by human evaluators. In our experiments, we use human judgments from three evaluators. For every relation, the first and second columns given in Table 4 are presented to the evaluators and they are asked to validate the schema. We present top 50 schemata based on the score of the constrained maximal clique induced by TFBA to the evaluators. This evaluation protocol was also used in (Movshovitz-Attias and Cohen, 2015) for evaluating ontology induction. All evaluations were blind, i.e., the evaluators were not aware of the model they were evaluating.
Difficulty with Computing Recall: Even though recall is a desirable measure, due to the lack of availability of gold higher-order schema annotated corpus, it is not possible to compute recall. Although the MUC dataset has gold annotations for some predefined list of events, it does not have annotations for the relations.   Experimental results comparing performance of various models for the task of HRSI are given in Table 5. We present evaluation results from three evaluators represented as E1, E2 and E3. As can be observed from Table 5, TFBA achieves better results than HardClust for the Shootings and NYT Sports datasets, however HardClust achieves better results for the MUC dataset. Percentage agreement of the evaluators for TFBA is 72%, 70% and 60% for Shootings, NYT Sports and MUC datasets respectively.
HardClust Limitations: Even though Hard-Clust gives better induction for MUC corpus, this approach has some serious drawbacks. HardClust can only induce one schema per relation. This is a restrictive constraint as multiple senses can exist for a relation. For example, consider the schemata induced for the relation shoot as shown in Table  4. TFBA induces two senses for the relation, but HardClust can induce only one schema. For a set of 4-tuples, HardClust can only induce ternary schemata; the dimensionality of the schemata cannot be varied. Since the latent factors induced by HardClust are entirely based on frequency, the latent categories induced by HardClust are dominated by only a fixed set of noun phrases. For example, in NYT Sports dataset, subject category induced by HardClust for all the relations is team, yankees, mets . In addition to inducing only one schema per relation, most of the times HardClust only induces a fixed set of categories. Whereas for TFBA, the number of categories depends on the rank of factorization, which is a user provided parameter, thus providing more flexibility to choose the latent categories.

Using Event Schema Induction for HRSI
Event schema induction is defined as the task of learning high-level representations of events, like a tournament, and their entity roles, like winningplayer etc, from unlabeled text. Even though the main focus of event schema induction is to induce the important roles of the events, as a side result most of the algorithms also provide schemata for the relations. In this section, we investigate the effectiveness of these schemata compared to the ones induced by TFBA.
Event schemata are represented as a set of (Actor, Rel, Actor) triples in (Balasubramanian et al., 2013). Actors represent groups of noun phrases and Rels represent relations. From this style of representation, however, the n-ary schemata for relations cannot be induced. Event schemata generated in (Weber et al., 2018) are similar to that in (Balasubramanian et al., 2013). Event schema induction algorithm proposed in (Nguyen et al., 2015) doesn't induce schemata for relations, but rather induces the roles for the events. For this investigation we experiment with the following algorithm. Chambers-13 (Chambers, 2013): This model learns event templates from text documents. Each event template provides a distribution over slots, where slots are clusters of NPs. Each event template also provides a cluster of relations, which is most likely to appear in the context of the aforementioned slots. We evaluate the schemata of these relation clusters.
As can be observed from Table 5, the proposed TFBA performs much better than  HardClust also performs better than Chambers-13 on all the datasets. From this analysis we infer that there is a need for algorithms which induce higher-order schemata for relations, a gap we fill in this paper. Please note that the experimental results provided in (Chambers, 2013) for MUC dataset are for the task of event schema induction, but in this work we evaluate the relation schemata. Hence the results in (Chambers, 2013) and results in this paper are not comparable. Example   Table 5: Higher-order RSI accuracies of various methods on the three datasets. Induced schemata for each dataset and method are evaluated by three human evaluators, E1, E2, and E3. TFBA performs better than HardClust for Shootings and NYT Sports datasets. Even though HardClust achieves better accuracy on MUC dataset, it has several limitations, see Section 4 for more details. Chambers-13 solves a slightly different problem called event schema induction, for more details about the comparison with Chambers-13 see Section 4.1.
schemata induced by TFBA and  are provided as part of the supplementary material.

Conclusion
Higher order Relation Schema Induction (HRSI) is an important first step towards building domainspecific Knowledge Graphs (KGs). In this paper, we proposed TFBA, a tensor factorizationbased method for higher-order RSI. To the best of our knowledge, this is the first attempt at inducing higher-order (n-ary) schemata for relations from unlabeled text. Rather than factorizing a severely sparse higher-order tensor directly, TFBA performs back-off and jointly factorizes multiple lower-order tensors derived out of the higher-order tensor. In the second step, TFBA solves a constrained clique problem to induce schemata out of multiple binary schemata. We are hopeful that the backoff-based factorization idea exploited in TFBA will be useful in other sparse factorization settings.