MCMH: Learning Multi-Chain Multi-Hop Rules for Knowledge Graph Reasoning

Multi-hop reasoning approaches over knowledge graphs infer a missing relationship between entities with a multi-hop rule, which corresponds to a chain of relationships. We extend existing works to consider a generalized form of multi-hop rules, where each rule is a set of relation chains. To learn such generalized rules efficiently, we propose a two-step approach that first selects a small set of relation chains as a rule and then evaluates the confidence of the target relationship by jointly scoring the selected chains. A game-theoretical framework is proposed to this end to simultaneously optimize the rule selection and prediction steps. Empirical results show that our multi-chain multi-hop (MCMH) rules result in superior results compared to the standard single-chain approaches, justifying both our formulation of generalized rules and the effectiveness of the proposed learning framework.


Introduction
Knowledge graphs (KGs) represent knowledge of the world as relationships between entities, i.e., triples with the form (subject, predicate, object) (Bollacker et al., 2008;Suchanek et al., 2007;Vrandečić and Krötzsch, 2014;Auer et al., 2007;Carlson et al., 2010). Such knowledge resource provides clean and structured evidence for many downstream applications such as question answering. KGs are usually constructed by human experts, which is time-consuming and leads to highly incomplete graphs (Min et al., 2013). Therefore automatic KG completion (Nickel et al., 2011;Bordes et al., 2013;Yang et al., 2014;Socher et al., 2013;Lao et al., 2011) is proposed to infer a missing link of relationship r between a head entity h and a tail entity t.
Existing KG completion work mainly makes use of two types of information: 1) co-occurrence of Figure 1: Examples of reasoning with multiple paths. (a) A standard multi-hop example. The target can be sufficiently inferred with one chain. (b) An example that requires a rule as the conjunction of two chains (the stadium hosts two teams but only one from NBA). (c) An example where multiple chains cannot sufficiently infer the target but improves its confidence. entities and relations and 2) deducible reasoning paths of tuples. KG embeddings encode entities and relations, the first type of information, together into continuous vector space with low-rank tensor approximations (Bordes et al., 2013;Dettmers et al., 2017;Lin et al., 2015;Neelakantan et al., 2015;Shi and Weninger, 2017;Trouillon et al., 2016;Wang et al., 2014;Xie et al., 2016;Yang et al., 2014).
Ours approach utilizes the second type of information, reasoning path of tuples that can be deduced to the target tuple (Lao and Cohen, 2010;Xiong et al., 2017;Das et al., 2016Das et al., , 2017. Here a reasoning path starts with the head entity h and ends with the tail entity t: h where r1 ∧ ... ∧ rN forms a relation chain that infers the existence of r. Therefore these methods are also referred as multi-hop reasoning over KGs, which learns a multi-hop chain as a rule to deduce the target r. An example of such a chain is given in Figure 1a to infer whether an athlete plays in an location. Multi-hop reasoning approaches can usually utilize richer evidence and self-justifiable in terms of reasoning path rules used in the predictions, making the prediction of missing relations more interpretable. Despite advantages and success of the multi-hop reasoning approach (Lin et al., 2018;Xiong et al., 2017;Das et al., 2017;Shen et al., 2018;Zhang et al., 2017), a target relationship may not be perfectly inferred from a single relation chain. There could exist multiple weak relation chains that correlate with the target relation. Figure 1 gives examples of such cases. These multiple chains could be leveraged in following ways: (1) the reasoning process naturally relies on the logic conjunction of multiple chains (Figure 1b); (2) more commonly, there are instances for which none of the chains is accurate, but aggregating multiple pieces of evidence improves the confidence (Figure 1c), as also observed in the case-based study works (Aamodt and Plaza, 1994;Das et al., 2020). Inspired by these observations, we propose the concept of multi-chain multi-hop rule set. Here, instead of treating each single multihop chain as a rule, we learn rules consisting of a small set of multi-hop chains. Therefore the inference of target relationships becomes a joint scoring of such a set of chains. We treat each set of chains as one rule and, since different query pairs can follow different rules, together we have a set of rules to reason each relation.
Learning the generalized multi-hop rule set is a combinatorial search problem. We address this challenge with a game-theoretic approach inspired by (Lei et al., 2016;Carton et al., 2018;Yu et al., 2019). Our approach consists of two steps: (1) selecting a generalized multi-hop rule set by employing a Multi-Layer Perceptron (MLP) over the candidate chains; (2) reasoning with the generalized rule set, which uses another MLP to model the conditional probability of the target relationship given the selected relation chains. The nonlinearity of MLP as reasoner provides the potential to model the logic conjunction among the selected chains in the rule set.
We demonstrate the advantage of our method on KG completion tasks in FB15K-237 and NELL-995. Our method outperforms existing single-chain approaches, showing that our defined generalized rules are necessary for many reasoning tasks.

Backgrounds
Problem Formulation We aim to infer missing relationships between two given entities, such as athleteAtLocation between Neymar and Paris, given their other connections in the knowledge graph. Formally, we are given a knowledge graph G, consisting of a set of triplets O = {(h, r, t)}, where r is a relation edge defined in G, h is a head entity, and t is the tail entity. The task is to identify the relationr between a set of query entityĥ andt. For evaluation, we have ground truth labels indicating whether each pair (ĥ,t) has the relationshipr or not.
For a given query (ĥ i ,r,t i ), the i-th sample inr, we extract a set of relation chains Each chain is a set of connected relations betweenĥ andt in G. The proposed multichain multi-hop rule set is a set of rules, each consisting of multiple relation chains S ⊂ R with size d = |S|. In the experiments, we represent each relation chain Rn with only relation names. Our task is to find such S for a target relationr over each query pairĥ i andt i , and estimate the confidence P (r|S). Note that S and R depend on query sample (ĥ i ,r,t i ) but for notation simplicity we omit i andr from Sr i and Rr i . Relation Chains Extraction To obtain the set of candidate relation chains R for a target relation r, we take the following extraction steps. First, we extract a fixed hop k sub-graph from the original KB. Each sub-graph starts with an entityĥ with relationr, ends with an entityt, and satisfies that (ĥ,r,t) ∈ G. The sub-graph consists of a list of m-hop paths connecting the two ends, where 1 ≤ m ≤ k. Each of the m-hop paths has the form (ĥ, r 1 , t 1 ), (t 1 , r 2 , t 2 ), · · · (t m−1 , r m ,t). We call r 1 → r 2 · · · → r m a candidate relation chain R. High k values can result in an intractable number of chains while low k values may not have sufficient coverage. Here we extract chains with length up to k = 3, and forr with a large number of chains (|R| ≥ 10 4 ), we filter out extracted chains with a set threshold (proportional to count of relation chains) in the positive training data for that relation.

A Game-Theoretic Approach for MCMH Rule Learning
A Three-Player Game for Rule Learning Finding a set of chains as the rule is a combinatorial search problem in R. For example, given an input In the prediction phase, the predictor Si is encoded as vS i = [0, 0, 1, 1] and estimates probability of athleteHomeStadium being true as 100%. The complement predictor S c i is encoded as v S c i = [1, 1, 0, 0] and estimates the probablity as 19%.
of 1,000 chains between a training entity pair, the selection of a set-rule of 4 chains corresponds to a search space of 10 12 . Hence, we propose a gametheoretic approximation to learn to generate predictive chains and reduce the learning complexity. Our method is inspired by the line of rationalization works (Carton et al., 2018;Yu et al., 2019). Specifically, our input is a set of chains R i ⊂ R for relationr and each training sample (ĥ i ,r,t i ).
Our method consists of three submodels: (1) a rule set generator that selects the set of chains S i as a rule, (2) a reasoner that predicts the probability of r i based on S i , and (3) a complement predictor that predicts the probability ofr based on S c i = R i \ S i . During training, the predictor and the complement predictor aim to minimize the cross-entropy loss for predicting the existence ofr. While the generator is optimized to make the predictor perform well, while decreasing the complement predictor's accuracy. In other words, the generator plays a cooperative game with the predictor to make the selected rule set S i be useful for inferring the target relationshipr. At the same time it plays an adversarial game with the complement predictor to ensure that no critical information is left, i.e., to ensure the comprehensiveness of the selected S i . An example of the workflow is given in Figure 2. Predictors The predictor estimates probability of r being true conditioned on S i , denoted asp(r|S i ).
The complement predictor estimates probability of r conditioned on S c i , denoted asp c (r|S c i ). The two models are optimized as follows: where H(p; q) denotes the cross entropy between p and q, and p(·|·) denotes the empirical distribution. We encode the inputs S i and S c i as binary vectors v S i and v S c i , respectively 1 , which are both of dimension |R i |, with each dimension corresponding to one relation chain in the candidate set R i . The j-th component of v S i is set to 1 if and only if the j-th chain is selected in S i , i.e., R j ∈ S i , and similarly for v S c i . The input vectors are fed into a 3-layer MLP to predict whetherr holds for (ĥ i ,t i ). Generator The generator extracts S i from the input chain set R i . This function, denoted as g : R i → S i , is optimized with: Lp − Lc + λsLs, where Lp and Lc are the losses of the predictor and the complement predictor, respectively. Ls is 1 Our method could use KG embedding as inputs like previous works (Xiong et al., 2017;Das et al., 2017). It may weakens the interpretability of the reasoning model as they are smoothed representations, but can potentially improve the performance for cases with smaller training data. We leave the investigation to future work.  a sparsity loss which aims to constrain the number of chains to be select to a desired size d: Since the generator makes a hard decision for selection of S i , the losses Lp and Lc are generally not differentiable. Hence, we utilize the policy gradient (Williams, 1992) reinforcement learning algorithm to optimize the generator. To have bounded rewards, we use the predictors' accuracy instead of the loss values Lp and Lc. The generator is also modeled with a MLP that is of the same architecture as the predictor. The output is a |R i |×2 vector which represents the probabilities that each chain would be selected into S i and S c i . Rule selection during inference During inference, to have a fixed number (d) of selection, for each instance, we select the top-d chains according to the probability predicted by the generator.

Empirical Evaluation
We evaluate our model with MCMH rules on two datasets, FB15K-237 (Toutanova et al., 2015) and NELL-995 (Xiong et al., 2017). We follow the existing setting of treating each target relationship as a separate task and training and evaluating relationship-specific reasoning models, and use the standard data splits (Xiong et al., 2017). Table 1 summarizes statistics of two datasets. For each target relation in the datasets, we extract candidate chain set R following Section 2. Table 2 shows the number of extracted chains for each relation. We compare with previous works in the same setting, DeepPath (Xiong et al., 2017) and MINERVA (Das et al., 2017). They both are single-chain methods, i.e., they learn a reasoning model to find a single multi-hop chain for the inference. Overall results Table 3 shows our method with double chains and five chains outperforms the single-chain baseline (d = 1 in our model) by clear margins on both datasets, demonstrating the advantage of our generalized rules compared to the single-chain rules studied in the existing works. Moreover, our generalized rule learning method, when setting d = 5, outperforms existing baselines on both datasets. For some relations (such as the teamSports relation), our method performs worse compared to the previous works. It is likely because the relation has less training data while previous works use pre-trained KG embeddings to alleviate the problem.
Effects of numbers of chains in one rule (d) The required numbers of chains differ from different datasets: on NELL-995, using double-relation chain with d = 2 achieves slightly better performance compared to setting d = 5, while on FB15K-237 there is a clear advantage with d = 5 relation chains. This observation shows that on FB15K-237 a relation generally requires more chains as evidence to improve the confidence of prediction. Moreover, since a conjunction rule usually does not span over 5 chains, for many FB15K-237 test tuples the evidence is not sufficient for making the decision, therefore adding more chains can enhance the confidence thus improve results significantly.
Choices of d The average number of chains (i.e., the number of chains that connect the specific entity pair) is 13.8 for NELL-995 and 63.3 for FB15K-237. Therefore selecting d=5 chains is a significant portion of the whole input space. Moreover, MAP of our model using all candidate chains is 0.671 for FB15K-237 and 0.892 for NELL-995, which are close to that of d=5 (the detail performance for each relation is shown in Appendix B). From the above observations, selecting d=5 chains is sufficient for the KB completion task. Also, the logic conjunction between d=2 chains or among 5 chains is more likely to be human-interpretable compared to the selection of large numbers of chains. Figure  3 of Appendix B shows MAP versus the number of selected chains d for two representative relations, showing that the performance of our model converges after d=5.

Effects of MLP versus linear predictors
Finally we study the impact of the two different ways that our generalized rules contribute to the improved results, namely modeling logic conjunctions and enhancing confidence of multiple weak rules, as discussed in Section 1. To this end, we replace the MLP predictors with linear models. The rationale is that the linear model is less effective in capturing conjunctions among inputs, so improvements from linear models over the single-chain baseline are more likely due to the enhanced confidence, rather than finding a conjunctive rule. We denote   this model as Ours (-conj) and show its results in Table 3. It is observed that the Ours (-conj) model outperforms the baseline, but is generally not as good as the MLP model. Hence most of the relations mainly benefit from the case of confidence enhancement. However, the results also highlight a few relations with a notable performance gap, e.g., athletePlaysForTeam, indicating that multiple conjunctions are also important to KB completion tasks.

Conclusion
We propose a new approach of multi-chain multihop rule learning for knowledge graph completion tasks. First, we formalize the concept of multi-hop rule sets with multiple relation chains from knowledge graphs. Second, we propose a gametheoretical learning approach to efficiently select predictive relation chains for a query relation. Our formulation and learning method demonstrate advantages on two benchmark datasets over existing single-chain based approaches. For future work, we plan to investigate rules beyond chains, as well as integrate KG embeddings into our framework.