DIVINE: A Generative Adversarial Imitation Learning Framework for Knowledge Graph Reasoning

Knowledge graphs (KGs) often suffer from sparseness and incompleteness. Knowledge graph reasoning provides a feasible way to address such problems. Recent studies on knowledge graph reasoning have shown that reinforcement learning (RL) based methods can provide state-of-the-art performance. However, existing RL-based methods require numerous trials for path-finding and rely heavily on meticulous reward engineering to fit specific dataset, which is inefficient and laborious to apply to fast-evolving KGs. To this end, in this paper, we present DIVINE, a novel plug-and-play framework based on generative adversarial imitation learning for enhancing existing RL-based methods. DIVINE guides the path-finding process, and learns reasoning policies and reward functions self-adaptively through imitating the demonstrations automatically sampled from KGs. Experimental results on two benchmark datasets show that our framework improves the performance of existing RL-based methods while eliminating extra reward engineering.


Introduction
Knowledge graphs (Suchanek et al., 2007;Auer et al., 2007;Bollacker et al., 2008;Carlson et al., 2010;Vrandečić and Krötzsch, 2014), typically composed of massive relational triples, are useful resources for many downstream natural language processing applications such as information extraction and question answering. Although existing KGs have an extraordinarily large scale, they are still highly incomplete (Min et al., 2013), leading extensive research efforts on automated inference of missing information from observed evidence. In this paper, we focus on the problem of multi-hop reasoning in KGs, which learns explicit inference formulas from existing triples to complete missing ones. * Corresponding author To tackle multi-hop reasoning, various pathbased methods (Lao et al., 2011;Gardner et al., 2013Gardner et al., , 2014Guu et al., 2015;Neelakantan et al., 2015;Toutanova et al., 2016;Das et al., 2017) have been proposed, which leverage the elaborately selected relational paths in KGs as the reasoning evidence. However, such evidential paths are obtained by random walks, which inevitably introduces inferior or even noisy paths. To address this problem, the RL-based methods, such as Deep-Path (Xiong et al., 2017) and MINERVA (Das et al., 2018), strive for more reliable evidential paths by policy-conditioned walking and achieve state-of-the-art performance. They formulate the path-finding problem as a Markov decision process where their policy-based agents continuously choose the most promising relation for state transition based on the current state and reasoning policy. Once a relational path found, the reasoning policy is updated by a reward function according to the path quality. Finally, through such a trialand-error process, the well-trained policy-based agent can be used to find evidential paths for predictions.
However, these RL-based methods still suffer from the following pain points. Firstly, they tend to require numerous trials from scratch to find a reliable evidential path since the action space can be very large due to the complexity of KGs, which leads to poor convergence properties. Secondly, and most importantly, an efficient trial-and-error optimization in RL requires designing a reward function manually to fit the specific dataset. However, such reward engineering depends on meticulous artificial design with domain expertise, which can be significantly challenging in practice (Ng et al., 2000). In particular, these RL-based methods are extremely sensitive to their reward functions, where a little variation may lead to a significant fluctuation of the reasoning performance. Therefore, for different datasets, the reward functions in the RL-based methods need manual adjustments to achieve a good performance, which will not only be inefficient and laborious but also difficult to adapt to the rapid evolutions of realworld KGs (Shi and Weninger, 2018).
In this paper, we present a novel plug-and-play framework based on generative adversarial imitation learning (GAIL) (Ho and Ermon, 2016) for enhancing existing RL-based methods, which is referred to as DIVINE for "Deep Inference via Imitating Non-human Experts". DIVINE trains a reasoner, consisting of a generator and a discriminator, from demonstrations by employing generative adversarial training, where the generator can be any of the policy-based agents in existing RLbased methods and the discriminator can be considered as a self-adaptive reward function. In this way, for different datasets, the reward functions can be automatically tuned to approximate the optimal performance, eliminating extra reward engineering and manual interventions. In particular, to enable the policy-based agent to find more diverse evidential paths for predictions, we propose a path-based GAIL method, which can learn the reasoning policy by imitating the path-level semantic features of the demonstrations. In addition, to acquire demonstrations without extra manual labor, we design an automated sampler for our framework to dynamically sample relational paths from KGs as the demonstrations according to the specific environment of each entity.
In summary, our contributions are threefold: • We present a plug-and-play framework based on GAIL to enhance existing RL-based reasoning in KGs by learning reasoning policies and reward functions through imitating demonstrations. To the best of our knowledge, we are the first to introduce GAIL into the field of knowledge graph reasoning.
• We propose a path-based GAIL method to encourage the diversity of evidential paths and design an automated sampler for our framework to sample demonstrations without extra manual labor.
• We conduct extensive experiments on two benchmark datasets. The experimental results illustrate that our framework improves the performance of the current state-of-theart RL-based methods while eliminating extra reward engineering.

Related Work
Automated reasoning on KGs has been a longstanding task for natural language processing. In recent years, various embedding-based methods using tensor factorization (Nickel et al., 2011;Bordes et al., 2013;Riedel et al., 2013;Yang et al., 2014;Trouillon et al., 2017) or neural network models (Socher et al., 2013) have been developed, where they learn a projection which maps the triples into a continuous vector space for further tensor operations. Despite the impressive results they achieved, most of them lack the ability to capture chains of multi-hop reasoning patterns contained in paths.
To address the limitation of the embeddingbased methods, a series of path-based methods have been proposed, which consider the selected relational paths as reasoning evidence. Lao et al. (2011) propose the Path-Ranking Algorithm (PRA) which uses random walks for path-finding. Gardner et al. (2013Gardner et al. ( , 2014 propose a variation on PRA which computes feature similarity in the vector space. To combine the embedding-based methods, several neural multi-hop models (Neelakantan et al., 2015;Guu et al., 2015;Toutanova et al., 2015Toutanova et al., , 2016Das et al., 2017) are proposed which perform a hybrid reasoning. Nevertheless, the evidential paths they used are gathered by random walks, which might be inferior and noisy.
Recently, DeepPath (Xiong et al., 2017) and MINERVA (Das et al., 2018) were proposed to address the problem above by using reinforcement learning, where they are committed to learn a policy which guides the agent to find more superior evidential paths to maximize the expected reward. Specifically, DeepPath parameterizes its policy with a fully-connected neural network and uses manual reward criteria, including global accuracy, efficiency and diversity, to evaluate the path quality. In the training phase, DeepPath applies the linear combination of these criteria as the positive reward while using a hand-craft constant as the negative penalty. As for MINERVA, it parameterizes its policy with a long short-term memory network (LSTM) and considers the path validity as the only reward criterion. In the training phase, MINERVA uses boolean value as a terminal signal to evaluate whether the current path reaches the target entity and manually tunes a moving average of cumulative discounted reward on different datasets for variance reduction.

Knowledge Graph Reasoning
Given an incomplete knowledge graph G = {(h, r, t)|h ∈ E, t ∈ E, r ∈ R}, where E and R denote the entity set and relation set, respectively. There are two main tasks in knowledge graph reasoning, namely link prediction and fact prediction. Link prediction involves inferring the tail entity t given the head entity h and the query relation r q , while fact prediction seeks to predict whether an unknown fact (h, r q , t) holds or not. Recently, RL-based reasoning has become a popular approach for knowledge graph reasoning, which achieves state-of-the-art performance. In general, RL-based reasoning methods strive to find relational paths to tune their reasoning policies for predictions and formulate the path-finding problem as a Markov decision process (MDP). In such a process, the policy-based agent decides to take an action a ∈ A from the current state (i.e., the current entity and its context information) s ∈ S to reach the next one according to its reasoning policy π, where the action space is defined as all the relations in G. In particular, each relational chain in the relational paths can be considered as a reasoning chain.

Imitation Learning
Imitation learning focuses on learning policies from demonstrations, which has achieved great success in solving reward engineering. The classical approach is to find the optimal reward function by inverse reinforcement learning (IRL) (Russell, 1998;Ng et al., 2000) to explain expert behaviors. However, IRL requires solving RL inside a learning loop, which can be expensive to run in large environments. Therefore, generative adversarial imitation learning (GAIL) (Ho and Ermon, 2016) has recently been proposed, which learns the expert policy with generative adversarial network (GAN) (Goodfellow et al., 2014), eliminating any intermediate IRL steps.
In GAIL, a generator G θ is trained to generate trajectories matching the distribution of expert trajectories (i.e., demonstrations). Each trajectory τ is represented as a state-action sequence [(s t , a t )] ∞ t=0 (s t ∈ S, a t ∈ A). In addition, a discriminator D ω is learned to distinguish between the generated policy π θ and expert policy π E . For each training epoch, the discriminator is updated first with the gradient as: (1) where τ E denotes the expert trajectories generated by π E and the trajectory expectation can be calculated in the γ-discounted infinite horizon as: where the discriminator here can be interpreted as a local reward function to provide feedback for the policy learning process. Then, the generator is updated with the cost function log(D(s, a)) using the trust region policy optimization (TRPO) (Schulman et al., 2015). After sufficient adversarial training, the optimal policyπ can be found by GAIL to rationalize the expert policy π E .

Framework Overview
As shown in Figure 1, our framework DIVINE consists of two modules, namely a generative adversarial reasoner and a demonstration sampler. In particular, the reasoner is composed of a generator and a discriminator. The generator can be any of the policy-based agents in existing RL-based methods and the discriminator can be interpreted as a self-adaptive reward function. For each query relation, the sampler and the generator are adopted respectively to automatically extract demonstrations and generate relational paths from the given KG. The discriminator is then used to evaluate the semantic similarity between the generated paths and demonstrations to update the generator. After sufficient adversarial training between the generator and the discriminator alternatively, the well-trained policy-based agent (i.e., generator) can be used to find evidential paths matching the distribution of the demonstrations and make predictions by synthesizing these evidential paths.

Generative Adversarial Reasoner
In our framework, the reasoner is learned from the demonstrations through generative adversarial training. A straightforward approach is to directly apply GAIL to train the reasoner. In particular, the policy-based agent in the reasoner is trained to find evidential paths by imitating the state-action pairs in each expert trajectory (i.e., demonstration). However, such an approach may lead to poor performance. The main reason lies in that the agent will tend to choose the same actions as in the expert trajectories under certain states, while ignoring many valuable evidential paths which are semantically similar to the expert trajectories but contain different reasoning chains. Therefore, to encourage the agent to find more diverse evidential paths, it is desirable to train the agent by imitating each trajectory instead of each of its state-action pairs. In addition, in the scenario of knowledge graph reasoning, since the reasoning chains consist of only relations, the demonstrations do not necessarily contain the state information. In other words, the demonstrations can be composed of only relational paths.
Based on the above analysis, we propose a pathbased GAIL method, where the reasoning policy is learned by imitating the path-level semantic features of the demonstrations which are composed of only relational paths.
In what follows, we first describe the two components of the reasoner, i.e., the generator and the discriminator. Then, we show how to extract the path-level semantic features.

Generator
The generator can be any of the policy-based agent in existing RL-based methods. We strive to enable the generator to find more diverse evidential paths matching the distribution of the demonstrations in the semantic space.

Discriminator
To better semantically distinguish between generated paths and demonstrations, we choose the convolutional neural networks (CNNs) to construct our discriminator D, as CNNs have shown high performance in semantic feature extraction from natural languages (Kim, 2014).

Semantic Feature Extraction
For each positive entity pair, we respectively pack the current generated paths and corresponding demonstrations in the same package form. For each package P = {x 1 , x 2 , ..., x N } containing N relational paths, we encode the package to a realvalued matrix as: where x n ∈ R k is the k-dimensional path embedding and ⊕ denotes the concatenation operator for the package representation p ∈ R N ×k . In particular, given a relational path x = {r 1 , r 2 , ...r t , ...}, the path embedding x is encoded as: where each relation r t is mapped into a real-valued embedding r t ∈ R k pre-trained by TransE (Bordes et al., 2013). After packing, we feed the package representation p into our discriminator D to parameterize its semantic features D(p). Specifically, a convolutional layer activated by ReLU nonlinearity is first used to extract local features by sliding a kernel ω ∈ R h×l for a new feature map: where b c denotes the bias term. Then, a fullyconnected hidden layer and an output layer are used for further semantic feature extraction: where the corresponding biases are not shown above for brevity and the output layer is normalized by a sigmoid function while other layers are activated by ReLU nonlinearity.

Demonstration Sampler
For imitation learning, the first prerequisite is to have high-quality demonstrations. However, due to the large scale and complexity of KGs, manually constructing a large number of reasoning demonstrations requires considerable time and expert efforts. Therefore, we design an automated sampler to sample reliable reasoning demonstrations from KGs without supervision and extra manual labor.

Static Demo Sampling
For each query relation, we use all positive entity pairs to sample demonstration candidates from the given KG. Specifically, for each positive entity pair, we use bi-directional breadth-first search to explore the shortest path between two entities. In particular, since the shorter paths incline to characterize more direct correlations between two entities, we prefer to use them for initialization to ensure the quality of demonstration candidates. As for the longer paths, despite their potential utility values, they are more likely to contain worthless or even noisy inference steps, thus we learn them only in the training phase. In doing so, we can get a demonstration set Ω E which contains all the candidates we sampled. Finally, to accommodate the fixed input dimension of the discriminator D, we can simply select a subset P e ⊆ Ω E with the top N occurrence frequency, where N is normally much smaller than |Ω E |.

Dynamic Demo Sampling
Despite the simplicity of the static demo sampling method, the obtained demonstrations by this method are fixed and ignore the specific environment of each entity in the given KG. Therefore, we propose an improving method to dynamically sample demonstrations by taking the topological correlations of entities into consideration. Given a positive entity pair ⟨e head , e tail ⟩, we introduce a relational set R h which contains all relations directly connected to e head . For each reasoning attempt, R h can be considered as the region of interest (ROI) of the agent to start reasoning, where the ROI related paths tend to be more relevant to the current entity pair. Thus, we refine the demonstration set by filtering out the demonstrations which begin from R h : where Ω E is generated by the static demo sampling method and r 1 (x) denotes the first relation in relational path x = {r 1 , r 2 , ...r t , ...}.
In most cases, we can obtain enough demonstrations in Ω ′ E to select a subset P e ⊆ Ω ′ E in the same way as the static demo sampling method. However, due to the sparsity of data in KG, we may get insufficient demonstrations on long-tail entities. To solve this problem, we perform semantic matching to explore more demonstrations from the remaining candidates C E = Ω E \ Ω ′ E . Since the reasoning policy is updated based on the semantic similarities between the generated paths and demonstrations, candidates which are semantically similar to the current demonstrations are also instructive for the imitation process.
Inspired by the neighborhood attention for oneshot imitation learning (Duan et al., 2017), we use each demonstration in Ω ′ E to query other candidates in correlation to itself. We adopt the dot product to measure the semantic matching similarity between two path embeddings: where α i represents the sum of matching scores between the current candidatex i and existing demonstrations in Ω ′ E . Finally, we iteratively select the candidate with the highest α to pad the refined demonstration set Ω ′ E until accommodating the required input dimension N .

Training
In the training phase, all the positive entity pairs are used to generate demonstration candidates Ω E for the imitation learning process. Specifically, for each positive entity pair, the demonstration sampler is required first to choose the corresponding demonstrations, while the generator is conducted to generate some relational paths. Then, the demonstrations are packed into package P e and the generated paths are packed into different packages {P g | P g ⊆ Ω G } according to their validity, i.e., whether the agent can reach the target entity along the current path, where Ω G is the collection of all generated paths.
For each package pair ⟨P g , P e ⟩, we train the discriminator D by minimizing its loss and expect it to be expert in distinguishing between P e and P g . In addition, to make the adversarial training process more stable and effective, we adopt the loss function proposed in WGAN-GP (Gulrajani et al., 2017) to update the discriminator: where L C , L P and L D respectively denote the original critic loss, gradient penalty and the loss of discriminator, λ is the gradient penalty coefficient andp is sampled uniformly along straight lines between p g and p e . According to the feedback of the discriminator, we calculate the reward R G as: where p n denotes a noise embedding composed of random noise with continuous uniform distribution, δ g is a characteristic function which characterizes the validity of package P g , Ω + G is the collection of all valid generated paths. We only give positive rewards for partial valid paths that at least have higher expectations than noise embedding p n , which filters out the paths of inferior quality to improve the convergence efficiency of the training process. Once the reward obtained, we updated the generator G by maximizing the expected cumulative reward with Monte-Carlo Policy Gradient (i.e., REINFORCE) (Williams, 1992).
We use mini-batch stochastic gradient descent (SGD) to optimize the loss function of discriminator, while the generator is updated with the Adam algorithm (Kingma and Ba, 2014).  The experiments are conducted on two benchmark datasets: NELL-995 (Xiong et al., 2017) and FB15K-237 (Toutanova et al., 2015). The details of the two datasets are described in Table 1. In particular, NELL-995, which is known to be a simple dataset for reasoning tasks, is generated from the 995th iteration of the NELL system (Carlson et al., 2010) by selecting the triples with Top-200 frequently occurring relations. Compared to NELL-995, FB15K-237 is more challenging and closer to real-world scenarios, where its facts are created from FB15K (Bordes et al., 2013) with redundant relations removed. For each triple (h, r, t), both datasets contain the inverse triple (h, r −1 , t) such that the agent can step backward in KGs, which makes it possible to recover from a potentially wrong decision that has been taken before. For each reasoning task with a query relation r q , all the triples with r q or r −1 q are removed from the KG and split into train and test samples.

Datasets and Evaluation Metrics
Similar to recent works (Das et al., 2018;Xiong et al., 2017), we use mean average precision (MAP), mean reciprocal rank (MRR) and Hits@k to evaluate the reasoning performance, where Hit-s@k is the fraction of positive instances ranked in the top k positions.

Baselines and Implementation Details
In our experiments, we consider two state-of-theart RL-based methods as baselines: DeepPath (Xiong et al., 2017) and MINERVA (Das et al., 2018). Deep path feeds the gathered evidential paths to PRA (Lao et al., 2011) for both link prediction and fact prediction tasks, while MINERVA directly applies the well-trained agent to the link prediction task for question answering. For Deep-Path, we use the code released by Xiong et al. (2017). For MINERVA, we use the code released by Das et al. (2018). The experiment settings for the baselines are set according to the suggestions in the original papers.
In the implementation of our framework, we set the path number N to 5 for each path package P, while the path dimension k is set to 200 which is the same as the relation dimension in baselines. For the discriminator, we set the convolution kernel size to 3 × 5, the hidden layer size to 1024, and the output layer size to the path dimension k, while the gradient penalty coefficient λ is set to 5 and L2 regularization is also used to avoid over-fitting.
During testing, we also rank the answer triples  Table 2: Overall results on NELL-995 and FB15K-237. " †" denotes the results with settings for question answering and " ‡" denotes the results of directly ranking all the positive and negative triples given a query relation.
against the negative triples used in DeepPath and MINERVA. In particular, there are approximately 10 corresponding negative triples for each positive ones. Each negative triple is generated by replacing the answer entity t with a faked one t ′ given a positive triple (h, r, t).

Results
The main results on the two datasets are shown in Table 2. We use "Div(*)" to denote the RL-based method "*" which adopts our framework DIVINE. For a fair comparison, we follow MINERVA to report the Hits@k and MRR scores (denoted with " †") to evaluate the link prediction performance for question answering, which ranks entities according to the probability that the agent can reach the entity along evidential paths. Moreover, we also follow DeepPath to report the MAP scores (denoted with " ‡") on the fact prediction task, which directly ranks all the positive and negative triples for a given query relation.
From the results shown in Table 2, we can observe that our framework produces consistent improvements for the two RL-based methods under varying degrees on both link prediction and fact prediction tasks. On the one hand, for existing RL-based methods, their results on FB15K-237 are generally lower than those on NELL-995 since FB15K-237 is more complex and arguably more difficult to design proper reward functions manually. However, our framework reliefs this problem to some extent by dynamically learning superior reward functions, thus we make greater improvements on challenging FB15K-237. On the other hand, for different datasets, the improvements our framework makes for DeepPath vary a lot while MINERVA not. This is because MIN-ERVA manually adjusts its hyper-parameters accordingly when calculating cumulative discounted reward, while DeepPath keeps the same. Obviously, it validates the necessity for existing RL-based methods to adjust their reward functions accordingly to fit different datasets. Enhanced by our framework, these RL-based methods no longer require additional manual adjustments for different datasets, which reveals great robustness.
Similar to existing RL-based methods, we also report the decomposed results of link prediction and use MAP to evaluate the performance for each query relation on NELL-995 in Table 3. From the results, we can observe that the results on different relations are of high variance and the enhanced RL-based methods achieve better or comparable performance for all query relations.  To investigate the effectiveness of the path-based GAIL method, we train the policy-based agent in DeepPath on NELL-995 by our path-based GAIL method and the original GAIL method, respectively. In particular, in the process of training the a-  gent by the original GAIL method, the demonstrations are composed of state-action trajectories. For each state-action pair (s t , a t ), the state representation s t is calculated by (e t , e tail − e t ), where e t and e tail denote the embeddings of the current entity and the tail entity, respectively. In Figure 2, we show the statistics of the evidential path set P new which are found by the agent and different from the demonstrations. In Table 4, we compare the average path number of P new and the reasoning performance on the two prediction tasks. As shown in Figure 2 and Table 4, we can observe that our path-based GAIL method obtains more evidential paths for most query relations and achieves better performance on both link and fact predictions, which validates the effectiveness of our path-based GAIL method and the rationality of encouraging the agent to find more diverse evidential paths.

Ablation Studies
We conduct ablation studies by embedding Deep-Path into our framework to quantify the role of components. Specifically, we re-train our framework by ablating certain components: • W/O Semantic Matching, where no semantic matching is performed on long-tail entities. Instead, we directly extract some paths from the remaining demonstration candidates C E according to their occurrence frequency.
• W/O Dynamic Sampling, where no dynamic demo sampling is performed to incorporate the local environment of entities. In other words, we only adopt the static demo sampling method to obtain demonstrations.  We use MAP to evaluate the link prediction performance on both NELL-995 and FB15K-237 in Table 5. From the results, we can observe that: (1) Based on imitation learning, our framework can effectively improve the reasoning performance, even if we use the static demo sampling method to obtain demonstrations; (2) High-quality demonstrations are crucial for imitation learning, which indicates both topology filtering and semantic matching play important roles in the demonstration sampler of our framework.

Conclusion
In this paper, we proposed a novel plug-and-play framework DIVINE for knowledge graph reasoning based on generative adversarial imitation learning, which enables existing RL-based methods to learn reasoning policies and reward functions self-adaptively to adapt the fast evolutions of real-world KGs. The experimental results show that our framework improves the performance of existing RL-based methods while eliminating extra reward engineering.