Dynamic Anticipation and Completion for Multi-Hop Reasoning over Sparse Knowledge Graph

Multi-hop reasoning has been widely studied in recent years to seek an effective and interpretable method for knowledge graph (KG) completion. Most previous reasoning methods are designed for dense KGs with enough paths between entities, but cannot work well on those sparse KGs that only contain sparse paths for reasoning. On the one hand, sparse KGs contain less information, which makes it difficult for the model to choose correct paths. On the other hand, the lack of evidential paths to target entities also makes the reasoning process difficult. To solve these problems, we propose a multi-hop reasoning model named DacKGR over sparse KGs, by applying novel dynamic anticipation and completion strategies: (1) The anticipation strategy utilizes the latent prediction of embedding-based models to make our model perform more potential path search over sparse KGs. (2) Based on the anticipation information, the completion strategy dynamically adds edges as additional actions during the path search, which further alleviates the sparseness problem of KGs. The experimental results on five datasets sampled from Freebase, NELL and Wikidata show that our method outperforms state-of-the-art baselines. Our codes and datasets can be obtained from https://github.com/THU-KEG/DacKGR


Introduction
Knowledge graphs (KGs) represent the world knowledge in a structured way, and have been proven to be helpful for many downstream NLP tasks like query answering (Guu et al., 2015), dialogue generation (He et al., 2017) and machine reading comprehension (Yang et al., 2019). Despite their wide applications, many KGs still face serious incompleteness (Bordes et al., 2013), which limits Figure 1: An illustration of multi-hop reasoning task over sparse KG. The missing relations (black dashed arrows) between entities can be inferred from existing triples (solid black arrows) through reasoning paths (bold arrows). However, some relations in the reasoning path are missing (red dashed arrows) in sparse KG, which makes multi-hop reasoning difficult. their further development and adaption for related downstream tasks.
To alleviate this issue, some embedding-based models (Bordes et al., 2013;Dettmers et al., 2018) are proposed, most of which embed entities and relations into a vector space and make link predictions to complete KGs. These models focus on efficiently predicting knowledge but lack necessary interpretability. In order to solve this problem, Das et al. (2018) and Lin et al. (2018) propose multihop reasoning models, which use the REINFORCE algorithm (Williams, 1992) to train an agent to search over KGs. These models can not only give the predicted result but also an interpretable path to indicate the reasoning process. As shown in the upper part of Figure 1, for a triple query (Olivia Langdon, child, ?), multi-hop reasoning models can predict the tail entity Susy Clemens through a reasoning path (bold arrows).
Although existing multi-hop reasoning models have achieved good results, they still suffer two problems on sparse KGs: (1) Insufficient information. Compared with normal KGs, sparse KGs contain less information, which makes it difficult for the agent to choose the correct search direction.
(2) Missing paths. In sparse KGs, some entity pairs do not have enough paths between them as reasoning evidence, which makes it difficult for the agent to carry out the reasoning process. As shown in the lower part of Figure 1, there is no evidential path between Mark Twain and English since the relation publish area is missing. From Table 1 we can learn that some sampled KG datasets are actually sparse. Besides, some domain-specific KGs (e.g., WD-singer) do not have abundant knowledge and also face the problem of sparsity.
As the performance of most existing multi-hop reasoning methods drops significantly on sparse KGs, some preliminary efforts, such as CPL (Fu et al., 2019), explore to introduce additional text information to ease the sparsity of KGs. Although these explorations have achieved promising results, they are still limited to those specific KGs whose entities have additional text information. Thus, reasoning over sparse KGs is still an important but not fully resolved problem, and requires a more generalized approach to this problem.
In this paper, we propose a multi-hop reasoning model named DacKGR, along with two dynamic strategies to solve the two problems mentioned above: Dynamic Anticipation makes use of the limited information in a sparse KG to anticipate potential targets before the reasoning process. Compared with multi-hop reasoning models, embeddingbased models are robust to sparse KGs, because they depend on every single triple rather than paths in KG. To this end, our anticipation strategy injects the pre-trained embedding-based model's predictions as anticipation information into the states of reinforcement learning. This information can guide the agent to avoid aimlessly searching paths.
Dynamic Completion temporarily expands the part of a KG to enrich the options of path expan-sion during the reasoning process. In sparse KGs, many entities only have few relations, which limits the choice spaces of the agent. Our completion strategy thus dynamically adds some additional relations (e.g., red dashed arrows in Figure 1) according to the state information of the current entity during searching reasoning paths. After that, for the current entity and an additional relation r, we use a pre-trained embedding-based model to predict tail entity e. Then, the additional relation r and the predicted tail entity e will form a potential action (r, e) and be added to the action space of the current entity for path expansion.
We conduct experiments on five datasets sampled from Freebase, NELL and Wikidata. The results show that our model DacKGR outperforms previous multi-hop reasoning models, which verifies the effectiveness of our model.

Problem Formulation
In this section, we first introduce some symbols and concepts related to normal multi-hop reasoning, and then formally define the task of multi-hop reasoning over sparse KGs.
Knowledge graph KG can be formulated as KG = {E, R, T }, where E and R denote entity set and relation set respectively. T = {(e s , r q , e o )} ⊆ E × R × E is triple set, where e s and e o are head and tail entities respectively, and r q is the relation between them. For every KG, we can use the average out-degree D out avg of each entity (node) to define its sparsity. Specifically, if D out avg of a KG is larger than a threshold, we can say it is a dense or normal KG, otherwise, it is a sparse KG.
Given a graph KG and a triple query (e s , r q , ?), where e s is the source entity and r q is the query relation, multi-hop reasoning for knowledge graphs aims to predict the tail entity e o for (e s , r q , ?). Different from previous KG embedding tasks, multi-hop reasoning also gives a supporting path {(e s , r 1 , e 1 ), (e 1 , r 2 , e 2 ) . . . , (e n−1 , r n , e o )} over KG as evidence. As mentioned above, we mainly focus on the multi-hop reasoning task over sparse KGs in this paper.

Methodology
In this section, we first introduce the whole reinforcement learning framework for multi-hop reasoning, and then detail our two strategies designed for the sparse KGs, i.e., dynamic anticipation and dynamic completion. The former strategy intro-  vector of e p is the prediction information introduced in Section 3.3. We use the current state to dynamically select some relations, and use the pre-trained embedding-based model to perform link prediction to obtain additional action space. The original action space will be merged with the additional action space to form a new action space.
duces the guidance information from embeddingbased models to help multi-hop models find the correct direction on sparse KGs. Based on this strategy, the dynamic completion strategy introduces some additional actions during the reasoning process to increase the number of paths, which can alleviate the sparsity of KGs. Following Lin et al. (2018), the overall framework of DacKGR is illustrated in Figure 2.

Reinforcement Learning Framework
In recent years, multi-hop reasoning for KGs has been formulated as a Markov Decision Process (MDP) over KG (Das et al., 2018): given a triple query (e s , r q , ?), the agent needs to start from the head entity e s , continuously select the edge (relation) corresponding to the current entity with maximum probability as the direction, and jump to the next entity until the maximum number of hops T . Following previous work (Lin et al., 2018), the MDP consists of the following components: State In the process of multi-hop reasoning, which edge (relation) the agent chooses depends not only on the query relation r q and the current entity e t , but also on the previous historical searching path. Therefore, the state of the t-th hop can be defined as s t = (r q , e t , h t ), where h t is the representation of the historical path. Specifically, we use an LSTM to encode the historical path information, h t is the output of LSTM at the t-th step.
Action For a state s t = (r q , e t , h t ), if there is a triple (e t , r n , e n ) in the KG, (r n , e n ) is an action of the state s t . All actions of the state s t make up its action space A t = {(r, e)|(e t , r, e) ∈ T }. Besides, for every state s t , we also add an additional action (r LOOP , e t ), where LOOP is a manually added selfloop relation. It allows the agent to stay at the current entity, which is similar to a "STOP" action.
Transition If the current state is s t = (r q , e t , h t ) and the agent chooses (r n , e n ) ∈ A t as the next action, then the current state s t will be converted to s t+1 = (r q , e n , h t+1 ). In this paper, we limit the maximum number of hops to T , and the transition will end at the state s T = (r q , e T , h T ).
Reward For a given query (e s , r q , ?) with the golden tail entity e o , if the agent finally stops at the correct entity, i.e., e T = e o , the reward is one, otherwise, the reward is a value between 0 and 1 given by the function f (e s , r q , e T ), where the function f is given by a pre-trained knowledge graph embedding (KGE) model for evaluating the correctness of the triple (e s , r q , e T ).

Policy Network
For the above MDP, we need a policy network to guide the agent to choose the correct action in different states.
We represent entities and relations in KG as vectors in a semantic space, and then the action (r, e) at the step t can be represented as a t = [r; e], where r and e are the vectors of r and e respectively. As we mentioned in Section 3.1, we use an LSTM to store the historical path information. Specifically, the representation of each action selected by the agent will be fed into the LSTM to generate historical path information so far, (1) The representation of the t-th state s t = (r q , e t , h t ) can be formulated as After that, we represent the action space by stacking all actions in A t as A t ∈ R |At|×2d , where d is the dimension of the entity and relation vector. The policy network is defined as, where σ is the softmax operator, W 1 and W 2 are two linear neural networks, and π θ (a t |s t ) is the probability distribution over all actions in A t .

Dynamic Anticipation
As reported in previous work (Das et al., 2018;Lin et al., 2018), although the KGE models are not interpretable, they can achieve better results than the multi-hop reasoning models on most KGs. This phenomenon is more obvious on the sparse KG (refer to experimental results in Table 3) since KGE models are more robust as they do not rely on the connectivity of the KGs. Inspired by the above phenomenon, we propose a new strategy named dynamic anticipation, which introduces the prediction information of the embedding-based models into the multi-hop reasoning models to guide the model learning. Specifically, for a triple query (e s , r q , ?), we use the pretrained KGE models to get the probability vector of all entities being the tail entity. Formally, the probability vector can be formulated as p ∈ R |E| , where the value of the i-th dimension of p represents the probability that e i is the correct tail entity.
For the dynamic anticipation strategy, we change the state representation in Equation 2 to: where e p is prediction information given by KGE models. In this paper, we use the following three strategies to generate e p : (1) Sample strategy. We sample an entity based on probability distribution p and denote its vector as e p .
(2) Top-one strategy. We select the entity with the highest probability in p.
(3) Average strategy. We take the weighted average of the vectors of all entities according to the probability distribution p as the prediction information e p . In experiments, we choose the strategy that performs best on the valid set.

Dynamic Completion
In sparse KGs, there are often insufficient evidential paths between head and tail entities, so that the performance of multi-hop reasoning models will drop significantly. In order to solve the above problems, we propose a strategy named dynamic completion to dynamically augment the action space of each entity during reasoning process. Specifically, for the current state s t , its candidate set of additional actions can be defined as C t = {(r, e)|r ∈ R ∧ e ∈ E ∧ (e t , r, e) ∈ T }. We need to select some actions with the highest probability from C t as additional actions, where the probability can be defined as: p((r, e)|s t ) = p(r|s t )p(e|r, s t ).
(5) However, the candidate set C t is too large, it will be time-consuming to calculate the probability of all actions in C t , so we adopt an approximate pruning strategy. Specifically, We first select some relations with the highest probability using p(r|s t ), and then select entities with the highest probability for these relations using p(e|r, s t ).
For the current state s t , we calculate the attention value over all relations as p(r|s t ), We define a parameter α to control the proportion of actions that need to be added. Besides, we also have a parameter M which represents the maximum number of additional actions. Therefore, the number of additional actions can be defined as, where N is the action space size of the current state. After we have the attention vector w, we select top x relations with the largest attention values in w to form a new relation set R add = {r 1 , r 2 , · · · , r x }.  For every relation r i ∈ R add and the current entity e t , we use the pre-trained KGE models to predict the probability distribution of the tail entity for triple query (e t , r i , ?) as p(e|r i , s t ). We only keep the k entities with the highest probability, which form k additional actions {(r i , e 1 r i ), · · · , (r i , e k r i )} for triple query (e t , r i , ?). Finally, all additional actions make up the additional action space A add t for s t . Here, k is a parameter, and x can be calculated using previous parameters, During the multi-hop reasoning process, we dynamically generate the additional action space A add t for every state s t . This additional action space will be added to the original action space A t and make up a new larger action space, Based on the previous dynamic anticipation strategy, the dynamic completion strategy can generate more accurate action space since the state contains more prediction information.

Policy Optimization
We use the typical REINFORCE (Williams, 1992) algorithm to train our agent and optimize the parameters of the policy network. Specifically, the training process is obtained by maximizing the expected reward for every triple query in the training set, The parameters θ of the policy network are optimized as follow, where β is the learning rate.

Datasets
In this paper, we use five datasets sampled from Freebase (Bollacker et al., 2008), NELL (Carlson et al., 2010) and Wikidata (Vrandečić and Krötzsch, 2014) for experiments. Specifically, in order to study the performance of our method on KGs with different degrees of sparsity, we constructed three datasets based on FB15K-237 (Toutanova et al., 2015), i.e., FB15K-237-10%, FB15K-237-20% and FB15K-237-50%. These three datasets randomly retain 10%, 20% and 50% triples of FB15K-237 respectively. In addition, we also construct two datasets NELL23K and WD-singer from NELL and Wikidata, where WD-singer is a dataset of singer domain from Wikidata. For NELL23K, we first randomly sample some entities from NELL and then sample triples containing these entities to form the dataset. For WD-singer, we first find all concepts related to singer in Wikidata, then use the entities corresponding to these concepts to build the entity list. After that, we expand the entity list appropriately, and finally use the triples containing entities in the entity list to form the final dataset. The statistics of our five datasets are listed in Table 2.
or multi-hop reasoning models to get the ranking list of the tail entity. Following the previous work (Bordes et al., 2013), we use the "filter" strategy in our experiments. We use two metrics: (1) the mean reciprocal rank of all correct tail entities (MRR), and (2) the proportion of correct tail entities ranking in the top K (Hits@K) for evaluation. Implementation Details In our implementation, we set the dimension of the entity and relation vectors to 200, and use the ConvE model as the pretrained KGE for both dynamic anticipation and dynamic completion strategies. In addition, we use a 3-layer LSTM and set its hidden dimension to 200. Following previous work (Das et al., 2018), we use Adam (Kingma and Ba, 2014) as the optimizer. For the parameters α, M and k in the dynamic completion strategy, we choose them from {0.5, 0.33, 0.25, 0.2}, {10, 20, 40, 60} and {1, 2, 3, 5} respectively. We select the best hyperparameters via grid search according to Hits@10 on the valid dataset. Besides, for every triple (e s , r q , e o ) in the training set, we also add a reverse triple (e o , r inv q , e s ).

Link Prediction Results
The left part of Table 3 shows the link prediction results on FB15K-237-10%, FB15K-237-20% and FB15K-237-50%. From the table, we can learn that our model outperforms previous multi-hop reasoning models on these three datasets, especially on FB15K-237-10%, where our model gains significant improvements compared with the best multi-hop reasoning baseline MultiHopKG (which is about 56.0% relative improvement on Hits@10). When we compare the experimental results on these three datasets horizontally (from right to left in Table 3), we can find that as the KG becomes sparser, the relative improvement of our model compared with the baseline models is more prominent. This phenomenon shows that our model is more robust to the sparsity of the KG compared to the previous multi-hop reasoning model.
As shown in previous work (Lin et al., 2018;Fu et al., 2019), KGE models often achieve better results than multi-hop reasoning models. This phenomenon is more evident on sparse KGs. The results of these embedding-based models are only used as reference because they are different types of models from multi-hop reasoning and are not interpretable.
The right part of Table 3 shows the link prediction results on NELL23K and WD-singer. From the table, we can find a phenomenon similar to that in the left part of Table 3. Our model performs better than previous multi-hop reasoning models, which indicates that our model can be adapted to many other knowledge graphs.
From the last three rows of Table 3, we can learn that the sample strategy in Section 3.3 performs better than top-one and average strategies in most cases. This is because these two strategies lose some information. The top-one strategy only retains the entity with the highest probability. The average strategy uses a weighted average of entity vectors, which may cause the features of different vectors to be canceled out.

Ablation Study
In this paper, we design two strategies for sparse KGs. In order to study the contributions of these two strategies to the performance of our model,   we conduct an ablation experiment by removing dynamic anticipation (DA) or dynamic completion (DC) strategy on FB15K-237-20% dataset. As shown in Table 4, removing either the DA or DC strategy will reduce the effectiveness of the model, which demonstrates that both strategies contribute to our model. Moreover, we can learn that using either strategy individually will enable our model to achieve better results than the baseline model. Specifically, the model using the DC strategy alone performs better than the model using the DA strategy alone, which is predictable, since the DA strategy only allows the agent to make a correct choice, and will not substantially alleviate the sparsity of KGs.

Analysis
In the dynamic completion (DC) strategy, we dynamically provide some additional actions for every state, which enrich the selection space of the agent and ease the sparsity of KGs. However, will the agent choose these additional actions, or in other words, do these additional actions really work?
In this section, we analyze the results of the DC hits ratio, which indicates the proportion of the agent selecting additional actions (e.g., choosing actions in A add t for s t ). In the first step, we an-alyze the change of DC hits ratio as the training progresses, which is shown in the first row of Figure 3. From this figure, we can learn that for most KGs (except FB15K-237-10%), DC hits ratio is relatively high at the beginning of training, then it will drop sharply and tend to stabilize. This is reasonable because there is some noise in the additional actions. In the beginning, the agent cannot distinguish the noise part and choose them as the same as the original action. But as the training proceeds, the agent can identify the noise part, and gradually reduces the selection ratio of additional actions. For FB15K-237-10%, DC hits ratio will decrease first and then increase. This is because many triples have been removed in FB15K-237-10%, which exacerbates the incompleteness of the dataset. The additional actions work more effectively in this situation and increase the probability of correct reasoning.
In the second row of the Figure 3, we give the effect of parameter α (indicates the proportion of actions that need to be added) described in Section 3.4 on the DC hits ratio. Specifically, We use the average DC hits ratio results of the last five epochs as the final result. From this figure, we can find that for most datasets, DC hits ratio will gradually increase as α increases. This is as expected because a larger α means more additional actions, and the probability that they are selected will also increase. It is worth noting that on the FB15K-237-50%, DC hits ratio hardly changes with α. This is because the sparsity of FB15K-237-50% is not severe and does not rely on additional actions.

Case Study
In Table 5, we give an example of triple query and three reasoning paths with the top-3 scores given  5 Related Work

Knowledge Graph Embedding
Knowledge graph embedding (KGE) aims to represent entities and relations in KGs with their corresponding low-dimensional embeddings. It then defines a score function f (e s , r q , e t ) with embeddings to measure the correct probability of each triple. Specifically, most KGE models can be divided into three categories : (1) Translation-based models (Bordes et al., 2013;Wang et al., 2014;Lin et al., 2015;Sun et al., 2018) formalize the relation as a translation from a head entity to a tail entity, and often use distance-based score functions derived from these translation operations.
(2) Tensor-factorization based models (Nickel et al., 2011;Balazevic et al., 2019) formulate KGE as a three-way tensor decomposition task and define the score function according to the decomposition operations.
(3) Neural network models (Socher et al., 2013;Dettmers et al., 2018;Nguyen et al., 2018;Shang et al., 2019) usually design neural network modules to enhance the expressive abilities. Generally, given a triple query (e s , r q , ?), KGE models select the entity e o , whose score function f (e s , r q , e o ) has the highest score as the final prediction. Although KGE models are efficient, they lack interpretability of their predictions.

Multi-Hop Reasoning
Different from embedding-based models, multihop reasoning for KGs aims to predict the tail entity for every triple query (e s , r q , ?) and meanwhile pro-vide a reasoning path to support the prediction. Before multi-hop reasoning task is formalized, there are some models on relation path reasoning task, which aims to predict the relation between entities like (e s , ?, e o ) using path information. DeepPath (Xiong et al., 2017) first adopts reinforcement learning (RL) framework for relation path reasoning, which inspires much later work (e.g., DIVA  and AttnPath ). MINERVA (Das et al., 2018) is the first model that uses REINFORCE algorithm to do the multihop reasoning task. To make the training process of RL models stable, Shen et al. propose M-Walk to solve the reward sparsity problem using off-policy learning. MultiHopKG (Lin et al., 2018) further improves MINERVA using action dropout and reward shaping. Lv et al. (2019) propose MetaKGR to address the new task that multi-hop reasoning on few-shot relations. In order to adapt RL models to a dynamically growing KG, Fu et al. (2019) propose CPL to do multi-hop reasoning and fact extraction jointly. In addition to the above RL-based reasoning models, there are some other neural symbolic models for multi-hop reasoning. NTP (Rocktäschel and Riedel, 2017) and NeuralLP  are two end-to-end reasoning models that can learn logic rules from KGs automatically.
Compared with KGE models, multi-hop reasoning models sacrifice some accuracy for interpretability, which is beneficial to fine-grained guidance for downstream tasks.

Conclusion
In this paper, we study the task that multi-hop reasoning over sparse knowledge graphs. The performance of previous multi-hop reasoning models on sparse KGs will drop significantly due to the lack of evidential paths. In order to solve this problem, we propose a reinforcement learning model named DacKGR with two strategies (i.e., dynamic anticipation and dynamic completion) designed for sparse KGs. These strategies can ease the sparsity of KGs. In experiments, we verify the effectiveness of DacKGR on five datasets. Experimental results show that our model can alleviate the sparsity of KGs and achieve better results than previous multihop reasoning models. However, there is still some noise in the additional actions given by our model. In future work, we plan to improve the quality of the additional actions.