Incorporating Graph Attention Mechanism into Knowledge Graph Reasoning Based on Deep Reinforcement Learning

Knowledge Graph (KG) reasoning aims at finding reasoning paths for relations, in order to solve the problem of incompleteness in KG. Many previous path-based methods like PRA and DeepPath suffer from lacking memory components, or stuck in training. Therefore, their performances always rely on well-pretraining. In this paper, we present a deep reinforcement learning based model named by AttnPath, which incorporates LSTM and Graph Attention Mechanism as the memory components. We define two metrics, Mean Selection Rate (MSR) and Mean Replacement Rate (MRR), to quantitatively measure how difficult it is to learn the query relations, and take advantages of them to fine-tune the model under the framework of reinforcement learning. Meanwhile, a novel mechanism of reinforcement learning is proposed by forcing an agent to walk forward every step to avoid the agent stalling at the same entity node constantly. Based on this operation, the proposed model not only can get rid of the pretraining process, but also achieves state-of-the-art performance comparing with the other models. We test our model on FB15K-237 and NELL-995 datasets with different tasks. Extensive experiments show that our model is effective and competitive with many current state-of-the-art methods, and also performs well in practice.


Introduction
Knowledge Graphs (KGs), such as NELL (Carlson et al., 2010), Freebase (Bollacker et al., 2008) and WordNet (Miller, 1995) play an increasingly critical role in many downstream NLP applications, e.g., question answering (Dubey et al., 2018), information retrieval (Liu et al., 2018), personalized recommendations (Wang et al., 2019), etc. However, KGs are always incomplete, which would affect many downstream tasks. Many links could miss among the entities in a KG. Thus, it is an important and challenge task on how to complete the KG by predicting the missing links between the entities through the method based on reasoning. For instance, if athleteP laysF orT eam(X, Y ) and teamP laysInLeague(Y, Z) both exist in the KG, then we can infer that athleteP laysInLeague(X, Z), i.e, filling the missing edge athleteP laysInLeague between X and Z.
There are mainly three ways to accomplish this task, such as Rule-Based (Wang and Cohen, 2016;, Embedding-Based (Bordes et al., 2013;Lin et al., 2015) and Path-Based (Lao et al., 2011). Meanwhile, it provides a new perspective to bring Deep Reinforcement Learning (DRL) into the task of predicting the missing links, such as DeepPath (Xiong et al., 2017), a type of path-based method. DeepPath is the first work which incorporates DRL into KG reasoning. It achieves significant improvements compared with PRA, but still has several drawbacks. First, it lacks memory components, resulting in requiring pretraining. The operation of pretraining is demanded to provide many known (or existed) paths to the model training. This brute-force operation may make the model prone to overfit on the given paths from pretraining. Second, it's inappropriate to set the same hyperparameter for different relations in a KG when training, which ignores the diversity of connections among the entities. Last, when the agent selects an invalid path, it will stop and reselect, which leads to select this invalid path constantly and finally stuck in one node.
Thus, in this paper, we present a novel deep reinforcement learning model and an algorithm, which aims at tackling the drawbacks mentioned above. The proposed model also belongs to the path-based framework. Our contributions can be summarized as follows: • We propose AttnPath, a model which incor-porates LSTM and graph attention as memory components, and is no longer subject to pretraining.
• Two metrics are defined (MSR and MRR), to quantitatively measure the difficulty to learn a relation's replaceable paths, which are used to fine-tune the model.
• A novel mechanism of reinforcement learning is proposed by forcing an agent to walk forward every step to avoid the agent stalling at the same entity node constantly.
We test AttnPath on FB15K-237 and NELL-995 datasets with two downstream tasks: fact prediction and link prediction. We also test on the success rate in finding paths and show the effectiveness of the Graph Attention Mechanism in the experiments.

Related Works
To date, there are many works proposed to solve the problem of KG incompleteness. Rule-Based methods, like ProPPR (Wang and Cohen, 2016) and Neural LP , generate reasoning rules manually or by mathematical logic rules, and then apply them to fill missing links based on existing triples. Although this type of methods has a solid mathematical background, they are hard to scale to large KGs, since they directly operate on symbols, while the number of possible reasoning paths is exponential to the number of the entities. Embedding-Based methods, like TransE (Bordes et al., 2013) and TransR (Lin et al., 2015), map entities and relations into a low-dimensional and continuous vector space, which captures the feature of distance between entities and relations. Then, they judge whether the query relation exists by comparing the distance between the two trained entities' embeddings and the query relation's embedding. This type of methods requires all triples in the KG to participate in training, and are only suitable for single-hop reasoning.
Path-Based, like PRA (Lao et al., 2011) and DeepPath (Xiong et al., 2017), train an agent to navigate on a KG, find replaceable paths for a certain relation, and then, use them as features to the downstream tasks. Path Ranking Algorithm (PRA) is the first path-based reasoning method. Neelakantan et al. develop a compositional model based on RNN, which composes the implications of a path and reasons about conjunctions of multihop relations non-atomically (Neelakantan et al., 2015). Guu et al. propose a soft edge traversal  operator, which can be recursively applied to predict paths and reduce cascaded propagation errors  faced by single-hop KG completion methods like  TransE and TransR (Guu et al., 2015). Toutanova et al. propose a dynamic programming algorithm which incorporates all relations' paths of bounded length in a KG, and models both relations and intermediate nodes in the compositional path representations (Toutanova et al., 2016). Such representations can aid in generating more high-quality reasoning paths. Das et al. improve DeepPath (Xiong et al., 2017) to MINERVA (Das et al., 2018), which views KG from QA's perspective. It gets rid of pretraining, introduces LSTM to memorize paths traversed before, and trains an agent to circulate on a certain entity if it believes this entity is the right answer. Lin et al. improve these two methods by introducing reward shaping and action dropout (Lin et al., 2018). Reward shaping replaces fixed penalty for useless selection with a dynamic penalty, which can either base on margin-based pretrained embeddings like TransE, or probabilitybased embeddings like ConvE (Dettmers et al., 2018). While action dropout randomly masks a certain proportion of valid actions, in order to reduce irrelevant paths to the query relation. DIVA  regards paths as latent variables, and relations as observed variables, so as to build a variational inference model to accomplish the KG reasoning task. It also uses beam search to broaden the search horizon. M-Walk (Shen et al., 2018) utilizes another RL algorithm called Monte Carlo Tree Search (MCTS), to tackle the problem of sparse rewards. The attention mechanism is first introduced into multi-hop KG reasoning by . However, it only computes the attention weights of the query's embedding and all found paths' embeddings. They are used to help to judge whether the answer found by the vanilla model is true.

AttnPath: Incorporating Memory Components
In this section, we will introduce the details of the proposed AttnPath. We will also show the definition of mean selection rate (MSR) and mean replacement rate (MRR), and also how they are used to fine-tune the model for different query relations.

RL framework for KG reasoning
Since we use Reinforcement Learning (RL) as the training algorithm of a sequential decision model, we first introduce basic elements of the RL framework in the KG reasoning, including environment, state, action, and reward. Environment: In this task, the environment refers to the whole KG, excluding the query relation and its inverse. The environment retains consistent in the whole training process.
State: The state of an agent is concatenated with three parts: the embedding part, the LSTM part, and the graph attention part. We will show the computation of the LSTM part and the graph attention part in the next section, and introduce the embedding part first.
Following (Xiong et al., 2017), the embedding part m t is concatenated with two subparts. The first one is e t , which is the embedding of the current entity node. The other one is e target − e t , denoting the distance between the tail entity node and the current node. Different from DeepPath which uses TransE (Bordes et al., 2013) as pretrained embeddings, we take advantage of TransD (Ji et al., 2015), which is an improvement of TransE and also a common-used benchmark. Under TransD, for a query relation, we project all the entities onto the vector space of this query relation. Then, an entity's projected embedding e ⊥ becomes e ⊥ = (r p e p + I)e, where p indicates the projection vectors. So m t should be [e t⊥ ; e target⊥ − e t⊥ ]. Action: For the KG reasoning task, an action refers to an agent choosing a relation path to step forward. Based on the framework of DRL, it chooses the relation according to the probabilities obtained by the model. Actions are either valid or invalid. Valid action means that there is an output relation connected with the current entity, while invalid action denotes that there is no relation.
Reward: Reward is a feedback to the agent according to whether the action is valid, and whether a series of actions can lead to ground truth tail entities in a specified number of times. We adopt reward shaping trick proposed by (Lin et al., 2018). For the invalid actions, the reward is -1. For the actions which don't lead to ground truth, we choose the output of ConvE (Dettmers et al., 2018) as the reward. Since ConvE outputs probabilities, which are in the range of (0, 1), we resort a logarithm op-erator to enlarge the range of this reward and improve discrimination. For the actions which lead to ground truth, i.e, a successful episode, the reward is the weighted sum of the global accuracy, the path efficiency, and the path diversity. The global accuracy is set to 1 by convention, and the path efficiency is the reciprocal of the path length, because we encourage the agent to step as fewer times as possible. The path diversity is defined as where |F | is the number of found paths, and p is the path embedding, simply the sum of all relations' embeddings in the path. The above definition guarantees that the reward of the valid actions are always larger than the invalid actions, and the reward of the successful episodes are always larger than the unsuccessful ones.

LSTM and Graph Attention as Memory Components
In our model, we utilize a three-layer LSTM, enabling the agent to memorize and learn from the actions taken before. Denote the hidden state of LSTM at step t by h t , and the initial hidden state h 0 by 0. Then we obtain where m t is defined in Eq. (1). This is the LSTM part of the state described above. Typically, an entity has several different aspects, for example, a football player may be connected with professional relations like playsF orT eam or playsInLeague, and also family relations like spouse or f ather. For different query relations, it's better for the agent to focus more on relations and neighbors which are highly related to the query relations. So we introduce Graph Attention mechanism (GAT) into our model, which is proposed by (Velickovic et al., 2018).
GAT is indeed self-attention on the entity nodes. We use a single-layer feedforward neural network to calculate attention weights, with a linear transformation matrix W and a weight vector a shared across all entities. LeakyReLU with negative input slope α = 0.2 is chosen as nonlinearity. So the attention weight from entity i to entity j is calculated as For entity i, we only compute attention weights to all its direct-connected neighbors, and normalize them with SoftMax. So the normalized attention weight is .
Now we can compute the graph attention part for the state, which is simply the weighted sum of all neighbors' embedding on the attention space with Thus, the state vector s i,t of entity i in time t is which in turn inputs a three-layer feedforward neural network, whose final output is a Softmax probability with length equal to the number of all relations in KG. The agent selects an action and obtains a reward. After it successfully reaches the tail entity or doesn't reach in a specified number of times, the rewards of the whole episode are used to update all parameters. The optimization is done using the REINFORCE (Williams, 1992) algorithm and updates θ with the following stochastic gradient: where e s is head entity, r the query relation, and π θ (a t |s t ) the probability of all relations. Figure 1 shows our AttnPath model.  Randomly sample action a ∼ π θ (a t |s t ) 7: if Action invalid (leads to nothing) then 8: Add < s t , a > to M neg 9: Normalize π θ (a t |s t ) on valid actions and sample a f 10: i.e, the proportion of relation r occupying h's outgoing paths.
Thus, MSR is the average of SR on T r : Lower MSR denotes that it's more difficult to learn r, because the entities connected with relation r may have more aspects. The replacement rate for relation r with regard to triple (h, r, t) is defined as Higher MRR denotes a relation may have more replacement relations, so it's easier to learn since an agent can directly choose an alternative relation to reach the destination. In our model, we have three ways to prevent overfitting: L2 regularization, dropout, and action dropout. However, for the relations which are easier to learn (high MSR and MRR), we wish to impose more regularization to encourage the agent to find more diverse paths, without overfitting on the immediate success. Otherwise, for the relations which are more difficult to learn (low MSR and MRR), we'd better focus on the success rate of finding a path, so we should impose less regularization.
For simplicity, we use exponential to compute the difficulty coefficient of relation r. It is defined as exp(M SR(r) + M RR(r)) and multiplied with the base rate of three regularization methods, respectively. The base rate of regularization methods is KG-based, shared across all relations in the same KG.

Overall Training Algorithm
Based on the proposed model, we present a novel training algorithm shown in Algorithm 1. One of our contributions in our algorithm is, when the agent selects an invalid path, our model not only penalizes it, but also forces it to select a valid relation to step forward. The probabilities from the neural network are normalized across all valid relations, which in turn act the probabilities of the forced action.
After the initializations, Line 6 samples an action according to the output of the network. When the agent selects an invalid action, Line 7 ∼ 10 is executed, and Line 9 ∼ 10 forces the agent to step forward. When the agent selects a valid action, Line 12 is executed. Lines 19, 22 and 25 update parameters for the invalid actions, the valid actions in a successful episode, and the valid actions in an unsuccessful episode, respectively, with reward −1, R total and R shaping .

Experiments
In this section, we will perform extensive experiments to validate the effectiveness of our proposed AttnPath. For each task, we will mainly focus on three quantitative metrics: success rate (SR) in finding paths, MAP of fact prediction (FP), and MAP of link prediction (LP). We will also demonstrate some reasoning paths and triples to show that graph attention is effective in finding more high-quality paths and mining which aspect of the entity is important in a specific task.

Datasets and Settings
Two datasets, FB15K-237 (Toutanova et al., 2015) and NELL-995 (Xiong et al., 2017), are used in our experiments. Statistics of these two datasets are displayed in Table 1. Following the previous works, for each triple (h, r, t), we add its inverse triple (t, r −1 , h), to allow the agent to step backward.
We summary the hyperparameters involved in our experiments here. The pretrained embedding dimension is set to 100. The LSTM hidden dimension is set to 200. The attention dimension is set to 100. Thus, s is a 500-dimension vector by concatenating the above three vectors with two times of the pretrained embedding dimension. λ 1 is 0.1, λ 2 is 0.8, λ 3 is 0.1. For FB15K-237 dataset, we set base L2 regularization, Dropout rate and action dropout rate to 0.005, 0.15, and 0.15, respectively. Also, for NELL-995, we set them to 0.005, 0.1 and 0.1, respectively. We choose Adam (Kingma and Ba, 2015) as the optimizer with different learning rates of 0.001, β 1 0.9 and β 2 0.999. For each task, we train 500 episodes, and for each episode, max steps is set to 50.
We verify the learned paths for each triple in FP and LP tasks when training, by the BFS-based method (Xiong et al., 2017).

Success Rate in Finding Paths
Our model is ignorant of the environment and triples at first, since it doesn't rely on pretraining. Thus, we record total SR, and SR in the recent 10 episodes to validate the agent's ability to learn the paths. For the tasks with less than 500 training samples, the samples are iterated first, then randomly sampled to reach 500 episodes. For the tasks with more than 500 training samples, we randomly select 500 out for training.
We select two relations, athleteP laysInLeague and organizationHeadquarteredInCity from NELL-995, to investigate its total SR and SR in recent 10 episodes. The former has relatively lower MSR and MRR, and the latter has higher. Figure 2 shows the results. It can be discovered that DeepPath outperforms our method at first, however, after 50 ∼ 150 epochs, our models surpass it. From SR-10 of AttnPath Force, we find that the initial SR is approximate to MRR, because the model knows nothing at all at first, so it selects a path randomly. As the training proceeds, the performance gains steady improvement. Similar results can also be found in other relations on FB15K-237.

Fact Prediction
Fact prediction (FP) aims at predicting whether an unknown fact is true or false. The proportion of positive triples and negative triples is about 1 : 10. For each relation, we use all found paths and reciprocal of length as weight, to accumulate the score of each triple based on whether the path is valid between h and t. Scores are ranked across all test set and Mean Average Precision (MAP) is used as  (Xiong et al., 2017). DeepPath is retrained with TransD for a fair comparison, which performs slightly better than (Xiong et al., 2017)  the evaluation metric. The results are shown in Table 2. It can be seen that AttnPath outperforms TransE / R and DeepPath significantly. AttnPath MS / RR, which uses MSR and MRR to fine-tune the hyperparameters, also gains performance improvement. AttnPath Force is also effective. By forcing the agent to walk forward every step, it improves the SR of finding a path, which in turn enriches the feature of the downstream tasks. This is especially important for the relations, which lack direct-connected replacement paths and require path with long-term dependency. In fact, our methods achieve SOTA results on both the two datasets.

Link Prediction
Link prediction (LP) aims at predicting the target entities. For each (h, r) pair, there is a ground truth t and about 10 generated false t. It is divided into a training set and a test set. We use the found paths as binary features, and train a classification model on the training set, applying it on the test set. LP also uses MAP as the evaluation metric, and the detailed results are shown in Table 3 1 . Alike FP, our model also attains better results in LP, and the SOTA results on FB15K-237 dataset 2 . However, we also notice that AttnPath doesn't attain the best result under a small part of query relations, even lower than TransE / R. By analyzing triples related to these relations, we found that: 1) they have more outgoing edges of the other relations pointing to the entities which are not the Figure 2: Total SR and SR-10 for two relations of NELL-995. DeepPath / AttnPath TransD means using TransD as the pretrained embeddings. AttnPath MS/RR is adding MSR and MRR to fine-tune hyperparameters. AttnPath Force is forcing the agent to move forward every step. These abbreviations are used throughout this section.  (Xiong et al., 2017). All the numbers are percentage (%). true tails, so MSR of these query relations are low. 2) Tail entities are only connected with edges of query relations and their inverse, while these edges are removed during training, so tail entities become isolated, without any possible replaceable paths. It will also lower MRR of these query relations. Take birthP lace and bornLocation as examples. If a person was born in a remote place, this place is difficult to be connected with the other entities, so it is prone to be isolated. However, such one-to-one relations are the strength of TransX methods.

Paths Found By DeepPath and AttnPath
We take capitalOf from FB15K-237, and athleteP laysInLeague from NELL-995, as ex-amples, to analyze these paths found by DeepPath and AttnPath. Table 4 shows the top 5 frequent paths and their frequencies for all the methods. It shows that AttnPath is more capable of capturing long-term dependencies between relations, which is useful for relations lack of direct-connected replaceable paths. AttnPath can also find more important and concentrated paths, so the distribution of paths doesn't have a heavy long tail. In the training process, we also find that AttnPath is better at stepping backward when it enters a dead end.

Effectiveness of Graph Attention Mechanism
We sample several pairs of entity and relation, compute the entity's attention weights to its neighbors under this relation, and investigate neighbors with top 5 attention weights. Table 5 displays the examples. It shows that GAT is capable of pay-  ing more attention to the neighbors, which is related to the query relation, especially for Anthony Minghella and Brandon Marshall, which pay different attention to neighbors with different query relations.

Conclusion and Future Work
In this paper, we propose AttnPath, a DRL based model for KG reasoning task which incorporates LSTM and Graph Attention Mechanism as memory components, to alleviate the model from pretraining. We also invent two metrics, MSR and MRR, to measure the learning difficulty of relations, and use it to better fine-tune training hyperparameters. We improve the training process to avoid the agent stalling in a meaningless state. Qualitative experiments and quantitative analysis show that our method outperforms DeepPath and embedding-based method significantly, proving its effectiveness.
In the future, we are interested in utilizing multi-task learning, to enable the model to learn reasoning paths for several query relations simultaneously. We would also like to research how to utilize GAT, MSR and MRR into other KG related tasks, such as KG representation, relation clustering and KB-QA.