Collaborative Policy Learning for Open Knowledge Graph Reasoning

In recent years, there has been a surge of interests in interpretable graph reasoning methods. However, these models often suffer from limited performance when working on sparse and incomplete graphs, due to the lack of evidential paths that can reach target entities. Here we study open knowledge graph reasoning—a task that aims to reason for missing facts over a graph augmented by a background text corpus. A key challenge of the task is to filter out “irrelevant” facts extracted from corpus, in order to maintain an effective search space during path inference. We propose a novel reinforcement learning framework to train two collaborative agents jointly, i.e., a multi-hop graph reasoner and a fact extractor. The fact extraction agent generates fact triples from corpora to enrich the graph on the fly; while the reasoning agent provides feedback to the fact extractor and guides it towards promoting facts that are helpful for the interpretable reasoning. Experiments on two public datasets demonstrate the effectiveness of the proposed approach.


Introduction
Knowledge graph completion or reasoning-i.e., the task of inferring the missing facts (entity relationships) for a given graph-is an important problem in natural language processing and has a wide range of applications (Bordes et al., 2011;Socher et al., 2013;Trouillon et al., 2016).Recent neural graph reasoning methods, such as MINERVA (Das et al., 2017), DeepPath (Xiong et al., 2017) and Multi-Hop (Lin et al., 2018), have achieved impressive results on the task, offering both good prediction accuracy (compared to embedding-based methods (Trouillon et al.,  Given an entity (e.g., Miami) and a query relation (e.g., located in), we learn to infer reasoning paths over the existing graph structure to help predict the answer entity (i.e., USA).
2016; Dettmers et al., 2018)) and interpretability of the model predictions.These reasoning methods frame the link inference task as a path finding problem over the graph (see Fig. 1 for example).However, current neural graph reasoning methods encounter two main challenges as follows: (1) their performance are often sensitive to the sparsity and completeness of the graph-missing edges (i.e., potential false positives) make it harder to find evidential paths reaching target entities.
(2) existing models assume the graph is static, and cannot adapt to dynamically enriched graphs where emerging new facts are constantly added.
In this paper, we study the new task of Open Knowledge Graph Reasoning (OKGR), where the new facts extracted from the text corpora will be used to augment the graph dynamically while performing reasoning (as illustrated in Figure 2).All the recent joint graph and text embedding methods focus on learning better knowledge graph embeddings for reasoning (Xu et al., 2014;Han et al., 2018), but we consider adding more facts to the graph from the text to improve the reasoning performance and further provide interpretability.A straightforward solution for the OKGR problem is to directly add extracted facts (by a pre-trained relation extraction model) to the graph.However, most facts so extracted may be noisy or irrelevant (John_McCain, run_for, Republican_Party), 0.7 add edges to current reasoning location current reasoning location Reasoning start location work_for Figure 2: Overview of our CPL framework for the OKGR problem.To augment the reasoning with the information from a background corpus, CPL extracts relevant facts (e.g., the triple linked by the blue dotted arrow) to augment the KG dynamically.CPL involves two agents: one learns fact extraction policy to suggest relevant facts; the other learns to reason on dynamically augmented graphs to make predictions (e.g., the red dotted arrow).
to the path inference process.Moreover, adding a large number of edges to the graph will create an ineffective search space and cause scalability issues to the path finding models.Therefore, it is desirable to design a method that can filter out irrelevant facts for augmenting the reasoning model.
To address the above challenges for OKGR, we propose a novel collaborative policy learning (CPL) framework to jointly train two RL agents in a mutually enhancing manner.In CPL, besides training a reasoning agent for path finding, we further introduce a fact extraction agent, which learns the policy to select relevant facts extracted from the corpus, based on the context of the reasoning process and the corpus (see Fig. 2).At inference time, the fact extraction agent dynamically augments the graph with only the most informative edges, and thus enables the reasoning agent to identify positive paths effectively and efficiently.Specifically, during policy learning, the reasoning agent will be rewarded when reaching the targets, while this positive feedback will also be transferred back to the fact extracting agent if its edge suggestions are adopted by the reasoning agent, i.e. making up correct reasoning paths.This ensures that the fact extraction policy can be learned in a way that edges which are beneficial to path inference will be preferred.By doing so, the fact extraction agent can learn to augment knowledge graphs dynamically to facilitate the reasoning agent, while the reasoning agent performs effective path-inference and provides reward signals to the fact extraction agent.Please refer to Sec. 3 and Fig. 3 for more implementation details.
The major contributions of our work are as follows: (1) We study knowledge graph reasoning in an "open-world" setting, where new facts ex-tracted from background corpora can be used to facilitate path finding; (2) We propose a novel collaborative policy learning framework which models the interactions between fact extraction and graph reasoning; (3) Extensive experiments and analysis are conducted to demonstrate the effectiveness and strengths of our proposed method.

Background and Problem
This section introduces basic concepts and notations related to the knowledge graph reasoning task and provides a formal problem definition.
A knowledge graph (KG) can be represented by a set of triples (facts) G = {(e s , r, e o )|e s , e o ∈ E, r ∈ R}, where E is the set of entities and R is the set of relations.e s , r, and e o are the subject entity, relation, and object entity respectively.
The task of knowledge graph reasoning (KGR) is defined as follows.Given the KG G, a query triple (e s , r q , e q ) where e q is unknown, KGR is to infer e q through finding a path starting from e s to e q on G, namely {(e s , r 1 , e 1 ), ..., (e n , r n , e q )}.Usually, KGR methods produce multiple candidate answers by ranking the found paths, while traditional KG completion methods rank all possible answer triples by exhaustively enumerating.
A background corpus is a set of sentences labeled with respective entity pairs, namely C = {(s i : (e k , e j ))|s i ∈ S, e k , e j ∈ E}, where S is the set of sentences, and the corpus shares the same entity set with G.In our problem setting, we assume the entities have already been extracted; thus, extracting facts from the corpus is equivalent to the relation extraction task.We process the corpus by labeling the sentences with subject and object entity pairs through Distant Supervision (Mintz et al., 2009).There may be many sen-tences labeled with the same entity pair.Following the formulation of previous work (Lin et al., 2016), we organize the sentences into sentence bags, i.e., A sentence bag contains the sentences which are labeled with the same entity pair.
Problem.Formally, Open Knowledge Graph Reasoning (OKGR) aims to perform KGR based on both G and C, where G is dynamically enriched by the facts extracted from C. This paper focuses on OKGR, i.e., empowering KGR with the corpus information and enriching the graph with relevant facts dynamically.Thus, the evaluation of the relation extraction performance are out of the scope of this paper.We leave this as future work.

Proposed Framework
Overview.To resolve the challenges in OKGR, we propose a novel collaborative policy learning (CPL) framework (see Fig. 2), which jointly train two RL agents, i.e., a path reasoning agent and a fact extraction agent.Given a query (e s , r q , e q ), the reasoning agent tries to infer e q via finding a reasoning path on the (augmented) G, while the fact extraction agent aims to select the most informative facts from C to enrich G dynamically.With such an extractor, the framework can effectively overcome the edge sparsity problem while remaining reasonably efficient (compared to the naive solution that adds all possible facts to G).
We train the extraction agent by rewarding it according to the the reasoning agent's performance.Hence the fact extractor can learn how to extract the most informative facts to benefit the reasoning.

Graph Reasoning Agent
The goal of the reasoner is learning to reason via paths finding on KGs.Specifically, given e s , r q , the reasoner aims at inferring a path from e s to some entity e o regarding r q , and specifying how likely the relationship r q holds between e s and e o .The inference path acts as the evidence of the prediction, and thus offers interpretation (see Fig. 1).At each time step, the reasoner tries to select an edge based on the observed information.The Markov Decision Process (MDP) of the reasoner is defined as follows: State.In path-based reasoning, each succeeding edge is closely related to the preceding edge on the path and the query in semantics.Similar to MINERVA (Das et al., 2017), we want the state to encode all observed information, i.e., we define s t R = (e s , r q , h t ) ∈ S R , where h t encodes the path history, and (e s , r q ) is the context shared among all states.Specifically, we use a LSTM module to encode the history, h t = LSTM(h t−1 , [r t , e t ]) (see Fig. 3).e t is the current reasoning location and r t is the previous relation connecting e t .Action.At time t, the reasoner will select an edge among e t 's out-edges.The reasoner's action space is a union of the edges in the current KG and edges extracted from the corpus.See Sec.3.3 for details.

Transition. The transition function
Reward.The reasoner is expected to learn effective reasoning path patterns.We let it explore for a fixed number of steps, which is a hyper-parameter.Only when it reaches the correct target entity, it receives a terminal reward 1, and 0 otherwise.All intermediate states always receive reward 0.

Fact Extraction Agent
The fact extractor learns to suggest the most relevant facts w.r.t. the current inference step of the reasoner.Suppose the reasoner arrives at entity e t on the graph at time t, the fact extractor will extract facts in the form of (e t , r , e ) / ∈ G from the corpus and add them to the graph temporarily.Consequently, the reasoner is offered more choices to expand the reasoning path.State.When the reasoner is at e t , the fact extractor tries to extract information from the corpus (sentences) and suggests promising out-edges of entity e t .Let b e t denote the sentence bags labeled with (e t , e ), e ∈ E. We define the state of the fact extractor to encode the current observed information, i.e., s t E = (b e t , e t ) ∈ S E , where S E is the whole state space, containing all possible combinations of entities and corresponding sentence bags.Action.The goal of the fact extractor is to select a reasoning-relevant fact contained in the corpus semantically.At step t, the reasoner will move to a new entity from e t , and hence only the outedges of e t should be considered, e.g., (e t , r , e ) (see Fig. 3).Therefore, for each fact (e t , r , e ) which can be extracted from the sentence bag b e t , we can derive an action a t E = (r , e ), and the action space at step t can be denoted as A t E = {(r , e )} (e t ,r ,e )∈b e t ⊂ A E .A E is the whole action space containing all facts in the corpus.Transition.The transition function f : Reward.The fact extractor receives a step-wise delayed reward from the reasoner according to how it improves the reasoners performance.The extractor will be positively rewarded when its suggestion benefits the reasoning process.Please see Sec. 3.3 for details.

Collaborative Policy Learning
In this section, we will introduce the detailed training process and the collaboration mechanism between the two agents.At the high level, we adopt an alternative training procedure to update the two agents jointly: training one of the agents for a few iterations while freezing the other; and vice versa.The policies of both agents are updated via RE-INFORCE algorithm (Williams, 1992) (details are in later of this section).Specifically, we introduce the details on agent collaboration as follows.
Augmented Action Space for Reasoning.At time t, the fact extractor helps the reasoner via expanding its action space with new edges extracted from the corpus (see Fig. 3).Due to the sparsity and incompleteness of the KG, there may be missing edges preventing the reasoner from inferring the correct reasoning path (Fig. 2).Therefore, we add high-confidence edges extracted by the extractor to the action space of the reasoner.Formally, at time t, the reasoner is at location e t (Fig. 3) and tries to select an edge out of all outedges of e t .Let A t K denote the edge set in the current KG, A t K = {(r, e)|(e t , r, e) ∈ G}.Let A t C denote the edge set suggested by the extractor, A t C = {(r , e )|(e t , r , e ) ∈ C}.The action space at time t of the reasoner is defined as , where A R denotes the whole action space of the reasoner, i.e., all pos-sible edges in the KG and the corpus.The reasoner learns a policy to select the best edges out of the joint action space for reasoning.
Reasoning Feedback for Fact Extraction.The reasoner helps the extractor to learn the extracting policy through providing feedbacks regarding how much the extractor contributes to the reasoning.Therefore we define that the fact extractor receives a step-wise delayed reward from the reasoner.Specifically, when the reasoner finishes exploration (at time T ) and arrives at the correct target, we consider the path is effective and positive for reasoning.If the fact extractor contributes to this positive path, it can be rewarded positively, i.e., if an edge on the positive path is suggested by the extractor at time t, 0 ≤ t ≤ T , the extractor will be rewarded 1 at time t, and 0 otherwise.Extracted edges triggering positive rewards will be kept in the graph, while the others will be removed when both agents move to the next state.
Policy Update.The MDPs of both agents are explicitly defined above now, and we can use the typical REINFORCE algorithm (Williams, 1992) to train the two agents.Specifically, their goals are maximizing the reward expectation, defined as where R(s, a) is the reward of selecting a given s, and π θ (a|s) is the policy learned by the agents and will be defined formally in Section.4. Given a training sequence sampled from π θ : {(s 1 , a 1 , r 1 ), ..., (s T , a T , r T )}, r t = R(s t , a t ) , at time step t, the parameters are updated according to the REINFORCE algorithm: (2) where G t is the discounted accumulated reward.
According to Eq. ( 2), we can see that REIN-FORCE will update the parameters only when G t is non-zero.In other words, the value of γ determines how the parameters are updated and to what extent the the internal states will be influenced by the future.If γ > 0, for positive training sequences, it is easy to verify that the G t of all states will be non-zero.Thus, the internal states will be positively rewarded, and the model parameters will be updated by the gradients of the internal states.For different task, we should carefully select the value of γ.For the extractor, we set γ = 0 for the extractor to avoid policy updating on zero-rewarded state-action experiences.This is because zero-rewarded experiences are mostly negative examples.Specifically, if a state of the extractor is zero-rewarded, we can infer that either the suggested edge is not selected by the reasoner, or the selected edge does not contribute to reaching the target.We can not allow the model to be updated on such experiences, so we set γ = 0 to avoid the influence of future.In contrast, we set γ = 1 for the reasoner because all the intermediate selected edges are meaningful as long as it leads to the target finally.

Model Implementation
In this section, we introduce the policy network architectures (cf.Fig. 3) of the two agents and provide details on model training and inference.

Policy Network Architectures
Reasoning Agent.We construct the state embedding by concatenating all related embeddings, s t R = [e s , r q , h t ], where h t = LSTM (h t−1 , [r t , e t ]).We construct the action embedding by concatenating the relation-entity embedding pair, i.e., a t R = [r, e], (e t , r, e) ∈ G ∪ C. We stack all action embeddings in A t R .The policy network is defined as: ), σ is softmax, and W 1 , W 2 are learnable weights.Fact extraction Agent.We use a PCNN-ATT as the sentence encoder in our experiments to construct the distributed representations for the sentences.Let b e t denote all the sentence bags labeled by (e t , e ), e ∈ E. At time t, we input b e t into the PCNN-ATT to obtain the sentence-bag-level embeddings E t b , which is regarded as the latent state embeddings.As mentioned in Sec. 2, the object entity has been labeled beforehand.We need to select the best relation first, then select the best entity under this relation.Thus, we stack the relation embeddings in , where d is the dimension of relation embedding.The policy network is defined formally as: where W is a learnable weight.The extractor will predict the scores for each sentence bag regarding the relation.We will select the sentence bag with the highest score, more formally, the corresponding relation-entity pair in that sentence bag will be chosen as the next action.We train the agents as introduced in Sec.3.3.

Model Training and Inference
Training.We use model pre-training and adaptive sampling to increase training efficiency.In particular, we first train the reasoner on the original KG to get a better initialization.Similarly, we train the extractor on the corpus labeled by distant supervision.Next, we use adaptive sampling to adaptively increase the selecting-priority of corpusextracted-edges when generating training experiences for the two agents.Adaptive sampling is designed to encourage the reasoner to explore more on new edges and facilitate the collaboration during the joint training.Replay memories (Mnih et al., 2013) are also used to increase training efficiency.We develop several model variants such as removing adaptive sampling or replay memory, or freezing the extractor all the time to conduct ablation studies.Please see Supplementary Material for more details.
Inference.At inference (reasoning) time, we use the trained model to predict missing facts via path finding.The process is similar to the training experience generation step in the training stage, i.e. using the reasoner for path-inference while the extractor suggests edges from the corpus constantly.The only differences are that we do not request rewards, and we use beam search to generate multiple reasoning paths over the graph, and rank them by the scores from the reasoner.Datasets.We construct two datasets for evaluation: FB60K-NYT101 and UMLS-PubMed2 .FB60K-NYT10 dataset includes the FB-60K KG and the NYT10 corpus; The UMLS dataset contains the UMLS KG and the PubMed corpus.
Statistics of both datasets are summarized in Table 1.We study the datasets and find that the relation distributions of the two datasets are very imbalanced.There are not enough reasoning paths for some relation types.Moreover, some relations are meaningless and of no reasoning value.Thus, we select a few meaningful and valuable relations (suggested by domain experts, in Table 2) with enough reasoning paths and construct two subgraphs accordingly.To show the impact of graph size, we sub-sample the KG into different subgraph.Specifically, for the two datasets, we first partition the whole KG into three parts according to the proportion 8:1:1 (The training set, validation set, and testing set).Next we create sub-train-sets with different ratio via random sampling.
Analysis of corpus-KG alignment.We analyze the information overlap (i.e., alignment) between the corpus and the KG in Table 1.The CT/CE (the ratio of triple quantity against entity quantity) of PubMed is far higher than NYT10.Higher CT/CE indicates adding corpus-edges to the KG increases the average degree more significantly, leading to more reduction in sparsity.The low CR/KR ratio of FB60K-NYT10 indicates the overlap between FB60K and NYT10 is lower than that between UMLS and PubMed.We can conclude that the alignment level of FB60K-NYT10 is lower than UMLS-PubMed.Intuitively, FB60K-NYT10 is a more difficult dataset than UMLS-PubMed.
Compared Algorithms.We compare our algorithm with (1) SOTA methods for KG embedding ; (2) methods for joint text and graph embedding; and (3) neural graph reasoning methods.
For triple-ranking-based KG embedding methods, we evaluate DistMult (Yang et al., 2014), ComplEx (Trouillon et al., 2016), and ConvE (Dettmers et al., 2018).For joint text and graph embedding methods, we evaluate RC-Net (Xu et al., 2014) and Joint-NRE (Han et al., 2018).We also construct a baseline, TransE+LINE, by constructing a word-entity co-occurrence network as RC-Net does.We use LINE (Tang et al., 2015) and TransE (Bordes et al., 2011) to jointly learn the entity and relation embeddings to preserve the structure information within the co-occurrence network and the KG.For neural graph reasoning method, we use MINERVA (Das et al., 2017), a reinforcement learning based path reasoning method4 .
To validate the effectiveness of fact extraction policy in CPL, we design a two-step baseline (i.e., Two-Step).It first uses PCNN-ATT to extract relational triples from the corpora, and augments KG with the triples whose prediction confidences are greater than a threshold.PCNN-ATT (Lin et al., 2016) is a fact extraction model, which completes the fact extraction part.We tune the threshold on the dev-set.Then, a MINERVA model is trained on the augmented KG for reasoning.
CPL is our full-fledged model as introduced in Sec. 3.For all the methods, we upload the source codes and list hyper-parameters we used in the supplemental materials.

Evaluation and Experimental Setup
Following previous work on KG completion (Bordes et al., 2011), we use Hits@K and mean reciprocal rank (MRR) to evaluate the effectiveness of the KGR and OKGR.Given a query (e s , r, ?) (for each triple in the test set, we pretend not to know the object entity), we rank the correct entity e q among a list of candidates entities.Suppose rank i is the rank of the correct answer entity for the i th query.We define Hit@K = i 1(ranki < K)/N and M RR = 1 N i In our experiments, we use a held-out validation set for all compared methods to search for the best hyper-parameters and the best model for testing (via logging checkpoints).For all methods, we train models using three fixed random seeds (55,83,5583), and report the metrics in average.More details on model training can be found in the Supplementary Material.
fers from out-of-memory problems on large-scale datasets.

Performance Comparison
Performances of the KG reasoning of all the algorithms are given in Table 3, 4 and Figure 6.We can draw conclusions as follows: 1. Triple ranking vs. path inference.CPL and MINERVA perform worse than triple-ranking methods when the size of KGs is small, while outperforms them significantly when adding more triples to the KGs (Figure 6).This is because the general and evidential paths for reasoning on sparse KGs are not enough, and path-based models cannot capture the underlying patterns.
2. CPL vs. joint embedding methods.CPL is inferior to RC-net, TransE+Line, and JointNRE on small KG partitions because they are not pathbased models and the connections on small KGs are too sparse.CPL outperforms them significantly on larger datasets.The reasons are two-fold : 1) the graphs are denser to provide enough reasoning paths for training; 2) other algorithms do not filter noisy text information in joint-training.
3. CPL vs. other graph reasoning methods.CPL outperforms MINERVA significantly because CPL makes use of relevant text information for prediction.MINERVA is better than CPL on full FB60K-NYT10 because the alignment between FB60K and NYT10 is very limited (Sec.5.1).The graph is dense at 100%, and the benefits from the corpus information are indiscernible.

Performance Analysis
1. Ablation Study on Model Components.In CPL, we apply multiple learning techniques to improve the performance, including collaboration, replay memory, and adaptive sampling as introduced in Sec. 4. To show the effects of differ-Model / Dataset 20% 50% 100% Hits@5 Hits@10 MRR Hits@5 Hits@10 MRR Hits@5 Hits@10 MRR TransE (Bordes et al., 2011) 15 Table 4: Performance comparison on the KG reasoning on the FB60K-NYT10 dataset.We can observe similar performance trends as those on the UMLS-PubMed dataset.
ent components, we remove them one by one and train the respective model variants.In addition, to shown the effect of collaboration, we train a model variation with the parameters of the extractor frozen.The result is shown in Fig. 5.
From the result, we find that 1) replay memory is only effective when adaptive sampling is also enabled.This is because adaptive sampling solves the sparse positive sample problem to some extent.There are enough positive experiences for replay.2) Collaboration improves performance significantly.CPL with a trainable extractor performs better than with a frozen extractor, which means the suggestions of the extractor can be improved by the reasoner's feedbacks.3) The improvement of CPL over MINERVA reduces as we increase the KG size.This is because with more data for training, the graph becomes denser, and hence the contribution from texts will be diluted.

Effectiveness of Fact Selection.
As mentioned above, Two-Step is the naive solution to OKGR.The best performing Two-Step model adds tens times more edges into the KG than CPL, whereas the Two-Step model's performance is inferior to CPL and MINERVA on all the datasets (Table 3,  4).The reasons are 1) most of the extracted edges used in the Two-Step model are noisy; 2) adding so many edges significantly enlarges the exploration space for reasoning.

2.
We perform a case study on the FB60K-NYT10 dataset to show the effectiveness of dynamically fact-filtering.We check the reasoning performance of the MINERVA and CPL periodically during the training.The results show that the extractor's contribution increases along with the training progress and the adaptive sampling can generate sufficient positive training experiences at The result is shown in Figure 4. We find a few interesting points as follows: 1) the sug edge/pos path ratio curve in Figure 4 suggests that the extractor's contribution increases along with the training progress; 2) CPL has a high initial performance because the adaptive sampling generates sufficient positive training experiences quickly.
3) The valley shape in the performance curve is because the agent has not learned a stable exploring policy when the adaptive sampling stops, and the adaptive sampling somehow twisted the true pattern distribution in the dataset.But with a good start, the agent can explore on its own to approach the true distribution.

Case Study of Reasoning Paths
We randomly sample some reasoning paths from the inference results of CPL as examples.Due to the space limit, please refer to the supplemental materials for these examples.These examples show 1) how the reasoner finds the path patterns for the respective relations; 2) how the reasoner finds the inference paths according to the patterns; 3) how the extractor suggests rel-  evant edges for each positive paths; 4) how the extractor extracts the relevant facts from related sentences.In summary, these cases show how CPL performs interpretable knowledge graph reasoning (infer the query entity through semanticsrelated path searching) and how CPL performs interpretable fact-filtering (suggest edges w.r.t the learned reasoning path patterns).
Joint Embedding of Text and KG.Joint embed-ding methods aim to unite text corpus and KG.Contrary to our focus, they mainly utilize KGs for better performances of other tasks.(Toutanova et al., 2015) focuses on fact-extraction on the corpus labeled via dependency parsing with the aids of KG and word embeddings, while (Han et al., 2016) conducts the same task with the raw corpus text.As a newer joint model developed from (Han et al., 2016), (Han et al., 2018) deals with fact extraction by employing the mutual attention.Open-World KG Completion.There are works focusing on similar topics as ours.(Shi and Weninger, 2018) defines an Open World KG Completion problem, in which they complete the KG with unseen entities.(Friedman and Broeck, 2019) introduces the Open-World Probabilistic Databases, an analogy to KGs.Unlike our setting, they try to complete the KG with logical inferences without extra information.(Sun et al., 2018) proposes an open, incomplete KB environment (or KG) with text corpora, but they focus on extracting answers from question-specific subgraphs.

Conclusion
In this paper, we focus on a new task named as Open Knowledge Graph Reasoning, which aims at boosting the knowledge graph reasoning with new knowledge extracted from the background corpus.We propose a novel and general framework, namely Collaborative Policy Learning, for this task.CPL trains two collaborative agents, the reasoner and fact extractor, which learns the path-reasoning policy and relevant-fact-extraction policy respectively.CPL can perform efficient interpretable reasoning on the KG and filtering of noisy facts.Experiments on two large real-world datasets demonstrate the strengths of CPL.Our work can cope with different path-finding modules such as MultHop with reward shaping by ConvE or RotatE and thus can improve its performance as the module improves.

A Supplemental Material
A.1 An Implementation of CPL CPL is a general framework whose components, two agents, are all replaceable.In our experiments, we modify MINERVA (Das et al., 2017) to construct the reasoner, and modify PCNN-ATT (Lin et al., 2016) to construct the fact extractor.
Here we briefly introduce a specific implementation of CPL based on PCNN-ATT and MINERVA (see also Fig. 3 for illustration) in details.
Fact Extractor.PCNN-ATT is an effective Relation Extraction approach containing mainly two parts: the sentence encoder and the attentionselector.The sentence encoder encodes each sentence into a vector given the labeled entity pair and their positions in the sentence.We organize the sentences into sentence bags.The sentences in the same bag share the same entity-pair label (x ω = (e s , e o )).For each sentence bag, we modify PCNN-ATT to produces a predictive probability distribution over all relations in the vocabulary: Φ(ω) = F pcnn (x ω ).Suppose at time step t during the inference, the reasoner is at entity e t .We need to suggest several edges pointing to different entities from e t to enrich the reasoner's action space.We use PCNN-ATT to make predictions on several sentence bags whose labels all contain e t and get a distribution set: The distribution set can be seen as a score set over different edges.We can define a stochastic policy based on the scores by sampling the edges according to the scores.In the original PCNN-ATT setting, the score indicates the confidence of linking e s and e o w.r.t. the respective predicted relation.According to the previous reward definition, we can construct the policy distributions over all candidate edges based on these output scores via softmax and provide the extractor with the most relevant edges.
Graph Reasoner.To extend MINERVA as our graph reasoner, we adopt random action drop-out (random dropping KG-edges) to unite the KGedges and corpus-extracted edges into a joint action space of fixed size.Specifically, for a query triple (e s , r q , e q ), it predicts e q through finding a path from e s to e q w.r.t.r q .At time step t, the observed state s t is (e s , r q , e t , h t ) as defined before.The history information before t is a sequence of edges, which is encoded into a vector with LSTM : ).The reasoner should select one edge from the joint action space defined above w.r.t. e t .The MINERVA model F mine takes in the state embedding and output the softmax scores w.r.t. each action (out-edge).Then we adopt adaptive sampling (discussed below) to select the action to proceed.for 0 to max-batches do 5: if e < e a then 6: Generate training sequences with 7: adaptive sampling.The lack of positive training samples is a common challenge for most RL algorithms.We use two techniques to accumulate positive experiences for the agents, model pre-training and adaptive sampling.
i) Model Pre-training.To get proper initialization, we pre-train the fact extractor and reasoner.In this way, at the beginning of the joint training, we can expect the agents to generate plausible experiences immediately.
ii) Adaptive Sampling.The policy learned by the agents can be regarded as a distribution of choosing certain actions given the states.Usually, we sample the actions multiple times according to the distribution to generate multiple experiences.In the pre-training stage, the reasoner is unaware of the facts in the texts.It tends to ignore the new facts suggested by the extractor.To facilitate interactions between two agents and encourage exploration, we reconstruct the distribution to ensure the extracted edges to be chosen with higher probability.Specifically, at time step t, the action space of the reasoner is the union of KG-edges and extracted edges, i.e., A t = {(r, e)|(e t , r, e) ∈ KG}∪ {(r , e )|(e t , r , e ) ∈ corpus C}.The reasoner will score all the actions in A t , and we increase the scores of extracted edges adaptively so that they have higher priority over the KG-edges.Whereas we cannot keep this priority all the time, it twists the true data or pattern distribution.Hence after a number of iterations, we stop the adaptive sampling and use the immediate policy distribution for sampling.
To increase exploration efficiency, the fact extractor samples multiple edges given its learned policy to add to the reasoner's joint action space (Fig. 3).We collect the experiences with above techniques and store them into two replay memories (Mnih et al., 2013) for two agents separately.

A.3.1 Datasets and Codes 5
We study the datasets and find that the relation distributions of the two datasets are very imbalanced.There are not enough reasoning paths for some relation types.Moreover, some relations are meaningless and of no reasoning value.We select a subset of the relations for each dataset as the reasoning tasks.There are enough reasoning paths for the path-based models to learn on these relations.They are also pretty informative and widely concerned according to the opinions of the domain experts we interviewed.The details are in Table 2. Specifically, we first divide the dataset into train, validation, and test sets in the proportion 8 : 1 : 1 randomly.Then we only keep the triples of the concerned relations in the validation and test set.
bedding sizes and hidden sizes on the MINERVA training and our model training.To get our result we trained it for 400 iterations at a batch size of 64 samples on FB60K.We set the iteration-number to 1000 and batch-size to 64 for UMLS.
To better reflect the models' capabilities, all models related to MINERVA are added reverse edge triples.Considering the inevitable fluctuations of this reinforcement learning model, we use three random keys 55, 83 and 5583 to initiate training and reach an average result for the three runs.
Our model In total we train 400 iterations for FB60K (considering time factors) and 1000 for UMLS; For first 200 iterations, we use BFS to search positive paths with higher priority on PCNN-ATT suggested edges.In each BFS iteration, 100 samples are selected.The learning rate is set to 0.001, and the batch size is 64, the same as MINERVA.
A.4 Case Study 1. Two-step is the naive solution to OKGR.For the two-step model, we filter the corpus-edges with the output scores (in [0,1]) of PCNN-ATT.0 means adding all the edges to the KG, while 1 means adding nothing.We find the best threshold (producing the best reasoning model) for UMLS-PubMed is 0.5 and 0 for FB60K-NYT10.Twostep adds about 85,000 edges to UMLS and 90,000 to FB60K under the corresponding thresholds, whereas CPL adds about 8,000 edges to UMLS and 1,500 for FB60K.
The two-step model performance is inferior to CPL and MINERVA on all the datasets (Table 3, 4).The reasons are that 1) most of the extracted edges use in the two-step model are noises; 2) adding so many edges significantly enlarges the explore space for reasoning.Selecting the correct out-edge at each step becomes more difficult.Lack sufficient positive experiences, with same iterations, the two-step model cannot learn the underlying patterns well.
3. Figure 7 shows the inference cases randomly sampled from the FB60K-NYT10 dataset.We select three relations and randomly sample several query cases from the test set.We track down the inference paths for each query case and mark the edges suggested by the extractor.Further, we track back to the raw text data to pick out the sentences from which the extractor extract the relevant facts.For example, for the query triple (gorgonzola, /location/location/contains inv, m.0bzty), the concerned relation is "/location/location/contains inv".

Figure 1 :
Figure1: Illustration of the Knowledge Graph Reasoning Task.Given an entity (e.g., Miami) and a query relation (e.g., located in), we learn to infer reasoning paths over the existing graph structure to help predict the answer entity (i.e., USA).

Figure 4 :
Figure 4: KG reasoning performance change w.r.t.time.sug edge/pos path means the ratio of positive edges suggested by the extractor w.r.t. the positive paths found by the reasoner.

Figure 5 :
Figure 5: Ablation Study on UMLS-PubMed dataset.CPL 1 denotes CPL without adaptive sampling, and the extractor is frozen during training.CPL 2 denotes CPL without adaptive sampling.CPL denotes our proposed final model (with all the components).

Figure 6 :
Figure 6: KG reasoning performance change w.r.t. the size of the graph.Triple-ranking based methods perform pretty well on smaller partitions, but are soon surpassed by path-based KG reasoning methods with the increase of graph size.
Training pipeline.A formal training algorithm is given in Algorithm 1. Algorithm 1 CPL(G, C, b r , b e , p l , e a , e m ) Require: Knowledge graph G, corpus C, # of batches training the reasoner b r , # of batches training the extractor b e , hyper parameters for learning p l , # of epochs applying adaptive sampling e a , maximal epochs e m .Ensure: CPL model 1: Initialize the reasoner and extractor.2: Register Adam optimizer with p l .3: for e = 0: e m do 4: techniques we will introduce a few techniques we use to increase training efficiency.
Figure 7: A Case study on discovered paths on FB60K-NYT10.We randomly pick three relations and show how CPL performs reasoning based on the KG and text corpus.Red texts are the relations.[xxx]-[xxx] represents [subject entity]-[object entity].The bold italic words in the sentences means where we extract the relations.
Detailed Model Design of Collaborative Policy Learning (CPL).Take PCNN-ATT as an example of the sentence encoder.The figure shows how it works at a certain inference time step t, the reasoner is at entity e t and will select one edge from the joint action space, which consists of new edges extracted by the extractor and the edges in the original graph.

Table 1 :
The dataset information.#triples(C) & #triples(G) denote the number of triples in the corpus and the KG respectively, and so on.S(train) denotes the number of sentences in the training corpus, while S(test) denotes the number of sentences in the testing corpus.CT/CE denotes triple-entity ratio.Lower triple-entity ratio indicates less triples per entity in average can be extracted from the corpus.CR/KR denotes corpus-relation-quantity/KGrelation-quantity ratio.Lower CR/KR indicates less information overlap between the corpus and the KG.

Table 2 :
The concerned relations in two datasets.UP means the UMLS-PubMed dataset, while FB means the FB60K-NYT10 dataset.

Relation Path pattern Query triple case Inference path Edge suggested by PCNN Related text
/people/person/place_lived_inv in his later years , pomus became an elder statesman in the new_york_city songwriter set , a larger-than-life connection to a lost era , knocking around town with dr._john and lou reed before succumbing to lung cancer in 1991 .

neighborhood/neighborhood_of the
st._george neighborhood near the ferry to manhattan is the closest thing to a downtown district in the borough , but it lacks the vibrancy of other sections of new_york_city that have become havens for young professionals and artists , said jonathan bowles , who wrote the study for the center for an urban future , a public policy group .

Table 5 :
Performance variance of KG reasoning on the FB60K-NYT10 dataset.Reinforcement learning methods do suffer from variances between different runs.

Table 6 :
Performance variance of KG reasoning on the UMLS-PubMed dataset.