Knowledge Base Completion via Coupled Path Ranking

Knowledge bases (KBs) are often greatly incomplete, necessitating a demand for K-B completion. The path ranking algorith-m (PRA) is one of the most promising approaches to this task. Previous work on PRA usually follows a single-task learning paradigm, building a prediction model for each relation independently with its own training data. It ignores meaningful associations among certain relations, and might not get enough training data for less frequent relations. This paper proposes a novel multi-task learning framework for PRA, referred to as coupled PRA (CPRA). It ﬁrst devises an agglomerative clustering strategy to automatically discover relations that are highly correlated to each other, and then employs a multi-task learning strategy to effectively couple the prediction of such relations. As such, CPRA takes into account relation association and enables implicit data sharing among them. We empirically evaluate CPRA on benchmark data created from Freebase. Experimental results show that CPRA can effectively identify coherent clusters in which relations are highly correlated. By further coupling such relations, CPRA signiﬁcantly outperforms PRA, in terms of both predictive accuracy and model interpretability.


Introduction
Knowledge bases (KBs) like Freebase (Bollacker et al., 2008), DBpedia (Lehmann et al., 2014), and NELL (Carlson et al., 2010) are extremely useful resources for many NLP tasks (Cucerzan, 2007;Schuhmacher and Ponzetto, 2014). They provide large collections of facts about entities and their relations, typically stored as (head entity, relation, tail entity) triples, e.g., (Paris, capitalOf, France). Although such KBs can be impressively large, they are still quite incomplete and missing crucial facts, which may reduce their usefulness in downstream tasks (West et al., 2014;Choi et al., 2015). KB completion, i.e., automatically inferring missing facts by examining existing ones, has thus attracted increasing attention. Approaches to this task roughly fall into three categories: (i) path ranking algorithms (PRA) (Lao et al., 2011); (ii) embedding techniques (Bordes et al., 2013;Guo et al., 2015); and (iii) graphical models such as Markov logic networks (MLN) (Richardson and Domingos, 2006). This paper focuses on PRA, which is easily interpretable (as opposed to embedding techniques) and requires no external logic rules (as opposed to MLN).
The key idea of PRA is to explicitly use paths connecting two entities to predict potential relations between them. In PRA, a KB is encoded as a graph which consists of a set of heterogeneous edges. Each edge is labeled with a relation type that exists between two entities. Given a specific relation, random walks are first employed to find paths between two entities that have the given relation. Here a path is a sequence of relations linking two entities, e.g., h While KBs are naturally composed of multiple relations, PRA models these relations separately during the inference phase, by learning an individual classifier for each relation. We argue, however, that it will be beneficial for PRA to model certain relations in a collective way, particularly when the relations are closely related to each other. For example, given two relations bornIn and livedIn, there must be a lot of paths (features) that are predictive for both relations, e.g., h nationality − −−−−−−−− −→ e hasCapital − −−−−−−− −→ t. These features make the corresponding relation classification tasks highly related. Numerous studies have shown that learning multiple related tasks simultaneously (a.k.a. multi-task learning) usually leads to better predictive performance, profiting from the relevant information available in different tasks (Carlson et al., 2010;Chapelle et al., 2010). This paper proposes a novel multi-task learning framework that couples the path ranking of multiple relations, referred to as coupled PRA (CPRA). The new model needs to answer two critical questions: (i) which relations should be coupled, and (ii) in what manner they should be coupled.
As to the first question, it is obvious that not all relations are suitable to be learned together. For instance, modeling bornIn together with hasWife might not bring any real benefits, since there are few common paths between these two relations. CPRA introduces a common-path based similarity measure, and accordingly devises an agglomerative clustering strategy to group relations. Only relations that are grouped into the same cluster will be coupled afterwards.
As to the second question, CPRA follows the common practice of multi-task learning (Evgeniou and Pontil, 2004), and couples relations by using classifiers with partially shared parameters. Given a cluster of relations, CPRA builds the classifiers upon (i) relation-specific parameters to address the specifics of individual relations, and (ii) shared parameters to model the commonalities among different relations. These two types of parameters are balanced by a coupling coefficient, and learned jointly for all relations. In this way CPRA couples the classification tasks of multiple relations, and enables implicit data sharing and regularization.
The major contributions of this paper are as follows. (i) We design a novel framework for multitask learning with PRA, i.e., CPRA. To the best of our knowledge, this is the first study on multi-task PRA. (ii) We empirically verify the effectiveness of CPRA on a real-world, large-scale KB. Specifically, we evaluate CPRA on benchmark data created from Freebase. Experimental results show that CPRA can effectively identify coherent clusters in which relations are highly correlated. By further coupling such relations, CPRA substantially outperforms PRA, in terms of not only predictive accuracy but also model interpretability. (iii) We compare CPRA and PRA to the embedding-based TransE model (Bordes et al., 2013), and demonstrate their superiority over TransE. As far as we know, this is the first work that formally compares PRA-style approaches to embedding-based ones, on publicly available Freebase data.
In the remainder of this paper, we first review related work in Section 2, and formally introduce PRA in Section 3. We then detail the proposed CPRA framework in Section 4. Experiments and results are reported in Section 5, followed by the conclusion and future work in Section 6.

Related Work
We first review three lines of related work: (i) KB completion, (ii) PRA and its extensions, and (iii) multi-task learning, and then discuss the connection between CPRA and previous approaches. KB completion. This task is to automatically infer missing facts from existing ones. Prior work roughly falls into three categories: (i) path ranking algorithms (PRA) which use paths that connect two entities to predict potential relations between them (Lao et al., 2011;Lao and Cohen, 2010); (ii) embedding-based models which embed entities and relations into a latent vector space and make inferences in that space (Nickel et al., 2011;Bordes et al., 2013); (iii) probabilistic graphical models such as the Markov logic network (MLN) and its variants (Pujara et al., 2013;Jiang et al., 2012). This paper focuses on PRA, since it is easily interpretable (as opposed to embedding-based models) and requires no external logic rules (as opposed to MLN and its variants).
PRA and its extensions. PRA is a random walk inference technique designed for predicting new relation instances in KBs, first proposed by Lao and Cohen (2010). Recently various extensions have been explored, ranging from incorporating a text corpus as additional evidence during inference (Gardner et al., 2013;Gardner et al., 2014), to introducing better schemes to generate more predictive paths (Gardner and Mitchell, 2015;Shi and Weninger, 2015), or using PRA in a broader context such as Google's Knowledge Vault (Dong et al., 2014). All these approaches are based on some single-task version of PRA, while our work explores multi-task learning for it.
Multi-task learning. Numerous studies have shown that learning multiple related tasks simulta-neously can provide significant benefits relative to learning them independently (Caruana, 1997). A key ingredient of multi-task learning is to model the notion of task relatedness, through either parameter sharing (Evgeniou and Pontil, 2004;Ando and Zhang, 2005) or feature sharing (Argyriou et al., 2007;He et al., 2014). In recent years, there has been increasing work showing the benefits of multi-task learning in NLP-related tasks, such as relation extraction (Jiang, 2009;Carlson et al., 2010) and machine translation (Sennrich et al., 2013;Cui et al., 2013;Dong et al., 2015). This paper investigates the possibility of multi-task learning with PRA, in a parameter sharing manner.
Connection with previous methods. Actually, modeling multiple relations collectively is a common practice in embedding-based approaches. In such a method, embeddings are learned jointly for all relations, over a set of shared latent features (entity embeddings), and hence can capture meaningful associations among different relations. As shown by (Toutanova and Chen, 2015), observed features such as PRA paths usually perform better than latent features for KB completion. In this context, CPRA is designed in a way that gets the multi-relational benefit of embedding techniques while keeping PRA-style path features. Nickel et al. (2014) and Neelakantan et al. (2015) have tried similar ideas. However, their work focuses on improving embedding techniques with observed features, while our approach aims at improving PRA with multi-task learning.

Path Ranking Algorithm
PRA was first proposed by Lao and Cohen (2010), and later slightly modified in various ways (Gardner et al., 2014;Gardner and Mitchell, 2015). The key idea of PRA is to explicitly use paths that connect two entities as features to predict potential relations between them. Here a path is a sequence of relations ⟨r 1 , r 2 , · · · , r ℓ ⟩ that link two entities. For example, ⟨bornIn, capitalOf⟩ is a path linking SophieMarceau to France, through an intermediate node Paris. Such paths are then used as features to predict the presence of specific relations, e.g., nationality. A typical PRA model consists of three steps: feature extraction, feature computation, and relation-specific classification.
Feature extraction. The first step is to generate and select path features that are potentially useful for predicting new relation instances. To this end, PRA first encodes a KB as a multi-relation graph. Given a pair of entities (h, t), PRA then finds the paths by performing random walks over the graph, recording those starting from h and ending at t with bounded lengths. More exhaustive strategies like breadth-first (Gardner and Mitchell, 2015) or depth-first (Shi and Weninger, 2015) search could also be used to enumerate the paths. After that a set of paths are selected as features, according to some precision-recall measure (Lao et al., 2011), or simply frequency (Gardner et al., 2014).
Feature computation. Once path features are selected, the next step is to compute their values. Given an entity pair (h, t) and a path π, PRA computes the feature value as a random walk probability p(t|h, π), i.e., the probability of arriving at t given a random walk starting from h and following exactly all relations in π. Computing these random walk probabilities could be at great expense. Gardner and Mitchell (2015) recently showed that such probabilities offer no discernible benefits. So they just used a binary value to indicate the presence or absence of each path. Similarly, Shi and Weninger (2015) used the frequency of a path as its feature value. Besides paths, other features such as path bigrams and vector space similarities could also be incorporated (Gardner et al., 2014).
Relation-specific classification. The last step of PRA is to train an individual classifier for each relation, so as to judge whether two entities should be linked by that relation. Given a relation and a set of training instances (i.e., pairs of entities that are linked by the relation or not, with features selected and computed as above), one can use any kind of classifier to train a model. Most previous work simply chooses logistic regression.

Coupled Path Ranking Algorithm
As we can see, PRA (as well as its variants) follows a single-task learning paradigm, which builds a classifier for each relation independently with its own training data. We argue that such a single-task strategy might not be optimal for KB completion: (i) by learning the classifiers independently, it fails to discover and leverage meaningful associations among different relations; (ii) it might not perform well on less frequent relations for which only a few training instances are available. This section presents coupled PRA (CPRA), a novel multi-task learning framework that couples the path ranking of multiple relations. Through a multi-task strat-egy, CPRA takes into account relation association and enables implicit data sharing among them.

Problem Formulation
Suppose we are given a KB containing a collection of triples O = {(h, r, t)}. Each triple is composed of two entities h, t ∈ E and their relation r ∈ R, where E is the entity set and R the relation set. The KB is then encoded as a graph G, with entities represented as nodes, and triple (h, r, t) a directed edge from node h to node t. We formally define KB completion as a binary classification problem. That is, given a particular relation r, for any entity pair (h, t) such that (h, r, t) / ∈ O, we would like to judge whether h and t should be linked by r, by exploiting the graph structure of G. Let R ⊆ R denote a set of relations to be predicted.
Each relation r ∈ R is associated with a set of training instances. Here a training instance is an entity pair (h, t), with a positive label if (h, r, t) ∈ O or a negative label otherwise. 1 For each of the entity pairs, path features could be extracted and computed using techniques described in Section 3. We denote by Π r the set of path features extracted for relation r, and define its training set as T r = {(x ir , y ir )}. Here x ir is the feature vector for an entity pair, with each dimension corresponding to a path π ∈ Π r , and y ir = ±1 is the label. Note that our primary goal is to verify the possibility of multi-task learning with PRA. It is beyond the scope of this paper to further explore better feature extraction or computation.
Given the relations and their training instances, CPRA performs KB completion using a multi-task learning strategy. It consists of two components: relation clustering and relation coupling. The former automatically discovers highly correlated relations, and the latter further couples the learning of these relations, described in detail as follows.

Relation Clustering
It is obvious that not all relations are suitable to be coupled. We propose an agglomerative clustering algorithm to automatically discover relations that are highly correlated and should be learned together. Our intuition is that relations sharing more common paths (features) are probably more similar in classification, and hence should be coupled.
Specifically, we start with |R| clusters and each cluster contains a single relation r ∈ R. Here |·| is the cardinality of a set. Then we iteratively merge the most similar clusters, say C m and C n , into a new cluster C. The similarity between two clusters is defined as: where Π C i is the feature set associated with cluster C i (if C i contains a single relation, Π C i the feature set associated with that relation). It essentially measures the overlap between two feature sets. The larger the overlap is, the higher the similarity will be. Once two clusters are merged, we update the feature set associated with the new cluster: The algorithm stops when the highest cluster similarity is below some predefined threshold δ. This paper empirically sets δ = 0.5. As such, relations sharing a substantial number of common paths are grouped into the same cluster.

Relation Coupling
After clustering, the next step of CPRA is to couple the path ranking of different relations within each cluster, i.e., to learn the classification tasks for these relations simultaneously. We employ a multi-task classification algorithm similar to (Evgeniou and Pontil, 2004), and learn the classifiers jointly in a parameter sharing manner. Consider a cluster containing K relations C = {r 1 , r 2 , · · · , r K }. Recall that during the clustering phase a shared feature set has been generated for that cluster, i.e., Π C = Π r 1 ∪ · · · ∪ Π r K . We first reform the training instances for the K relations using this shared feature set, so that all training data is represented in the same space. 2 We denote by T k = {(x ik , y ik )} N k i=1 the reformed training data associated with the k-th relation. Then our goal is to jointly learn We first assume that the classifier for each relation has a linear form f k (x) = w k · x + b k , where w k ∈ R d is the weight vector and b k the bias. To model associations among different relations, we further assume that all w k and b k can be written, for every k ∈ {1, · · · , K}, as: Here the shared w 0 is used to model the commonalities among different relations, and the relationspecific v k to address the specifics of individual relations. If the relations are closely related (v k ≈ 0), they will have similar weights (w t ≈ w 0 ) on the common paths. We use the same bias b 0 for all the relations. 3 We estimate v k , w 0 , and b 0 simultaneously in a joint optimization problem, defined as follows.
Problem 1 CPRA amounts to solving the general optimization problem: where ℓ (x ik , y ik ) is the loss on a training instance. It can be instantiated into a logistic regression (L-R) or support vector machine (SVM) version, by respectively defining the loss ℓ (x ik , y ik ) as: We call them CPRA-LR and CPRA-SVM respectively.
In this problem, λ 1 and λ 2 are regularization parameters. By adjusting their values, we control the degree of parameter sharing among different relations. The larger the ratio λ 1 λ 2 is, the more we believe that all w t should conform to the common model w 0 , and the smaller the relation-specific weight v t will be.
The multi-task learning problem can be directly linked to a standard single-task learning one, built on all training data from different relations.
Proposition 1 Suppose the training data associated with the k-th relation, for every k = 1, · · · , K, is transformed into: where 0 ∈ R d is a vector whose coordinates are all zero, and ρ = λ 2 λ 1 a coupling coefficient. Consider a linear classifier for the transformed data f ( x) = w · x + b, with w and b constructed as: Then the objective function of Problem 1 is equivalent to: where ℓ = log(1 + exp(−y ik f ( x ik ))) is a logistic loss for CPRA-LR, and ℓ = [1 − y ik f ( x ik )]+ a hinge loss for CPRA-SVM; and λ = λ 1 K . That means, after transforming data from different relations into a unified representation, Problem 1 is equivalent to a standard single-task learning problem, built on the transformed data from all the relations. So it can easily be solved by existing tools such as LR or SVM.

Experiments
In this section we present empirical evaluation of CPRA in the KB completion task.

Experimental Setups
We create our data on the basis of FB15K (Bordes et al., 2011) 4 , a relatively dense subgraph of Freebase containing 1,345 relations and the corresponding triples.
KB graph construction. We notice that in most cases FB15K encodes a relation and its reverse relation at the same time. That is, once a new fact is observed, FB15K creates two triples for it, e.g., (x, film/edited-by, y) and (y, editor/film, x). Reverse relations provide no additional knowledge. They may even hurt the performance of PRA-style methods. Actually, to enhance graph connectivity, PRA-style methods usually automatically add an inverse version for each relation in a KB (Lao and Cohen, 2010;Lao et al., 2011). That is, for each observed triple (h, r, t), another triple (t, r −1 , h) is constructed and added to the KB. Consider the prediction of a relation, say film/edited-by. In the training phase, we could probably find that every two entities connected by this relation are also connected by the path editor/film −1 , and hence assign an extremely high weight to it. 5 However, in the testing phase, for any entity pair (x, y) such that (y, editor/film, x) has not been encoded, we might not even find that path and hence could always make a negative prediction. 6 For this reason, we remove reverse relations in FB15K. Specifically, we regard r 2 to be a reverse relation of r 1 if the triple (t, r 2 , h) holds whenever (h, r 1 , t) is observed, and we randomly discard one of the two relations. 7 As such, we keep 774 out of 1,345 relations in FB15K, covering 14,951 entities and 327,783 triples. Then we build a graph based on this data and use it as input to CPRA (and our baseline methods).
Labeled instance generation. We select 171 relations to test our methods. To do so, we pick 10 popular domains, including award, education, film, government, location, music, olympics, organization, people, and tv. Relations in these domains with at least 50 triples observed for them are selected. For each of the 171 relations, we split the associated triples into roughly 80% training, 10% validation, and 10% testing. Since the triple number varies significantly among the relations, we allow at most 200 validation/testing triples for each relation, so as to make the test cases as balanced as possible. Note that validation and testing triples are not used for constructing the graph.
We generate positive instances for each relation directly from these triples. Given a relation r and a triple (h, r, t) observed for it (training, validation, or testing), we take the pair of entities (h, t) as a positive instance for that relation. Then we follow (Shi and Weninger, 2015;Krompaß et al., 2015) to generate negative instances. Given each positive instance (h, t) we generate four negative ones, two by randomly corrupting the head h, and the other two the tail t. To make the negative instances as difficult as possible, we corrupt a position using only entities that have appeared in that position. That means, given the relation capitalOf and the positive instance (Paris, France), we could generate a negative instance (Paris, UK) but never (Paris, NBA), since NBA never appears as a tail entity of the relation. We further ensure that the negative instances do not overlap with the positive ones.
Feature extraction and computation. Given the labeled instances, we extract path features for them using the code provided by Shi and Weninger (2015) 8 . It is a depth-first search strategy that enumerates all paths between two entities. We set the maximum path length to be ℓ = 3. There are about 8.2% of the labeled instances for which no path could be extracted. We remove such cases, giving on average about 5,250 training, 323 validation, and 331 testing instances per relation. Then we remove paths that appear only once in each relation, getting 5,515 features on average per relation. We simply compute the value of each feature as its frequency in an instance. Table 1 lists the statistics of the data used in our experiments. Evaluation metrics. As evaluation metrics, we use mean average precision (MAP) and mean reciprocal rank (MRR), following recent work evaluating KB completion performance (West et al., 2014;Gardner and Mitchell, 2015). Both metrics evaluate some ranking process: if a method ranks the positive instances before the negative ones for each relation, it will get a high MAP or MRR.
Baseline methods. We compare CPRA to traditional single-task PRA. CPRA first groups the 171 relations into clusters, and then learns classifiers jointly for relations within the same cluster. We implement two versions of it: CPRA-LR and CPRA-SVM. As we have shown in Proposition 1, both of them could be solved by standard classification tools. PRA learns an individual classifier for each of the relations, using LR or SVM classification techniques, denoted by PRA-LR or PRA-SVM. We use LIBLINEAR (Fan et al., 2008) 9 to solve the LR and SVM classification problems. For all these methods, we tune the cost c in the range of {2 −5 , 2 −4 , · · · , 2 4 , 2 5 }. And we set the coupling coefficient ρ = λ 2 λ 1 in CPRA in the range of {0.1, 0.2, 0.5, 1, 2, 5, 10}.
We further compare CPRA to TransE, a widely adopted embedding-based method (Bordes et al., 2013). TransE learns vector representations for entities and relations (i.e., embeddings), and uses the learned embeddings to determine the plausibility of missing facts. Such plausibility can then be used to rank the labeled instances. We implement TransE using the code provided by Bordes et al. (2013) 10 . To learn embeddings, we take as input the triples used to construct the graph (from which CPRA and PRA extract their paths). We tune the embedding dimension in {20, 50, 100}, the margin in {0.1, 0.2, 0.5, 1, 2, 5}, and the learning rate film/casting-director gov-jurisdiction/dist-represent film/cinematography location/contain film/costume-design-by location/adjoin film/art-direction-by us-county/county-seat film/crewmember county-place/county film/set-decoration-by location/partially-contain film/production-design-by region/place-export film/edited-by film/written-by film/story-by org/place-founded country/divisions org/headquarter-city country/capital org/headquarter-state country/fst-level-divisions org/geographic-scope country/snd-level-divisions org/headquarter-country admin-division/capital org/service-location tv/tv-producer music-group-member/instrument tv/recurring-writer music-artist/recording-role tv/program-creator music-artist/track-role tv/regular-appear-person music-group-member/role tv/tv-actor Table 2: Six largest clusters of relations (with the stopping criterion δ = 0.5).
in {10 −4 , 10 −3 , 10 −2 , 10 −1 , 1}. For details please refer to (Bordes et al., 2013). For each of these methods, we select the optimal configuration that leads to the highest MAP on the validation set and report its performance on the test set.

Relation Clustering Results
We first test the effectiveness of our agglomerative strategy (Section 4.2) in relation clustering. With the stopping criterion δ = 0.5, 96 out of the 171 relations are grouped into clusters which contain at least two relations. Each of these 96 relations will later be learned jointly with some other relations. The other 75 relations cannot be merged, and will still be learned individually. Table 2 shows the six largest clusters discovered by our algorithm. Relations in each cluster are arranged in the order they were merged. The results indicate that our algorithm can effectively identify coherent clusters in which relations are highly correlated to each other. For example, the top left cluster describes relations between a film and its crew members, and the middle left between an organization and a location.
During clustering we might obtain clusters that contain too many relations and hence too many training instances for our CPRA model to learn efficiently. We split such clusters into sub-clusters, either according to the domain (e.g., the film cluster and tv cluster) or randomly (e.g., the two location clusters on the top right).

KB Completion Results
We further test the effectiveness of our multi-task learning strategy (Section 4.3) in KB completion. Table 3 gives the results on the 96 relations that are actually involved in multi-tasking learning (i.e., grouped into clusters with size larger than one). 11 The 96 relations are grouped into 29 clusters, and relations within the same cluster are learned jointly. Table 3 reports (i) MAP and MRR within each cluster and (ii) overall MAP and MRR on the 96 relations. Numbers marked in bold type indicate that CPRA-LR/SVM outperforms PRA-LR/SVM, within a cluster (with its ID listed in the first column) or on all the 96 relations (ALL). We judge statistical significance of the overall improvements achieved by CPRA-LR/SVM over PRA-LR/SVM and TransE, using a paired t-test. The average precision (or reciprocal rank) on each relation is used as paired data. The symbol " * * " indicates a significance level of p < 0.0001, and " * " a significance level of p < 0.05.
From the results, we can see that (i) CPRA outperforms PRA (using either LR or SVM) and TransE on the 96 relations (ALL) in both metrics. All the improvements are statistically significant, with a significance level of p < 0.0001 for MAP and a significance level of p < 0.05 for MRR. (ii) CPRA-LR/SVM outperforms PRA-LR/SVM in 22/24 out of the 29 clusters in terms of MAP. Most of the improvements are quite substantial. (iii) Improving PRA-LR and PRA-SVM in terms of MRR could be hard, since they already get the best performance (MRR = 1) in 19 out of the 29 clusters. But even so, CPRA-LR/SVM still improves 7/8 out of the remaining 10 clusters. (iv) The PRAstyle methods perform substantially better than the embedding-based TransE model in most of the 29 clusters and on all the 96 relations. This observation demonstrates the superiority of observed features (i.e., PRA paths) over latent features. Table 4 further shows the top 5 most discriminative paths (i.e., features with the highest weights) discovered by PRA-SVM (left) and CPRA-SVM (right) for each relation in the 6th cluster. 12 The average precision on each relation is also provid-   Table 3: KB completion results on the 96 relations that have been grouped into clusters with size larger than one (with the stopping criterion δ = 0.5), and hence involved in multi-tasking learning.
ed. We can observe that (i) CPRA generally discovers more predictive paths than PRA. Almost all the top paths discovered by CPRA are easily interpretable and provide sensible reasons for the final prediction, while some of the top paths discovered by PRA are hard to interpret and less predictive. Take org/place-founded as an example. All the 5 CPRA paths are useful to predict the place where an organization was founded, e.g., the 3rd one tells that "the organization headquarter in a city which is located in that place". However, the PRA path "common/class → common/class −1 → film/debutvenue" is hard to interpret and less predictive. (ii) For the 1st/4th/6th relation on which PRA gets a low average precision, CPRA learns almost completely different top paths and gets a substantially higher average precision. While for the other relations (2nd/3rd/5th) on which PRA already performs well enough, CPRA learns similar top paths and gets a comparable average precision. We have conducted the same analyses with CPRA-LR and PRA-LR, and observed similar phenomena. All these observations demonstrate the superiority of CPRA, in terms of not only predictive accuracy but also model interpretability.
This is the first work that investigates the possibility of multi-task learning with PRA, and we just provide a very simple solution. There are still many interesting topics to study. For instance, the agglomerative clustering strategy can only identify highly correlated relations, i.e., those sharing a lot of common paths. Relations that are only loosely correlated, e.g., those sharing no common paths but a lot of sub-paths, will not be identified. We would like to design new mechanisms to discover loosely correlated relations, and investigate whether coupling such relations still provides benefits. Another example is that the current method is a two-step approach, performing relation clustering first and then relation coupling. It will be interesting to study whether one can merge the clus-tering step and the coupling step so as to have a richer inter-task dependent structure. We will investigate such topics in our future work.