Quantifying Similarity between Relations with Fact Distribution

We introduce a conceptually simple and effective method to quantify the similarity between relations in knowledge bases. Specifically, our approach is based on the divergence between the conditional probability distributions over entity pairs. In this paper, these distributions are parameterized by a very simple neural network. Although computing the exact similarity is in-tractable, we provide a sampling-based method to get a good approximation. We empirically show the outputs of our approach significantly correlate with human judgments. By applying our method to various tasks, we also find that (1) our approach could effectively detect redundant relations extracted by open information extraction (Open IE) models, that (2) even the most competitive models for relational classification still make mistakes among very similar relations, and that (3) our approach could be incorporated into negative sampling and softmax classification to alleviate these mistakes.


Introduction
Relations1, representing various types of connections between entities or arguments, are the core of expressing relational facts in most general knowledge bases (KBs) (Suchanek et al., 2007;Bollacker et al., 2008).Hence, identifying relations is a crucial problem for several information extraction tasks.Although considerable effort has been devoted to these tasks, some nuances between similar relations Author contributions: Hao Zhu designed the research; Weize Chen prepared the data, and organized data annotation; Hao Zhu and Xu Han designed the experiments; Weize Chen performed the experiments; Hao Zhu, Weize Chen and Xu Han wrote the paper; Zhiyuan Liu and Maosong Sun proofread the paper.Zhiyuan Liu is the corresponding author.
1Sometimes relations are also named properties.

Sentence
The crisis didn't influence his two daughters OBJ and SUBJ.
Correct per:siblings Predicted per:parents

Similarity Rank 2
Table 1: An illustration of the errors made by relation extraction models.The sentence contains obvious patterns indicating the two persons are siblings, but the model predicts it as parents.We introduce an approach to measure the similarity between relations.Our result shows "siblings" is the second most similar one to "parents".By applying this approach, we could analyze the errors made by models, and help reduce errors.
are still overlooked, (Table 1 shows an example); on the other hand, some distinct surface forms carrying the same relational semantics are mistaken as different relations.These severe problems motivate us to quantify the similarity between relations in a more effective and robust method.
In this paper, we introduce an adaptive and general framework for measuring similarity of the pairs of relations.Suppose for each relation r, we have obtained a conditional distribution, P (h, t | r) (h, t ∈ E are head and tail entities, and r ∈ R is a relation), over all head-tail entity pairs given r.We could quantify similarity between a pair of relations by the divergence between the conditional probability distributions given these relations.In this paper, this conditional probability is given by a simple feed-forward neural network, which can capture the dependencies between entities conditioned on specific relations.Despite its simplicity, the proposed network is expected to cover various facts, even if the facts are not used for training, owing to the good generalizability of neural networks.For example, our network will assign a fact a higher probability if it is "logical": e.g., the network might prefer an athlete has the same nationality as same as his/her national team rather than other nations.
Intuitively, two similar relations should have similar conditional distributions over head-tail entity pairs P ( h, t | r ), e.g., the entity pairs associated with be trade to and play for are most likely to be athletes and their clubs, whereas those associated with live in are often people and locations.In this paper, we evaluate the similarity between relations based on their conditional distributions over entity pairs.Specifically, we adopt Kullback-Leibler (KL) divergence of both directions as the metric.However, computing exact KL requires iterating over the whole entity pair space E × E, which is quite intractable.Therefore, we further provide a sampling-based method to approximate the similarity score over the entity pair space for computational efficiency.
Besides developing a framework for assessing the similarity between relations, our second contribution is that we have done a survey of applications.We present experiments and analysis aimed at answering five questions: (1) How well does the computed similarity score correlate with human judgment about the similarity between relations?How does our approach compare to other possible approaches based on other kinds of relation embeddings to define a similarity?( §3.4 and §5) (2) Open IE models inevitably extract many redundant relations.How can our approach help reduce such redundancy?( §6) (3) To which extent, quantitatively, does best relational classification models make errors among similar relations?( §7) (4) Could similarity be used in a heuristic method to enhance negative sampling for relation prediction? ( §8) (5) Could similarity be used as an adaptive margin in softmax-margin training method for relation extraction? ( §9) Finally, we conclude with a discussion of valid extensions to our method and other possible applications.

Learning Head-Tail Distribution
Just as introduced in §1, we quantify the similarity between relations by their corresponding head-tail entity pair distributions.Consider the typical case that we have got numbers of facts, but they are still sparse among all facts in the real world.How could we obtain a well-generalized distribution over the whole space of possible triples beyond the training facts?This section proposes a method to parameterize such a distribution.

Formal Definition of Fact Distribution
A fact is a triple (h, r, t) ∈ E × R × E, where h and t are called head and tail entities, r is the relation connecting them, E and R are the sets of entities and relations respectively.We consider a score function F θ : E × R × E → R maps all triples to a scalar value.As a special case, the function can be factorized into the sum of two parts: F θ ( h, t; r ) u θ 1 (h; r) + u θ 2 (t; h, r).We use F θ to define the unnormalized probability.
for every triple ( h, r, t ).The real parameter θ can be adjusted to obtain difference distributions over facts.
In this paper, we only consider locally normalized version of F θ : where ũθ 1 and ũθ 2 are directly parameterized by feed-forward neural networks.Through local normalization, Pθ ( h, t | r ) is naturally a valid probability distribution, as the partition function

Neural architecture design
Here we introduce our special design of neural networks.For the first part and the second part, we implement the scoring functions introduced in equation (2) as where each MLP θ represents a multi-layer perceptron composed of layers like y = relu(W x + b), h, r, t are embeddings of h, r, t, and θ includes weights and biases in all layers.

Quantifying Similarity
So far, we have talked about how to use neural networks to approximate the natural distribution of facts.The center topic of our paper, quantifying similarity, will be discussed in detail in this section.

Relations as Distributions
In this paper, we provide a probability view of relations by representing relation r as a probability distribution P θ * ( h, t | r ).After training the neural network on a given set of triples, the model is expected to generalize well on the whole E × R × E space.
Note that it is very easy to calculate P θ * ( h, t | r ) in our model thanks to local normalization (equation (2)).Therefore, we can compute it by

Defining Similarity
As the basis of our definition, we hypothesize that the similarity between P θ * ( h, t | r ) reflects the similarity between relations.3For example, if the conditional distributions of two relations put mass on similar entity pairs, the two relations should be quite similar.If they emphasize different ones, the two should have some differences in meaning.
Formally, we define the similarity between two relations as a function of the divergence between the distributions of corresponding head-tail entity pairs: where D KL ( •|| •) denotes Kullback-Leibler divergence, vice versa, and function g(•, •) is a symmetrical function.To keep the coherence between semantic meaning of "similarity" and our definition, g should be a monotonically decreasing function.Through this paper, we choose to use an exponential family4 composed with max function, i.e., g(x, y) = e − max(x,y) .Note that by taking both sides of KL divergence into account, our definition incorporates both the entity pairs with high probability in r 1 and r 2 .Intuitively, if P θ * ( h, t | r 1 ) mainly distributes on a proportion of entities pairs that P θ * ( h, t | r 2 ) emphasizes, r 1 is only hyponymy of r 2 .Considering both sides of KL divergence could help model yield more comprehensive consideration.We will talk about the advantage of this method in detail in §3.4.

Calculating Similarity
Just as introduced in §1, it is intractable to compute similarity exactly, as involving O(|E|2 ) computation.Hence, we consider the monte-carlo approximation: where S is a list of entity pairs sampled from P θ * ( h, t | r 1 ).We use sequential sampling5 to gain S, which means we first sample h given r from u θ 1 (h; r), and then sample t given h and r from u θ 2 (t; h, r).6

Relationship with other metrics
Previous work proposed various methods for representing relations as vectors (Bordes et al., 2013;Yang et al., 2015), as matrices (Nickel et al., 2011), even as angles (Sun et al., 2019), etc.Based on each of these representations, one could easily define various similarity quantification methods.7We show in Table 2 the best one of them in each category of relation presentation.
Here we provide two intuitive reasons for using our proposed probability-based similarity: (1) 4We view KL divergences as energy functions.5Sampling h and t at the same time requires O(|E| 2 ) computation, while sequential sampling requires only O(|E|) computation.
6It seems to be a non-symmetrical method, and sampling from the mixture of both forward and backward should yield a better result.Surprisingly, in practice, sampling from single direction works just as well as from both directions.
7Taking the widely used vector representations as an example, we can define the similarity between relations based on cosine distance, dot product distance, L1/L2 distance, etc.   the capacity of a single fixed-size representation is limited -some details about the fact distribution is lost during embedding; (2) directly comparing distributions yields a better interpretability -you can not know about how two relations are different given two relation embeddings, but our model helps you study the detailed differences between probabilities on every entity pair.Figure 1 provides an example.Although the two relations talk about the same topic, they have different meanings.TransE embeds them as vectors the closest to each other, while our model can capture the distinction between the distributions corresponds to the two relations, which could be directly noticed from the figure.

Dataset Construction
We show the statistics of the dataset we use in Table 3, and the construction procedures will be introduced in this section.

Wikidata
In Wikidata (Vrandečić and Krötzsch, 2014), facts can be described as (Head item/property, Property, Tail item/property).To construct a dataset suitable for our task, we only consider the facts whose head  entity and tail entity are both items.We first choose the most common 202 relations and 120000 entities from Wikidata as our initial data.Considering that the facts containing the two most frequently appearing relations (P2860: cites, and P31: instance of ) occupy half of the initial data, we drop the two relations to downsize the dataset and make the dataset more balanced.Finally, we keep the triples whose head and tail both come from the selected 120000 entities as well as its relation comes from the remaining 200 relations.

ReVerb Extractions
ReVerb (Fader et al., 2011) is a program that automatically identifies and extracts binary relationships from English sentences.We use the extractions from running ReVerb on Wikipedia9.We only keep the relations appear more than 10 times and their corresponding triples to construct our dataset.

FB15K and TACRED
FB15K (Bordes et al., 2013) is a subset of freebase.TACRED (Zhang et al., 2017) is a large supervised relation extraction dataset obtained via crowdsourcing.We directly use these two dataset, no extra processing steps were applied.9http://reverb.cs.washington.edu/5 Human Judgments Following Miller and Charles (1991); Resnik (1999) and the vast amount of previous work on semantic similarity, we ask nine undergraduate subjects to assess the similarity of 360 pairs of relations from a subset of Wikidata (Vrandečić and Krötzsch, 2014)10 that are chosen to cover from high to low levels of similarity.In our experiment, subjects were asked to rate an integer similarity score from 0 (no similarity) to 4 (perfectly the same)11 for each pair.The inter-subject correlation, estimated by leavingone-out method (Weiss and Kulikowski, 1991), is r = 0.763, standard deviation = 0.060.This important reference value (marked in Figure 2) could be seen as the highest expected performance for machines (Resnik, 1999).
To get baselines for comparison, we consider other possible methods to define similarity functions, as shown in Table 2.We compute the correlation between these methods and human judgment scores.As the models we have chosen are the ones work best in knowledge base completion, we do expect the similarity quantification approaches based on them could measure some degree of similarity.As shown in Figure 2, the three baseline models could achieve moderate (0.1-0.5) positive correlation.On the other hand, our model shows a stronger correlation (0.63) with human judgment, indicating that considering the probability over whole entity pair space helps to gain a similarity closer to human judgments.These results provide evidence for our claim raised in §3.2.

Redundant Relation Removal
Open IE extracts concise token patterns from plain text to represent various relations between entities, e.g." (Mark Twain, was born in, Florida).As Open IE is significant for constructing KBs, many effective extractors have been proposed to extract triples, such as Text-Runner (Yates et al., 2007), ReVerb (Fader et al., 2011), and Standford Open IE (Angeli et al., 2015).However, these extractors only yield relation patterns between entities, without aggregating and clustering their results.Accordingly, there are a fair amount of redundant relation patterns after extracting those relation patterns.Furthermore, the redundant patterns lead to 10Wikidata provides detailed descriptions to properties (relations), which could help subjects understand the relations better.
11The detailed instruction is attached in the Appendix F.  some redundant relations in KBs.
Recently, some efforts are devoted to Open Relation Extraction (Open RE) (Lin and Pantel, 2001;Yao et al., 2011;Marcheggiani and Titov, 2016;ElSahar et al., 2017), aiming to cluster relation patterns into several relation types instead of redundant relation patterns.Whenas, these Open RE methods adopt distantly supervised labels as golden relation types, suffering from both false positive and false negative problems on the one hand.On the other hand, these methods still rely on the conventional similarity metrics mentioned above.
In this section, we will show that our defined similarity quantification could help Open IE by identifying redundant relations.To be specific, we set a toy experiment to remove redundant relations in KBs for a preliminary comparison ( §6.1).Then, we evaluate our model and baselines on the realworld dataset extracted by Open IE methods ( §6.2).Considering the existing evaluation metric for Open IE and Open RE rely on either labor-intensive annotations or distantly supervised annotations, we propose a metric approximating recall and precision evaluation based on operable human annotations for balancing both efficiency and accuracy.

Toy Experiment
In this subsection, we propose a toy environment to verify our similarity-based method.Specifically, we construct a dataset from Wikidata12 and implement Chinese restaurant process13 to split every relation in the dataset into several sub-relations.Then, we filter out those sub-relations appearing less than 50 times to eventually get 1165 relations.All these split relations are regarded as different ones during training, and then different relation similarity metrics are adopted to merge those subrelations into one relation.As Figure 2 shown that the matrices-based approach is less effective than other approaches, we leave this approach out of this experiment.The results are shown in  Because it is nearly impossible to annotate all pattern pairs for their merging or not, meanwhile it is also inappropriate to take distantly supervised annotations as golden results.Hence, we propose a novel metric approximating recall and precision evaluation based on minimal human annotations for evaluation in this experiment.

Approximating Recall and Precision
Recall Recall is defined as the yielding fraction of true positive instances over the total amount of real positive14 instances.However, we do not have annotations about which pairs of relations are synonymous.Crowdsourcing is a method to obtain a large number of high-quality annotations.Nevertheless, applying crowdsourcing is not trivial in our settings, because it is intractable to enumerate all synonymous pairs in the large space of relation (pattern) pairs O(|R| 2 ) in Open IE.A promising method is to use rejection sampling by uniform sampling from the whole space, and only keep the synonymous ones judged by crowdworkers.However, this is not practical either, as the synonymous pairs are sparse in the whole space, resulting in low efficiency.Fortunately, we could use normalized importance sampling as an alternative to get an unbiased estimation of recall.Theorem 1. 15 Suppose every sample x ∈ X has a label f (x) ∈ {0, 1}, and the model to be evaluated also gives its prediction f (x) ∈ {0, 1}.The recall can be written as where U is the uniform distribution over all samples with f (x) = 1.If we have a proposal distribu- tion q(x) satisfying ∀x, f (x) = 1 ∧ f (x) = 1 ⇒ q(x) = 0, we get an unbiased estimation of recall: where ŵi is a normalized version of , where q is the unnormalized version of q, and {x i } n i=1 are i.i.d.drawn from q(x).Precision Similar to equation ( 9), we can write the expectation form of precision: where U is the uniform distribution over all samples with f (x) = 1.As these samples could be found out by performing models on it.We can simply approximate precision by Monte Carlo Sampling: where ∼ U .In our setting, x = (r 1 , r 2 ) ∈ R × R, f (x) = 1 means r 1 and r 2 are the same relations, f (x) = 1 means S(r 1 , r 2 ) is larger than a threshold λ.

Results
The results on the ReVerb Extractions dataset that we constructed are described in Figure 3.To approximate recall, we use the similarity scores as the proposal distribution q.500 relation pairs are then drawn from q.To approximate precision, we set thresholds at equal intervals.At each threshold, we uniformly sample 50 to 100 relation pairs whose similarity score given by the model is larger than the threshold.We ask 15 undergraduates to judge

Error Analysis for Relational Classification
In this section, we consider two kinds of relational classification tasks: (1) relation prediction and (2) relation extraction.Relation prediction aims at predicting the relationship between entities with a given set of triples as training data; while relation extraction aims at extracting the relationship between two entities in a sentence.

Relation Prediction
We hope to design a simple and clear experiment setup to conduct error analysis for relational prediction.Therefore, we consider a typical method TransE (Bordes et al., 2013) as the subject as well as FB15K (Bordes et al., 2013) as the dataset.TransE embeds entities and relations as vectors, and train these embeddings by minimizing 16The figure is shown in Figure 6 where D is the set of training triples, d(•, •) is the distance function, (h , r , t )17 is a negative sample with one element different from (h, r, t) uniformly sampled from E × R × E, and γ is the margin.During testing, for each entity pair (h, t), TransE rank relations according to d(h + r, t).For each (h, r, t) in the test set, we call the relations with higher rank scores than r distracting relations.We then compare the similarity between the golden relation and distracting relations.Note that some entity pairs could correspond to more than one relations, in which case we just do not see them as distracting relations.

Relation Extraction
For relation extraction, we consider the supervised relation extraction setting and TACRED dataset (Zhang et al., 2017).As for the subject model, we use the best model on TACRED dataset -positionaware neural sequence model.This method first passes the sentence into an LSTM and then calculate an attention sum of the hidden states in the LSTM by taking positional features into account.This simple and effective method achieves the best in TACRED dataset.

Results
Figure 4 shows the distribution of similarity ranks of distracting relations of the above mentioned models' outputs on both relation prediction and relation extraction tasks.From Figures 4a and 4b, we could observe the most distracting relations are the most 17Note that only head and tail entities are changed in the original TransE when doing link prediction.But changing r results in better performance when doing relation prediction.(Xu et al., 2015) 66.3 52.7 58.7 LSTM 65.7 59.9 62.7 PA-LSTM (Zhang et al., 2017) 65.7 64.5 65.1 Neural+Ours PA-LSTM (Softmax-Margin Loss) 68.5 64.7 66.6 Table 5: Improvement of using similarity in softmaxmargin loss.
similar ones, which corroborate our hypothesis that even the best models on these tasks still make mistakes among the most similar relations.This result also highlights the importance of a heuristic method for guiding models to pay more attention to the boundary between similar relations.We also try to do the negative sampling with relation type constraints, but we see no improvement compared with uniform sampling.The details of negative sampling with relation type constraints are presented in Appendix E.

Similarity and Negative Sampling
Based on the observation presented in §7.3, we find out that similar relations are often confusing for relation prediction models.Therefore, corrupted triples with similar relations can be used as highquality negative samples.For a given valid triple (h, r, t), we corrupt the triple by substituting r with r with the probability, where α is the temperature of the exponential function, the bigger the α is, the flatter the probability distribution is.When the temperature approaches infinite, the sampling process reduces to uniform sampling.
In training, we set the initial temperature to a high level and gradually reduce the temperature.Intuitively, it enables the model to distinguish among those obviously different relations in the early stage and gives more and more confusing negative triples as the training processes to help the model distinguish the similar relations.This can be also viewed as a process of curriculum learning (Bengio et al., 2009), the data fed to the model gradually changes from simple negative triples to hard ones.
We perform relation prediction task on FB15K with TransE.Following Bordes et al. (2013), we use the "Filtered" setting protocol, i.e., filtering out the corrupted triples that appear in the dataset.Our sampling method is shown to improve the model's performance, especially on Hit@1 (Figure 5).Training details are described in Appendix C.

Similarity and Softmax-Margin Loss
Similar to §8, we find out that relation extraction models often make wrong preditions on similar relations.In this section, we use similarity as an adaptive margin in softmax-margin loss to improve the performance of relation extraction models.
As shown in (Gimpel and Smith, 2010), Softmax-Margin Loss can be expressed as where R(x) denotes a structured output space for x, and x (i) , r (i) is i th example in training data.
We can easily incorporate similarity into cost function cost(r (i) , r).In this task, we define the cost function as αS(r (i) , r), where α is a hyperparameter.
Intuitively, we give a larger margin between similar relations, forcing the model to distinguish among them, and thus making the model perform better.We apply our method to Position-aware Attention LSTM (PA-LSTM) (Zhang et al., 2017), and Table 5 shows our method improves the performance of PA-LSTM.Training details are described in Appendix C.

Related Works
As many early works devoted to psychology and linguistics, especially those works exploring semantic similarity (Miller and Charles, 1991;Resnik, 1999), researchers have empirically found there are various different categorizations of semantic relations among words and contexts.For promoting research on these different semantic relations, Bejar et al. (1991) explicitly defining these relations and Miller (1995) further systematically organize rich semantic relations between words via a database.For identifying correlation and distinction between different semantic relations so as to support learning semantic similarity, various methods have attempted to measure relational similarity (Turney, 2005(Turney, , 2006;;Zhila et al., 2013;Pedersen, 2012;Rink and Harabagiu, 2012;Mikolov et al., 2013b,a).
For both semantic relations and general relations, identifying them is a crucial problem, requiring systems to provide a fine-grained relation similarity metric.However, the existing methods suffer from sparse data, which makes it difficult to achieve an effective and stable similarity metric.Motivated by this, we propose to measure relation similarity by leveraging their fact distribution so that we can identify nuances between similar relations, and merge those distant surface forms of the same relations, benefitting the tasks mentioned above.

Conclusion and Future Work
In this paper, we introduce an effective method to quantify the relation similarity and provide analysis and a survey of applications.We note that there are a wide range of future directions: (1) human prior knowledge could be incorporated into the similarity quantification; (2) similarity between relations could also be considered in multi-modal settings, e.g., extracting relations from images, videos, or even from audios; (3) by analyzing the distributions corresponding to different relations, one can also find some "meta-relations" between relations, such as hypernymy and hyponymy.
A Proofs to theorems in the paper Proof.

Recall
If we have a proposal distribution q(x) satisfying ∀x, f (x) = 1 ∧ f (x) = 1 ⇒ q(x) = 0, then equation ( 16) can be further written as Sometimes, it's hard for us to compute normalized probability q.To tackle this problem, consider selfnormalized importance sampling as an unbiased estimation (Owen, 2013), where ŵi is the normalized version of w.

B Chinese Restaurant Process
Specifically, for a relation r with currently m subrelations, we turn it to a new sub-relation with probability or to the k th existing sub-relation with probability where n k is the size of k th existing sub-relation, n is the sum of the number of all sub-relationships of r, and α is a hyperparameter, in which case we use α = 1.

C.1 Training Details on Negative Sampling
The sampling is launched with an initial temperature of 8192.The temperature drops to half every 200 epochs and remains stable once it hits 16.Optimization is performed using SGD, with a learning rate of 1e-3.

C.2 Training Details on Softmax-Margin Loss
The sampling is launching with an initial temperature of 64.The temperature drops by 20% per epoch, and remains stable once it hits 16.The alpha we use is 9. Optimization is performed using SGD, with a learning rate of 1.

D Recall Standard Deviation
As is shown in Figure 6, the max recall standard deviation for our model is 0.4, and 0.11 for TransE.

E Negative Samplilng with Relation Type Constraints
In FB15K, if two relations have same prefix, we regard them as belonging to a same type, e.g., both /film/film/starring./film/performance/actor and /film/actor/film./film/performance/filmhave prefix film, they belong to same type.Similar to what is mentioned in §8, we expect the model first to learn to distinguish among obviously different relations, and gradually learn to distinguish similar relations.Therefore, we conduct negative sampling with relation type constraints in two ways.

E.1 Add Up Two Uniform Distribution
For each triple (h, r, t), we have two uniform distribution U all and U type .U all is the uniform distribution over all the relations except for those appear with (h, t) in the knowledge base, and U type is the uniform distribution over the relations of the same type as r.When corrupting the triple, we sample r from the distribution: where α is a hyperparameter.We set α to 1 at the beginning of training, and every k epochs, α will be multiplied by decrease rate γ.We do grid search for k ∈ {50, 70, 100} and γ ∈ {0.9, 0.95, 0.98}, but no improvement is observed.

E.2 Add Weight
We speculate that the unsatisfactory result produced by adding up two uniform distribution is because that for those types with few relations in it, a small change of α will result in a significant change in U .Therefore, when sampling a negative r , we add weights to relations that are of the same type as r instead.Concretely, we substitute r with r with probability p, which can be calculated as: where T (r) denotes all the relations that are the same type as r, is a hyperparameter and N is a normalizing constant.We set to 0 at the beginning of training, and every k epochs, will increase by γ.We do grid search for k ∈ {50, 70, 100} and γ ∈ 0.5, 1, still no improvement is observed.

F Wikidata annotation guidance
We show the guidance provided for the annotators here.
• A pair of relations should be marked as 4 points if the two relations are only two different expressions for a certain meaning.
Example: (study at, be educated at) • A pair of relations should be marked as 3 points if the two relations are describing a same topic, and the entities that the two relations connect are of same type respectively.
Example: (be the director of, be the screenwriter of), both relations relate to movie, and the types of the entities they connect are both (person, movie).
• A pair of relations should be marked as 2 points if the two relations are describing a same topic, but the entities that the two relations connect are of different type respectively.
Example: (be headquartered in, be founded in), both relations relate to organization, but the types of the entities they connect are different, i.e., (company, location) and (company, time) • A pair of relations should be marked as 1 points if the two relations do not meet the conditions above but still have semantic relation.
Example: (be the developer of, be the employer of) • A pair of relations should be marked as 0 points if the two relations do not have any connection.
Example: (be a railway station locates in, be published in)

Figure 1 :
Figure1: Head-tail entity pairs of relation "be an unincorporated community in" (in blue) and "be a small city in" (in red) sampled from our fact distribution model.The coordinates of the points are computed by t-sne(Maaten and Hinton, 2008) on the concatenation of head and tail embeddings8.The two larger blue and red points indicate the embeddings of these two relations.

8Embeddings
used in this graph are from a trained TransE model.Matrix Vector(TransE) Angle Vector(DistMult)

Figure 2 :
Figure2: Spearman correlations between human judgment and models' outputs.The inter-subject correlation is also shown as a horizontal line with its standard deviation as an error band.Our model shows the strongest positive correlation with human judgment, and, in other words, the smallest margin with human inter-subject agreement.Significance: ***/**/* := p < .001/.01/.05.

Table 2 :
Methods to define a similarity function with different types of relation representations

Table 3 :
Statistics of the triple sets used in this paper.

Table 4 .
12The construction procedure is shown in §4.1.13Chinese restaurant process is shown in Appendix B.

Table 4 :
The experiment results on the toy dataset show that our metric based on probability distribution significantly outperforms other relation similarity metrics.
Similarity rank distributions of distracting relations on different tasks and datasets.Most of the distracting relations have top similarity rank.Distracting relations are, as defined previously, the relations have a higher rank in the relation classification result than the ground truth.
whether two relations in a relation pair have the same meaning.A relation pair is viewed valid only if 8 of the annotators annotate it as valid.We use the annotations to approximate recall and precision with equation (10) and equation (12).Apart from the confidential interval of precision shown in the figure, the largest 95% confidential interval among thresholds for recall is 0.0416.From the result, we could see that our model performs much better than other models' similarity by a very large margin.