Type-Sensitive Knowledge Base Inference Without Explicit Type Supervision

State-of-the-art knowledge base completion (KBC) models predict a score for every known or unknown fact via a latent factorization over entity and relation embeddings. We observe that when they fail, they often make entity predictions that are incompatible with the type required by the relation. In response, we enhance each base factorization with two type-compatibility terms between entity-relation pairs, and combine the signals in a novel manner. Without explicit supervision from a type catalog, our proposed modification obtains up to 7% MRR gains over base models, and new state-of-the-art results on several datasets. Further analysis reveals that our models better represent the latent types of entities and their embeddings also predict supervised types better than the embeddings fitted by baseline models.


Introduction
Knowledge bases (KBs) store facts in the form of relations (r) between subject entity (s) and object entity (o), e.g., Obama, born-in, Hawaii . Since KBs are typically incomplete (Bollacker et al., 2008), the task of KB Completion (KBC) attempts to infer new tuples from a given KB. Neural approaches to KBC, e.g., Complex (Trouillon et al., 2016) and DistMult (Yang et al., 2015), calculate the score f (s, r, o) of a tuple (s, r, o) via a latent factorization over entity and relation embeddings, and use these scores to predict the validity of an unseen tuple.
A model is evaluated over queries of the form s * , r * , ? . It ranks all entities o in the descend-*Equal contribution.
ing order of tuple scores f (s * , r * , o), and credit is assigned based on the rank of gold entity o * . Our preliminary analysis of DistMult (DM) and Complex (CX) reveals that they make frequent errors by ranking entities that are not compatible with types expected as arguments of r * high. In 19.5% of predictions made by DM on FB15K, the top prediction has a type different from what is expected (see Table 1 for illustrative examples).
In response, we propose a modification to base models (DM, Complex) by explicitly modeling type compatibility. Our modified function f (s, r, o) is the product of three terms: the original tuple score f (s, r, o), subject typecompatibility between r and s, and object typecompatibility between r and o. Our type-sensitive models, TypeDM and TypeComplex, do not expect any additional type-specific supervisionthey induce all embeddings using only the original KB.
Experiments over three datasets show that all typed models outperform base models by significant margins, obtaining new state-of-the-art results in several cases. We perform additional analyses to assess if the learned embeddings indeed capture the type information well. We find that embeddings from typed models can predict known symbolic types better than base models.
Finally, we note that an older model called E (Riedel et al., 2013) can be seen as modeling type compatibilities. Moreover, previous work has explored additive combinations of DM and E (Garcia-Duran et al., 2015b;. We directly compare against these models and find that, our proposal outperforms both E, DM and their linear combinations.
We contribute open-source implementations 1 of all models and experiments discussed in this paper for further research.

Background and Related Work
We are given an incomplete KB with entities E and relations R. The KB also contains T = { s, r, o }, a set of known valid tuples, each with subject and object entities s, o ∈ E, and relation r ∈ R. Our goal is to predict the validity of any tuple not present in T . Popular top performing models for this task are Complex and DM.
In Complex, each entity e (resp., relation r) is represented as a complex vector a a a e ∈ C D (resp., b b b r ∈ C D ). Tuple score f CX (s, r, o) = D d=1 a sd b rd a od , where (z) is real part of z, and z is complex conjugate of z. Holographic embeddings (Nickel et al., 2016) are algebraically equivalent to Complex. In DM, each entity e is represented as a vector a a a e ∈ R D , each relation r as a vector b b b r ∈ R D , and the tuple score Riedel et al. (2013) proposed a different model called E: relation r is represented by two vectors v v v r , w w w r ∈ R D , and the tuple score f E (s, r, o) = a a a s · v v v r + a a a o · w w w r . E may be regarded as a relation prediction model that depends purely on type compatibility checking.
Observe that, in a a a s , b b b r , a a a o , b b b r mediates a direct compatibility between s and o for relation r, whereas, in a a a s ·v v v r +a a a o ·w w w r , we are scoring how well s can serve as subject and o as object of the relation r. Thus, in the second case, a a a e is expected to encode the type(s) of entity e, where, by 'type', we loosely mean "information that helps decide if e can participate in a relation r, as subject or object." Heuristic filtering of the entities that do not match the desired type at test time has been known to improve accuracy Krompaß et al., 2015). Our typed models formalize this within the embeddings and allow for discovery of latent types without additional data. Krompaß et al. (2015) also use heuristic typing of entities for generating negative samples while training the model. Our experiment finds that this approach is not very competitive against our typed models.

TypeDM and TypeComplex
Representation: We start with DM as the base model; the Complex case is identical. The first key modification (see Figure 1) is that each entity e is now represented by two vectors: u u u e ∈ R K to encode type information, and a a a e ∈ R D to encode information. Typically, K D . The second, concomitant modification is that each relation r is now associated with three vectors: b b b r ∈ R D as before, and also v v v r , w w w r ∈ R K . v v v r and w w w r encode the expected types for subject and object entities.
An ideal way to train type embeddings would be to provide canonical type signatures for each relation and entity. Unfortunately, these aspects of realistic KBs are themselves incomplete (Neelakantan and Chang, 2015;. Our models train all embeddings using T only and don't rely on any explicit type supervision. DM uses (E + R)D model weights for a KB with R relations and E entities, whereas TypeDM uses E(D +K)+R(D +2K). To make comparisons fair, we set D and K so that the total number of model weights (real or complex) are about the same for base and typed models.
v Prediction: DM's base prediction score for tuple (s, r, o) is a a a s , b b b r , a a a o . We apply a (sigmoid) nonlinearity: and then combine with two additional terms that measure type compatibility between the subject and the relation, and the object and the relation: is a function that measures the compatibility between the type embedding of e for a given argument slot of r: If each of the three terms in Equation 2 is interpreted as a probability, f (s, r, o) corresponds to a simple logical AND of the three conditions.
We want f (s, r, o) to be almost 1 for positive instances (tuples known to be in the KG) and close to 0 for negative instances (tuples not in the KG). For a negative instance, one or more of the three terms may be near zero. There is no guidance to the learner on which term to drive down.
Because f ∈ [0, 1] for typed models, we scale it with a hyper-parameter β > 0 (a form of inverse temperature) to allow Pr(o|s, r) to take values over the full range [0, 1] in loss minimization.
The sum over o in the denominator is sampled based on contrastive sampling, so the left hand side is not a formal probability (exactly as in DM). A similar term is added for Pr(s|r, o). The loglikelihood loss minimizes: − s,r,o ∈P log P r(o|s, r; θ) + log P r(s|o, r; θ) The summation is over P which is the set of all positive facts. Following Trouillon et al. (2016), we also implement the logistic loss Here Y sro is 1 if the fact (s, r, o) is true and −1 otherwise. Also, T is the set of all positive facts along with the negative samples. With logistic loss, model weights θ are L2-regularized and gradient norm is clipped at 1.

Experiments
Datasets: We evaluate on three standard data sets, FB15K, FB15K-237, and YAGO3-10 (Bor-des et al., 2013; Dettmers et al., 2017). We retain the exact train, dev and test folds used in previous works. TypeDM and TypeComplex are competitive on the WN18 data set (Bordes et al., 2013), but we omit those results, as WN18 has 18 very generic relations (e.g., hyponym, hypernym, antonym, meronym), which do not give enough evidence for inducing types.   all the typed models. To balance total model sizes ( Table 2), we choose K = 19 dimensions for u u u e , v v v r , w w w r and 180 dimensions for a a a e , b b b r 2 .
Typed models and E perform best with 400 negative samples per positive tuple while using loglikelihood loss (robust to a larger number of negative facts as opposed to logistic loss, which falls for class imbalance). FB15K and YAGO3-10 use L2 regularization coefficient of 2.0, and it is 5.0 for FB15K-237. Note that the L2 regularization penalty is applied to only those entities and relations that are a part of that batch update, as proposed by Trouillon et al. (2016). β is set to 20.0 for the typed models, and 1.0 for other models if they use the log-likelihood loss. Entity embeddings are unit normalized at the end of every epoch, for the type models. Also, we find that in TypeDM scaling the embeddings of the base model to unit norm performs better than using L2 regularization. Table 3 shows that TypeDM and Type-Complex dominate across all data sets. E by itself is understandably weak, and DM+E does not lift it much. Each typed model improves upon the corresponding base model on all measures, underscoring the value of type compatibility scores. 3 To the best of our knowledge, the results of our typed models are competitive with various reported results for models of similar sizes that do not use any additional information, e.g., soft rules (Guo et al., 2018), or textual corpora .

Results:
We also compare against the heuristic genera-tion of type-sensitive negative samples (Krompaß et al., 2015). For this experiment, we train a Complex model using this heuristically generated negative set, and use standard evaluation, as in all other models. We find that all the models reported in Ta

Analysis of Typed Embeddings
We perform two further analyses to assess whether the embeddings produced by typed models indeed capture type information better. For these experiments, we try to correlate (and predict) known symbolic types of an entity using the unsupervised embeddings produced by the models. We take a fine catalog of most frequent 90 freebase types over the 14,951 entities in the FB15k dataset (Xie et al., 2016). We exclude /common/topic as it occurs with most entities. On an average each entity has 12 associated types.  belong to one of the 5 types (people, location, organization, film, and sports) from the freebase dataset. These cover 84.88% of FB15K entities. We plot the FB15K entities e using the PCA projection of u u u e and a a a e in Figure 2, color-coding their types. We observe that u u u e separates the type clusters better than a a a e , suggesting that u u u e vectors indeed collect type information. We also perform k-means clustering of u u u e and a a a e embeddings of these entities, as available from different models. We report cluster homogeneity and completeness scores (Rosenberg and Hirschberg, 2007) in Table 4. Typed models yield superior clusters.

Prediction of Symbolic Types:
We train a single-layer network that inputs embeddings from various models and predicts a set of symbolic types from the KB. This tells us the extent to which the embeddings capture KB type information (that was not provided explicitly during training). Table 4 reports average macro F1 score (5fold cross validation). Embeddings from TypeDM and TypeComplex are generally better predictors than embeddings learned by Complex, DM and E. u u u e ∈ R 19 is often better than a a a e ∈ R 180 or more, for typed models. DM+E with 199 model weights narrowly beats TypeDM with 19 weights, but recall that it has poorer KBC scores.

Conclusion and Future Work
We propose an unsupervised typing gadget, which enhances top-of-the-line base models for KBC (DistMult, Complex) with two type-compatibility functions, one between r and s and another between r and o. Without explicit supervision from any type catalog, our typed variants (with similar number of parameters as base models) substan-tially outperform base models, obtaining up to 7% MRR improvements and over 10% improvements in the correctness of the top result. To confirm that our models capture type information better, we correlate the embeddings learned without type supervision with existing type catalogs. We find that our embeddings indeed separate and predict types better. In future work, combining type-sensitive embeddings with a focus on less frequent relations (Xie et al., 2017), more frequent entities (Dettmers et al., 2017), or side information such as inference rules (Guo et al., 2018;Jain and Mausam, 2016) or textual corpora  may further increase KBC accuracy. It may also be of interest to integrate the typing approach here with the combinations of tensor and matrix factorization models for KBC (Jain et al., 2018).