Using Pairwise Occurrence Information to Improve Knowledge Graph Completion on Large-Scale Datasets

Bilinear models such as DistMult and ComplEx are effective methods for knowledge graph (KG) completion. However, they require large batch sizes, which becomes a performance bottleneck when training on large scale datasets due to memory constraints. In this paper we use occurrences of entity-relation pairs in the dataset to construct a joint learning model and to increase the quality of sampled negatives during training. We show on three standard datasets that when these two techniques are combined, they give a significant improvement in performance, especially when the batch size and the number of generated negative examples are low relative to the size of the dataset. We then apply our techniques to a dataset containing 2 million entities and demonstrate that our model outperforms the baseline by 2.8% absolute on hits@1.


Introduction
A Knowledge Graph (KG) is a collection of facts which are stored as triples, e.g. Berlin is-capital-of Germany. Even though knowledge graphs are essential for various NLP tasks, open domain knowledge graphs have missing facts. To tackle this issue, there has recently been considerable interest in KG completion methods, where the goal is to rank correct triples above incorrect ones.
Embedding methods such as DistMult  and ComplEx (Trouillon et al., 2016) are simple and effective methods for this task, but are known to be sensitive to hyperparameter and loss function choices (Kadlec et al., 2017;Lacroix et al., 2018). When paired with the right loss function, these methods need large minibatches and a large number of corrupted triples per each positive triple during training to reach peak performance. *Work done while the author was an intern 1 This causes memory issues for KGs in the wild, which are several magnitudes bigger than the common benchmarking datasets.
To address the issue of scalability, we develop a framework that could be used with any bilinear KG embedding model. We name our model JoBi (Joint model with Biased negative sampling). Our framework uses occurrences of entityrelation pairs to overcome data sparsity, and to bias the model to score plausible triples higher. The framework trains a base model jointly with an auxiliary model that uses occurrences of pairs within a given triple in the data as labels. For example, the auxiliary model would receive the label 1 for the triple (Berlin is-capital-of France) if the pairs (Berlin is-capital-of ) and (is-capital-of France) are present in the training data, while the base model would receive the label 0.
The intuition for using bigram occurrences is to capture some information about restrictions on the set of entities that could appear as the object or subject of a given relation; that information should implicitly correspond to some underlying type constraints. For example, even if (Berlin iscapital-of France) is not a correct triple, Berlin is the right type for the subject of is-capital-of.
Our framework also utilizes entity-relation pair occurrences to improve the distribution of negative examples for contrastive training, by sampling a false triple e.g. (Berlin is-capital-of France) with higher probability if the pairs (Berlin is-capitalof ) and (is-capital-of France) both occur in the dataset. This tunes the noise distribution for the task so that it is more challenging, and hence the model needs a fraction of the negative examples compared to a uniform distribution.
We show empirically that joint training is especially beneficial when the batch size is small, 3592 and biased negative sampling helps model learn higher quality embeddings with much fewer negative samples. We show that the two techniques are complementary and perform significantly better when combined. We then test JoBi on a largescale dataset, and demonstrate that JoBi learns better embeddings in very large KGs.

Background
Formally, given a set of entities E = {e 0 , . . . , e n } and a set of relations R = {r 0 , . . . , r m }, a Knowledge Graph (KG) is a set triples in the form where if a triple (h, r, t) 2 G, then relation r holds between entities h and t. Given such a KG, the aim of KG completion is score each triple in E ⇥ R ⇥ E, so that correct triples are assigned higher scores than the false ones. KG embedding methods achieve this by learning dense vector representations for entities and relations through optimizing a chosen scoring function. A class of KG completion models such as RESCAL (Nickel et al., 2012), DistMult , ComplEx (Trouillon et al., 2016), SimplE (Kazemi andPoole, 2018) and TUCKER (Balažević et al., 2019) define their scoring function to be a bilinear interaction of the embeddings of entities and relations in the triple. For this work we consider DistMult, ComplEx and SimplE as our baseline models due to their simplicity.
DistMult. ) is a knowledge graph completion model that defines the scoring function for a triple as a simple bilinear interaction, where the entity has the same representation regardless of whether it appears as the head or the tail entity. For entities h, t, relation r, and the embeddings h, t, r 2 R d , the scoring function is defined as: where diag(r) is a diagonal matrix with r on the diagonal.
ComplEx. (Trouillon et al., 2016) is a bilinear model similar to DistMult. Because of its symmetric structure, DistMult cannot model antisymmetric relations. ComplEx overcomes this shortcoming by learning embeddings in a complex vector space, and defining the embedding of an entity in tail position as the complex conjugate of the embedding in the head position.
Let h, r, t 2 C d be the embeddings for h, r, t. The score for ComplEx is defined as follows: Where a denotes the complex conjugate of a, and Re(a) denotes the real part of the complex vector a.
SimplE. (Kazemi and Poole, 2018) is also a bilinear model similar to DistMult. Each entity has two associated embeddings e 1 , e 2 2 R d , where one is the representation of e as the head, and the other as the tail entity of the triple. Each relation also has two associated embeddings: r and r 1 , where r 1 is the representation for the reverse of r. The score function is defined as:

Joint framework
JoBi contains two copies of a bilinear model, where one is trained on labels of triples, and the other on occurrences of entity-relation pairs within the triples. For the pair module, we label a triple (h, r, t) correct if there are triples (h, r, t 0 ) and (h 0 , r, t) in the training set for some t 0 and h 0 . The scoring functions for the two models are s bi and s tri , for the pair and the triple modules respectively. We tie the weights of the entity embeddings, but let the embeddings for the relations be optimized separately. The equations using Com-plEx as the base model are as follows: We define the framework for DistMult and Sim-plE analogously. During training, we optimize the two jointly, but use only s tri during test time.
Hence, the addition of the auxiliary module has no effect on the number of final parameters of the trained model. Note that even during training, this doesn't increase model complexity in any significant way since the number of relations in KGs are often a fraction of the number of entities.
For each triple in the minibatch, we generate n neg negative examples per positive triple by randomly corrupting the head or the tail entity. For s tri , we use the negative log-likelihood of softmax as the loss function, and for s bi we use binary cross  entropy loss. We combine the two losses via a simple weighted addition with a tunable hyperparameter ↵: Biased negative sampling. We also examine the effect of using the pair cooccurrence information for making the contrastive training more challenging for the model. For this, we keep the model as is, but with probability p, instead of corrupting the head or the tail of the triple with an entity chosen uniformly at random, we corrupt it with an entity that is picked with uniform probability from the set of entities that occur as the head or tail entity of the relation in the given triple. To illustrate, when sampling a negative tail entity for the tuple Berlin is-capital-of, this method causes the model to pick France with higher probability than George Orwell if France but not George Orwell occurs as the head entity for the relation is-capital-of in the training data.

Experiments
We perform our experiments on standard datasets FB15K (Bordes et al., 2013), FB15K-237 (Toutanova et al., 2015), YAGO3-10 (Dettmers et al., 2018), and on a new large-scale dataset FB1.9M which we construced from FB3M (Xu and Barbosa, 2018). 2 We focus on YAGO3-10 since it is 10 times larger than the other two and better reflects how the performance of the models scale. We present the comparison of the sizes of these datasets in Table 1, and further details could be found in Appendix A. For evaluation, we rank each triple (h, r, t) in the test set against (h 0 , r, t) for all entities h 0 , and similarly against (h, r, t 0 ) for all entities t 0 . We filter out the candidates that have occurred in training, validation or test set as described in Bordes et al. (2013), and we report average hits@1, 3, 10 and mean reciprocal rank (MRR). 2 We do not perform experiments on WordNet derived datasets WN18 or WN18RR because bigram modelling would not provide any information -all entities are synsets and almost all can occur as an object or subject to all the possible relations.
We re-implement all our baselines and obtain very competitive results. In our preliminary experiments on baselines, we found that the choice of loss function had a large effect on performance, with negative log-likelihood (NLL) of softmax consistently outperforming both max-margin and logistic losses. Larger batch sizes lead to better performance. With NLL of sampled softmax, we found that increasing the number of generated negatives steadily increases performance 3 , and state-of-the-art results could be reached by using the full softmax as used in Joulin et al. (2017) and Lacroix et al. (2018). This technique is possible for standard benchmarks but not for large KGs, and we report results in Appendix D for all datasets small enough to allow for full contrastive training. However, our main experiments use NLL of sampled softmax since our focus is on scalability. Note that results with full softmax (Appendix D) demonstrate that our implementation of baselines is very competitive.Our implementation of ComplEx performs significantly better than ConvE (Dettmers et al., 2018) on two out of the three datasets, and come close to results of Lacroix et al. (2018) who use extremely large embeddings as well as full softmax, thus cannot be scaled. Our code is publicly available. 4 For most of our experiments, we choose to use ComplEx as the base for our model (JoBi Com-plEx), since this configuration consistently outperformed others in preliminary experiments. To test the effect of our techniques on different bilinear models, we report results with DistMult (JoBi DistMult) and SimplE (JoBi SimplE) on FB15K-237.
Discussion. It could be seen in Table 2     large dataset, where it is not possible to perform softmax over the entire set of entities, or have very large embedding sizes due to memory constraints. Although one epoch for JoBi takes slightly longer than the baseline, JoBi converges in fewer epochs, resulting in shorter running time overall. We report running times on FB1.9M in Table 4.
Comparison with TypeComplex For results of TypeComplex, Jain et al. (2018) use a wider set of negative ratios in their grid search than we do. To isolate the effects of the different models from hyperparameter choices, we set the negative ratio for our model to be 400 to match the setting on their best performing models. We keep the other hyperparameters the same as the best performing models for the previous experiments. Jain et al. (2018) use a modified version of the ranking evaluation procedure to report their results, where they only rank the tail entity against all other entities. To be able to compare our model to theirs, we also report the performance of our framework on this modified metric. The results  for these experiments can be found in Table 5.
Our model generally outperforms TypeComplex by a large margin on hits@10. It also outperforms TypeComplex on MRR by a moderate margin except on FB15K-237, the smallest dataset. On the other hand, TypeComplex outperforms our model on hits@1 in two out of the three datasets. In fact for FB15K, TypeComplex does worse on hits@10 compared to the baseline model. This suggests that TypeComplex may be compromising on hits@k where k is larger to improve the hits@1 metric, which might be undesirable depending on the application.
Qualitative analysis. We analyzed correct predictions made by JoBi ComplEx but not regular ComplEx. Among relations in YAGO3-10, major gains can be observed for hasGender (Appendix C). The improvement comes solely from tailentity predictions, with hits@1 increasing from 0.22 to 0.86. Furthermore, we found that the errors made by ComplEx are exactly of the kind that can be mitigated by enforcing plausibility: Com-plEx predicts an object that is not a gender (e.g. a sports team or a person) 65% of the time; JoBi makes such an obvious mistake only 2% of the time.
Ablation studies. We compare joint training without biased sampling (Joint) and biased sampling without joint training (BiasedNeg) to the full model JoBi on YAGO3-10. The results can be found in Table 6. We also conduct experiments to isolate the effect of our techniques on varying batch sizes and negative ratios. The results for this experiment are presented in Figures 1 and 2. Training details can be found in Appendix B.   In Table 6 it can be seen that Joint on its own gives a slight performance boost over the baseline, and BiasedNeg performs slightly under the baseline on all measures. However, combining our two techniques in JoBi gives 5.6% points improvement on hits@1. This suggests that biased negative sampling increases the efficacy of joint training greatly, but is not very effective on its own. Figure 1 and 2 shows that JoBi not only consistently performs the best over the entire range of parameters, but also delivers a performance improvement that is especially large when the batch size or the negative ratio is small. This setting was designed to reflect the training conditions on very large datasets. It can be seen that Biased-Neg is more robust to low values of negative ratios, and both BiasedNeg and Joint alone show less deterioration in performance as the batch size decreases. When these two methods are combined in JoBi, the training becomes more robust to different choices on both these parameters.
The reason behind BiasedNeg performing worse on its own but better with Joint could be the choice of binary cross entropy loss for the pair module. We speculate that as the negative ratio increases, the ratio of negative to positive examples for this module becomes more skewed. Biasing the negative triples in the training alleviates this problem by making the classes more balanced, and allows the joint training to be more effective.

Related work
Utilizing pair occurrences for embedding models have been considered before, both as explicit model choices and as negative sampling strategies. Chang et al. (2014) and Krompaß et al. (2015) use pair occurrences to constrain the set of triples to be used in the optimization procedure. For methods that rely on SGD with contrastive training, this translates to a special case of our biased sampling method where p = 1. Garcia-Durán et al. (2016) present TATEC, a model that combines bigram and trigram interactions. The trigram model uses a full matrix representation for relations, and hence has many more parameters compared to our model. Jain et al. (2018) present JointDM and JointComplex, which could be viewed as a simplification of TATEC. Unlike our model, both of these methods use the bigram terms both in training and evaluation, do not share any of the embeddings between two models, and do not provide supervision based on pair occurrences in the data. Other methods that have been considered for improving the negative sampling procedure includes adversarial (Cai and Wang, 2018) and self-adversarial (Sun et al., 2019) training. None of these methods focus on improving the models to scale to large KGs.

Conclusion
We have presented a joint framework for KG completion that utilizes entity-relation pair occurrences as an auxiliary task, and combined it with a technique to generate informative negative examples with higher probability. We have shown that joint training makes the model more robust to smaller batch sizes, and biased negative sampling to different values of the number of generated negative samples. Furthermore, these techniques perform well above baselines when combined, and are effective on a very large KG dataset. Applying JoBi to non-bilinear models is also possible, but left for future work.