Structure Aware Negative Sampling in Knowledge Graphs

Learning low-dimensional representations for entities and relations in knowledge graphs using contrastive estimation represents a scalable and effective method for inferring connectivity patterns. A crucial aspect of contrastive learning approaches is the choice of corruption distribution that generates hard negative samples, which force the embedding model to learn discriminative representations and find critical characteristics of observed data. While earlier methods either employ too simple corruption distributions, i.e. uniform, yielding easy uninformative negatives or sophisticated adversarial distributions with challenging optimization schemes, they do not explicitly incorporate known graph structure resulting in suboptimal negatives. In this paper, we propose Structure Aware Negative Sampling (SANS), an inexpensive negative sampling strategy that utilizes the rich graph structure by selecting negative samples from a node's k-hop neighborhood. Empirically, we demonstrate that SANS finds semantically meaningful negatives and is competitive with SOTA approaches while requires no additional parameters nor difficult adversarial optimization.


Introduction
Knowledge Graphs (KGs) are repositories of information organized as factual triples (h, r, t), where head and tail entities are connected via a particular relation (r). Indeed, KGs have seen wide application in a variety of domains such as question answering (Yao and Van Durme, 2014;Hao et al., 2017;Moldovan and Rus, 2001) and machine reading (Weissenborn et al., 2018;Yang and Mitchell, 2017) to name a few and have a rich history within the natural language processing (NLP) community (Berant et al., 2013;Yu and Dredze, 2014;Collobert and Weston, 2008;Peters et al., 2019). While * Equal contribution, names ordered alphabetically. often large, real-world KGs such as FreeBase (Bollacker et al., 2008) and WordNet (Miller, 1995) are known to be incomplete. Consequently, KG completion via link prediction constitutes a fundamental research topic ameliorating the practice of important NLP tasks (Sun et al., 2019;Angeli and Manning, 2013).
In recent years, there has been a surge of methods employing graph embedding techniques that encode KGs into a lower-dimensional vector space facilitating easier data manipulation (Zhang et al., 2019) while being an attractive framework for handling data sparsity and incompleteness (Wang et al., 2018). To learn such embeddings, contrastive learning has emerged as the de facto gold standard. Indeed, contrastive learning approaches enjoy significant computational benefits over methods that require computing an exact softmax over a large candidate set, such as over all possible tail entities given a head and relation. Another important consideration is modeling needs, as certain assumptions are best expressed as some score or energy in margin-based or un-normalized probability models (Smith and Eisner, 2005). For example, modeling entity relations as translations or rotations in a vector space naturally leads to a distance-based score to be minimized for observed entity-relation-entity triplets (Bordes et al., 2013).
Leveraging contrastive estimation to train KG embedding models involves optimizing the model by pushing up the energy with respect to observed positive triplets while simultaneously pushing down energy on negative triplets. Consequently, the choice of negative sampling distribution plays a crucial role in shaping the energy landscape as simple random sampling-e.g. Noise Contrastive Estimation (NCE) (Gutmann and Hyvärinen, 2010)produces negatives that are easily classified and provide little information alongside in the form of a gradient signal. This is easily remedied if the corruption process selects a hard negative example through more complex negative sampling distribution, such as adversarial samplers (Cai and Wang, 2018;Bose et al., 2018;Sun et al., 2019). However, adversarial negative sampling methods are computationally expensive, while more tractable approaches-e.g. cache-based methods (Zhang et al., 2019)-are not tailored to the KG setting as they fail to incorporate known graph structure as part of the sampling process. This raises the important question of whether we can obtain a computationally inexpensive negative sampling strategy while benefiting from the rich graph structure of KGs. Present Work. In this work, we introduce Structure Aware Negative Sampling (SANS), an algorithm that utilizes the graph structure of a KG to find hard negative examples. Specifically, SANS constructs negative samples using a subset of entities restricted to either the head or tail entity's k-hop neighborhood. We hypothesize that entities that are within each other's neighborhood but share no direct relation have higher chances of being related to one another and thus are good candidates for negative sampling. We also experiment with a dynamic sampling scheme based on random walks to approximate a node's local neighborhood. Empirically, we find that negative sampling using SANS consistently leads to improvements upon uniform sampling and sophisticated Generative Adversarial Network (Goodfellow et al., 2014) (GAN) based approaches at a fraction of the computational cost, and is competitive with other SOTA approaches with no added parameters.

Related Work
Negative Sampling. Negative sampling is a method that can be employed to enable the scaling of log-linear models. In essence, negative sampling resolves computational intractability of computing the normalization constant by changing the task to distinguishing observed positive data and fictitious negative examples that are generated by corrupting the positive examples. This general approach is a simplification of NCE, which is based on a Monte-Carlo approximation of the partition function used in Importance Sampling (IS) (Bengio et al., 2003). Non-Fixed Negative Sampling. As proposed in (Mikolov et al., 2013), negative triplets can be generated using a uniform sampling scheme. However, such uniform and fixed sampling schemes result in easily-classified negative triplets during training, which do not provide any meaningful information (Sun et al., 2019;Zhang et al., 2019). Hence, as the training progresses, most of the sampled negative triplets receive small scores and almost zero gradients, impeding the training of the graph embedding model after only a small number of iterations.
To address the issue of easy negatives, Sun et al. (2019) propose Self-Adversarial negative sampling, which weighs each sampled negative according to its probability under the embedding model. Alternatively, the authors in (Wang et al., 2018) and (Cai and Wang, 2018) try creating high-quality negative samples by exploiting GANs, which, while effective, are expensive to train and require black-box gradient estimation techniques. Another elegant approach that uses fewer parameters and is easier to train compared to GAN-based methods is NSCaching (Zhang et al., 2019), which involves using a cache of high-quality negative triplets-i.e. those with high scores.

Structure Aware Negative Sampling
Given an observed positive triplet (h, r, t), a negative sample can be constructed by corrupting either the head or tail entity to form a new triplet-i.e. (h , r, t )-where either h , t ∈ E, where E is the set of all entities in the KG. Additionally, we assume that the graph embedding models are trained using a loss function of the following form: where d r (h, t) denotes the score assigned to the compatibility of head and tail entities under the relation r, γ is a fixed margin, σ is the sigmoid function, and n is the number of negative samples.
In this paper, we seek to explicitly use the rich graph structure surrounding a particular node when generating negative triplets. We motivate our approach based on the observation that prior work in learning word embeddings (Mikolov et al., 2013), where negative sampling has historically developed, lacked the richness of graph structure that is immediately accessible in the KG setting. Consequently, we hypothesize that enriching the negative sampling process with structural information can yield harder negative examples, crucial to learning effective embeddings. Fig. 1 highlights our approach, which requires the construction of the k-hop neighborhood (K) for each node at its first step, for k > 0, where k is an integer, representing the neighborhood radius, A is the KG's adjacency matrix, and S + is the element-wise sign function set to 1 if a path exists and 0 otherwise. To construct negatives triplets, we may now simply sample from the nonzero cell of K, which represents a subset of all entities for each node in the KG-i.e. K ⊂ 1 E×E . Intuitively, SANS exploits the locality of an entity's neighborhood, where negative samples are defined as entities that are not directly linked under a relation r but can be accessed through a path of at most length k. We argue that such local negatives are harder to distinguish and lead to higher scores as evaluated by the embedding model. One important technical detail in constructing K is the existence of multiple relation types, which requires an additional dimension to represent the graph connectivity as adjacency and k-hop tensors.

Variants of SANS
Although SANS requires a one-time preprocessing step to construct K as defined in Eqn. 2, this may still be costly for large and dense KGs. To combat this inefficiency, we introduce RW-SANS in Alg. 1, which uses ω random walks (Perozzi et al., 2014) of length k in the adjacency tensor to approximate the k-hop neighborhood.
As SANS constructs a local neighborhood from which negative samples are drawn, it can also be combined with other negative sampling approaches. In this work, we extend the Self-Adversarial approach in (Sun et al., 2019) and combine it with SANS by restricting the negative triplet candidate set to the k-hop neighborhood. In the subsequent sections, we refer to this technique as Self-Adversarial (Self-Adv.) SANS, whereas the former approach is referred to as Uniform SANS.

Experiments
We investigate the application of SANS-based negatives to train KG embedding models based on the TransE, DistMult, and RotatE models for the task of KG completion * . We evaluate our proposed approach on standard benchmarks, consisting of FB15K-237 (Bollacker et al., 2008), WN18 and WN18RR (Miller, 1995). From our experiments we seek to answer the following questions: (Q1) Hard Negatives: Can we sample hard negatives purely using graph structure?
(Q2) Can we combine graph structure with other SOTA negative samplers?
(Q3) Can we effectively approximate the adjacency tensor with random walks?
In our experiments, we rely on three representative baselines, namely uniform negative sampling (Bordes et al., 2013), KBGAN (Cai andWang, 2018), andNSCaching (Zhang et al., 2019). We also compare with the current SOTA approach in Self-Adversarial negative sampling (Sun et al., 2019), and we test whether local graph structure can also be leveraged in this setting.

Results
We now address the core experimental questions. Q1:.  with TransE, and is the second best-performing algorithm when combined with DistMult without requiring additional parameters. We observe average ∆ values of 0.0231, −0.0341, and 0.0056 in MRR for TransE, DisMult, and RotateE respectively, which confirm our approach's effectiveness compared to SOTA while remaining computationally efficient.
We also qualitatively investigate the semantic hardness of SANS negatives against negatives generated via uniform sampling. For instance, using the center node "arachnoid" in the WN18RR dataset as an example, the negatives sampled via SANS within a 2-hop neighborhood are "arachnida," "biology," "arthropod," "wolf spider," and "garden spider," while the ones picked by uniform sampling are "diner," "refusal," "landscape," "rise," and "nurser." Clearly, the negatives found via SANS are semantically harder to distinguish, and as a result, they also confirm the importance of incorporating graph structure into negative samplers to aid in 'hard' negative mining. A more detailed qualitative analysis of negative samplesincluding the effect of varying neighborhood sizesgenerated by SANS can be found in C.1. Q2:. We now combine our approach SANS with Self-Adversarial negative sampling (Sun et al., 2019). Our results are presented in Table 2 under Self-Adv. SANS and Self-Adv. RW-SANS, both of which reweigh the negative triplets as done in (Sun et al., 2019). We observe comparable performance between the two approaches, but crucially this is achieved by mostly considering 0.2% to 9% of the entities in the datasets like in WN18 and WN18RR, as indicated in Table 3. By considering that the partially-filled adjacency tensors improve computational feasibility for requiring less memory and allowing sparse tensor operations to take place, the appeal of incorporating graph structure while choosing negative samples is further highlighted.  Q3:. We now analyze the impact of approximating the local neighborhood using random walks. Fig. 2 depicts the effect of varying the number of random walks (ω) with neighborhoods of different radii and MRR. We report two baselines, one being the performance of uniform sampling, and the other being our best performance achieved by Uniform SANS when combined with TransE, for which the k-hop tensor was explicitly computed. Interestingly, we find that the k-hop tensor can not only be well approximated with 3000 random walks, but RW-SANS beats both baselines. We reconcile this result by noting that certain nodes have a higher probability of being sampled due to sharing a larger number of paths with the center node, resulting in an implicit weighted negative sampling scheme.

Conclusion and Future Directions
In this work, we introduced SANS, a novel negative sampling strategy, which directly leverages information about k-hop neighborhoods to select negative examples. Our work sheds light on the need and importance of incorporating graph structure when designing negative samplers for KGs, and for which SANS can be seen as a cheap yet powerful baseline that requires no additional parameters or difficult optimization. Empirically, we find that SANS-based negatives have comparable performance with SOTA approaches and even outperform previous sophisticated GAN-based approaches.

A Experimental Settings
This section provides an overview of the datasets and evaluation protocols used for obtaining our results.

A.1 Datasets
To conduct experiments for our proposed methods, datasets FB15K-237, WN18, and WN18RR were used. FB15K-237 is a subset of FB15K, which has been derived from the FreeBase Knowledge Base (KB) (Bollacker et al., 2008), a large database that contains general facts about the world with many different relation types. On the other hand, WN18RR is a subset of WN18, which has been derived from the WordNet KB (Miller, 1995), which is a large lexical English database that captures lexical relations-e.g. the super-subordinate relations between words. The WN18 and FB15K were first introduced in (Bordes et al., 2013) and were used in the majority of KG-related researches. In comparison, WN18 and WN18RR contain less relation types than FB15K-237. A summary of the number of entities and relation types corresponding to each of these datasets is provided in Table 4.

A.2 Evaluation Protocols
To evaluate our negative sampling approach, we used standard evaluation metrics, consisting of Mean Reciprocal Rank (MRR) and Hits at N (H@N). The train/validation/test split information is provided in Table 5.

B Implementation Details
This section of the supplemental goes over the implementation details of our RW-SANS algorithmsi.e. Uniform RW-SANS and Self-Adv. RW-SANS, which use random walks to approximate the k-hop adjacency tensor. Other experimental setups are further detailed herein.

B.1 Hyperparameters
k and ω (for when the k-hop neighborhood is being approximated by Alg. 1) are the hyperparameters in our negative sampling algorithms. To find the optimal hyperparameter values that resulted in the highest performance on the validation set of different datasets, k and ω values in range 2 to 8 and 1000 to 5000 were used respectively during the negative sampling step. In other words, the best performances on the validation sets in our empirical study were found by manual tuning of the hyperparameters. More information about the experimental trials can be found in Table 6, where the total number of trials for training each of the graph embedding models on each dataset is also indicated.

B.2 Preprocessing
Building the k-hop neighborhood of the nodes within the KG can be regarded as the preprocessing step, essential to implementing SANS. In this paper, we propose two techniques for doing so, which are: 1. explicit computation of the k-hop neighborhood by manipulating Eqn. 2 while accounting for different relation types and, 2. approximation of the k-hop neighborhood using random walks, as detailed in Alg. 1.

B.3 Infrastructure Settings
The experiments in our study were carried on a server with one NVIDIA V100 GPU, 10 CPU cores, and 46GB RAM.

C.1 Qualitative Assessment of Negative Samples
In this section, we assess the semantic meaningfulness of negative samples produced by Uniform SANS and those produced by uniform sampling using the WN18RR dataset. As presented by the   examples given in Table 8, Uniform SANS results in negative examples that are harder to distinguish semantically compared to uniform sampling. We also notice the semantic meaningfulness of SANSbased negatives decline as we increase the size of the k-hop neighbourhood. This observation is indeed expected since as the neighbourhood increases in size (i.e. k → ∞) Uniform SANS will become analogous to uniform sampling.

C.2 SOTA Algorithms
Results for the Uniform and Self-Adversarial algorithms in Table 1 and Table 2 Table 9 and Table 10 report the performance of the graph embedding models fused with our negative sampling techniques on the validation and test sets with respect to the evaluation metrics. Additionally, they list the hyperparameter values corresponding to Uniform/Self-Adv. SANS and Uniform/Self-Adv. RW-SANS that resulted in the best performance on the validation sets. Based on our out-   (Bordes et al., 2013) O ( Table 11: Comparison of different negative sampling algorithms in terms of preprocessing, runtime, and space complexities given batch size b, negative sample size n, cache size c, cache extension size e, node set V , edge set E, relation set R, embedding dimension d, hops count k, random walks count r, and GAN parameters count t.

C.3 SANS Algorithms
comes, we hypothesize that the usage of random walks in approximating the k-hop neighborhood implicitly results in the removal of nodes with the least number of walks to the center node-i.e. outlier nodes. Table 11 is representative of the time and space complexities of different negative sampling approaches including SANS.