KBGAN: Adversarial Learning for Knowledge Graph Embeddings

We introduce KBGAN, an adversarial learning framework to improve the performances of a wide range of existing knowledge graph embedding models. Because knowledge graphs typically only contain positive facts, sampling useful negative training examples is a nontrivial task. Replacing the head or tail entity of a fact with a uniformly randomly selected entity is a conventional method for generating negative facts, but the majority of the generated negative facts can be easily discriminated from positive facts, and will contribute little towards the training. Inspired by generative adversarial networks (GANs), we use one knowledge graph embedding model as a negative sample generator to assist the training of our desired model, which acts as the discriminator in GANs. This framework is independent of the concrete form of generator and discriminator, and therefore can utilize a wide variety of knowledge graph embedding models as its building blocks. In experiments, we adversarially train two translation-based models, TRANSE and TRANSD, each with assistance from one of the two probability-based models, DISTMULT and COMPLEX. We evaluate the performances of KBGAN on the link prediction task, using three knowledge base completion datasets: FB15k-237, WN18 and WN18RR. Experimental results show that adversarial training substantially improves the performances of target embedding models under various settings.


Introduction
Knowledge graph (Dong et al., 2014) is a powerful graph structure that can provide direct access of knowledge to users via various applications such as structured search, question answering, and intelligent virtual assistant. A common representation of knowledge graph beliefs is in the form of a discrete relational triple such as Locate-dIn (NewOrleans,Louisiana).
A main challenge for using discrete representation of knowledge graph is the lack of capability of accessing the similarities among different entities and relations. Knowledge graph embedding (KGE) techniques (e.g., RESCAL (Nickel et al., 2011), TRANSE (Bordes et al., 2013), DIST-MULT , and COMPLEX (Trouillon et al., 2016)) have been proposed in recent years to deal with the issue. The main idea is to represent the entities and relations in a vector space, and one can use machine learning technique to learn the continuous representation of the knowledge graph in the latent space.
However, even steady progress has been made in developing novel algorithms for knowledge graph embedding, there is still a common challenge in this line of research. For space efficiency, common knowledge graphs such as Freebase (Bollacker et al., 2008), Yago (Suchanek et al., 2007), and NELL (Mitchell et al., 2015) by default only stores beliefs, rather than disbeliefs. Therefore, when training the embedding models, there is only the natural presence of the positive examples. To use negative examples, a common method is to remove the correct tail entity, and randomly sample from a uniform distribution (Bordes et al., 2013). Unfortunately, this approach is not ideal, because the sampled entity could be completely unrelated to the head and the target relation, and thus the quality of randomly generated negative examples is often poor (e.g, Locate-dIn(NewOrleans,BarackObama)). Other approach might leverage external ontological constraints such as entity types (Krompaß et al., 2015) to generate negative examples, but such resource does not always exist or accessible.
In this work, we provide a generic solution to improve the training of a wide range of knowl-Model Score function f (h, r, t) Number of parameters TRANSE ||h + r − t|| 1/2 k|E| + k|R| TRANSD ||(I + rphp T )h + r − (I + rptp T )t|| 1/2 2k|E| + 2k|R| DISTMULT < h, r, t > (= k i=1 hiriti) k|E| + k|R| COMPLEX < h, r,t > (h, r, t ∈ C k ) 2k|E| + 2k|R| TRANSH ||(I − rprp T )h + r − (I + rprp T )t|| 1/2 k|E| + 2k|R| TRANSR ||Wrh + r − Wrt|| 1/2 k|E| + (k 2 + k)|R| edge graph embedding models. Inspired by the recent advances of generative adversarial deep models (Goodfellow et al., 2014), we propose a novel adversarial learning framework, namely, KBGAN, for generating better negative examples to train knowledge graph embedding models. More specifically, we consider probabilitybased, log-loss embedding models as the generator to supply better quality negative examples, and use distance-based, margin-loss embedding models as the discriminator to generate the final knowledge graph embeddings. Since the generator has a discrete generation step, we cannot directly use the gradient-based approach to backpropagate the errors. We then consider a onestep reinforcement learning setting, and use a variance-reduction REINFORCE method to achieve this goal. Empirically, we perform experiments on three common KGE datasets (FB15K-237, WN18 and WN18RR), and verify the adversarial learning approach with a set of KGE models. Our experiments show that across various settings, this adversarial learning mechanism can significantly improve the performance of some of the most commonly used translation based KGE methods. Our contributions are three-fold: • We are the first to consider adversarial learning to generate useful negative training examples to improve knowledge graph embedding.
• This adversarial learning framework applies to a wide range of KGE models, without the need of external ontologies constraints.
• Our method shows consistent performance gains on three commonly used KGE datasets.
2 Related Work

Knowledge Graph Embeddings
A large number of knowledge graph embedding models, which represent entities and relations in a knowledge graph with vectors or matrices, have been proposed in recent years. RESCAL (Nickel et al., 2011) is one of the earliest studies on matrix factorization based knowledge graph embedding models, using a bilinear form as score function. TRANSE (Bordes et al., 2013) is the first model to introduce translation-based embedding.
Later variants, such as TRANSH (Wang et al., 2014), TRANSR (Lin et al., 2015) and TRANSD (Ji et al., 2015), extend TRANSE by projecting the embedding vectors of entities into various spaces. DISTMULT  simplifies RESCAL by only using a diagonal matrix, and COMPLEX (Trouillon et al., 2016) extends DISTMULT into the complex number field. (Nickel et al., 2015) is a comprehensive survey on these models. Some of the more recent models achieve strong performances. MANIFOLDE (Xiao et al., 2016) embeds a triple as a manifold rather than a point. HOLE (Nickel et al., 2016) employs circular correlation to combine the two entities in a triple. CONVE (Dettmers et al., 2017) uses a convolutional neural network as the score function. However, most of these studies use uniform sampling to generate negative training examples (Bordes et al., 2013). Because our framework is independent of the concrete form of models, all these models can be potentially incorporated into our framework, regardless of the complexity. As a proof of principle, our work focuses on simpler models. Table 1 summarizes the score functions and dimensions of all models mentioned above.

Generative Adversarial Networks and its Variants
Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) was originally proposed for generating samples in a continuous space such as images. A GAN consists of two parts, the generator and the discriminator. The generator accepts a noise input and outputs an image. The discriminator is a classifier which classifies images as "true" (from the ground truth set) or "fake" (generated by the generator). When training a GAN, the generator and the discriminator play a minimax game, in which the generator tries to generate "real" images to deceive the discriminator, and the discriminator tries to tell them apart from ground truth images. GANs are also capable of generating samples satisfying certain requirements, such as conditional GAN (Mirza and Osindero, 2014). It is not possible to use GANs in its original form for generating discrete samples like natural language sentences or knowledge graph triples, because the discrete sampling step prevents gradients from propagating back to the generator. SE-QGAN  is one of the first successful solutions to this problem by using reinforcement learning-It trains the generator using policy gradient and other tricks. IRGAN  ) is a recent work which combines two categories of information retrieval models into a discrete GAN framework. Likewise, our framework relies on policy gradient to train the generator which provides discrete negative triples.
The discriminator in a GAN is not necessarily a classifier. Wasserstein GAN or WGAN (Arjovsky et al., 2017) uses a regressor with clipped parameters as its discriminator, based on solid analysis about the mathematical nature of GANs. GOGAN (Juefei-Xu et al., 2017) further replaces the loss function in WGAN with marginal loss. Although originating from very different fields, the form of loss function in our framework turns out to be more closely related to the one in GOGAN.

Our Approaches
In this section, we first define two types of training objectives in knowledge graph embedding models to show how KBGAN can be applied. Then, we demonstrate a long overlooked problem about negative sampling which motivates us to propose KBGAN to address the problem. Finally, we dive into the mathematical, and algorithmic details of KBGAN.

Types of Training Objectives
For a given knowledge graph, let E be the set of entities, R be the set of relations, and T be the set of ground truth triples. In general, a knowledge graph embedding (KGE) model can be formulated as a score function f (h, r, t), h, t ∈ E, r ∈ R which assigns a score to every possible triple in the knowledge graph. The estimated likelihood of a triple to be true depends only on its score given by the score function.
Different models formulate their score function based on different designs, and therefore interpret scores differently, which further lead to various training objectives. Two common forms of training objectives are particularly of our interest: Marginal loss function is commonly used by a large group of models called translation-based models, whose score function models distance between points or vectors, such as TRANSE, TRANSH, TRANSR, TRANSD and so on. In these models, smaller distance indicates a higher likelihood of truth, but only qualitatively. The marginal loss function takes the following form: where γ is the margin, [·] + = max(0, ·) is the hinge function, and (h , r, t ) is a negative triple. The negative triple is generated by replacing the head entity or the tail entity of a positive triple with a random entity in the knowledge graph, or formally (h , r, t ) ∈ {(h , r, t)|h ∈ E} ∪ {(h, r, t )|t ∈ E}. Log-softmax loss function is commonly used by models whose score function has probabilistic interpretation. Some notable examples are RESCAL, DISTMULT, COMPLEX. Applying the softmax function on scores of a given set of triples gives the probability of a triple to be the best one among them: The loss function is the negative log-likelihood of this probabilistic model: where N eg(h, r, t) ⊂ {(h , r, t)|h ∈ E} ∪ {(h, r, t )|t ∈ E} is a set of sampled corrupted triples.
Figure 1: An overview of the KBGAN framework. The generator (G) calculates a probability distribution over a set of candidate negative triples, then sample one triples from the distribution as the output. The discriminator (D) receives the generated negative triple as well as the ground truth triple (in the hexagonal box), and calculates their scores. G minimizes the score of the generated negative triple by policy gradient, and D minimizes the marginal loss between positive and negative triples by gradient descent.
Other forms of loss functions exist, for example CONVE uses a triple-wise logistic function to model how likely the triple is true, but by far the two described above are the most common. Also, softmax function gives an probabilistic distribution over a set of triples, which is necessary for a generator to sample from them.

Weakness of Uniform Negative Sampling
Most previous KGE models use uniform negative sampling for generating negative triples, that is, replacing the head or tail entity of a positive triple with any of the entities in E, all with equal probability. Most of the negative triples generated in this way contribute little to learning an effective embedding, because they are too obviously false.
To demonstrate this issue, let us consider the following example. Suppose we have a ground truth triple LocatedIn(NewOrleans,Louisiana), and corrupt it by replacing its tail entity. First, we remove the tail entity, leaving Lo-catedIn(NewOrleans,?). Because the relation Lo-catedIn constraints types of its entities, "?" must be a geographical region. If we fill "?" with a random entity e ∈ E, the probability of e having a wrong type is very high, resulting in ridiculous triples like Lo-catedIn(NewOrleans,BarackObama) or Locate-dIn(NewOrleans,StarTrek). Such triples are considered "too easy", because they can be eliminated solely by types. In contrast, Locate-dIn(NewOrleans,Florida) is a very useful negative triple, because it satisfies type constraints, but it cannot be proved wrong without detailed knowl-edge of American geography. If a KGE model is fed with mostly "too easy" negative examples, it would probably only learn to represent types, not the underlying semantics.
The problem is less severe to models using logsoftmax loss function, because they typically samples tens or hundreds of negative triples for one positive triple in each iteration, and it is likely to have a few useful negatives among them. For instance, (Trouillon et al., 2016) found that a 100:1 negative-to-positive ratio results in the best performance for COMPLEX. However, for marginal loss function, whose negative-to-positive ratio is always 1:1, the low quality of uniformly sampled negatives can seriously damage their performance.

Generative Adversarial Training for Knowledge Graph Embedding Models
Inspired by GANs, we propose an adversarial training framework named KBGAN which uses a KGE model with softmax probabilities to provide high-quality negative samples for the training of a KGE model whose training objective is marginal loss function. This framework is independent of the score functions of these two models, and therefore possesses some extent of universality. Figure 1 illustrates the overall structure of KBGAN. In parallel to terminologies used in GAN literature, we will simply call these two models generator and discriminator respectively in the rest of this paper. We use softmax probabilistic models as the generator because they can adequately model the "sampling from a probability distribu- Sample one negative triple (h s , r, t s ) from N eg(h, r, t) according to {pi}i=1...N s . Assume its probability to be ps; tion" process of discrete GANs, and we aim at improving discriminators based on marginal loss because they can benefit more from high-quality negative samples. Note that a major difference between GAN and our work is that, the ultimate goal of our framework is to produce a good discriminator, whereas GANS are aimed at training a good generator. In addition, the discriminator here is not a classifier as it would be in most GANs. Intuitively, the discriminator should assign a relatively small distance to a high-quality negative sample. In order to encourage the generator to generate useful negative samples, the objective of the generator is to minimize the distance given by discriminator for its generated triples. And just like the ordinary training process, the objective of the discriminator is to minimize the marginal loss between the positive triple and the generated negative triple. In an adversarial training setting, the generator and the discriminator are alternatively trained towards their respective objectives.
Suppose that the generator produces a probability distribution on negative triples p G (h , r, t |h, r, t) given a positive triple (h, r, t), and generates negative triples (h , r, t ) by sampling from this distribution. Let f D (h, r, t) be the score function of the discriminator. The objective of the discriminator can be formulated as minimizing the following marginal loss function: The only difference between this loss function and Equation 1 is that it uses negative samples from the generator.
Policy Gradient Theorem arises from reinforcement learning (RL), so we would like to draw an analogy between our model and an RL model. The generator can be viewed as an agent which interacts with the environment by performing actions and improves itself by maximizing the reward returned from the environment in response of its actions. Correspondingly, the discriminator can be viewed as the environment. Using RL terminologies, (h, r, t) is the state (which determines what actions the actor can take), p G (h , r, t |h, r, t) is the policy (how the actor choose actions), (h , r, t ) is the action, and −f D (h , r, t ) is the reward. The method of optimizing R G described above is called REINFORCE (Williams, 1992) algorithm in RL. Our model is a simple special case of RL, called one-step RL. In a typical RL setting, each action performed by the agent will change its state, and the agent will perform a series of actions (called an epoch) until it reaches certain states or the number of actions reaches a certain limit. However, in the analogy above, actions does not affect the state, and after each action we restart with another unrelated state, so each epoch consists of only one action.
To reduce the variance of REINFORCE algorithm, it is common to subtract a baseline from the reward, which is an arbitrary number that only depends on the state, with-out affecting the expectation of gradients. 2 In our case, we replace −f D (h , r, t ) with −f D (h , r, t ) − b(h, r, t) in the equation above to introduce the baseline. To avoid introducing new parameters, we simply let b be a constant, the average reward of the whole training set: In practice, b is approximated by the mean of rewards of recently generated negative triples.
Let the generator's score function to be f G (h, r, t), given a set of candidate negative triples N eg(h, r, t) ⊂ {(h , r, t)|h ∈ E} ∪ {(h, r, t )|t ∈ E}, the probability distribution p G is modeled as: Ideally, N eg(h, r, t) should contain all possible negatives. However, knowledge graphs are usually highly incomplete, so the "hardest" negative triples are very likely to be false negatives (true facts). To address this issue, we instead generate N eg(h, r, t) by uniformly sampling of N s entities (a small number compared to the number of all possible negatives) from E to replace h or t. Because in real-world knowledge graphs, true negatives are usually far more than false negatives, such set would be unlikely to contain any false negative, and the negative selected by the generator would likely be a true negative. Using a small N eg(h, r, t) can also significantly reduce computational complexity.
Besides, we adopt the "bern" sampling technique (Wang et al., 2014) which replaces the "1" side in "1-to-N" and "N-to-1" relations with higher probability to further reduce false negatives.
Algorithm 1 summarizes the whole adversarial training process. Both the generator and the dis-criminator require pre-training, which is the same as conventionally training a single KBE model with uniform negative sampling. Formally speaking, one can pre-train the generator by minimizing the loss function defined in Equation (1), and pre-train the discriminator by minimizing the loss function defined in Equation (2). Line 14 in the algorithm assumes that we are using the vanilla gradient descent as the optimization method, but obviously one can substitute it with any gradientbased optimization algorithm.

Experiments
To evaluate our proposed framework, we test its performance for the link prediction task with different generators and discriminators. For the generator, we choose two classical probability-based KGE model, DISTMULT and COMPLEX, and for the discriminator, we also choose two classical translation-based KGE model, TRANSE and TRANSD, resulting in four possible combinations of generator and discriminator in total. See Table  1 for a brief summary of these models.

Datasets
We use three common knowledge base completion datasets for our experiment: FB15k-237, WN18 and WN18RR. FB15k-237 is a subset of FB15k introduced by (Toutanova and Chen, 2015), which removed redundant relations in FB15k and greatly reduced the number of relations. Likewise, WN18RR is a subset of WN18 introduced by (Dettmers et al., 2017) which removes reversing relations and dramatically increases the difficulty of reasoning. Both FB15k and WN18 are first introduced by (Bordes et al., 2013) and have been commonly used in knowledge graph researches. Statistics of datasets we used are shown in Table 3.

Evaluation Protocols
Following previous works like  and (Trouillon et al., 2016), for each run, we report two common metrics, mean reciprocal ranking (MRR) and hits at 10 (H@10). We only report scores under the filtered setting (Bordes et al., 2013), which removes all triples appeared in training, validating, and testing sets from candidate triples before obtaining the rank of the ground truth triple. 3 In the pre-training stage, we train every model to convergence for 1000 epochs, and divide every epoch into 100 mini-batches. To avoid overfitting, we adopt early stopping by evaluating MRR on the validation set every 50 epochs. We tried γ = 0.5, 1, 2, 3, 4, 5 and L 1 , L 2 distances for TRANSE and TRANSD, and λ = 0.01, 0.1, 1, 10 for DISTMULT and COMPLEX, and determined the best hyperparameters listed on table 2, based on their performances on the validation set after pre-training. Due to limited computation resources, we deliberately limit the dimensions of embeddings to k = 50, similar to the one used in earlier works, to save time. We also apply certain constraints or regularizations to these models, which are mostly the same as those described in their original publications, and also listed on table 2.

Implementation Details
In the adversarial training stage, we keep all the hyperparamters determined in the pre-training stage unchanged. The number of candidate negative triples, N s , is set to 20 in all cases, which is proven to be optimal among the candidate set of {5, 10, 20, 30, 50}. We train for 5000 epochs, with 100 mini-batches for each epoch. We also use early stopping in adversarial training by evaluating MRR on the validation set every 100 epochs.

Results
Results of our experiments as well as baselines are shown in Table 4. All settings of adversarial training bring a pronounced improvement to the model, which indicates that our method is consistently effective in various cases. TRANSE performs slightly worse than TRANSD on FB15k-237 and WN18, but better on WN18RR. Using DIST-MULT or COMPLEX as the generator does not affect performance greatly. TRANSE and TRANSD enhanced by KBGAN can significantly beat their corresponding baseline implementations, and outperform stronger baselines in some cases. As a prototypical and proofof-principle experiment, we have never expected state-of-the-art results. Being simple models pro-   (Lin et al., 2015) with its default parameters. Results marked with ‡ are copied from (Dettmers et al., 2017). All other baseline results are copied from their original papers. posed several years ago, TRANSE and TRANSD has their limitations in expressiveness that are unlikely to be fully compensated by better training technique. In future researches, people may try employing more advanced models into KBGAN, and we believe it has the potential to become stateof-the-art.
To illustrate our training progress, we plot performances of the discriminator on validation set over epochs, which are displayed in Figure 2. As all these graphs show, our performances are always in increasing trends, converging to its max-imum as training proceeds, which indicates that KBGAN is a robust GAN that can converge to good results in various settings, although GANs are wellknown for difficulty in convergence. Fluctuations in these graphs may seem more prominent than other KGE models, but is considered normal for an adversially trained model. Note that in some cases the curve still tends to rise after 5000 epochs. We do not have sufficient computation resource to train for more epochs, but we believe that they will also eventually converge.  Table 5: Examples of negative samples in WN18 dataset. The first column is the positive fact, and the term in bold is the one to be replaced by an entity in the next two columns. The second column consists of random entities drawn from the whole dataset. The third column contains negative samples generated by the generator in the last 5 epochs of training. Entities in italic are considered to have semantic relation to the positive one

Case study
To demonstrate that our approach does generate better negative samples, we list some examples of them in Table 5, using the KBGAN (TRANSE + DISTMULT) model and the WN18 dataset. All hyperparameters are the same as those described in Section 4.1.3.
Compared to uniform random negatives which are almost always totally unrelated, the generator generates more semantically related negative samples, which is different from type relatedness we used as example in Section 3.2, but also helps training. In the first example, two of the five terms are physically related to the process of distilling liquids. In the second example, three of the five entities are geographical objects. In the third example, two of the five entities express the concept of "gather".
Because we deliberately limited the strength of generated negatives by using a small N s as described in Section 3.3, the semantic relation is pretty weak, and there are still many unrelated entities. However, empirical results (when selecting the optimal N s ) shows that such situation is more beneficial for training the discriminator than generating even stronger negatives.