A Greedy Bit-flip Training Algorithm for Binarized Knowledge Graph Embeddings

This paper presents a simple and effective discrete optimization method for training binarized knowledge graph embedding model B-CP. Unlike the prior work using a SGD-based method and quantization of real-valued vectors, the proposed method directly optimizes binary embedding vectors by a series of bit flipping operations. On the standard knowledge graph completion tasks, the B-CP model trained with the proposed method achieved comparable performance with that trained with SGD as well as state-of-the-art real-valued models with similar embedding dimensions.


Introduction
Knowledge graph embedding (KGE) has a wide range of applications in AI and NLP, such as knowledge acquisition, question answering, and recommender systems. Most of the existing KGE models represent entities and relations as real or complexvalued vectors thus consuming a large amount of memory (Nickel et al., 2011;Bordes et al., 2013;Socher et al., 2013;Yang et al., 2014;Wang et al., 2014;Lin et al., 2015;Nickel et al., 2016;Trouillon et al., 2016;Hayashi and Shimbo, 2017;Liu et al., 2017;Manabe et al., 2018;Kazemi and Poole, 2018;Dettmers et al., 2018;Balažević et al., 2019a;Xu and Li, 2019;Balažević et al., 2019b). To deal with knowledge graphs with more than a million entities, more lightweight models are desirable for faster processing and to reduce memory consumption, as AI applications on mobile devices are becoming more common. Kishimoto et al. (2019b) proposed a binarized KGE model B-CP, wherein all vector components are binarized, allowing them to be stored compactly with bitwise representation. Despite the reduced * The first and second authors equally contributed to this work. memory consumption by more than a magnitude, B-CP performed as well as the existing real-valued KGE models on benchmark tasks.
B-CP is based on the CP decomposition of a knowledge graph (Lacroix et al., 2018;Kazemi and Poole, 2018). It is fully expressive (Kishimoto et al., 2019a), meaning that any knowledge graph can be represented as a B-CP model.
During the training of B-CP, however, realvalued embeddings are maintained and are quantized at each training step (Kishimoto et al., 2019b). The loss function is computed with respect to the quantized vectors, but stochastic gradient descent is performed on the real vectors with the help of Hinton's "straight-through" estimator (HSTE) (Bengio et al., 2013). Thus, training does not benefit significantly from the compact bitwise representations, although score computation is faster by a bitwise technique. DKGE (Li et al., 2020) is another binary KGE model proposed recently, but it also maintains real-valued vectors during training, as it solves a relaxed optimization problem with continuous variables.
In this paper, we propose greedy bit flipping, a new training approach for B-CP in which binary vectors are directly optimized, i.e., without the intervention of real-valued vectors. A bit in binary vectors is sequentially flipped in a greedy manner so that the objective loss is improved. The advantages of greedy bit flipping are: (1) It does not need to maintain real-valued vectors even during training; (2) it is simple and is easy to implement; and (3) it has only a few hyperparameters.

Binarized CP Decomposition for Knowledge Graphs
A knowledge graph is a set of triples (e i , e j , r k ), where e i , e j are subject and object entities (represented as nodes in the graph), respectively, and r k is the label of the relation between them (corresponding to labeled arcs in the graph). When a triple is in a knowledge graph it is called a fact.
A knowledge graph can be equivalently represented by a third-order boolean tensor X = [x i jk ] ∈ {0, 1} N e ×N e ×N r , where N e is the number of entities in the graph, and N r is the number of relation labels; if a triple (e i , e j , r k ) is a fact, x i jk = 1, and 0 otherwise.
CP decomposition (Hitchcock, 1927) is a general technique for decomposing a tensor into a sum of rank-1 tensors. For a third-order tensor X representing a knowledge graph, its approximate CP decomposition is given by X ≈ ∑ D d=1 a d ⊗ b d ⊗ c d where ⊗ denotes outer product, and a d , b d ∈ R N e and c d ∈ R N r are real (column) vectors. In this case, matrices A = [a 1 a 2 · · · a D ] ∈ R N e ×D , B = [b 1 b 2 · · · b D ] ∈ R N e ×D , and C = [c 1 c 2 · · · c D ] ∈ R N r ×D are called factor matrices. For any matrix M, let m i: denote its ith row vectors. Then, the component x i jk of X can be written as x i jk ≈ a i: , b j: , c k: = ∑ D d=1 a id b jd c kd . Thus, vectors a i: , b j: , c k: can be regarded as the D-dimensional vectors representing the subject entity e i , object entity e j , and relation label r k , respectively.
The B-CP decomposition of a knowledge graph (Kishimoto et al., 2019b) differs from the standard CP, in that X is decomposed in terms of binary vectors a d , b d ∈ {−1, +1} N e , c d ∈ {−1, +1} N r . As with CP, B-CP decomposition can be cast as a problem of binary classification, and solved by logistic regression. First, each x i jk is assumed to be a random variable sampled independently from a probability distribution parameterized by A, B, C: where θ i jk = a i: , b j: , c k: is called the score of triple (e i , e j , r k ), and To train factor matrices to match observed/ unobserved facts encoded as X , we minimize the 3 Greedy Bit-flip Training for B-CP The proposed training method randomly samples an element (or a bit) of the factor matrices A, B, C of B-CP, and negates its sign if this "bit flipping" reduces the objective loss. This process is repeated until the loss does not improve further or a specified number of iterations is reached. The pseudocode of the algorithm is depicted in Algorithms 1 and 2. In Algorithm 1, when a factor matrix is updated, the other two factor matrices are fixed. As the number N r of relations is considerably smaller than the number N e of entities in general, the change in the relation matrix C influences the total loss much more significantly than entity matrices A and B. For this reason, we update C prior to A and B in each iteration to promote faster convergence.
Actual update is carried out in Algorithm 2. As remarked on Line 2, a row of a factor matrix, which represents a single entity or a relation, can be processed in parallel, because the score of an individ-ual triple depends only on a single subject, object, and relation it contains; for instance, even when all subject embeddings are updated simultaneously, only one of them can change the score of any given triple. This means that, when multiple rows of a factor matrix are updated, the change in the total loss in Eq. (1) is invariant to the order of the updates, as long as the other two factor matrices are fixed. Since Algorithm 2 updates only one matrix, we see that its rows can be processed in parallel.
By contrast, the loss is dependent on the order of updated columns (i.e., bits) within a row, i.e., components in an embedding vector. We thus change the order of updated columns every time Algorithm 2 is called, by shuffling the set [D] of dimensions in Line 1.
In Algorithm 2, each bit in a factor matrix is examined to see if it is worth being flipped. For instance, consider a component (bit) (1), and let A denote the factor matrix A after a i j is flipped to −a i j . The change in the loss is then where θ i jk is computed before the update (i.e., using A, not A ) by Eq.
(2). Only if ∆(a i j ) is found to be negative, i.e., the loss is decreased, a i j is actually flipped. The same rule applies to the bits in the factor matrices B and C. Repeated application of this update guarantees the loss to be non-increasing. However, the loss may be stuck in a local minimum, depending on the order of updates. 1 Training is terminated when the objective loss does not improve, or if a pre-determined number of epochs has elapsed.

Fast Score Computation by Bitwise Operations
In this section, we describe the implementation detail necessary to speed up training. As Algorithm 2 involves repeated computation of scores θ i jk , fast computation of scores is a key to speed up training. Although one easy approach is to cache all scores in memory, the number of facts may be huge in knowledge graphs. We thus consider bitwise computation to speed up score computation.
We can compute the B-CP scores by bitwise operation, as follows.
where h(·, ·) is Hamming distance and XNOR(·, ·) is the negation of exclusive-or. As shown in Figure 1, bitwise score computation is much faster than naive computation of scores by Eq.
(3), making the cost of score computation negligible.

Negative Sampling and Reciprocal Relations
Before calling Algorithm 1, for each (e i , e j , r k ) in the training set Pos, we introduce its reciprocal triple (e j , e i , r −1 k ) in the set, with a new relation label r −1 k . This technique was used by Lacroix et al. (2018) and Kazemi and Poole (2018), and is effective for models such as CP and B-CP, in which an entity has separate embeddings for subject and object.
Following prior work, we also approximate the objective loss by sampling negative examples (Algorithm 1, Line 3) to cope with the enormous size and sparsity of knowledge graphs. Specifically, for each (e i , e j , r k ) in the training set, a predetermined number of entities are first sampled randomly. Then, for each sampled entity e, we create a negative triple (e i , e, r k ) and its reciprocal negative triple (e, e i , r −1 k ).
In the entity prediction task, a KGE model is given a set of incomplete triples, each of which is generated by hiding one of the entities in a positive triple in the test set; i.e., from a positive triple (e i , e j , r k ), incomplete triples (?, e j , r k ) and (e i , ?, r k ) are generated. For each such incomplete triple, the KGE model must produce a list of all entities (including the correct entity, e i or e j ) ranked by the score when each of these entities is plugged instead of the placeholder "?" in the triple. The quality of the output ranking list is then measured by two standard evaluation measures for the KGC task: Mean Reciprocal Rank (MRR) and Hits@10, in the "filtered" setting (Bordes et al., 2013).
We selected the hyperparameter D of the proposed method (henceforth denoted as "Bit-flip B-CP") via grid search over the range D ∈ {200, 400, 600}, such that the filtered MRR is maximized on the validation set. The maximum number of training epochs was set to 20. We generated 20 negative triples per positive training triple for FB15k-237 and 5 for WN18RR. Bit-flip B-CP was implemented in Java, and ran on a laptop PC with 2.7GHz Intel Core i7 CPU. Our implementation with D = 400 took about 5 minutes to finish 20 training epochs on the WN18RR training dataset.

Results
Training Convergence Figure 2 shows the MRR scores on the validation set at each training epoch. For comparison, we also trained B-CP using HSTEbased stochastic gradient descent for optimization and the best hyperparameters reported by Kishimoto et al. (2019a).
The figure shows greedy bit flipping (Bit-flip B-CP) requires a much smaller number of training epochs to converge than HSTE-based training (HSTE B-CP). For both datasets, the best MRR  (Ruffinelli et al., 2020), (Kishimoto et al., 2019b) and (Li et al., 2020), respectively. The memory consumption figures for these models are estimated from the reported number of parameters.
for Bit-flip B-CP was obtained when D = 400, and thus, we used this setting for the following test evaluations. To examine the dependence on initial parameter values, we trained five bit-flip-trained B-CP models using different initial values generated with varied random seeds. The performance figures in the table for Bit-flip B-CP are the average over these five models, with the standard deviation shown in parentheses. The small standard deviations indicate that bit flipping training is stable over different random seeds.

KGC Performance
Notice that B-CP consists of binary vectors, which makes the memory consumption approximately 1/20 of that of real-valued models DistMult and ConvE. Taking advantage of the small memory consumption of B-CP, we created an ensemble of five B-CP models; i.e., the score θ i jk is computed by the sum of the scores of all models in the ensemble. Its performance is shown in the rows titled " †Bit-flip B-CP" of Table 1. For comparison, we also show the result for the ensemble of five HSTEtrained B-CP models (" †HSTE B-CP"). As we can see from the table, ensemble improved the task performance. Note that even the ensemble models consume much less memory than existing models using 32-bit real embeddings.

Conclusion
In this paper, we have introduced greedy bit flipping, a simple yet effective discrete optimization method for training the binarized KGE model B-CP.
On the standard benchmark datasets of KGC, B-CP models trained by bit flipping were on per with HSTE-trained B-CP in terms of accuracy. Experimental results show that the KGC performance was stable over different initial values. Making ensemble of multiple B-CP models is made tractable by the small memory consumption of B-CP, which brought further performance improvement.
Bit flipping is unique in that it does not require the loss function to be differentiable, making it potentially applicable to a wide range of loss functions. We plan to investigate this direction in our future work. Application of bit flipping to other binarized KGE models is another interesting direction. A binary version of DistMult looks interesting as a starting point, as it is closely related to DKGE (Li et al., 2020), a recently proposed binarized model.