Knowledge Graph Embedding Compression

Knowledge graph (KG) representation learning techniques that learn continuous embeddings of entities and relations in the KG have become popular in many AI applications. With a large KG, the embeddings consume a large amount of storage and memory. This is problematic and prohibits the deployment of these techniques in many real world settings. Thus, we propose an approach that compresses the KG embedding layer by representing each entity in the KG as a vector of discrete codes and then composes the embeddings from these codes. The approach can be trained end-to-end with simple modifications to any existing KG embedding technique. We evaluate the approach on various standard KG embedding evaluations and show that it achieves 50-1000x compression of embeddings with a minor loss in performance. The compressed embeddings also retain the ability to perform various reasoning tasks such as KG inference.


Introduction
Knowledge graphs (KGs) are a popular way of storing world knowledge, lending support to a number of AI applications such as search (Singhal, 2012), question answering (Lopez et al., 2013;Berant et al., 2013) and dialog systems (He et al., 2017;Young et al., 2018). Typical KGs are huge, consisting of millions of entities and relations.
With the growth in use of KGs, researchers have explored ways to learn better representations of KGs in order to improve generalization and robustness in downstream tasks. In particular, there has been interest in learning embeddings of KGs in continuous vector spaces (Bordes et al., 2011(Bordes et al., , 2013Socher et al., 2013). KG embedding approaches represent entities as learnable continuous vectors while each relation is modeled as an operation in the same space such as translation, projection, etc. (Bordes et al., 2013;Wang et al., 2014;Lin et al., 2015;Ji et al., 2015). These approaches give us a way to perform reasoning in KGs with simple numerical computation in continuous spaces.
Despite the simplicity and wide-applicability of KG embedding approaches, they have a few key issues. A major issue is that the number of embedding parameters grow linearly with the number of entities. This is challenging when we have millions or billions of entities in the KG, especially when there are a lot of sparse entities or relations in the KG. There is a clear redundancy in the continuous parameterization of embeddings given that many entities are actually similar to each other. This overparameterization can lead to a drop in performance due to overfitting in downstream models. The large memory requirement of continuous representations also prevents models that rely on them from being deployed on modest user-facing computing devices such as mobile phones.
To address this issue, we propose a coding scheme that replaces the traditional KG embedding layer by representing each entity in the KG with a K-way D dimensional code (KD code) (van den Oord et al., 2017;Chen et al., 2018;Chen and Sun, 2019). Each entity in the KG is represented as a sequence of D codes where each code can take values in {1 . . . K}. The codes for each entity are learnt in such a way that they capture the semantics and the relational structure of the KG -i.e., the codes that represent similar or related entities are typically also similar 1 . The coding scheme is much more compact than traditional KG embedding schemes.
We learn the discrete codes for entities using an autoencoder style model which learns a discretization function that maps continuous entity representations to discrete codes and a reversediscretization function that maps the discrete codes back to continuous entity representations. The discretization and reverse-discretization functions are jointly learnt end-to-end. The inherent discreteness of the representation learning problem poses several learning issues. We tackle these issues by resorting to the straight-through estimator (Bengio et al., 2013) or the tempering softmax (Maddison et al., 2016;Jang et al., 2016) and using guidance from existing KG embeddings to smoothly guide learning of the discrete representations.
We evaluate our approach on various standard KG embedding evaluations and we find that we can massively reduce the size of the KG embedding layer while suffering only a minimal loss in performance (if at all). We show that the proposed approach for learning discrete KG representations leads to a good performance in the task of link prediction (cloze entity prediction) as well as in the task of KG reasoning and inference.

Knowledge Graph Embeddings
A knowledge graph (KG) G ⊆ E × R × E can be formalized as a set of triplets (e i , r, e j ) composed of head and tail entities e i and e j (e i , e j ∈ E, E being the set of entities) and a relation r ∈ R (R being the set of relations) -n e = |E|, n r = |R|. The goal of learning KG embeddings is to learn vector embeddings e ∈ R de for each entity e ∈ E (and possibly also relation embeddings r ∈ R dr ).
Typical KG embedding approaches are multilayer neural networks which consist of an embedding component and a scoring component. The embedding component maps each entity to its corresponding embedding. The scoring component learns a scoring function f : E × R × E → R where f (e i , r, e j ) defines the score of the triplet (e i , r, e j ). KG embeddings are learnt by defining a loss function L and solving the following optimization problem: Here Θ includes all embedding parameters and any other neural network parameters. The loss function typically encourages the score of a positive triplet (e i , r, e j ) to be higher than that of a (corrupted) negative triplet. In Table 1, we summarize the scoring function for several existing KG embedding approaches as well as their corresponding entity (and relation) representation parameters.
In all the KG embedding models, the number of parameters grow super-linearly with the number of entities and relations in the KG as well as the size of their representations. This number can be very large and learning KG embeddings can be a challenge for large, sparse KGs. In this paper, we present a novel coding scheme that significantly reduces the number of embedding parameters. We do so by leveraging recent advances in discrete representation learning. We summarize them below.

Discrete Representation Learning
Typical deep learning methods define an embedding function as F : V → R d , where V denotes the vocabulary such as words, sub-words, entities, relations, etc. and each symbol in the vocabulary is mapped to a continuous vector in R d . The embedding function can be trained separate from the task in a completely unsupervised manner or jointly with other neural net parameters to optimize the target loss function. A common specification of the embedding function in NLP is a lookup table L ∈ R n×d with n = |V|. The total number of bits used to represent this table is O(nd) (32nd if each real number is represented by 32-bit floating point). This is problematic for large n and/or d.
Thus, various approaches have been proposed to compress embedding layers in neural networks. These include weight-tying (Press and Wolf, 2016;Inan et al., 2016;, matrix-factorization based approaches (Acharya et al., 2019), and approaches that rely on gumbel softmax (Baevski and Auli, 2018), vector quantization (Chen and Sun, 2019) and codebook learning (Shu and Nakayama, 2017). In this work, we build on discrete representation learning approaches (van den Oord et al., 2017;Chen et al., 2018;Chen and Sun, 2019). Discrete representation learning gives us a way to mitigate this issue by representing each symbol v in the vocabulary as a discrete vector Discrete representations have another clear benefit that they are interpretable and are a natural fit for complex reasoning, planning and predictive learning tasks.
. . x k i denotes the generalized dot product,¯denotes the conjugate of a complex number and ⊗ denotes circular correlation, σ denotes an activation function, • denotes the convolution operator, • denotes the hadamard product and × k denotes tensor product along the k th mode.
ing process differentiable enables end-to-end learning of discrete representations via optimizing some task-specific objectives from language modelling and machine translation. In this work, we use discrete representation learning to compress KG embeddings. We describe it below.

Discrete KG Representation Learning
In order to learn discrete KG representations, we define a quantization function Q : R d → R d , which (during training) takes raw KG embeddings and produces their quantized representations. Q = D • R is composed of two functions: that maps the continuous KG embedding into a K-way D-dimensional discrete code with cardinality |Z| = K (we call this KD code) 2. A reverse-discretization function R : Z D → R de that maps the KD code back to the continuous embedding.
During training, both D and R are learned. Then, every entity in the KG is represented by a KD code via applying the discretization function D to save space (compression). The continuous embeddings and the parameters of the discretization function are then no longer needed. In the test/inference stage, the reverse-discretization function R is used to decode the KD codes into regular embedding vectors for every entity. We use vector quantization (Chen et al., 2018;Chen and Sun, 2019) and codebook learning (Cai et al., 2010) to define the discretization and reverse-discretization functions D and R. We describe them below.

Discretization Function D
The goal of the discretization function is to map continuous KG embedding vectors into KD codes. We model the discretization function using nearest neighbor search (Cayton, 2008). Given continuous KG embeddings {e i |i = 1 . . . n e } as query vectors, we define a set of K key vectors In order to learn D-dimensional discrete codes, we partition the query and key vectors into D partitions where each partition corresponds to one of the D discrete codes -e Vector Quantization (VQ): Our first alternative for discretization is vector-quantization (Ballard, 1997), a classical quantization technique for data compression. We assume that the j th discrete code of the i th entity z (j) i can be computed by calculating distances between the corresponding query vector partition e We use the Euclidean distance function: dist(a, b) = ||a − b|| 2 2 in our experiments. Note that the argmin operation is inherently nondifferentiable. The resulting quantization function Q has no gradient towards the input query vectors. Thus, we use the straight-through estimator (Bengio et al., 2013) to compute a pseudo gradient. This means that during the forward pass, we compute Q as defined here, but during the backward pass, we use the gradient of the query vectors. Tempering Softmax (TS): Vector quantization is a popular method for learning discrete representations. Yet another popular approach is continuous relaxation of (2) via the tempering softmax (Maddison et al., 2016;Jang et al., 2016). We again use dot product and softmax for computing the proximity between query and key vectors: Here, τ is the temperature and a, b = a T b denotes the dot product operation. Note that this function still carries an inherent non-differentiability. Hence, we relax the above and compute probability vectorsz (j) i which represent the probability distribution of the j th dimension of the discrete code for the i th entity taking a particular value (say k). Given probabilistic vectorsz (j) i , we can compute the discrete codes z (j) i simply by taking the argmax. To compute discrete KD codes, we set a small value of τ . As τ → 0, the softmax becomes spiky concentrated on the true z (j) i -th dimension. We again estimate pseudo gradients by setting a very small τ in the forward pass (i.e. close to the discrete case (eq. 1)) and τ = 1 in the backward pass.

Reverse-discretization Function R
The goal of the reverse-discretization function is to map discrete KD codes into continuous KG embedding vectors. We model the reverse-discretization process first by a simple linear model which maps the discrete codes to continuous vectors by looking up a learnt codebook. Then, we present an alternative -a non-linear model for reverse-discretization based on recurrent neural networks. Codebook Lookup (CL): We first define the reverse-discretization function in a simple manner where we substitute every discrete code with a continuous vector from a codebook. Let C be a set of codebooks. C consists of a number of codebooks -a separate codebook C (j) for each position j = 1 . . . D in the KD code. We model each codebook simply as a set of vectors: i ∈ R de/D . We simply compute the embedding vector for the j th dimension of the i th entity as: The final entity embedding vector e i is achieved by the concatenation of the embedding vectors for each dimension: Non-linear Reconstruction (NL): While the codebook lookup approach is simple and efficient, due to its linear nature, the capacity of the generated KG embedding may be limited. Thus, we also employ neural network based non-linear approaches for embedding reconstruction. We propose a non-linear embedding reconstruction approach based on the Bi-LSTM network.
Given the KD code z i as a sequence of codes z i , we map the KD code to a continuous embedding vector by feeding the code to a Bi-LSTM followed by mean pooling. Let be the hidden state representations for the various Bi-LSTM cells. Finally, we reconstruct the entity embeddingê i by meanpooling the code embedding vectors followed by a linear transformation: We also tried to map the KD code to a continuous embedding vector by feeding the code to variations of a character level CNN (Kim et al., 2016). However, the Char CNN model always performed worse than the Bi-LSTM model in our experiments. This was because our discretization function which discretizes contiguous partitions of the continuous representation better suits the Bi-LSTM reconstruction model. In the future, we would like to consider more complex discretization functions with other complex non-linear reconstruction models. Storage Efficiency: A key motivation of learning discrete representations is that we can significantly compress the embedding layer at test time. The size of the embedding layer for typical KG representations is 32n e d e (assuming a 32 bit representation) -this can be very large. In contrast, with discrete representation learning, we only need to store code embeddings {z i } and the parameters used in the reverse-discretization function such as the codebooks C or the parameters of the embedding reconstruction Bi-LSTM {Θ LSTM , W rev }.
The entity codes require n e D log 2 K bits. The codebook lookup approach needs to also maintain codebooks which require 32Kd e parameters and the non-linear reconstruction approach requires Dd × 6 parameters (two set of parameter matrices each for the input, output and forget gates) for the Bi-LSTM and d e d parameters for storing W rev -a total of (6D + d e )d parameters. Here, d is the size of the code embedding vectors.
In both codebook lookup and non-linear reconstruction formulations, discrete representation learning neatly decouples the KG size (number of entities) and dimensionality of the continuous embeddings. Thus, the discrete embedding layer can be compactly stored as typically D and log 2 K are smaller than 32d e (considering only the dominating term n e ). Test Time Inference of Embeddings: At test time, we retrieve continuous embeddings for an entity by looking up the codebook or running inference on the reconstruction model using its discrete representation. We can further cache the embedding lookups and various intermediate results such as matrix vector products to improve performance. We show in our results that the test time inference overhead is typically very small. Learning: Similar to previous continuous KG representation learning methods, we learn discrete entity representations by minimizing the triplet loss function. We extend equation 1 as: Here, z e are code embeddings, θ are the parameters of the reverse-discretization function (C or {θ LST M , W rev }) and Θ denotes parameters of the KG embedding approaches (listed in Table 1). The aforementioned loss function (eq 3) is differentiable w.r.t. the embedding parameters and parameters of entity representation learning methods. However, the discrete codes introduce a nondifferentiability. Thus, we use straight-through (Bengio et al., 2013) or the tempering softmax (Maddison et al., 2016;Jang et al., 2016) to estimate pseudo-gradients as described before (section 3.1). Guidance from KG embeddings: We find that even with sophisticated discrete representation learning methods, solving the above optimization problem can be challenging in practice. Due to discreteness of the problem, this can lead to a suboptimal solution where discrete codes are not as good. Therefore, we also use guidance from continuous KG embeddings to solve (3) when provided 2 .
The key idea is that in addition to optimizing (3), we can encourage the reconstructed embeddings from the learnt discrete codes to mimic continuous embeddings.
In order to provide this guidance from continuous embeddings, during the training, instead of using the reconstructed embedding vector generated from the discrete code, we use a weighted average of the reconstructed embeddings and continuous embeddings obtained using methods described in Table 1: (1 − λ)D • R(e) + λe. Here λ ∈ (0, 1) is a linear interpolant for selecting between reconstructed embeddings and pre-learnt continuous embeddings. We initialize λ to 1 and gradually decrease λ as training proceeds. This enables the method to gradually rely more and more on reconstruction from discrete embeddings. We also add a regularization term ||D • R(e) − e|| 2 2 during the training to encourage the reconstructed embeddings to match the pre-learnt continuous embeddings. This procedure is similar to knowledge-distillation guidance (Hinton et al., 2015) in previous discrete representation learning works (Chen et al., 2018).
Here λ ∈ (0, 1) is a linear interpolant for selecting between reconstructed embeddings and prelearnt continuous embeddings. We initialize λ to 1 and gradually decrease λ as training proceeds. This enables the method to gradually rely more and more on reconstruction from discrete embeddings. We also add a regularization term ||D • R(e) − e|| 2 2 during the training to encourage the reconstructed embeddings to match the pre-learnt continuous em-

Experiments
We compare the baseline continuous representations described earlier in Table 1 with four discrete representation learning techniques described in this paper: • VQ-CL: D = VQ and R = CL

Datasets
We evaluate our approach on four standard link prediction datasets: • FB15k (Bordes et al., 2013) is a subset of Freebase.
• FB15k-237 (Toutanova et al., 2015) is a subset of the FB15k dataset created by removing inverse relations that cause test leakage.
• WN18RR (Dettmers et al., 2018) is a subset of the WN18 dataset created by removing inverse relations.
We summarize all the data statistics in Table 2. We also use the Countries dataset (Bouchard et al., 2015) for some in-depth analysis of inference abilities of discrete representations.

Implementation Details
We implement discrete KG representation learning by extending OpenKE (Han et al., 2018), an opensource framework for learning KG embeddings implemented on PyTorch 3 . We train and test all our models on a single 2080Ti system. We set K = 32 and D = 10 in our experiments unless stated otherwise. For the linear embedding transformation function in the non-linear reconstruction approach, we use a hidden layer of 100 hidden units. We

Results
Link Prediction: We learn discrete representations corresponding to various continuous KG representations (described in Table 1) and compare the obtained discrete representations with their continuous counterparts. We use the same hyper-parameter settings as in the original KG embedding papers. We generate n e candidate triples for each test triple by combining the test entity-relation pair with all possible entities E. We use the filtered setting (Bordes et al., 2013), i.e. all known true triples are removed from the candidate set except for the current test triple. We use standard evaluation metrics previously used in the literature: mean reciprocal rank (MRR) and hits@10 (H@10). Mean reciprocal rank is the average of the inverse of the mean rank assigned to the true triple over all candidate triples. Hits@10 measures the percentage of times a true triple is ranked within the top 10 candidate triples. In addition, in order to report the compression efficiency of the discrete representations, we also report the compression ratio which is computed as follows: Here, Storage(continuous) is the storage used to store full continuous KG representations. Storage(discrete) is the storage used in the discrete representation learning method (during the testing stage). This includes discrete KG representations as well as parameters of the reverse-discretization function (i.e. codebook or Bi-LSTM parameters). Tables 3, 4, 5 and 6 show our results on the link prediction task on the four datasets respectively. In Table 3, we compare various continuous representations with the four discrete representation learning techniques described in this paper. We find that the discrete representations sustain only minor losses in performance (and are sometimes actually better than their continuous counterparts) in terms of both evaluation metrics: MRR and H@10, while being able to obtain significant embedding compression (42x-585x). Table 3 also compares the different discrete representation learning approaches. We observe that TS-NL which uses tempering softmax and non-linear reconstruction performs the best in most of the settings. This observation was also    made on the other three datasets. Hence, in Tables 4, 5 and 6, we only compare TS-NL with the continuous representations. We again observe that TS-NL compresses the KG embeddings (71x-952x) while suffering only a minor loss in performance.
Logical Inference with Discrete representations: KG embeddings give us a way to perform logical inference and reason about knowledge. In this experiment, we explore if discrete representations retain the ability to perform inference and reasoning in KGs. We evaluate our models on the countries dataset (Bouchard et al., 2015) which was designed to test the logical inference capabilities of KG embedding models. We use the same evaluation protocol as in (Nickel et al., 2016) Table 6: Results of several models and our proposed discrete counterpart (TS-NL) evaluated on the WN18RR dataset experiments. The countries dataset contains 2 relations and 272 entities (244 countries, 5 regions and 23 subregions) and 3 tasks are posed, requiring subsequently longer and harder inference than the previous one: 1. Task S1 poses queries of the form locatedIn(c; ?), and the answer is one of the five regions.
2. Task S2 poses queries of the form neighbo-rOf(c1; c2) ∧ locatedIn(c2; r) =⇒ locate-dIn(c1; r) 3. Task S3 poses queries of the form neighbo-rOf(c1; c2) ∧ locatedIn(c2; s) ∧ locatedIn(s; r) =⇒ locatedIn(c1; r): We use the AUC-PR metric, which was also used in previous works (Bouchard et al., 2015;Nickel et al., 2016). Table 7 shows our results. We find that TS-NL is a very good KG representation for KG inference. Infact, we find that TS-NL outperforms many of their continuous counterparts. Additional Inference Cost: A tradeoff in learning discrete KG representations is that the inference time increases as we need to decompress discrete representations into continuous embeddings for every entity before using them by looking up the  Table 7: Results of several continuous representations (ct.) and discrete TS-NL (dis.) evaluated on the three tasks (S1, S2 and S3) of logical inference on countries dataset.
codebook or running inference on the LSTM reconstruction model. In practice, we found that this additional inference cost was very small. For example, the additional inference cost of running TransE on the entire FB15K test set was ≈ 1 minute for codebook lookup and ≈ 2.5 minutes for non-linear reconstruction approach on our single 2080Ti system. The additional inference cost for the other continuous KG representations were similarly low.
Varying K and D: There is an evident tradeoff between the extent of compression (which is dictated by the choice of K and D) and model performance.
In order to explore this tradeoff, we plot heatmaps of performance (MRR) and compression ratio (CR) on the FB15K test set as we vary K and D for TransE in Figure 1. Not surprisingly, the performance drops as the compression increases. Plotting these heat maps would allow the end user to pick K and D depending on their tolerance to loss in performance.
Dependence on guidance from continuous embeddings: We evaluate the contribution of the guidance from continuous embeddings in learning discrete KG representations. Figure 2 compares the test MRR for TS-NL as training proceeds on the FB-15K dataset when we do or do not have guidance from the continuous representation (TransE). We observe that learning in the unguided model is much slower than the guided model. However, the guided model achieves almost similar performance in the end. Thus, we conclude that while guidance helps us achieve faster and more stable convergence, it is not necessary to learn discrete representations.
Quality of the Discrete representations: We also assess the quality of the learnt discrete entity representations directly as features for the link predic-  WN 3-7-0-6-X animalize, work animal, farm animal, animal husbandry, offspring, animal, invertebrate, marine animal, animal kingdom, predator 5-3-0-X-1 jabalpur, calcutta, bombay, hyderabad, chennai, lucknow, mysore FB 2-X-7-4-1 novelist, dramatist, actor, writer, cartoonist, poet, songwriter, musician 2-5-X-4-1 albert einstein, voltaire, isaac newton, nikola tesla tion task. In this case, we only retain the discrete entity representations learnt by TS-NL and learn a new LSTM based non-linear reverse-discretization on the validation set. Then, we obtain the linkprediction performance on the test set as before (see Table 8 for transfer results on the FB15K dataset). We observe that the performance of this "transfer" model is close to that of the original model which used a pre-trained reverse-discretization model (compare Table 8 with the shaded part of Table 3). Note that, in the "transfer" setting, we can achieve much higher compression as we do not even need to store the reverse-discretization model.

Interpretability of discrete representations:
The discrete codes provide us with additional interpretability which continuous representations can lack. In Table 9, we show a sample of learned codes for the two datasets. We observe that semantically similar entities are assigned to close-by codes.

Related Work
Deep learning model compression has attracted many research efforts in the last few years (Han et al., 2015). These efforts include network pruning (Reed, 1993;Castellano et al., 1997), weight sharing (Ullrich et al., 2017), quantization (Lin et al., 2016), low-precision computation (Hwang and Sung, 2014;Courbariaux et al., 2015) and knowledge distillation (Hinton et al., 2015) These techniques can also be used for embedding compression. Press and Wolf (2016) and Inan et al. (2016) propose weight-tying approaches that learn input and output representations jointly. Matrix factorization-based methods (Acharya et al., 2019;Shu and Nakayama, 2017; have also been proposed which approximate an embedding matrix with smaller matrices or clusters. Closest to our work are (Shu and Nakayama, 2017;Chen et al., 2018;Chen and Sun, 2019) who present similar approaches to learn discrete codings for word embeddings using multiple codebooks, i.e. product quantization (Jegou et al., 2010). Similar techniques have used been used by van den Oord et al. (2017) who extend VAEs to learn discrete representations using vector quantization in the image domain. This allows the VAE model to circumvent its well known issues of "posterior collapse". All these previous works have been applied to the image domain, and sometimes in language to learn discrete word embeddings. In this work, we present the first results on compressing KG embeddings and also show how the compressed embeddings can be used to support various knowledge based applications such as KG inference.

Conclusion
The embedding layer contains majority of the parameters in any representation learning approach on knowledge graphs. This is a barrier in successful deployment of models using knowledge graphs at scale on user-facing computing devices. In this work, we proposed novel and general approaches for KG embedding compression. Our approaches learn to represent entities in a KG as a vector of discrete codes in an end-to-end fashion. At test time, the discrete KG representation can be cheaply and efficiently converted to a dense embedding and then used in any downstream application requiring the use of a knowledge graph. We evaluated our proposed methods on different link prediction and KG inference tasks and show that the proposed methods for KG embedding compression can effectively compress the KG embedding table without suffering any significant loss in performance. In this work, we only considered the problem of learning discrete entity representations. In the future, we would like to jointly learn discrete representations of entities as well as relations.