On the Equivalence of Holographic and Complex Embeddings for Link Prediction

We show the equivalence of two state-of-the-art models for link prediction/knowledge graph completion: Nickel et al’s holographic embeddings and Trouillon et al.’s complex embeddings. We first consider a spectral version of the holographic embeddings, exploiting the frequency domain in the Fourier transform for efficient computation. The analysis of the resulting model reveals that it can be viewed as an instance of the complex embeddings with a certain constraint imposed on the initial vectors upon training. Conversely, any set of complex embeddings can be converted to a set of equivalent holographic embeddings.


Introduction
Recently, there have been efforts to build and maintain large-scale knowledge bases represented in the form of a graph (knowledge graph) (Auer et al., 2007;Bollacker et al., 2008;Suchanek et al., 2007). Although these knowledge graphs contain billions of relational facts, they are known to be incomplete. Knowledge graph completion (KGC) (Nickel et al., 2015) aims at augmenting missing knowledge in an incomplete knowledge graph automatically. It can be viewed as a task of link prediction (Liben-Nowell and Kleinberg, 2003;Hasan and Zaki, 2011) studied in the field of statistical relational learning (Getoor and Taskar, 2007). In recent years, methods based on vector embeddings of graphs have been actively pursued as a scalable approach to KGC (Bordes et al., 2011;Socher et al., 2013;Guu et al., 2015;Yang et al., 2015;Nickel et al., 2016;Trouillon et al., 2016b).
In this paper, we investigate the connection between two models of graph embeddings that have emerged along this line of research: The holographic embeddings (Nickel et al., 2016) and the complex embeddings (Trouillon et al., 2016b). These models are simple yet achieve the current state-of-the-art performance in KGC.
We begin by showing that holographic embeddings can be trained entirely in the frequency domain induced by the Fourier transform, thereby reducing the time needed to compute the scoring function from O(n log n) to O(n), where n is the dimension of the embeddings.
The analysis of the resulting training method reveals that the Fourier transform of holographic embeddings can be regarded as an instance of the complex embeddings, with a specific constraint (viz. conjugate symmetry property) cast on on the initial values.
Conversely, we also show that every set of complex embeddings has a set of holographic embeddings (with real vectors) that is equivalent, in the sense that their scoring functions are equal up to scaling.

Preliminaries
Let i denote the imaginary unit, R be the set of real values, and C the set of complex values. We write [v] j to denote the jth component of vector v. A superscript T (e.g., v T ) represents vector/matrix transpose. For a complex scalar z, vector z, and matrix Z, z, z, and Z represent their complex conjugate, and Re(z), Re(z), and Re(Z) denote their real parts, respectively. Let x = [x 0 · · · x n−1 ] T ∈ R n and y = [y 0 · · · y n−1 ] T ∈ R n . Note that the vector indices start from 0 for notational convenience. The circular convolution of x and y, denoted by x * y, is defined by where mod denotes modulus. Likewise, circular correlation x y is defined by While circular convolution is commutative, circular correlation is not; i.e., x * y = y * x, but x y y x in general. As it can be verified with Eqs. (1) and (2), x y = flip(x) * y, where flip(x) = [x n−1 · · · x 0 ] T is a vector obtained by arranging the components of x in reverse.
For n-dimensional vectors, naively computing circular convolution/correlation by Eqs. (1) and (2) requires O(n 2 ) multiplications. However, we can take advantage of the Fast Fourier Transform (FFT) algorithm to accelerate the computation: For circular convolution, first compute the discrete Fourier transform (DFT) of x and y, and then compute the inverse DFT of their elementwise vector product, i.e., where F : R n → C n and F −1 : C n → R n respectively denote the DFT and inverse DFT, and denotes the elementwise product. DFT and inverse DFT can be computed in O(n log n) time with the FFT algorithm, and thus the computation time for circular convolution is also O(n log n). The same can be said of circular correlation. Since F(flip(x)) = F(x), we have By analogy to how the Fourier transform is used in signal processing, the original real space R n is called the "time" domain, and the complex space C n where DFT vectors reside is called the "frequency" domain.
3 Holographic embeddings for knowledge graph completion

Knowledge graph completion
Let E and R be the finite sets of entities and (binary) relations over entities, respectively. For each relation r ∈ R and each pair s, o ∈ E of entities, Table 1: Correspondence between operations in time and frequency domains. r ↔ ρ indicates ρ = F(r) (and also r = F −1 (ρ)). operation time frequency we are interested in whether r(s, o) holds 1 or not; we write r(s, o) = +1 if it holds, and r(s, o) = −1 if not. To be precise, given a training set Dataset D can be regarded as a directed graph in which nodes represent entities E and edges are labeled by relations R. Thus, the task is essentially that of link prediction (Liben-Nowell and Kleinberg, 2003;Hasan and Zaki, 2011). Often, it is also called knowledge graph completion. Nickel et al. (2016) proposed holographic embeddings (HolE) for knowledge graph completion. Using training data D, this method learns the vector embeddings e k ∈ R n of entities k ∈ E and the embeddings w r ∈ R n of relations r ∈ R. The score for triple (r, s, o) is then given by

Spectral training of HolE
To compute the circular correlation in the scoring function of Eq. (4) efficiently, Nickel et al. (2016) used Eq. (3) in Section 2 and FFT. In this section, we extend this technique further, and con-sider training HolE solely in the frequency domain. That is, real-valued embeddings e k , w r ∈ R n in the original "time" domain are abolished, and instead we train their DFT counterparts ε k = F(e k ) ∈ C n and ω k = F(w r ) ∈ C n in the frequency domain. This formulation eliminates the need of FFT and inverse FFT, which are the major computational bottleneck in HolE. As a result, Eq. (4) can be computed in time O(n) directly from ε k and ω k . Indeed, equivalent counterparts in the frequency domain exist for not only convolution/correlation but all other computations needed for HolE: scalar multiplication, summation (needed when vectors are updated by stochastic gradient descent), and dot product (used in Eq. (4)). The frequencydomain equivalents for these operations are summarized in Table 1. All of these can be performed efficiently (in linear time) in the frequency domain.
In particular, the following relation holds for the dot product between any "time" vectors x, y ∈ R n .
where the dot product on the right-hand side is the complex inner product defined by a · b = a T b.
Eq. (5) is known as Parseval's theorem (also called the power theorem in (Smith, 2007)), and it states that dot products in two domains are equal up to scaling. After embeddings ε k , ω r ∈ C n are learned in the frequency domain, their time-domain counterparts e k = F −1 (ε k ) and w r = F −1 (ω r ) can be recovered if needed, but this is not required as far as computation of the scoring function is concerned. Thanks to Parseval's theorem, Eq. (4) can be directly computed from the frequency vectors ε k , ω r ∈ C n by
The DFT F(x) is conjugate symmetric if and only if x is a real vector. Thus, maintaining conjugate symmetry of "frequency" vectors is the key to ensure their "time" counterparts remain in real space. Below, we verify that this property is indeed preserved with stochastic gradient descent. Moreover, conjugate symmetry provides a sufficient condition under which dot product takes a real value. It also has implications on space requirement. These topics are covered in the rest of this section.

Vector initialization and update in frequency domain
Typically, at the beginning of training HolE, each individual embedding is initialized by a random vector. When we train HolE in the frequency domain, we could first generate a random real vector, regard them as a HolE vector in the time domain, and compute its DFT as the initial value in the frequency domain. An alternative, easier approach is to directly generate a random complex vector that is conjugate symmetric, and use it as the initial frequency vector. This guarantees the inverse DFT to be a real vector, i.e., there exists a valid corresponding image in the time domain. Given a training set D (see Section 3.1), HolE is trained by minimizing the following objective function over parameter matrix Θ = [e 1 · · · e |E | w 1 · · · w |R| ] ∈ R n×(|E |+|R|) : (r,s,o,y)∈D log{1 + exp(−y f HolE (r, s, o))} + λ||Θ|| 2 F (7) where λ ∈ R is the hyperparameter controlling the degree of regularization, and · F denotes the Frobenius norm. In our version of spectral training of HolE, the parameters matrix consists of frequency vectors ε k and ω r instead of e k and w r , i.e., Θ = [ε 1 · · · ε |E | ω 1 · · · ω |R| ] ∈ C n×(|E |+|R|) . Let us discuss the stochastic gradient descent (SGD) update with respect to these frequency vectors. In particular, we are interested in whether conjugate symmetry of vectors is preserved by the update.
Suppose vectors ω r , ε s , ε o are conjugate symmetric. Neglecting the contribution from the regularization term 2 in Eq. (7), we see that in an SGD update step, α∂ f HoLE /∂ω r , α∂ f HoLE /∂ε s , and α∂ f HoLE /∂ε o are respectively subtracted from ω r , ε s , ε o , where α ∈ R is a factor not depending on these parameters. Noting the equalities w r · (e s e o ) = e s · (w r e o ) = e o · (w r * e s ) (see (Nickel et al., 2016, Eq. (12), p. 1958) and their frequency counterparts obtained through the translation of Table 1, we can derive As seen from above, conjugation, scalar multiplication, summation, and elementwise product are used in the SGD update. And it is straightforward to verify that all these operations preserve conjugate symmetry. It follows that if ω r , ε s , ε o are initially conjugate symmetric, they will remain so during the course of training, which assures that the inverse DFTs of the learned embeddings are real vectors.

Real-valued dot product
In the scoring function of HolE (Eq. (4)), dot product is used for generating a real-valued "score" out of two vectors, w r and e s e o . Likewise, in Eq. (6), the dot product is applied to ω r and ε s ε o , which are complex-valued. However, provided that the conjugate symmetry of these vectors is maintained, their dot product is always real. This follows from Parseval's theorem; the inverse DFTs of these frequency vectors are real, and thus their dot product is also real. Therefore, the dot product of the corresponding frequency vectors is real as well, according to Eq. (5).

Space requirement
A general complex vector ξ ∈ C n can be stored in memory as 2n floating-point numbers, i.e., one each for the real and imaginary part of a component. In our spectral representation of HolE, however, the first n/2 components suffice to specify the frequency vector ξ, since the vector is conjugate symmetric. Moreover, ξ 0 (and ξ n/2 if n is even) are real values. Thus, a spectral representation of HolE can be specified with exactly n floating-point numbers, which can be stored in the same amount of memory as needed by the original HolE. In their model, however, these vectors are complex-valued, and are based on the eigendecomposition of complex matrix X r = EW r E T that encodes relation r ∈ R over pairs of entities, where X r ∈ C |E |×|E | , E = [e 1 , . . . , e |E | ] T ∈ C |E |×n , and W r = diag(w r ) ∈ C n×n is a diagonal matrix (with diagonal elements w r ∈ C n ). In practice, X r needs to be a real matrix, because its (r, s)component must define the score for r(s, o). To this end, Trouillon et al. simply extracted the real part; i.e., X r = Re(EW r E T ). Trouillon et al.

Relation to
(2016a) advocated this approach, by showing that any real matrix X r can be expressed in this form. With this formulation, the score for triple (r, s, o) is given by

Equivalence of holographic and complex embeddings
Now let us rewrite Eq. (8). Noting the definition of complex dot product, i.e., a · b = a T b, we have and since Re(z) = Re(z), Re(w r · (e s e o )) = Re(w r · (e s e o )).
Thus, Eq. (8) can be written as f ComplEx (r, s, o) = Re w r · (e s e o ) . (9) Here, a marked similarity is noticeable between Eq. (9) and Eq. (6), the scoring function of our spectral version of HolE (spectral HolE); Com-plEx extracts the real part of complex dot product, whereas in the spectral HolE, dot product is guaranteed to be real because all embeddings satisfy conjugate symmetry. Indeed, Eq. (6) can be equally written as although the operator Re(·) in this formula is redundant, since the inner product is guaranteed to be real-valued. Nevertheless, Eq. (10) elucidates the fact that the spectral HolE can be viewed as an instance of ComplEx, with the embeddings constrained to be conjugate symmetric to make the inner product in Eq. (10) real-valued. Conversely, given a set of complex embeddings for entities and relations, we can construct their equivalent holographic embeddings, in the sense that f ComplEx (r, s, o) = c f HolE (r, s, o) for every r, s, o, where c > 0 is a constant. For each ndimensional complex embeddings x ∈ {e k } k∈E ∪ {w r } r∈R ⊂ C n computed by ComplEx, we make a corresponding HolE h(x) ∈ R 2n+1 as follows: For a given complex embedding x = [x 0 · · · x n−1 ] ∈ C n , first compute s(x) ∈ C 2n+1 by s(x) = 0 x 0 · · · x n−1 x n−1 · · · x 0 T = 0 x flip(x) T (11) and then define h(x) = F −1 (s(x)). Since s(x) is conjugate symmetric, h(x) is a real vector.
To verify if this conversion defines an equivalent scoring function for any triple (r, s, o), let us now suppose complex embeddings w r ∈ C n and e s , e o ∈ C n are given. Since we regard real vectors h(w r ), h(e s ), h(e o ) ∈ R 2n+1 as the holographic embeddings of r, s and o, respectively, the HolE score for the triple (r, s, o) is given as which shows that h(·) (or s(·)) gives the desired conversion from ComplEx to HolE.

Conclusion
In this paper, we have shown that the holographic embeddings (HolE) can be trained entirely in the frequency domain. If stochastic gradient descent is used for training, the conjugate symmetry of frequency vectors is preserved, which ensures the existence of the corresponding holographic embedding in the original real space (time domain). Also, this training method eliminates the need of FFT and inverse FFT, thereby reducing the computation time of the scoring function from O(n log n) to O(n). Moreover, we have established the equivalence of HolE and the complex embeddings (ComplEx): The spectral version of HolE is subsumed by Com-plEx as a special case in which conjugate symmetry is imposed on the embeddings. Conversely, every set of complex embeddings can be converted to equivalent holographic embeddings.
Many systems for natural language processing, such as those for semantic parsing and question answering, benefit from access to information stored in knowledge graphs. We plan to further investigate the property of spectral HolE and Com-plEx in these applications.