Semantically Smooth Knowledge Graph Embedding

This paper considers the problem of embedding Knowledge Graphs (KGs) consisting of entities and relations into low-dimensional vector spaces. Most of the existing methods perform this task based solely on observed facts. The only requirement is that the learned embeddings should be compatible within each individual fact. In this paper, aiming at further discovering the intrinsic geometric structure of the embedding space, we propose Semantically Smooth Embedding (SSE). The key idea of SSE is to take full advantage of additional semantic information and enforce the embedding space to be semantically smooth, i.e., entities belonging to the same semantic category will lie close to each other in the embedding space. Two manifold learning algorithms Laplacian Eigenmaps and Locally Linear Embedding are used to model the smoothness assumption. Both are formulated as geometrically based regularization terms to constrain the embedding task. We empirically evaluate SSE in two benchmark tasks of link prediction and triple classi-ﬁcation, and achieve signiﬁcant and consistent improvements over state-of-the-art methods. Furthermore, SSE is a general framework. The smoothness assumption can be imposed to a wide variety of embedding models, and it can also be constructed using other information besides entities’ semantic categories.


Introduction
Knowledge Graphs (KGs) like WordNet (Miller, 1995), Freebase (Bollacker et al., 2008), and DB- * Corresponding author: Quan Wang. pedia (Lehmann et al., 2014) have become extremely useful resources for many NLP related applications, such as word sense disambiguation (Agirre et al., 2014), named entity recognition (Magnini et al., 2002), and information extraction (Hoffmann et al., 2011). A KG is a multirelational directed graph composed of entities as nodes and relations as edges. Each edge is represented as a triple of fact ⟨e i , r k , e j ⟩, indicating that head entity e i and tail entity e j are connected by relation r k . Although powerful in representing structured data, the underlying symbolic nature makes KGs hard to manipulate.
Recently a new research direction called knowledge graph embedding has attracted much attention (Socher et al., 2013;Bordes et al., 2014;Lin et al., 2015). It attempts to embed components of a KG into continuous vector spaces, so as to simplify the manipulation while preserving the inherent structure of the original graph. Specifically, given a KG, entities and relations are first represented in a low-dimensional vector space, and for each triple, a scoring function is defined to measure its plausibility in that space. Then the representations of entities and relations (i.e. embeddings) are learned by maximizing the total plausibility of observed triples. The learned embeddings can further be used to benefit all kinds of tasks, such as KG completion (Socher et al., 2013;, relation extraction (Riedel et al., 2013;, and entity resolution (Bordes et al., 2014).
To our knowledge, most of existing KG embedding methods perform the embedding task based solely on observed facts. The only requirement is that the learned embeddings should be compatible within each individual fact. In this paper we propose Semantically Smooth Embedding (SSE), a new approach which further imposes constraints on the geometric structure of the embedding space. The key idea of SSE is to make ful-l use of additional semantic information (i.e. semantic categories of entities) and enforce the embedding space to be semantically smooth-entities belonging to the same semantic category should lie close to each other in the embedding space. This smoothness assumption is closely related to the local invariance assumption exploited in manifold learning theory, which requires nearby points to have similar embeddings or labels (Belkin and Niyogi, 2001). Thus we employ two manifold learning algorithms Laplacian Eigenmaps (Belkin and Niyogi, 2001) and Locally Linear Embedding (Roweis and Saul, 2000) to model the smoothness assumption. The former requires an entity to lie close to every other entity in the same category, while the latter represents that entity as a linear combination of its nearest neighbors (i.e. entities within the same category). Both are formulated as manifold regularization terms to constrain the KG embedding objective function. As such, SSE obtains an embedding space which is semantically smooth and at the same time compatible with observed facts.
The advantages of SSE are two-fold: 1) By imposing the smoothness assumption, SSE successfully captures the semantic correlation between entities, which exists intrinsically but is overlooked in previous work on KG embedding. 2) KGs are typically very sparse, containing a relatively small number of facts compared to the large number of entities and relations. SSE can effectively deal with data sparsity by leveraging additional semantic information. Both aspects lead to more accurate embeddings in SSE. Moreover, our approach is quite general. The smoothness assumption can actually be imposed to a wide variety of KG embedding models. Besides semantic categories, other information (e.g. entity similarities specified by users or derived from auxiliary data sources) can also be used to construct the manifold regularization terms. And besides KG embedding, similar smoothness assumptions can also be applied in other embedding tasks (e.g. word embedding and sentence embedding).
Our main contributions can be summarized as follows. First, we devise a novel KG embedding framework that naturally requires the embedding space to be semantically smooth. As far as we know, it is the first work that imposes constraints on the geometric structure of the embedding space during KG embedding. By leveraging addition-al semantic information, our approach can also deal with the data sparsity issue that commonly exists in typical KGs. Second, we evaluate our approach in two benchmark tasks of link prediction and triple classification, and achieve significant and consistent improvements over state-ofthe-art models.
In the remainder of this paper, we first provide a brief review of existing KG embedding models in Section 2, and then detail the proposed SSE framework in Section 3. Experiments and results are reported in Section 4. Then in Section 5 we discuss related work, followed by the conclusion and future work in Section 6.

A Brief Review of KG Embedding
KG embedding aims to embed entities and relations into a continuous vector space and model the plausibility of each fact in that space. In general, it consists of three steps: 1) representing entities and relations, 2) specifying a scoring function, and 3) learning the latent representations. In the first step, given a KG, entities are represented as points (i.e. vectors) in a continuous vector space, and relations as operators in that space, which can be characterized by vectors Bordes et al., 2014;Wang et al., 2014b), matrices (Bordes et al., 2011;Jenatton et al., 2012), or tensors (Socher et al., 2013). In the second step, for each candidate fact ⟨e i , r k , e j ⟩, an energy function f (e i , r k , e j ) is further defined to measure its plausibility, with the corresponding entity and relation representations as variables. Plausible triples are assumed to have low energies. Then in the third step, to obtain the entity and relation representations, a marginbased ranking loss, i.e., is minimized. Here, O is the set of observed (i.e. positive) triples, and t + = ⟨e i , r k , e j ⟩ ∈ O; N t + denotes the set of negative triples constructed by replacing entities in t + , and t − = ⟨e ′ i , r k , e ′ j ⟩ ∈ N t + ; γ > 0 is a margin separating positive and negative triples; and [x] + = max(0, x). The ranking loss favors lower energies for positive triples than for negative ones. Stochastic gradient descent (in mini-batch mode) is adopted to solve the minimization problem. For details please refer to  and references therein.
Different embedding models differ in the first two steps: entity/relation representation and energy Method Entity/Relation embeddings Energy function TransE  e, r ∈ R d f (e i , r k , e j ) = ∥e i + r k − e j ∥ ℓ 1 /ℓ 2 SME (lin) (Bordes et al., 2014) e function definition. Three state-of-the-art embedding models, namely TransE , SME (Bordes et al., 2014), and SE (Bordes et al., 2011), are detailed below. Please refer to (Jenatton et al., 2012;Socher et al., 2013;Wang et al., 2014b;Lin et al., 2015) for other methods. TransE  represents both entities and relations as vectors in the embedding space. For a given triple ⟨e i , r k , e j ⟩, the relation is interpreted as a translation vector r k so that the embedded entities e i and e j can be connected by r k with low error. The energy function is defined as f (e i , r k , e j ) = ∥e i + r k − e j ∥ ℓ 1 /ℓ 2 , where ∥·∥ ℓ 1 /ℓ 2 denotes the ℓ 1 -norm or ℓ 2 -norm. SME (Bordes et al., 2014) also represents entities and relations as vectors, but models triples in a more expressive way. Given a triple ⟨e i , r k , e j ⟩, it first employs a function g u (·, ·) to combine r k and e i , and g v (·, ·) to combine r k and e j . Then, the energy function is defined as matching g u (·, ·) and g v (·, ·) by their dot product, i.e., f (e i , r k , e j ) = g u (r k , e i ) T g v (r k , e j ). There are two versions of SME, linear and bilinear (denoted as SME (lin) and SME (bilin) respectively), obtained by defining different g u (·, ·) and g v (·, ·).
SE (Bordes et al., 2011) represents entities as vectors but relations as matrices. Each relation is modeled by a left matrix R u k and a right matrix R v k , acting as independent projections to head and tail entities respectively. If a triple ⟨e i , r k , e j ⟩ holds, R u k e i and R v k e j should be close to each other. The Table 1 summarizes the entity/relation representations and energy functions used in these models.

Semantically Smooth Embedding
The methods introduced above perform the embedding task based solely on observed facts. The only requirement is that the learned embeddings should be compatible within each individual fact. However, they fail to discover the intrinsic geometric structure of the embedding space. To deal with this limitation, we introduce Semantically S-mooth Embedding (SSE) which constrains the embedding task by incorporating geometrically based regularization terms, constructed by using additional semantic categories of entities.

Problem Formulation
Suppose we are given a KG consisting of n entities and m relations. The facts observed are stored as . A triple ⟨e i , r k , e j ⟩ indicates that entity e i and entity e j are connected by relation r k . In addition, the entities are classified into multiple semantic categories. Each entity e is associated with a label c e indicating the category to which it belongs. SSE aims to embed the entities and relations into a continuous vector space which is compatible with the observed facts, and at the same time semantically smooth.
To make the embedding space compatible with the observed facts, we make use of the triple set O and follow the same strategy adopted in previous methods. That is, we define an energy function on each candidate triple (e.g. the energy functions listed in Table 1), and require observed triples to have lower energies than unobserved ones (i.e. the margin-based ranking loss defined in Eq. (1)).
To make the embedding space semantically smooth, we further leverage the entity category information {c e }, and assume that entities within the same semantic category should lie close to each other in the embedding space. This smoothness assumption is similar to the local invariance assumption exploited in manifold learning theory (i.e. nearby points are likely to have similar embeddings or labels). So we employ two manifold learning algorithms Laplacian Eigenmaps (Belkin and Niyogi, 2001) and Locally Linear Embedding (Roweis and Saul, 2000) to model such semantic smoothness, termed as LE and LLE for short respectively.

Modeling Semantic Smoothness by LE
Laplacian Eigenmaps (LE) is a manifold learning algorithm that preserves local invariance between each two data points (Belkin and Niyogi, 2001). We borrow the idea of LE and enforce semantic smoothness by assuming: Smoothness Assumption 1 If two entities e i and e j belong to the same semantic category, they will have embeddings e i and e j close to each other.
To encode the semantic information, we construct an adjacency matrix W 1 ∈ R n×n among the entities, with the i j-th entry defined as: where c e i /c e j is the category label of entity e i /e j . Then, we use the following term to measure the smoothness of the embedding space: where e i and e j are the embeddings of entities e i and e j respectively. By minimizing R 1 , we expect Smoothness Assumption 1: if two entities e i and e j belong to the same semantic category (i.e. w (1) i j = 1), the distance between e i and e j (i.e. ∥e i − e j ∥ 2 2 ) should be small.
We further incorporate R 1 as a regularization term into the margin-based ranking loss (i.e. Eq. (1)) adopted in previous KG embedding methods, and propose our first SSE model. The new model performs the embedding task by minimizing the following objective function: ] + is the ranking loss on the positive-negative triple pair ( t + , t − ) , and N is the total number of such triple pairs. The first term in L 1 enforces the resultant embedding space compatible with all the observed triples, and the second term further requires that space to be semantically smooth. Hyperparameter λ 1 makes a trade-off between the two cases.
The minimization is carried out by stochastic gradient descent. Given a randomly sampled positive triple t + = ⟨e i , r k , e j ⟩ and the associated negative triple t − = ⟨e ′ i , r k , e ′ j ⟩, 1 the stochastic gradient w.r.t. e s (s ∈ {i, j, i ′ , j ′ }) can be calculated as: The negative triple is constructed by replacing one of the entities in the positive triple.
where E = [e 1 , e 2 , · · · , e n ] ∈ R d×n is a matrix consisting of entity embeddings; D ∈ R n×n is a diagonal matrix with the i-th entry on the diagonal being d ii = ∑ n j=1 w (1) i j ; and 1 s ∈ R n is a column vector where the s-th entry is 1 and the others are 0. Other parameters are not included in R 1 , and their gradients remain the same as defined in previous work.

Modeling Semantic Smoothness by LLE
As opposed to LE which preserves local invariance within data pairs, Locally Linear Embedding (LLE) expects each data point to be roughly reconstructed by a linear combination of its nearest neighbors (Roweis and Saul, 2000). We borrow the idea of LLE and enforce semantic smoothness by assuming: Smoothness Assumption 2 Each entity e i can be roughly reconstructed by a linear combination of its nearest neighbors in the embedding space, i.e., e i ≈ ∑ e j ∈N(e i ) α j e j . Here nearest neighbors refer to entities belonging to the same semantic category with e i .
To model this assumption, for each entity e i , we randomly sample K entities uniformly from the category to which e i belongs, denoted as the nearest neighbor set N (e i ). We construct a weight matrix W 2 ∈ R n×n by defining: and normalize the rows so that ∑ n j=1 w (2) i j = 1 for each row i. Note that W 2 is no longer a symmetric matrix. The smoothness of the embedding space can be measured by the reconstruction error: Minimizing R 2 results in Smoothness Assumption 2: each entity can be linearly reconstructed from its nearest neighbors with low error. By incorporating R 2 as a regularization term into the margin-based ranking loss defined in Eq.
(1), we obtain our second SSE model, which performs the embedding task by minimizing: .

87
The resultant embedding space is also semantically smooth and compatible with the observed triples. Hyperparameter λ 2 makes a trade-off between the two cases. Similar to the first model, stochastic gradient descent is used to solve the minimization problem. Given a positive triple t + = ⟨e i , r k , e j ⟩ and the associated negative triple t − = ⟨e ′ i , r k , e ′ j ⟩, the gradient w.r.t. e s (s ∈ {i, j, i ′ , j ′ }) is calculated as: where I ∈ R n×n is the identity matrix. Other parameters are not included in R 2 , and their gradients remain the same as defined in previous work.
To better capture the cohesion within each category, during each stochastic step we resample the nearest neighbors for each entity, uniformly from the category to which it belongs.

Advantages and Extensions
The advantages of our approach can be summarized as follows: 1) By incorporating geometrically based regularization terms, the SSE models are able to capture the semantic correlation between entities, which exists intrinsically but is overlooked in previous work. 2) By leveraging additional entity category information, the SSE models can deal with the data sparsity issue that commonly exists in most KGs. Both aspects lead to more accurate embeddings. Entity category information has also been investigated in (Nickel et al., 2012;Chang et al., 2014;Wang et al., 2015), but in different manners. Nickel et al. (2012) take categories as pseudo entities and introduce a specific relation to link entities to categories. Chang et al. (2014) and Wang et al. (2015) use entity categories to specify relations' argument expectations, removing invalid triples during training and reasoning respectively. None of them considers the intrinsic geometric structure of the embedding space.
Actually, our approach is quite general. 1) The smoothness assumptions can be imposed to a wide variety of KG embedding models, not only the ones introduced in Section 2, but also those based on matrix/tensor factorization (Nickel et al., 2011;Chang et al., 2013). 2) Besides semantic categories, other information (e.g. entity similarities specified by users or derived from auxiliary data sources) can also be used to construct the manifold regularization terms. 3) Besides KG embedding, similar smoothness assumptions can also be

Experiments
We empirically evaluate the proposed SSE models in two tasks: link prediction  and triple classification (Socher et al., 2013).

Data Sets
We create three data sets with different sizes using NELL (Carlson et al., 2010): Location, Sport, and Nell186. Location and Sport are two small-scale data sets, both containing 8 relations on the topics of "location" and "sport" respectively. The corresponding relations are listed in Table 2. Nell186 is a larger data set containing the most frequent 186 relations. On all the data sets, entities appearing only once are removed. We extract the entity category information from a specific relation called Generalization, and keep non-overlapping categories. 2 Categories containing less than 5 entities on Location and Sport as well as categories containing less than 50 entities on Nell186 are further removed. Table 3 gives some statistics of the three data sets, where # Rel./# Ent./# Trip./# Cat. denotes the number of relations/entities/observed triples/categories respectively, and # c-Ent. denotes the number of entities that have category labels. Note that our SSE models do not require every entity to have a category label. From the statistics, we can see that all the three data sets suffer from the data sparsity issue, containing a relatively small number of observed triples compared to the number of entities.
On the two small-scale data sets Location and Sport, triples are split into training/validation/test sets, with the ratio of 3:1:1. The first set is used for modeling training, the second for hyperparameter tuning, and the third for evaluation. All experiments are repeated 5 times by drawing new 2 If two categories overlap, the smaller one is discarded.  training/validation/test splits, and results averaged over the 5 rounds are reported. On Nell186 experiments are conducted only once, using a training/validation/test split with 31,134/5,000/5,000 triples respectively. We will release the data upon request.

Link Prediction
This task is to complete a triple ⟨e i , r k , e j ⟩ with e i or e j missing, i.e., predict e i given (r k , e j ) or predict e j given (e i , r k ). Baseline methods. We take TransE, SME (lin), SME (bilin), and SE as our baselines. We then incorporate manifold regularization terms into these methods to obtain the SSE models. A model with the LE/LLE regularization term is denoted as TransE-LE/TransE-LLE for example. We further compare our SSE models with the setting proposed by Nickel et al. (2012), which also takes into account the entity category information, but in a more direct manner. That is, given an entity e with its category label c e , we create a new triple ⟨e, Generalization, c e ⟩ and add it into the training set. Such a method is denoted as TransE-Cat for example.
Evaluation protocol. For evaluation, we adopt the same ranking procedure proposed by . For each test triple ⟨e i , r k , e j ⟩, the head entity e i is replaced by every entity e ′ i in the KG, and the energy is calculated for the corrupted triple ⟨e ′ i , r k , e j ⟩. Ranking the energies in ascending order, we get the rank of the correct entity e i . Similarly, we can get another rank by corrupting the tail entity e j . Aggregated over all test triples, we report three metrics: 1) the averaged rank, denoted as Mean (the smaller, the better); 2) the median of the ranks, denoted as Median (the smaller, the better); and 3) the proportion of ranks no larger than 10, denoted as Hits@10 (the higher, the better).
Implementation details. We implement the methods based on the code provided by   3 . For all the methods, we create 100 mini-batches on each data set. On Location and Sport, the dimension of the embedding space d is 3 https://github.com/glorotxa/SME set in the range of {10, 20, 50, 100}, the margin γ is set in the range of {1, 2, 5, 10}, and the learning rate is fixed to 0.1. On Nell186, the hyperparameters d and γ are fixed to 50 and 1 respectively, and the learning rate is fixed to 10. In LE and LLE, the regularization hyperparameters λ 1 and λ 2 are tuned in {10 −4 , 10 −5 , 10 −6 , 10 −7 , 10 −8 }. And the number of nearest neighbors K in LLE is tuned in {5, 10, 15, 20}. The best model is selected by early stopping on the validation sets (by monitoring Mean), with a total of at most 1000 iterations over the training sets.
Results. Table 4 reports the results on the test sets of Location, Sport, and Nell186. From the results, we can see that: 1) SSE (regularized via either LE or LLE) outperforms all the baselines on all the data sets and with all the metrics. The improvements are usually quite significant. The metric Mean drops by about 10% to 65%, Median drops by about 5% to 75%, and Hits@10 rises by about 5% to 190%. This observation demonstrates the superiority and generality of our approach. 2) Even if encoded in a direct way (e.g. TransE-Cat), the entity category information can still help the baseline methods in the link prediction task. This observation indicates that leveraging additional information is indeed useful in dealing with the data sparsity issue and hence leads to better performance. 3) Compared to the strategy which incorporates the entity category information directly, formulating such information as manifold regularization terms results in better and more stable results. The *-Cat models sometimes perform even worse than the baselines (e.g. TransE-Cat on Sport data), while the SSE models consistently achieve better results. This observation further demonstrates the superiority of constraining the geometric structure of the embedding space.
We further visualize and compare the geometric structures of the embedding spaces learned by traditional embedding and semantically smooth embedding. We select the 10 largest semantic categories in Nell186 (specified in Figure 1) and the 5,740 entities therein. We take the embeddings of these entities learned by TransE, TransE-Cat, TransE-LE, and TransE-LLE, with the optimal hyperparameter settings determined in the link prediction task. Then we create 2D plots using t-SNE (Van der Maaten and Hinton, 2008) 4 . The results are shown in Figure 1, where a different    color is used for each category. It is easy to see that imposing the semantic smoothness assumptions helps in capturing the semantic correlation between entities in the embedding space. Entities within the same category lie closer to each other, while entities belonging to different categories are easily distinguished (see Figure 1(c) and Figure 1(d)). Incorporating the entity category information directly could also helps. But it fails on some "hard" entities (i.e., those belonging to different categories but mixed together in the center of Figure 1(b)). We have conducted the same experiments with the other methods and observed similar phenomena.

Triple Classification
This task is to verify whether a given triple ⟨e i , r k , e j ⟩ is correct or not. We test our SSE models in this task, with the same comparison settings as used in the link prediction task. Evaluation protocol. We follow the same evaluation protocol used in (Socher et al., 2013;Wang et al., 2014b). To create labeled data for classifica-tion, for each triple in the test and validation sets, we construct a negative triple for it by randomly corrupting the entities. To corrupt a position (head or tail), only entities that have appeared in that position are allowed. During triple classification, a triple is predicted as positive if the energy is below a relation-specific threshold δ r ; otherwise as negative. We report two metrics on the test sets: micro-averaged accuracy and macro-averaged accuracy, denoted as Micro-ACC and Macro-ACC respectively. The former is a per-triple average, while the latter is a per-relation average. Implementation details. We use the same hyperparameter settings as in the link prediction task. The relation-specific threshold δ r is determined by maximizing Micro-ACC on the validation sets. Again, training is limited to at most 1000 iterations, and the best model is selected by early stopping on the validation sets (by monitoring Micro-ACC).
Results.  line methods on all the data sets in both metrics. The improvements are usually quite substantial. The metric Micro-ACC rises by about 1% to 25%, and Macro-ACC by about 2% to 30%. 2) Incorporating the entity category information directly can also improve the baselines in the triple classification task, again demonstrating the effectiveness of leveraging additional information to deal with the data sparsity issue.
3) It is a better choice to incorporate the entity category information as manifold regularization terms as opposed to encoding it directly. The *-Cat models sometimes perform even worse than the baselines (e.g. TransE-Cat on Location data and SE-Cat on Sport data), while the SSE models consistently achieve better results. The observations are similar to those observed during the link prediction task, and further demonstrate the superiority and generality of our approach.

Related Work
This section reviews two lines of related work: KG embedding and manifold learning. KG embedding aims to embed a KG composed of entities and relations into a low-dimensional vector space, and model the plausibility of each fact in that space.  categorized the literature into three major groups: 1) methods based on neural networks, 2) methods based on matrix/tensor factorization, and 3) methods based on Bayesian clustering. The first group performs the embedding task using neural network architectures Bordes et al., 2014;Socher et al., 2013). Several state-of-the-art neural network-based embedding models have been introduced in Section 2. For other work please refer to (Jenatton et al., 2012;Wang et al., 2014b;Lin et al., 2015). In the second group, KGs are represented as tensors, and embedding is performed via tensor factorization or collective matrix factorization techniques (Singh and Gordon, 2008;Nickel et al., 2011;Chang et al., 2014). The third group embeds factorized representations of entities and relations into a nonparametric Bayesian clustering framework, so as to obtain more interpretable embeddings (Kemp et al., 2006;Sutskever et al., 2009). Our work falls into the first group, but differs in that it further imposes constraints on the geometric structure of the embedding space, which exists intrinsically but is overlooked in previous work. Although this paper focuses on incorporating geometrically based regularization terms into neural network architectures, it can be easily extended to matrix/tensor factorization techniques.
Manifold learning is a geometrically motivated framework for machine learning, enforcing the learning model to be smooth w.r.t. the geometric structure of data (Belkin et al., 2006). Within this framework, various manifold learning algorithms have been proposed, such as ISOMAP (Tenenbaum et al., 2000), Laplacian Eigenmaps (Belkin and Niyogi, 2001), and Locally Linear Embedding (Roweis and Saul, 2000). All these algorithms are based on the so-called local invariance assumption, i.e., nearby points are likely to have similar embeddings or labels. Manifold learning has been widely applied in many different areas, from dimensionality reduction (Belkin and Niyo-gi, 2001;Cai et al., 2008) and semi-supervised learning (Zhou et al., 2004;Zhu and Niyogi, 2005) to recommender systems (Ma et al., 2011) and community question answering (Wang et al., 2014a). This paper employs manifold learning algorithms to model the semantic smoothness assumptions in KG embedding.

Conclusion and Future Work
In this paper, we have proposed a novel approach to KG embedding, referred to as Semantically Smooth Embedding (SSE). The key idea of SSE is to impose constraints on the geometric structure of the embedding space and enforce it to be semantically smooth. The semantic smoothness assumptions are constructed by using entities' category information, and then formulated as geometrically based regularization terms to constrain the embedding task. The embeddings learned in this way are capable of capturing the semantic correlation between entities. By leveraging additional information besides observed triples, SSE can also deal with the data sparsity issue that commonly exists in most KGs. We empirically evaluate SSE in two benchmark tasks of link prediction and triple classification. Experimental results show that by incorporating the semantic smoothness assumptions, SSE significantly and consistently outperforms state-of-the-art embedding methods, demonstrating the superiority of our approach. In addition, our approach is quite general. The smoothness assumptions can actually be imposed to a wide variety of embedding models, and it can also be constructed using other information besides entities' semantic categories.
As future work, we would like to: 1) Construct the manifold regularization terms using other data sources. The only information required to construct the manifold regularization terms is the similarity between entities (used to define the adjacency matrix in LE and to select nearest neighbors for each entity in LLE). We would try entity similarities derived in different ways, e.g., specified by users or calculated from entities' textual descriptions. 2) Enhance the efficiency and scalability of SSE. Processing the manifold regularization terms can be time-and space-consuming (especially the one induced by the LE algorithm). We would investigate how to address this problem, e.g., via the efficient iterative algorithms introduced in (Saul and Roweis, 2003) or via paral-lel/distributed computing. 3) Impose the semantic smoothness assumptions on other KG embedding methods (e.g. those based on matrix/tensor factorization or Bayesian clustering), and even on other embedding tasks (e.g. word embedding or sentence embedding).