Interpretable and Compositional Relation Learning by Joint Training with an Autoencoder

Embedding models for entities and relations are extremely useful for recovering missing facts in a knowledge base. Intuitively, a relation can be modeled by a matrix mapping entity vectors. However, relations reside on low dimension sub-manifolds in the parameter space of arbitrary matrices – for one reason, composition of two relations M1, M2 may match a third M3 (e.g. composition of relations currency_of_country and country_of_film usually matches currency_of_film_budget), which imposes compositional constraints to be satisfied by the parameters (i.e. M1*M2=M3). In this paper we investigate a dimension reduction technique by training relations jointly with an autoencoder, which is expected to better capture compositional constraints. We achieve state-of-the-art on Knowledge Base Completion tasks with strongly improved Mean Rank, and show that joint training with an autoencoder leads to interpretable sparse codings of relations, helps discovering compositional constraints and benefits from compositional training. Our source code is released at github.com/tianran/glimvec.


Introduction
Broad-coverage knowledge bases (KBs) such as Freebase (Bollacker et al., 2008) and DBPedia (Auer et al., 2007) store a large amount of facts in the form of head entity, relation, tail entity triples (e.g. The Matrix, country of film, Australia ), which could support a wide range of reasoning and question answering applications. The Knowledge Base Completion (KBC) task aims Figure 1: In joint training, relation parameters (e.g. M 1 ) receive updates from both a KB-learning objective, trying to predict entities in the KB; and a reconstruction objective from an autoencoder, trying to recover relations from low dimension codings.
to predict the missing part of an incomplete triple, such as Finding Nemo, country of film, ? , by reasoning from known facts stored in the KB.
As a most common approach (Wang et al., 2017), modeling entities and relations to operate in a low dimension vector space helps KBC, for three conceivable reasons. First, when dimension is low, entities modeled as vectors are forced to share parameters, so "similar" entities which participate in many relations in common get close to each other (e.g. Australia close to US). This could imply that an entity (e.g. US) "type matches" a relation such as country of film. Second, relations may share parameters as well, which could transfer facts from one relation to other similar relations, for example from x, award winner, y to x, award nominated, y . Third, spatial positions might be used to implement composition of relations, as relations can be regarded as mappings from head to tail entities, and the composition of two maps can match a third (e.g. the composition of currency of country and country of film matches the relation currency of film budget), which could be captured by modeling composition in a space.
However, modeling relations as mappings naturally requires more parameters -a general linear map between d-dimension vectors is represented by a matrix of d 2 parameters -which are less likely to be shared, impeding transfers of facts between similar relations. Thus, it is desired to reduce dimensionality of relations; furthermore, the existence of a composition of two relations (assumed to be modeled by matrices M 1 , M 2 ) matching a third (M 3 ) also justifies dimension reduction, because it implies a compositional constraint M 1 · M 2 ≈ M 3 that can be satisfied only by a lower dimension sub-manifold in the parameter space 1 .
Previous approaches reduce dimensionality of relations by imposing pre-designed hard constraints on the parameter space, such as constraining that relations are translations (Bordes et al., 2013) or diagonal matrices , or assuming they are linear combinations of a small number of prototypes (Xie et al., 2017). However, pre-designed hard constraints do not seem to cope well with compositional constraints, because it is difficult to know a priori which two relations compose to which third relation, hence difficult to choose a pre-design; and compositional constraints are not always exact (e.g. the composition of currency of country and headquarter location usually matches business operation currency but not always), so hard constraints are less suited.
In this paper, we investigate an alternative approach by training relation parameters jointly with an autoencoder (Figure 1). During training, the autoencoder tries to reconstruct relations from low dimension codings, with the reconstruction objective back-propagating to relation parameters as well. We show this novel technique promotes parameter sharing between different relations, and drives them toward low dimension manifolds (Sec.6.2). Besides, we expect the technique to cope better with compositional constraints, because it discovers low dimension manifolds posteriorly from data, and it does not impose any explicit hard constraints. 1 It is noteworthy that similar compositional constraints apply to most modeling schemes of relations, not just matrices. Yet, joint training with an autoencoder is not simple; one has to keep a subtle balance between gradients of the reconstruction and KB-learning objectives throughout the training process. We are not aware of any theoretical principles directly addressing this problem; but we found some important settings after extensive pre-experiments (Sec.4). We evaluate our system using standard KBC datasets, achieving state-of-the-art on several of them (Sec.6.1), with strongly improved Mean Rank. We discuss detailed settings that lead to the performance (Sec.4.1), and we show that joint training with an autoencoder indeed helps discovering compositional constraints (Sec.6.2) and benefits from compositional training (Sec.6.3).

Base Model
A knowledge base (KB) is a set T of triples of the form h, r, t , where h, t ∈ E are entities and r ∈ R is a relation (e.g. The Matrix, country of film, Australia ). A relation r has its inverse r −1 ∈ R so that for every h, r, t ∈ T , we regard t, r −1 , h as also in the KB. Under this assumption and given T as training data, we consider the Knowledge Base Completion (KBC) task that predicts candidates for a missing tail entity in an incomplete h, r, ? triple.
Most approaches tackle this problem by training a score function measuring the plausibility of triples being facts. The model we implement in this work represents entities h, t as d-dimension vectors u h , v t respectively, and relation r as a d×d matrix M r . If u h , v t are one-hot vectors with dimension d = |E| corresponding to each entity, one can take M r as the adjacency matrix of entities joined by relation r, so the set of tail entities filling into h, r, ? is calculated by u h M r (with each nonzero entry corresponds to an answer). Thus, we have u h M r v t > 0 if and only if h, r, t ∈ T . This motivates us to use u h M r v t as a natural parameter to model plausibility of h, r, t , even in a low dimension space with d |E|. Thus, we define the score function as for the basic model. This is similar to the bilinear model of Nickel et al. (2011), except that we distinguish u h (the vector for head entities) from v t (the vector for tail entities). It has also been proposed in Tian et al. (2016), but for modeling dependency trees rather than KBs.
More generally, we consider composition of relations r 1 / . . . /r l to model paths in a KB (Guu et al., 2015), as defined by r 1 , . . . , r l participating in a sequence of facts such that the head entity of each fact coincides with the tail of its previous. For example, a sequence of two facts The Matrix, country of film, Australia and Australia, currency of country, Australian Dollar form a path of composition country of film / currency of country, because the head of the second fact (i.e. Australia) coincides with the tail of the first. Using the previous d = |E| analogue, one can verify that composition of relations is represented by multiplication of adjacency matrices, so we accordingly define to measure the plausibility of a path. It is explored in Guu et al. (2015) to learn a score function not only for single facts but also for paths. This compositional training scheme is shown to bring valuable information about the structure of the KB and may help KBC. In this work, we conduct experiments both with and without compositional training.
In order to learn parameters u h , v t , M r of the score function, we follow Tian et al. (2016) using a Noise Contrastive Estimation (NCE) (Gutmann and Hyvärinen, 2012) objective. For each path (or triple) h, r 1 / . . . , t taken from the KB, we generate negative samples by replacing the tail entity t with some random noise t * . Then, we maximize L 1 := path ln s(h, r 1 / . . . , t) k + s(h, r 1 / . . . , t) + noise ln k k + s(h, r 1 / . . . , t * ) as our KB-learning objective. Here, k is the number of noises generated for each path. When the score function is regarded as probability, L 1 represents the log-likelihood of " h, r 1 / . . . , t being actual path and h, r 1 / . . . , t * being noise". Maximizing L 1 increases the scores of actual paths and decreases the scores of noises.

Joint Training with an Autoencoder
Autoencoders learn efficient codings of highdimensional data while trying to reconstruct the original data from the coding. By joint training relation matrices with an autoencoder, we also expect it to help reducing the dimensionality of the original data (i.e. relation matrices).
Formally, we define a vectorization m r for each relation matrix M r , and use it as input to the autoencoder. m r is defined as a reshape of M r flattened into a d 2 -dimension vector, and normalized such that m r = √ d. We define as the coding. Here A is a c × d 2 matrix with c d 2 , and ReLU is the Rectified Linear Unit function (Nair and Hinton, 2010). We reconstruct the input from c r by multiplying a d 2 × c matrix B. We want Bc r to be more similar to m r than other relations. For this purpose, we define a similarity which measures the length of Bc r 2 projected to the direction of m r 1 . In order to learn the parameters A, B, we adopt the Noise Contrastive Estimation scheme as in Sec.2, generate random noises r * for each relation r and maximize as our reconstruction objective. Maximizing L 2 increases m r 's similarity with Bc r , and decreases it with Bc r * .
During joint training, both L 1 and L 2 are simultaneously maximized, and the gradient ∇L 2 propagates to relation matrices as well. Since ∇L 2 depends on A and B, and A, B interact with all relations, they promote indirect parameter sharing between different relation matrices. In Sec.6.2, we further show that joint training drives relations toward a low dimension manifold.

Optimization Tricks
Joint training with an autoencoder is not simple. Relation matrices receive updates from both ∇L 1 and ∇L 2 , but if they update ∇L 1 too much, the autoencoder has no effect; conversely, if they update ∇L 2 too often, all relation matrices crush into one cluster. Furthermore, an autoencoder should learn from genuine patterns of relation matrices that emerge from fitting the KB, but not the reverse -in which the autoencoder imposes arbitrary patterns to relation matrices according to random initialization. Therefore, it is not surprising that a naive optimization of L 1 + L 2 does not work.
After extensive pre-experiments, we have found some crucial settings for successful training. The most important "magic" is the scaling factor 1 √ dc in definition of the similarity function (3), perhaps being combined with other settings as we discuss below. We have tried different factors 1, 1 √ d , 1 √ c and 1 dc instead, with various combinations of d and c; but the autoencoder failed to learn meaningful codings in other settings. When the scaling factor is too small (e.g. 1 dc ), all relations get almost the same coding; conversely if the factor is too large (e.g. 1), all codings get very close to 0.
The next important rule is to keep a balance between the updates coming from ∇L 1 and ∇L 2 . We use Stochastic Gradient Descent (SGD) for optimization, and the common practice (Bottou, 2012) is to set the learning rate as Here, η, λ are hyper-parameters and τ is a counter of processed data points. In this work, in order to control the updates in detail to keep a balance, we modify (4) to use a a step counter τ r for each relation r, counting "number of updates" instead of data points 2 . That is, whenever M r gets a nonzero update from a gradient calculation, τ r increases by 1. Furthermore, we use different hyper-parameters for different "types of updates", namely η 1 , λ 1 for updates coming from ∇L 1 , and η 2 , λ 2 for updates coming from ∇L 2 . Thus, let ∆ 1 be the partial gradient of ∇L 1 , and ∆ 2 the partial gradient of ∇L 2 , we update M r by α 1 (τ r )∆ 1 + α 2 (τ r )∆ 2 at each step, where The rule for setting η 1 , λ 1 and η 2 , λ 2 is that, η 2 should be much smaller than η 1 , because η 1 , η 2 control the magnitude of learning rates at the early stage of training, with the autoencoder still largely random and ∆ 2 not making much sense; on the other hand, one has to choose λ 1 and λ 2 such that ∆ 1 /λ 1 and ∆ 2 /λ 2 are at the same scale, because the learning rates approach 1/(λ 1 τ r ) and 1/(λ 2 τ r ) respectively, as the training proceeds. In this way, the autoencoder will not impose random patterns to relation matrices according to its initialization at the early stage, and a balance is kept between α 1 (τ r )∆ 1 and α 2 (τ r )∆ 2 later.
But how to estimate ∆ 1 and ∆ 2 ? It seems that we can approximately calculate their scales from initialization. In this work, we use i.i.d. Gaussians of variance 1/d to initialize parameters, so the initial Euclidean norms are u h ≈ 1, v t ≈ 1, M r ≈ √ d, and BAm r ≈ √ dc. Thus, by calculating ∇L 1 and ∇L 2 using (1) and (3), we have approximately It suggests that, because of the scaling factor 1 √ dc in (3), we have ∆ 1 and ∆ 2 at the same scale, so we can set λ 1 = λ 2 . This might not be a mere coincidence.

Training the Base Model
Besides the tricks for joint training, we also found settings that significantly improve the base model on KBC, as briefly discussed below. In Sec.6.3, we will show performance gains by these settings using the FB15k-237 validation set.
Normalization It is better to normalize relation matrices to M r = √ d during training. This might reduce fluctuations in entity vector updates.
Regularizer It is better to minimize M r M r − 1 d tr(M r M r )I during training. This regularizer drives M r toward an orthogonal matrix (Tian et al., 2016) and might reduce fluctuations in entity vector updates. As a result, all relation matrices trained in this work are very close to orthogonal.
Initialization Instead of pure Gaussian, it is better to initialize matrices as (I + G)/2, where G is random. The identity matrix I helps passing information from head to tail (Tian et al., 2016).
Negative Sampling Instead of a unigram distribution, it is better to use a uniform distribution for generating noises. This is somehow counterintuitive compared to training word embeddings.
Among the previous works, TransE (Bordes et al., 2013) is the classic method which represents a relation as a translation of the entity vector space, and is partially inspired by Mikolov et al. (2013)'s vector arithmetic method of solving word analogy tasks. Although competitive in KBC, it is speculated that this method is well-suited for 1to-1 relations but might be too simple to represent N -to-N relations accurately (Wang et al., 2017). Thus, extensions such as TransR (Lin et al., 2015b) and STransE (Nguyen et al., 2016) are proposed to map entities into a relation-specific vector space before translation. The ITransF model (Xie et al., 2017) further enhances this approach by imposing a hard constraint that the relation-specific maps should be linear combinations of a small number of prototypical matrices. Our work inherits the same motivation with ITransF in terms of promoting parameter-sharing among relations.
On the other hand, the base model used in this work originates from RESCAL (Nickel et al., 2011), in which relations are naturally represented as analogue to the adjacency matrices (Sec.2). Further developments include HolE (Nickel et al., 2016b) and ConvE (Dettmers et al., 2018) which improve this approach in terms of parameterefficiency, by introducing low dimension factorizations of the matrices. We inherit the basic model of RESCAL but draw additional training techniques from Tian et al. (2016), and show that the base model already can achieve near state-of-the-art performance (Sec.6.1,6.3). This sends a message similar to Kadlec et al. (2017), saying that training tricks might be as important as model designs.
Nevertheless, we emphasize the novelty of this work in that the previous models mostly achieve dimension reduction by imposing some pre-designed hard constraints (Bordes et al., 2013;Trouillon et al., 2016;Nickel et al., 2016b;Xie et al., 2017;Dettmers et al., 2018), whereas the constraints themselves are not learned from data; in contrast, our approach by jointly training an autoencoder does not impose any explicit hard constraints, so it leads to more flexible modeling.
Moreover, we additionally focus on leveraging composition in KBC. Although this idea has been frequently explored before (Guu et al., 2015;Neelakantan et al., 2015;Lin et al., 2015a), our discussion about the concept of compositional constraints and its connection to dimension reduction has not been addressed similarly in previous research. In experiments, we will show (Sec.6.2,6.3) that joint training with an autoencoder indeed helps finding compositional constraints and benefits from compositional training.
Autoencoders have been used solo for learning distributed representations of syntactic trees (Socher et al., 2011), words and images (Silberer and Lapata, 2014), or semantic roles (Titov and Khoddam, 2015). It is also used for pretraining other deep neural networks (Erhan et al., 2010). However, when combined with other models, the learning of autoencoders, or more generally sparse codings (Rubinstein et al., 2010), is usually conveyed in an alternating manner, fixing one part of the model while optimizing the other, such as in Xie et al. (2017). To our knowledge, joint training with an autoencoder is not widely used previously for reducing dimensionality.
Jointly training an autoencoder is not simple because it takes non-stationary inputs. In this work, we modified SGD so that it shares traits with some modern optimization algorithms such as Adagrad (Duchi et al., 2011), in that they both set different learning rates for different parameters. While Adagrad sets them adaptively by keeping track of gradients for all parameters, our modification of SGD is more efficient and allows us to grasp a rough intuition about which parameter gets how much update. We believe our techniques and findings in joint training with an autoencoder could be helpful to reducing dimensionality and improving interpretability in other neural network architectures as well.

Experiments
We evaluate on standard KBC datasets, including WN18 and FB15k (Bordes et al., 2013), WN18RR (Dettmers et al., 2018) and FB15k-237 (Toutanova and Chen, 2015). The statistical information of these datasets are shown in Table 1. WN18 collects word relations from WordNet (Miller, 1995), and FB15k is taken from Freebase (Bollacker et al., 2008); both have filtered out low frequency entities. However, it is reported in Toutanova and Chen (2015) that both WN18 and FB15k have information leaks because the inverses of some test triples appear in the training set. FB15k-237 and WN18RR fix this problem by deleting such triples from training and test data. In this work, we do evaluate on WN18 and FB15k, but our models are mainly tuned on FB15k-237.  For all datasets, we set the dimension d = 256 and c = 16, the SGD hyper-parameters η 1 = 1/64, η 2 = 2 −14 and λ 1 = λ 2 = 2 −14 . The training batch size is 32 and the triples in each batch share the same head entity. We compare the base model (BASE) to our joint training with an autoencoder model (JOINT), and the base model with compositional training (BASE+COMP) to our joint model with compositional training (JOINT+COMP). When compositional training is enabled (BASE+COMP, JOINT+COMP), we use random walk to sample paths of length 1 + X, where X is drawn from a Poisson distribution of mean λ = 1.0.
For any incomplete triple h, r, ? in KBC test, we calculate a score s(h, r, e) from (1), for every entity e ∈ E such that h, r, e does not appear in any of the training, validation, or test sets (Bordes et al., 2013). Then, the calculated scores together with s(h, r, t) for the gold triple is converted to ranks, and the rank of the gold entity t is used for evaluation. Evaluation metrics include Mean Rank (MR), Mean Reciprocal Rank (MRR), and Hits at 10 (H10). Lower MR, higher MRR, and higher H10 indicate better performance.
We consult MR and MRR on validation sets to determine training epochs; we stop training when both MR and MRR have stopped improving.

KBC Results
The results are shown in Table 2. We found that joint training with an autoencoder mostly improves performance, and the improvement becomes more clear when compositional training is enabled (i.e., JOINT ≥ BASE and JOINT+COMP > BASE+COMP). This is convincing because generally, joint training contributes with its regularizing effects, and drastic improvements are less expected 3 . When compositional training is enabled, 3 The source code and trained models are publicly released at https://github.com/tianran/glimvec, where profession profession −1 film_crew_role −1 film_release_region −1 film_language −1 nationality currency_of_country currency_of_company currency_of_university currency_of_film_budget 2 4 6 8 10 12 14 16 currency_of_film_budget release_region_of_film corporation_of_film producer_of_film writer_of_film Figure 2: Examples of relation codings learned from FB15k-237. Each row shows a 16 dimension vector encoding a relation. Vectors are normalized such that their entries sum to 1.
the system usually achieves better MR, though not always improves in other measures. The performance gains are more obvious on the WN18RR and FB15k-237 datasets, possibly because WN18 and FB15k contain a lot of easy instances that can be solved by a simple rule (Dettmers et al., 2018). Furthermore, the numbers demonstrated by our joint and base models are among the strongest in the literature. We have conducted re-experiments of several representative algorithms, and also compare with state-of-the-art published results. For re-experiments, we use Lin et al. (2015b)'s implementation 4 of TransE (Bordes et al., 2013) and TransR, which represent relations as vector translations; and Nickel et al. (2016b)'s implementation 5 of RESCAL (Nickel et al., 2011) and HolE, where RESCAL is most similar to the BASE model and HolE is a more parameter-efficient variant. We experimented with the default settings, and found that our models outperform most of them.
Among the published results, STransE (Nguyen et al., 2016) and ITransF (Xie et al., 2017) Table 2: KBC results on the WN18, FB15k, WN18RR, and FB15k-237 datasets. The first and second sectors compare our joint to the base models with and without compositional training, respectively; the third sector shows our re-experiments and the fourth shows previous published results. Bold numbers are the best in each sector, and ( * ) indicates the best of all. (Trouillon et al., 2016) and ConvE were previously the best results. Our models mostly outperform them. Other results include Kadlec et al. (2017)'s simple but strong baseline and several recent models (Schlichtkrull et al., 2017;Shi and Weninger, 2017;Shen et al., 2017) which achieve best results on FB15k or WN18 in some measure. Our models have comparable results.

Intuition and Insight
What does the autoencoder look like? How does joint training affect relation matrices? We address these questions by analyses showing that (i) the autoencoder learns sparse and interpretable codings of relations, (ii) the joint training drives relation matrices toward a low dimension manifold, and (iii) it helps discovering compositional constraints.

Sparse Coding and Interpretability
Due to the ReLU function in (2), our autoencoder learns sparse coding, with most relations having large code values at only two or three dimensions. This sparsity makes it easy to find patterns in the model that to some extent explain the semantics of relations. Figure 2 shows some examples.
In the first group of Figure 2, we show a small number of relations that are almost always assigned a near one-hot coding, regardless of initialization. These are high frequency relations joining two large categories (e.g. film and language), which probably constitute the skeleton of a KB.
In the second group, we found the 12th dimension strongly correlates with currency; and in the third group, we found the 4th dimension strongly correlates with film. As for the relation currency of film budget, it has large code values at both dimensions. This kind of relation clustering also seems independent of initialization. Intuitively, it shows that the autoencoder may discover similarities between relations and promote indirect parameter sharing among them. Yet, as the autoencoder only reconstructs approximations of relation matrices but never constrain them to be exactly equal to the original, relation matrices with very similar codings may still differ considerably. For example, producer of film and writer of film have codings of cosine similarity 0.973, but their relation matrices only have 6 a cosine similarity 0.338.

Low dimension manifold
In order to visualize the relation matrices learned by our joint and base models, we use UMAP 7 (McInnes and Healy, 2018) to embed M r into a 2D plane 8 . We use relation matrices trained on FB15k-237, and compare models trained by the same number of epochs. The results are shown in Figure 3.
We can see that Figure 3a and Figure 3c are mostly similar, with high frequency relations scattered randomly around a low frequency cluster, suggesting that they come from various directions of a high dimension space, with frequent relations probably being pulled further by the training updates. On the other hand, in Figure 3b and Figure 3d we found less frequent relations being clustered with frequent ones, and multiple traces of low dimension structures. It suggests that joint training with an autoencoder indeed drives relations toward a low dimension manifold. In addition, Figure 3d shows different structures against Figure 3b, which we conjecture could be related to compositional constraints discovered by compositional training.

Compositional constraints
In order to directly evaluate a model's ability to find compositional constraints, we extracted from FB15k-237 a list of (r 1 /r 2 , r 3 ) pairs such that r 1 /r 2 matches r 3 . Formally, the list is constructed as below. For any relation r, we define a content set C(r) as the set of (h, t) pairs such that h, r, t is a fact in the KB. Similarly, we define C(r 1 /r 2 ) t-SNE (van der Maaten and Hinton, 2008) but found UMAP more insightful.  as the set of (h, t) pairs such that h, r 1 /r 2 , t is a path. We regard (r 1 /r 2 , r 3 ) as a compositional constraint if their content sets are similar; that is, if |C(r 1 /r 2 ) ∩ C(r 3 )| ≥ 50 and the Jaccard similarity between C(r 1 /r 2 ) and C(r 3 ) is ≥ 0.4. Then, after filtering out degenerated cases such as r 1 = r 3 or r 2 = r −1 1 , we obtained a list of 154 compositional constraints, e.g. (currency of country/country of film, currency of film budget).
For each compositional constraint (r 1 /r 2 , r 3 ) in the list, we take the matrices M 1 , M 2 and M 3 corresponding to r 1 , r 2 and r 3 respectively, and rank M 3 according to its cosine similarity with M 1 M 2 , among all relation matrices. Then, we calculate MR and MRR for evaluation. We compare the JOINT+COMP model to BASE+COMP, as well as a randomized baseline where M 2 is selected randomly from the relation matrices in JOINT+COMP instead (RANDOMM2). The results are shown in Table 3. We have evaluated 5 different random initializations for each model, trained by the same number of epochs, and we report the mean and standard deviation. We verify that JOINT+COMP performs better than BASE+COMP, indicating that joint training with an autoencoder indeed helps discovering compositional constraints. Furthermore, the random baseline RANDOMM2 tests a hypothesis that joint training might be just clustering M 3 and M 1 here, to the extent that M 3 and M 1 are so close that even a random M 2 can give the correct answer; but as it turns out, JOINT+COMP largely outperforms RANDOMM2, excluding this possibility. Thus, joint training performs better not simply because it clusters relation matrices; it learns compositions indeed.

Losses and Gains
In the KBC task, where are the losses and what are the gains of different settings? With additional evaluations, we show (i) some crucial settings for the base model, and (ii) joint training with an autoencoder benefits more from compositional training.  Crucial settings for the base model It is noteworthy that our base model already achieves strong results. This is due to several detailed but crucial settings as we discussed in Sec.4.1; Table 4 shows their gains on the FB15k-237 validation data. The most dramatic improvement comes from the regularizer that drives matrices to orthogonal.

Gains with compositional training
One can force a model to focus more on (longer) compositions of relations, by sampling longer paths in compositional training. Since joint training with an autoencoder helps discovering compositional constraints, we expect it to be more helpful when the sampled paths are longer. In this work, path lengths are sampled from a Poisson distribution, we thus vary the mean λ of the Poisson to control the strength of compositional training. The results on FB15k-237 are shown in Table 5.
We can see that, as λ gets larger, MR improves much but MRR slightly drops. It suggests that in FB15k-237, composition of relations might mainly help finding more appropriate candidates for a missing entity, rather than pinpointing a correct one. Yet, joint training improves base models even more as the paths get longer, especially in MR. It further supports our conjecture that joint training with an autoencoder may strongly interact with compositional training.

Conclusion
We have investigated a dimension reduction technique which trains a KB embedding model jointly with an autoencoder. We have developed new training techniques and achieved state-of-the-art results on several KBC tasks with strong improvements in Mean Rank. Furthermore, we have shown that the autoencoder learns low dimension sparse codings that can be easily explained; the joint training technique drives high-dimensional data toward low  dimension manifolds; and the reduction of dimensionality may interact strongly with composition, help discovering compositional constraints and benefit from compositional training. We believe these findings provide insightful understandings of KB embedding models and might be applied to other neural networks beyond the KBC task.
Occasionally, a KBC test set may contain entities that never appear in the training data. Such out-ofvocabulary (OOV) entities pose a challenge to KBC systems; while some systems address this issue by explicitly learn an OOV entity vector (Dettmers et al., 2018), our approach is described below. For an incomplete triple h, r, ? in the test, if h is OOV, we replace it with the most frequent entity that has ever appeared as a head of relation r in the training data. If the gold tail entity is OOV, we use the zero vector for computing the score and the rank of the gold entity. Usually, OOV entities are rare and negligible in evaluation; except for the WN18RR test data which contains about 6.7% triples with OOV entities. Here, we also report adjusted scores on WN18RR in the setting that all triples with OOV entities are removed from the test. The results are shown in Table 6