Neural Tensor Networks with Diagonal Slice Matrices

Although neural tensor networks (NTNs) have been successful in many NLP tasks, they require a large number of parameters to be estimated, which often leads to overfitting and a long training time. We address these issues by applying eigendecomposition to each slice matrix of a tensor to reduce its number of paramters. First, we evaluate our proposed NTN models on knowledge graph completion. Second, we extend the models to recursive NTNs (RNTNs) and evaluate them on logical reasoning tasks. These experiments show that our proposed models learn better and faster than the original (R)NTNs.


Introduction
Alongside the nonlinear activation functions, linear mapping by matrix multiplication is an essential component of neural network (NN) models, as it determines the feature interaction and thus the expressiveness of models. In addition to the matrix-based mapping, neural tensor networks (NTNs) (Socher et al., 2013a) employ a 3dimensional tensor to capture direct interactions among input features. Due to the large expressive capacity of 3D tensors, NTNs have been successful in an array of natural language processing (NLP) and machine learning tasks, including knowledge graph completion (KGC) (Socher et al., 2013a), sentiment analysis (Socher et al., 2013b), and reasoning with logical semantics (Bowman et al., 2015). However, since a 3D tensor has a large number of parameters, NTNs need longer time to train than other NN models. Moreover, the millions of parameters often make the model suffer from overfitting (Yang et al., 2015).
To solve these problems, we propose two new parameter reduction techniques for NTNs. These techniques drastically decrease the number of parameters in an NTN without diminishing its expressiveness. We use the matrix decomposition techniques that are utilized for KGC in Yang et al. (2015) and Trouillon et al. (2016). Yang et al. (2015) imposed a constraint that a matrix in the bilinear term in their model had to be diagonal. As mentioned in a subsequent section, this is essentially equal to assuming that the matrix be symmetric and performing eigendecomposition. Trouillon et al. (2016) also applied eigendecomposition to a matrix by regarding it as the real part of a normal matrix. Following these studies, we perform simultaneous diagonalization on all slice matrices of a NTN tensor. As a result, mapping by a 3D (n × n × k) tensor is replaced with an array of k "triple inner products" of two input vectors and a weight vector. Thus, we obtain two new NTN models where the number of parameters is reduced from O(n 2 k) to O(nk).
On a KGC task, these parameter-reduced NTNs (NTN-Diag and NTN-Comp) alleviate overfitting and outperform the original NTN. Moreover, our proposed NTNs can learn faster than the original NTN. We also show that our proposed models perform better and learn faster in a recursive setting by examining a logical reasoning task.

Background
We consider mapping in a neural network (NN) layer that takes two vectors as input, such as recursive neural networks.
Recurrent neural networks also has this structure, with one input vector being the hidden state from the previous time step. As a mapping before activation in the NN layer, linear mapping (matrix multiplication) is commonly used: Here, since x 1 , x 2 ∈ R n , W 1 , W 2 ∈ R k×n , this linear mapping is a transformation from R 2n to R k . Linear mapping, which is a standard component of NNs, has been applied successfully in many tasks. However, it cannot consider the interaction between different components of two input vectors, which renders it not ideal for modeling complex compositional structures such as trees and graphs.
To alleviate this problem, some models such as NTNs (Socher et al., 2013a) have explored 3D tensors to yield more expressive mapping: The output of this mapping is an array of k bilinear products in the form of x T 1 W [i] x 2 . Thus, this is also a transformation from R 2n to R k . Each element of the output of this mapping equals the sum of W [i] ⊙ (x 1 ⊗ x 2 ), where ⊙ and ⊗ represent, respectively, the Hadamard and the outer products. Hence this mapping captures the direct interaction between different components (or "features") in two input vectors. Thanks to this expressiveness, NTNs are effective in tasks such as knowledge graph completion (Socher et al., 2013a), sentiment analysis (Socher et al., 2013b), and logical reasoning (Bowman et al., 2015).
Although mapping by a 3D tensor provides expressiveness, it has a large number (O(n 2 k)) of parameters. Due to this, NTNs often suffer from overfitting and long training times.

Simple Matrix Decomposition (SMD)
To reduce the number of parameters of a slice matrix W [i] ∈ R n×n in a tensor, simple matrix decomposition (SMD) is commonly used (Bai et al., 2009). SMD factorizes W [i] into a product of two low rank matrices S [i] ∈ R n×m and T [i] ∈ R m×n (m ≪ n): By plugging (1) into bilinear term x T 1 W [i] x 2 , we obtain the approximation x T 1 S [i] T [i] x 2 . SMD reduces the number of parameters of W [i] from n 2 to 2nm. However, the dimension m for S and T is a hyperparameter and must be determined prior to training.

Simultaneous Diagonalization
This section introduces two techniques that can simultaneously diagonalize all slice matrices W [1] , . . . , W [i] , . . . , W [k] ∈ R n×n . As described in (Liu et al., 2017), we make use of the fact that if matrices V [1:k] form a commuting family: i.e., , ∀i, j ∈ {1, 2, . . . , k}, they can be diagonalized by a shared orthogonal or unitary matrix. Both of the two techniques reduce the number of parameters of W [i] to O(n) from O(n 2 ).

Orthogonal Diagonalization
Many NLP datasets contain symmetric patterns. For example, if binary relation (Bob, is relative of, Alice) holds in a knowledge graph, then (Alice, is relative of, Bob) should also hold in it. English phrases "dog and cat" and "cat and dog" have identical meaning. For symmetric structures, we can reasonably suppose that each slice matrix W [i] of a 3D tensor is symmetric because ∈ R n×n is symmetric, it can be diagonalized as: x 2 , we can rewrite it as follows: where ∈ R n and ⟨a, b, c⟩ denotes a "triple inner product" defined by ⟨a, b, c⟩ = ∑ n l=1 a l b l c l . This reduces the number of parameters in a single slice matrix from n 2 to n.

Unitary Diagonalization
Since most of the structures in the NLP data are not symmetric, the symmetric matrix assumption is usually violated. To obtain more expressive diagonal matrix, we regard each slice matrix W [i] as the real part of a complex matrix and consider its eigendecomposition.
For any real matrix W [i] , there exists a complex normal matrix Z [i] whose real part is equal to it: . ℜ (·) represents an operation that takes the real part of a complex number, vector or matrix. Further, any complex normal matrix can be diagonalized by a unitary matrix. With these two properties, any real matrix W [i] can be diagonalized as follows (Trouillon et al., 2016): whose U is the same unitary matrix in all slice matrices, we can rewrite every bilinear term x T 1 W [i] x 2 as follows: where This technique reduces the number of parameters of the matrices from n 2 to 2n. As shown in the right-hand side of Eq. (3), ℜ ( ⟨y 1 , w [i] , y 2 ⟩ ) can be replaced with three additions and a subtraction of the triple inner product of real vectors.

Neural Network Models
This section introduces the baseline and our proposed models. After describing them, we explain how to extend them for handling compositional structures like binary trees. First, we describe a standard single layer neural network (NN) model for two vectors x 1 , x 2 ∈ R n . The model uses linear mapping V ∈ R k×2n to combine two input vectors:

Model # of Parameters
where b ∈ R k is a bias term and f is a non-linear activation function. The NN model has only (2n+ 1)k parameters, and does not consider the direct interactions between x 1 and x 2 .

Neural Tensor Network (NTN)
Socher et al. (2013a) proposed a neural tensor network (NTN) model that uses a 3D tensor W [1:k] ∈ R n×n×k to combine two input vectors: Unlike the standard NN model, NTN can directly relate two input vectors using a tensor. However, it has too many parameters; (n 2 + 2n + 1)k.

NTN-SMD
Although the NTN model has tremendous expressive power, it is extremely time-consuming to compute, since a naive 3D tensor product incur O(n 2 k) computation time. To overcome this weakness, Zhao et al. (2015) and  independently introduced simple matrix decomposition (SMD) to the NTN model by replacing each slice matrix W [i] with its factorized approximation given by Eq. (1): When m ≪ n, the NTN-SMD model drastically reduces the number of parameters compared to the original NTN model; i.e., from (n 2 + 2n + 1)k to (2mn + 2n + 1)k.

NTNs with Diagonal Slice Matrices
In this paper, we introduce two new NTN models: NTN-Diag and NTN-Comp, both of which reduce the number of parameters in a 3D tensor more than NTN-SMD with little loss in the model's generalization performance. Table 1 summarizes the number of parameters in each model.

NTN-Diag
We replace all slice matrices W [i] of W [1:k] with the triple inner product formulation of Eq.
(2) by assuming that they are symmetric and commuting. As a result, we derive the following new NTN formulation: Thus, under the symmetric and commuting matrix constraints, we regard mapping by a 3D tensor as an array of k triple inner products. The total number of parameters is just (3n + 1)k.

NTN-Comp
By assuming that are real parts of normal matrices forming a commuting family, we can replace each slice matrix of a tensor term in NTN with the triple Hermitian inner product shown in Eq. (3): . . , k}. Similar to NTN-Diag, we regard mapping by a 3D tensor as an array of k triple Hermitian inner products. The total number of parameters is just (6n + 1)k. As is clear of its form, NTN-Diag is a special case of NTN-Comp whose vectors x 1 , x 2 and w [i] are constrained to be real.

Recursive Neural Tensor Networks
We extend the above NTN models to handle compositional structures. As a representative of compositional structures, we consider a binary tree where each NTN layer computes a vector representation for a node by combining two vectors from its child nodes in the lower layer. Except for NTN-Comp, the models implement mappings R n → R k so that each of their layers can receive its lower layer's output directly, if k equals to n. Thus, the models do not have to be modified for them. However, NTN-Comp cannot receive its lower layer's output as it is because NTN-Comp is a mapping from C n to R k . To solve this problem, we set k to 2n and treat the output y ′ ∈ R 2n as the concatenation of vectors representing the real and imaginary parts of y ∈ C n : Note that this approach is valid since Eq. (3) can actually be defined in real vector space by transforming the complex vectors in C n into real vectors in R 2n .

Related Work Knowledge Graph Completion
In KGC, researchers usually design scoring function Φ for the given triplet (s, r, o) to judge whether it is a fact or not. Here (s, r, o) denotes that entity s is linked to entity o by relation r. RESCAL (Nickel et al., 2011) uses e T s W r e o as Φ, where e s , e o are entity embedding vectors and W r is an embedding matrix of relation r. This bilinear operation is effective for the task, but its computational cost is high and it suffers from overfitting. To overcome these problems, DistMult (Yang et al., 2015) adopts the triple inner product ⟨e s , w r , e o ⟩ as Φ, where w r is an embedding vector of relation r. This solves those problems, but it degrades the model's ability to capture directionality of relations, because the scoring function of DistMult is symmetric with respect to s and o; i.e., ⟨e s , w r , e o ⟩ = ⟨e o , w r , e s ⟩. To reconcile the complexity and expressiveness of a model, ComplEx (Trouillon et al., 2016) uses complex vectors for entity and relation embeddings. As scoring function Φ, they adopted the triple Hermitian inner product ℜ (⟨e s , w r , e o ⟩), where e o denotes the complex conjugate of e o . Since ℜ (⟨e s , w r , e o ⟩) ̸ = ℜ (⟨e o , w r , e s ⟩), Com-plEx solves the expressiveness problem of Dist-Mult without full matrices as relation embeddings. We can regard DistMult as a special case of RESCAL with a symmetric matrix constraint on W r . ComplEx is also a RESCAL variant with W r as the real part of a normal matrix. Our research is based on these works, but to the best of our knowledge, no previous work applied this ap-proach to reduce the number of parameters in a tensor.

NN Architectures
To give additional expressiveness power to standard (R)NNs, many architectures have been proposed, such as LSTM (Hochreiter and Schmidhuber, 1997), GRU (Cho et al., 2014), andCNN (LeCun et al., 1998). NTN (Socher et al., 2013a) and RNTN (Socher et al., 2013b) are other such architectures. However, (R)NTNs differ in that they only add 3D tensor mapping to standard neural networks. Thus, they can also be regarded as a powerful basic component of NNs because 3D tensor mapping can be applied to more complicated architectures such as those examples.

Parameter Reduction in NN
Several researchers reduced the number of parameters of NNs by using specific parameter sharing mechanisms. Cheng et al. (2015) used circulant matrix mapping instead of conventional linear mapping and improved the time complexity of the matrix-vector product by using Fast Fourier Transformation (FFT). Circulant matrix   To evaluate their performance for link prediction on knowledge graphs, we compared our proposed methods (NTN-Diag and NTN-Comp) to baseline methods (NTN (Socher et al., 2013a) and NTN-SMD).

Task
Let E and R denote entities and relations, respectively. A relational triplet, or simply a triplet, (s, r, o) is a triple with s, o ∈ E and r ∈ R. It represents a proposition that relation r holds between subject entity s and object entity o. A triplet is called a fact if the proposition it denote is true.
A knowledge graph is a collection of knowledge triplets, with the understanding that all its member triplets are facts. It is called a graph because each triplet can be regarded as an edge in a directed graph; the vertices in this graph represent entities in E, and each edge is labeled by a relation in R. Let G be a knowledge graph, viewed as a collection of facts. Knowledge graph completion (KGC) is the task of predicting whether unknown triplet (s ′ , r ′ , o ′ ) ̸ ∈ G such that s ′ , o ′ ∈ E, r ′ ∈ R is a fact or not.

Models and Loss Function
The standard approach to KGC is to design a score function Φ : E × R × E → R that assigns a large value when a triplet seems to be a fact. Socher et al. (2013a) defined it as follows.
Here, e s , e o ∈ R n are entity embeddings and W r , V r , b r , u r are parameters for each relation r. u r is a k-dimensional vector to map f 's output R k to R which indicates a score. f is the hyperbolic tangent. To compare the performances of the baselines and proposed models, we change the mapping before an activation. For NTN-SMD, we change term e T s W  We report Hits@n in the filtered setting. * Results are those in (Trouillon et al., 2016) we assume all slice matrices of tensors among relations form a commuting family. The loss function used to train the models is shown below: where λ∥Ω∥ 2 2 is an L2 regularization term, T (i) denotes the i-th example of training data of size N , and T (i) c is one of C randomly sampled negative examples for the i-th training example. We generated negative samples of a triplet (s, r, o) by corrupting its subject or object entity.

Experimental Setup
We used the Wordnet (WN18) and Freebase (FB15k) datasets to verify the benefits of our proposed methods. The dataset statistics are given in Table 2. We selected hyper-parameters based on Socher et al. (2013a) and Yang et al. (2015): For all of the models, the size of mini-batches was set to 1000, the dimensionality of the entity vector to d = 100, and the regularization parameter to 0.0001; the tensor slice size was set to k = 4 for all models, except NTN for which we also tested with k = 1 to see the influence of the slice size on the performance. We performed 300 epochs of training for Wordnet and 100 on Freebase using Adagrad (Duchi et al., 2011) with the initial learning rate set to 0.1.
For evaluation, we removed the subject or object entity of each test example and then replaced it with all the entities in E. We computed the scores of these corrupted triplets and ranked them in descending order of scores. We here report the results collected in filtered and raw settings. In the filtered setting, given test example (s, r, o), we remove from the ranking all the other positive triplets that appear in either training, validation, or test dataset, whereas the raw metrics do not remove these triplets.

Result
Experimental results are shown in Table 3. We observe the following: • The performance of NN and NTNs differs considerably; Apparently, NN is inadequate for this task.
• By comparing the results of NTNs with different slice sizes, we see that k = 4 performs better than k = 1.
• NTN-SMDs perform better than NN, but are all inferior to NTNs, although their results improved as m (the rank of decomposed matrices) is increased.
• NTN-Diag achieved better results than NTN, although it has far fewer parameters than NTN and the datasets contain many unsymmetrical triplets. This demonstrates that NTN-Diag solves the overfitting problem of NTN without sacrificing the expressiveness power. NTN-Diag also has fewer parameters than the smallest (m = 1) NTN-SMD. Thus, Table 4: Conjunctive and disjunctive normal forms in propositional logic. A ij is a literal, which is a propositional variable or its negation. For example, p 1 and ¬p 2 are literal, but not ¬¬p 3 . we conclude that NTN-Diag is a better alternative of NTN than NTN-SMD is, in terms of both accuracy and computational cost.
• NTN-Comp outperformed NTN-Diag, showing that its flexible constraint on matrices yielded additional expressiveness. However, NTN-Diag and NTN-Comp do not exceed DistMult and ComplEx, respectively, in almost all measures.
Although not shown in the table, in this experiment, NTN-Diag and NTN-Comp was, respectively, 3 and 1.7 times as fast as NTN to train.

Logical Reasoning
To validate the performance of our proposed models in a recursive neural network setting, we experimentally tested them by having them solve a semantic compositionality problem in logic.

Task
This task definition basically follows Bowman et al. (2015): Given a pair of artificially generated propositional logic formulas, classify the relation between the formulas into one of the seven basic semantic relations of natural logic (MacCartney and Manning, 2009). Table 5 shows these seven relation types. The formulas consist of propositional variables, negation, and conjunction and disjunction connectives. Although Bowman et al. (2015) generated formulas with no constraint on its form, we restricted them to disjunctive normal not p3 ∧ p3 p3 ⊏ (p3 or p2) (p1 or(p2 or p4))) ⊐ (p2 and not p4)   Table 4). Recall that any propositional formula can be transformed into these forms.

Models and Loss Function
Following Bowman et al. (2015), we constructed a model that infers the relations between formula pairs, as described in Table 6.
The model consists of two layers: composition and comparison layers (Figure 1). The composition layer outputs the embeddings of both left and right formulas by recursive neural networks. Subsequently, the comparison layer compares the two embeddings using a single layer neural network, and then a softmax classifier receives its output. In the composition layer, we set different parameters for and and or operations. As a loss function, we used cross entropy with L2 regularization and apply the NTNs in Section 4 to the comparison layer and uses RNTNs for as the composition layer.

Experimental Setup
In this experiment, an example is a pair of propositional formulas, and its class label is the seven relation types between the pair. We generated examples following the protocol described in Bowman et al. (2015), with the exception that the formulas are restricted to CNF or DNF, as mentioned above.  Table 7: Result of logical inference for Tests 1-12. Example in Test n has n logical operators in either or both left and right formulas. Each score is the average accuracy of five trials of the λ that achieved best performance on validation set. "Majority class" denotes the ratio of the majority class (relation "#", i.e., Independence; see Table 5).

Result
The results are shown in Table 7. From the table, we observe the following: • As with KGC, the large difference in performance between RNN and RNTN suggests that this logical reasoning task requires feature interactions to be captured 1 .
• RNTN-Diag achieved the best accuracy except for Tests 2 and 12 and outperformed RNTN except for Test 2. This is not surprising because both and and or are symmetric: p 1 and p 2 equals p 2 and p 1 . This matches the tensor term in RNTN-Diag which is symmetric with respect to x 1 and x 2 .
• RNTN-Comp was the second best except for Tests 1-3 and 10-12. For all tests, its accuracy was comparable with or superior to that of RNTN.
• RNTN-SMD (m = 1) was inferior to RNTN for most test sets, although some good results were observed with m = 1, 2, 3 on Tests 11 and 12. Indeed, except for Tests 9-12, RNTN-SMD (m = 1) was inferior even to RNN despite the larger number of parameters in RNTN-SMD. RNTN-SMD (m = 2) obtained better results than m = 1, but it is still worse than RNTN except for Tests 10-12. Further increase in m (m = 4, 8, 16) worsened the accuracy despite an increase of the number of parameters.
We also evaluated the stability of the model over different trials and hyperparameters. Table 8 shows the best average accuracy for each compared model (among all the tested λ) on the validation set. The parenthesized figures (on the rightmost column) show the standard deviation over five independent trials used for computing the average, i.e., all five trials used the same λ value that achieved the best average accuracy. We see that RNTN-SMDs have larger standard deviations than reason, we did not test TreeLSTM in this paper.  Finally, Figure 3 shows that training times increase quadratically with dimension for RNTN that has O(n 2 k) parameters, but not for our methods, which have only O(nk) parameters.

Conclusion
We proposed two new parameter reduction methods for tensors in NTNs. The first method constrains the slice matrices to be symmetric, and the second assumes them to be normal matrices. In both methods, the number of a 3D tensor param- eters is reduced from O(n 2 k) to O(nk) after the constrained matrices are eigendecomposed. By removing the tensor's surplus parameters, our methods learn better and faster as was shown in experiments. 2 Future work will test the versatility of our proposals, RNTN-Diag and RNTN-Comp, in other tasks that deal with data sets exhibiting carious structures.