A Relational Memory-based Embedding Model for Triple Classification and Search Personalization

Knowledge graph embedding methods often suffer from a limitation of memorizing valid triples to predict new ones for triple classification and search personalization problems. To this end, we introduce a novel embedding model, named R-MeN, that explores a relational memory network to encode potential dependencies in relationship triples. R-MeN considers each triple as a sequence of 3 input vectors that recurrently interact with a memory using a transformer self-attention mechanism. Thus R-MeN encodes new information from interactions between the memory and each input vector to return a corresponding vector. Consequently, R-MeN feeds these 3 returned vectors to a convolutional neural network-based decoder to produce a scalar score for the triple. Experimental results show that our proposed R-MeN obtains state-of-the-art results on SEARCH17 for the search personalization task, and on WN11 and FB13 for the triple classification task.


Introduction
Knowledge graphs (KGs) -representing the genuine relationships among entities in the form of triples (subject, relation, object) denoted as (s, r, o) -are often insufficient for knowledge presentation due to the lack of many valid triples (West et al., 2014). Therefore, research work has been focusing on inferring whether a new triple missed in KGs is likely valid or not (Bordes et al., 2011(Bordes et al., , 2013Socher et al., 2013). As summarized in (Nickel et al., 2016;Nguyen, 2017), KG embedding models aim to compute a score for each triple, such that valid triples have higher scores than invalid ones.
Existing embedding models are showing promising performances mainly for knowledge graph completion, where the goal is to infer a missing entity given a relation and another entity. But in real applications, less mentioned, such as triple classification (Socher et al., 2013) that aims to predict whether a given triple is valid, and search personalization (Vu et al., 2017) that aims to re-rank the relevant documents returned by a user-oriented search system given a query, these models do not effectively capture potential dependencies among entities and relations from existing triples to predict new triples.
To this end, we leverage the relational memory network (Santoro et al., 2018) to propose R-MeN to infer a valid fact of new triples. In particular, R-MeN transforms each triple along with adding positional embeddings into a sequence of 3 input vectors. R-MeN then uses a transformer self-attention mechanism (Vaswani et al., 2017) to guide the memory to interact with each input vector to produce an encoded vector. As a result, R-MeN feeds these 3 encoded vectors to a convolutional neural network (CNN)-based decoder to return a score for the triple. In summary, our main contributions are as follows: • We present R-MeN -a novel KG embedding model to memorize and encode the potential dependencies among relations and entities for two real applications of triple classification and search personalization.
• Experimental results show that R-MeN obtains better performance than up-to-date embedding models, in which R-MeN produces new state-of-the-art results on SEARCH17 for the search personalization task, and a new highest accuracy on WN11 and the secondhighest accuracy on FB13 for the triple classification task.

+ + +
Figure 1: Processes in our proposed R-MeN for an illustration purpose. "M" denotes a memory. "MLP" denotes a multi-layer perceptron. "g" denotes a memory gating. "CNN" denotes a convolutional neural networkbased decoder.
Let G be a KG database of valid triples in the form of (subject, relation, object) denoted as (s, r, o). KG embedding models aim to compute a score for each triple, such that valid triples obtain higher scores than invalid triples.
We denote v s , v r and v o ∈ R d as the embeddings of s, r and o, respectively. Besides, we hypothesize that relative positions among s, r and o are useful to reason instinct relationships; hence we add to each position a positional embedding. Given a triple (s, r, o), we obtain a sequence of 3 vectors {x 1 , x 2 , x 3 } as: where W ∈ R k×d is a weight matrix, and p 1 , p 2 and p 3 ∈ R d are positional embeddings, and k is the memory size.
We assume we have a memory M consisting of N rows wherein each row is a memory slot. We use M (t) to denote the memory at timestep t, and M (t) i,: ∈ R k to denote the i-th memory slot at timestep t. We follow Santoro et al. (2018) to take x t to update M (t) i,: using the multi-head selfattention mechanism (Vaswani et al., 2017) as: where H is the number of attention heads, and ⊕ denotes a vector concatenation operation. Regarding the h-th head, W h,V ∈ R n×k is a valueprojection matrix, in which n is the head size and k = nH. Note that {α i,j,h } N j=1 and α i,N +1,h are attention weights, which are computed using the softmax function over scaled dot products as: where W h,Q ∈ R n×k and W h,K ∈ R n×k are query-projection and key-projection matrices, respectively. As following Santoro et al. (2018), we feed a residual connection between x t andM (t+1) i,: to a multi-layer perceptron followed by a memory gating to produce an encoded vector y t ∈ R k for timestep t and the next memory slot M (t+1) i,: for timestep (t + 1).
As a result, we obtain a sequence of 3 encoded vectors {y 1 , y 2 , y 3 } for the triple (s, r, o). We then use a CNN-based decoder to compute a score for the triple as: where we view [y 1 , y 2 , y 3 ] as a matrix in R k×3 ; Ω denotes a set of filters in R m×3 , in which m is the window size of filters; w ∈ R |Ω| is a weight vector; * denotes a convolution operator; and max denotes a max-pooling operator. Note that we use the max-pooling operator -instead of the vector concatenation of all feature maps used in ConvKB (Nguyen et al., 2018) -to capture the most important feature from each feature map, and to reduce the number of weight parameters.
We illustrate our proposed R-MeN as shown in Figure 1. In addition, we employ the Adam optimizer (Kingma and Ba, 2014) to train R-MeN by minimizing the following loss function (Trouillon et al., 2016;Nguyen et al., 2018): where G and G are collections of valid and invalid triples, respectively. G is generated by corrupting valid triples in G.
3 Experimental setup 3.1 Task description and evaluation

Triple classification
The triple classification task is to predict whether a given triple (s, r, o) is valid or not (Socher et al., 2013). Following Socher et al. (2013), we use two benchmark datasets WN11 and FB13, in which each validation or test set consists of the same number of valid and invalid triples. It is to note in the test set that Socher et al. (2013) did not include triples that either or both of their subject and object entities also appear in a different relation type or order in the training set, to avoid reversible relation problems. Table 1 gives statistics of the experimental datasets. Each relation r has a threshold θ r computed by maximizing the micro-averaged classification accuracy on the validation set. If the score of a given triple (s, r, o) is above θ r , then this triple is classified as a valid triple, otherwise, it is classified as an invalid one.

Search personalization
In search personalization, given a submitted query for a user, we aim to re-rank the documents returned by a search system, so that the more the returned documents are relevant for that query, the higher their ranks are. We follow (Vu et al., 2017;Nguyen et al., 2019a,b) to view a relationship of the submitted query, the user and the returned document as a (s, r, o)-like triple (query, user, document). Therefore, we can adapt our R-MeN for the search personalization task.
We evaluate our R-MeN on the benchmark dataset SEARCH17 (Vu et al., 2017) as follows: (i) We train our model and use the trained model to compute a score for each (query, user, document) triple. (ii) We sort the scores in the descending order to obtain a new ranked list. (iii) We employ two standard evaluation metrics: mean reciprocal rank (MRR) and Hits@1. For each metric, the higher value indicates better ranking performance.

Triple classification
We use the common Bernoulli strategy (Wang et al., 2014;Lin et al., 2015) when sampling invalid triples. For WN11, we follow Guu et al. (2015) to initialize entity and relation embeddings in our R-MeN by averaging word vectors in the relations and entities, i.e., v american arborvitae = 1 2 (v american + v arborvitae ), in which these word vectors are taken from the Glove 50-dimensional pre-trained embeddings (Pennington et al., 2014) (i.e., d = 50). For FB13, we use entity and relation embeddings produced by TransE to initialize entity and relation embeddings in our R-MeN, for which we obtain the best result for TransE on the FB13 validation set when using l 2 -norm, learning rate at 0.01, margin γ = 2 and d = 50.
Furthermore, on WN11, we provide our new fine-tuned result for TransE using our experimental setting, wherein we use the same initialization taken from the Glove 50-dimensional pre-trained embeddings to initialize entity and relation embeddings in TransE. We get the best score for TransE on the WN11 validation set when using l 1 -norm, learning rate at 0.01, margin γ = 6 and d = 50.
In preliminary experiments, we see the highest accuracies on the validation sets for both datasets when using a single memory slot (i.e., N = 1); and this is consistent with utilizing the single memory slot in language modeling (Santoro et al., 2018). Therefore, we set N = 1 to use the single memory slot for the triple classification task. Also from preliminary experiments, we select the batch size bs = 16 for WN11 and bs = 256 for FB13, and set the window size m of filters to 1 (i.e., m = 1).
Regarding other hyper-parameters, we vary the number of attention heads H in {1, 2, 3}, the head size n in {128, 256, 512, 1024}, the number of MLP layers l in {2, 3, 4}, and the number of filters F = |Ω| in {128, 256, 512, 1024}. The memory size k is set to be nH = k. To learn our model parameters, we train our model using the Adam initial learning rate lr in {1e −6 , 5e −6 , 1e −5 , 5e −5 , 1e −4 , 5e −4 }. We run up to 30 epochs and use a grid search to select the optimal hyper-parameters. We monitor the accuracy after each training epoch to compute the relation-specific threshold θ r to get the optimal hyper-parameters (w.r.t the highest accuracy) on the validation set, and to report the final accuracy on the test set.

Search personalization
We use the same initialization of user profile, query and document embeddings used by Nguyen et al. (2019b) on SEARCH17 to initialize the corresponding embeddings in our R-MeN respectively. From the preliminary experiments, we set N = 1, bs = 16 and m = 1. Other hyper-parameters are varied as same as used in the triple classification task. We monitor the MRR score after each training epoch to obtain the highest MRR score on the validation set to report the final scores on the test set. Table 2 reports the accuracy results of our R-MeN model and previously published results on WN11 and FB13. R-MeN sets a new state-of-the-art accuracy of 90.5% that significantly outperforms other models on WN11. R-MeN also achieves a second highest accuracy of 88.9% on FB13. Overall, R-MeN yields the best performance averaged over these two datasets.

Triple classification
Regarding TransE, we obtain the second-best accuracy of 89.2% on WN11 and a competitive accuracy of 88.1% on FB13. Figure 2 shows the accuracy results for TransE and our R-MeN w.r.t each relation. In particular, on WN11, the accuracy for the one-to-one relation "similar to" significantly increases from 50.0% for TransE to 78.6% for R-MeN. On FB13, R-MeN improves the accuracies over TransE for the many-to-many relations "institution" and "profession".

Effects of hyper-parameters
Next, we present in Figure 3 the effects of hyperparameters consisting of the head size n, and the number H of attention heads. Using large head sizes (e.g., n = 1024) can produce better performances on all 3 datasets. Additionally, using multiple heads gives better results on WN11 and FB13, while using a single head (i.e., H = 1) works best on SEARCH17 because each query usually has a single intention.

Ablation analysis
For the last experiment, we compute and report our ablation results over 2 factors in Table 4. In particular, the scores degrade on FB13 and SEARCH17 when not using the positional embeddings. More importantly, the results degrade on  all 3 datasets without using the relational memory network. These show that using the positional embeddings can explore the relative positions among s, r and o; besides, using the relational memory network helps to memorize and encode the potential dependencies among relations and entities.

Conclusion
We propose a new KG embedding model, named R-MeN, where we integrate transformer self-attention mechanism-based memory interactions with a CNN decoder to capture the potential dependencies in the KG triples effectively. Experimental results show that our proposed R-MeN obtains the new state-of-the-art performances for both the triple classification and search personalization tasks. In future work, we plan to extend R-MeN for multihop knowledge graph reasoning. Our code is available at: https://github.com/daiquocnguyen/ R-MeN.