Suggest me a movie for tonight: Leveraging Knowledge Graphs for Conversational Recommendation

Conversational recommender systems focus on the task of suggesting products to users based on the conversation flow. Recently, the use of external knowledge in the form of knowledge graphs has shown to improve the performance in recommendation and dialogue systems. Information from knowledge graphs aids in enriching those systems by providing additional information such as closely related products and textual descriptions of the items. However, knowledge graphs are incomplete since they do not contain all factual information present on the web. Furthermore, when working on a specific domain, knowledge graphs in its entirety contribute towards extraneous information and noise. In this work, we study several subgraph construction methods and compare their performance across the recommendation task. We incorporate pre-trained embeddings from the subgraphs along with positional embeddings in our models. Extensive experiments show that our method has a relative improvement of at least 5.62% compared to the state-of-the-art on multiple metrics on the recommendation task.


Introduction
Conversational Recommender Systems (CRSs) are goal-oriented dialogue systems focusing on recommending products to users through multi-turn dialogues. In these dialogues, the user and the recommender system take turns to interact with each other. CRSs hold enormous potential in the e-commerce industry wherein users can be recommended products based directly on the understanding of their requirements. However, users are not always entirely aware of their preferences while purchasing products. A CRS enables the user to make an informed decision while purchasing a product by taking into consideration the information about different products and matching them to their needs. This is extremely useful in situations of high-involvement products. By understanding the context and learning the user's preference, CRS can suggest products to the users which in turn will lead to higher consumer satisfaction and low buyer's remorse. The three central components of CRSs are the dialogue manager, the user modelling system, and the recommendation engine (Jannach et al., 2020). The dialogue manager focuses on managing the dialogue states along with user intents and decides on the next actions to take. The user modelling system models the user profile using the dialogue content. The recommendation engine generates an appropriate recommendation to users by considering the dialogue states and user profiles into account. More recently, works on leveraging information from Knowledge Graphs (KGs) have been gaining ground. KGs contain factual information about real-world entities in a structured format. Such external knowledge helps in adapting to a specific domain as they contain rich information about products and their features. This knowledge helps the recommendation engine to utilise additional knowledge about products which in turn leads to better suggestions. Leveraging such graphs helps in making a better recommendation as additional information is made available to the system apart from the dialogue content and user behaviour. Therefore, we focus on improving recommendations provided to users by incorporating KGs in dialogue systems for movie recommendation. However, to leverage the benefits of KGs, there is a need for constructing appropriate subgraphs for the targeted domain, thereby reducing the amount of redundant information made available to the system. Subgraphs that are rich in domain information containing a low amount of noise are desirable. We, therefore, study the benefits of subgraphs created using N -hop and PageRank (Page et al., 1999) approaches. Furthermore, we analyse which subgraphs are best suited for the task at hand. In this work, we build upon the work of Knowledge-Based Recommendation Dialog (KBRD) (Chen et al., 2019). The authors extract movie entities from dialogues and utilise information from a KG to suggest movies to users. We incorporate pre-trained entity embeddings and make use of positional embeddings to improve the performance of the system. The main contributions of our work are as follows: • We conduct extensive experiments on different subgraph extracted from DBpedia (Auer et al., 2007) to show that the entire information contained in this resource is not beneficial and that there is a need for optimal subgraph creation technique.
• We show that using pre-trained entity embeddings supplemented with positional embeddings allows the model to learn better entity representation for recommendation.

Related Works
Rich (1979) suggested the use of natural language to suggest books to users by asking them questions in natural language and learning their personalities and choices. Recently, there has been a shift from the traditional collaborative system techniques for the recommendation (Sarwar et al., 2001;Zhang et al., 2016) to CRSs (Christakopoulou et al., 2018). Recommendation systems have made a shift towards deep neural networks as shown in the work of . CRSs have also moved towards the use of neural networks Chen et al., 2019). Christakopoulou et al. (2018) use recurrent neural network-based models to recommend videos to users. Zhang et al. (2016) explore the use of knowledge bases in recommendation tasks. They leverage heterogeneous information from the knowledge base to improve the performance of the recommendation system. Sun et al. (2018) propose an embedding based approach to learn semantic representations of entities and paths in a KG to characterise user preferences towards items.  improve the performance of the recommender system by learning the embeddings for entities in the KG using TransR (Lin et al., 2015) and refining and discriminating the node embeddings by using attention over the neighbouring nodes of a given node. Wang et al. (2018) and Li et al. (2020) focus on solving the task of goal-oriented conversation recommendation for cold-start users. Wang et al. (2018) simulate the propagation of user preferences over the set of knowledge entities by automatically and iteratively extending a user's potential interests along with links in a graph. While learning the entity representations, prior works have used pre-trained embeddings for a node and fine-tuned those embeddings using the node's neighbourhood. Instead of fine-tuning the pretrained embeddings, we extract subgraphs well suited for the domain, and learn entity embeddings on them, thereby allowing the model to capture information apart from the node's neighbourhood, in turn improving the performance of the system. CRSs have been used to solve the goal-oriented recommendation dialogue task for specific domains. Li et al. (2020) generate new venues for recommendation using graph convolution networks and encode the dialogue contents using hierarchical recurrent encoder-decoder (HRED) (Sordoni et al., 2015) and thereby recommend locations to users.  paved the way for the use of deep learning in CRSs. They released the ReDial dataset wherein users are recommended movies based on the conversation they have with the recommender. They use a hierarchical recurrent encoder-decoder (Sordoni et al., 2015) to develop the conversational recommender model trained on the ReDial dataset. Chen et al. (2019) extend the work of  by incorporating the use of KG to recommend movies to users. They also show that dialogue and recommendation in CRSs are complementary tasks and benefit one another. However, they do not take into account the sequential nature of items. Instead of creating a sessions graph to learn the user-representation, we infuse positional embeddings into the entity embeddings to preserve the temporal preference of the user over the movie entities. Figure 1: Overview of our model. A subgraph is constructed from the KG. The entity embeddings are learnt on the extracted subgraph. Entity embeddings belonging to a particular user is enriched with positional embeddings ( represents elements wise addition) and then passed through a soft-attention layer to represent the user. The score of each entity is calculated by taking a dot-product ( represents the dot-product of two vectors) of its embedding with the user representation. Finally, the probability of each entity is calculated by passing the scores of all entities through a softmax layer.

Dataset
For this work, we use the ReDial dataset , which is an annotated set of dialogues where a recommender suggests movies to a recommendee. The dataset consists of 10,006 dialogues with a total of 182,150 utterances. We split the dataset into 80% for training and 10% each for validation and testing.
For the task of leveraging a KG, we use DBpedia as the KG for this task.
It is an open KG that stores structured content from Wikimedia projects. As DBpedia is served as open linked data, as suggested by the KBRD task, entities (movies, actors, directors, etc.) present in the conversations are linked to DBpedia entities using the method put forward by Daiber et al. (2013), where the authors perform entity linking using the generative probabilistic model. Entities are furthermore split into two categories: movie entities and mentioned entities. Movie entities are those entities, which are annotated movies present in the ReDial dataset. Mentioned entities are the entities that are linked to DBpedia as suggested in Chen et al. (2019). Entities related to actors, producers, and directors are examples of mentioned entities. The dataset consists of 6,924 movies and 4,765 mentioned entities. Among the movie entities, 824 movies from the dataset could not be linked to DBpedia. Similarly, 121 mentioned entities could not be linked to DBpedia. We introduce the unlinked entities as isolated nodes in the graph. Since the total number of nodes and edges present in the DBpedia KG has a magnitude of 10 6 and 10 7 respectively, we need to extract a relevant subgraph for this task. We build the initial set of subgraphs by considering Nhop neighbours of the movie nodes present in our dataset. When constructing a graph using the N -hop technique, the subgraph grows exponentially which induces noise in the subgraph. To counteract this issue, we use the PageRank algorithm (Page et al., 1999) to extract subgraphs for the task.

Methodology
To address the recommendation task, we begin with constructing a subgraph from the DBpedia KG. We then learn entity embeddings on the subgraphs thus constructed using the method described by Balažević et al. (2019). The learned embeddings are used for user representation. Finally, we compute the similarity of the user with the movie entities. Overview of our model is shown in Figure 1.

Subgraph creation
To construct the subgraphs for analysing the information contributed by different kinds of subgraphs, we use the N -hop neighbours and the PageRank algorithm. In all of the cases, the initial set of seed nodes are the set of all the movie entities present in the ReDial dataset. During this process, we do preserve  the relational information in the subgraphs. When constructing the subgraph using the N -hop method, we include all the edges and nodes reachable from the seed set by paths of length N . We consider cases when N =2, 3 and 5 starting from the initial set of seed nodes. On the other hand, the construction of subgraphs using PageRank is a two-step process. An initial subgraph is constructed by considering 3-hop neighbours of the seed set. PageRank scores of the subgraph nodes are computed and the top-k nodes with the highest PageRank scores are considered for the second step. To construct the final subgraph, we consider the top-k nodes with respect to the PageRank values along with the movie nodes. We take the set of nodes thus constructed and their 1-hop neighbours to present the final subgraph. Since the random walk in PageRank can start from any random node, we use the Personalised PageRank (PPR) (Bahmani et al., 2010) to make the subgraphs more movie-centric. We distribute high personalisation scores to the movie entities so that the random walk starts with high probability from the movie nodes. In our experiments, we distribute α per between the set of movie nodes and 1 − α per among the rest of the nodes in the subgraph extracted after the first step. To construct such movie-centric graphs, we choose α per = 0.7, 0.9. Table 1 shows the graph properties of different subgraphs constructed. The 5-hop subgraph contains the highest number of nodes and edges while having the lowest density. We learn entity embeddings for these subgraphs constructed for use in the model.

Entity Embeddings
To incorporate information from the relational graph into our model, we need to represent the entities as a d-dimensional vector. More specifically, each entity is represented as a R d vector. In this paper, the embedding of an entity i is represented by e i ∈ R d . Balažević et al. (2019) learn entity embeddings by solving link-prediction tasks in relational graphs. Since those embeddings are well suited for the task of link-prediction, they contain relevant information for predicting new relations in an incomplete graph. Connectionist models such as demonstrated by Schlichtkrull et al. (2018), leverage the local neighbourhood of nodes to learn nodal embeddings. Although relational graph neural networks help in learning better entity representation for relational graphs, KGs do not contain all factual relations. As shown in Table 2, for all of the subgraphs, at least 18% of the target nodes are unreachable from the previously mentioned entities. Therefore, we choose the above-mentioned method for learning entity embeddings.

TuckER Embeddings
Graph Neural Networks (Wu et al., 2020) allow the information to flow through a graph using connectionist models. However, such models do not allow information flow between nodes that are not connected through paths of any length. The nature of the learned embeddings is such that it helps in link prediction between two entities. Since, in our case, many target entities are not connected to source nodes, we use embeddings learned through Tucker decomposition (Tucker, 1966) for better entity prediction.
Tucker decomposition decomposes a tensor into a smaller core tensor and several smaller matrices. Balažević et al. (2019) show that Tucker decomposition can be used to learn entity embeddings from relational graphs. They also showed that embeddings learned using this technique outperforms techniques such as TransE (Bordes et al., 2013) or DistMult  on link prediction task over multiple datasets. Tucker decomposition for a tensor X ∈ R I×J×K , gives tensor Z ∈ R P×Q×R and matrices A ∈ R I×P , B ∈ R J×Q , and C ∈ R K×R as outputs.
For link prediction tasks, X is the adjacency tensor, consisting of the adjacency matrices for each relation in the KG, with 1 at index (s, r, o) if a relationship exists between e s and e o on relation r, else 0. We set A = C , to obtain the same entity embeddings in both the source and object positions. The core tensor Z is the set of parameters that is linearly proportional to d e and d r , the entity embedding size and the relation embedding size respectively and Z ∈ R de×dr×de space. Our method learns the entity embedding matrix A ∈ R ne×de , relation embedding matrix B ∈ R nr×dr space as shown in Equation 2, where e s , e o ∈ R ne×de are the corresponding rows of A and w r is the corresponding row of B. The model is trained to minimise the Bernoulli negative log-likelihood loss function which maximises the estimated probability of a relation φ(e s , r, e o ) over known relations.

Positional Embeddings
During the dialogue flow, entities are mentioned sequentially. By virtue of this, entities mentioned later have a larger effect on future recommendations. Wu et al. (2019) constructed a session graph to represent the sequential nature of items belonging to a user. By leveraging the properties of such graphs, they learned latent item embeddings using graph neural networks. To avoid the construction of session graph and increasing the complexity of the model, we use positional embeddings as described in Devlin et al. (2019), which allow us to infuse sequence information into the entity embeddings. The pattern of embeddings induced by Equation 3 and 4 encodes positional information of entities in the model.

User representation
A user is represented by a set of entities already encountered in the dialogue. If a user has interacted with n entities, the user representation is represented by the matrix U ∈ R nxd . We then map the matrix U to a vector in R d space to find the similarity of the user with different entities. We borrow a part of the soft-attention used by Wu et al. (2019) to construct a representation of the user in the R d space. We can represent the user by choosing the embedding of the last-seen item. We use soft-attention to better represent the user.
Here, U n , U i are the representation of the user when n, i entities have been encountered respectively. q ∈ R d and W1, W2 ∈ R dxd are trainable parameters.

Similarity score of use with products
The final similarity scores of users with the movie entities is calculated by computing the dot product of the user representation with the movie entity representations passed through a softmax layer.
We optimise our model by using the cross-entropy loss on the set of recommended movies.

Experimental settings
This section presents the parameters to train our system and the evaluation metrics used to evaluate our approach. Furthermore, we describe previous approaches, against which we compare our proposed approach 1 .

Model settings
For learning the entity embeddings using Tucker decomposition, we set the learning rate = 0.0001, batch size = 1,536 and epochs = 512 for all of the subgraphs. We set the dropout values of all the layers to be 0.1 for every layer. For the recommendation engine, we set the batch size to be 32, and the learning rate to be 0.001 to avoid overfitting of the model. Similar to Chen et al. (2019), we use the Adam optimiser (Kingma and Ba, 2014) with β 1 = 0.9, β 2 = 0.999 and = 10 −8 to optimise our model. We set the value of the scaling factor β to 1,000. We train the entity recommendation model for 1,000 epochs with a validation patience of 5.

Evaluation Metrics
As our system focuses on improving the performance of the recommendation engine, we select the information retrieval metrics Mean Reciprocal Rank (MRR) and Recall@k as our metrics. MRR is calculated as the mean of the reciprocal of the rank of the target entities as predicted by the model. Recall@k measures if the target movie is present in the top-k recommended products.

Comparison to previous approaches
We consider KBRD (Chen et al., 2019) as the baseline for our work. The KBRD system extracts different movie entities and mentioned entities from the dialogues and links the entities to DBpedia using the entity linking technique described by Daiber et al. (2013). The system then extracts a subgraph by considering 2-hop neighbours of movie entities. The entity embeddings are learned by using Relational Graph Convolutional Network (Schlichtkrull et al., 2018). The entity embeddings belonging to a particular user are then passed to a self-attention layer to learn the user-representation. The dot-product of the user representation is computed with the entity embeddings and the scores are passed through a softmax layer to get the recommendation probabilities.
We also compare the performance of our model with ReDial . ReDial is a CRS with a dialogue generation component built using HRED (Sordoni et al., 2015). The recommendation system makes use of an autoencoder model and sentiment analysis model to suggest movies to users.

Results and Discussions
First, we perform a quantitative evaluation of the models. We then analyse the effect of different subgraphs on the performance. Thereafter, we proceed towards studying the performance while recommending disconnected entities and the effect of session sequence length. Finally, we analyse two examples, where the model results are not consistent with the gold-standard, to get insights for future work.

Quantitative Evaluation
In this work, we put forward the hypothesis that the use of pre-trained embeddings of entities supplemented with positional information improves the performance of the recommendation engine. The results present in Table 3 show the performance of different models. To compute the performance of each model, we did 24 runs of our models to obtain the distribution of the performance scores on a different metric. We compute the confidence interval for the performance of each model on a different metric by using the Student's-t distribution (Student, 1908). Two sample t-test showed that our results are statistically significant results over the previous models (p 0.001). We achieve a performance of 3.36%, 18.05%, 35.70%, and 0.082 on the recall@1, recall@10, recall@50, and MRR metrics respectively with 5-hops subgraph. We improve upon the previous baseline by (Chen et al., 2019) by 12.00%, 10.73%, 5.32%, and 24.24% on the four metrics respectively. These results show the efficacy of our choice of using subgraphs and pre-trained embeddings.

Performance Analysis
This section presents the performance of our model in different settings of the problem. It also presents an error analysis to understand the performance of the model. Table 4: Evaluation on disconnected entities. We report the performance values on different metrics for the instances when the target entity is disconnected from the already seen set of entities in the subgraphs.

Effect of different subgraphs
Graphs with a higher number of nodes and edges contain more information. However, in the recommendation scenario, it is desirable to have subgraphs from KGs containing the maximum amount of information with a minimal amount of noise. Table 3 shows that for models with subgraphs constructed using the N -hop method, as the value of N increases, the performance of the model improves. However, the model performance when N =5 is not significantly better than the case when N =3. This points us towards the direction of analysing the performance of the models when using subgraphs constructed using PageRank. The results show that even though subgraphs constructed using PageRank or personalised PageRank contain a lower number of nodes and edges (Table 1), they perform at par with N -hop models.
Since PageRank identifies important nodes in a graph, PageRank graphs are subgraphs of 5-hops graphs containing a lower number of nodes and edges as well as a lesser amount of noise. Table 3 further shows that giving more importance to movie nodes while constructing PageRank subgraphs gives better performance. When giving more importance to movie nodes, the PageRank algorithm constructs a subgraph focusing on the movie nodes. This method can be used to adapt easily to different domains. When pretraining entity embeddings using the method described by Balažević et al. (2019), the runtime is directly proportional to the number of entities and relations. With smaller and richer subgraph, the pre-training is faster while giving us comparable performance.

Recommending Disconnected Entities
Disconnected target entities are those entities that are not connected through a path of any length in the KG to entities that have already interacted with the user. We compare 2-hop model against KBRD. The two models make use of the same DBpedia subgraph and thereby have the same set of disconnected entities. We do not compare KBRD against our best performing model since the subgraph would change for 5-hop model. Similarly, we do not compare against ReDial as it uses auxiliary information such as sentiment information which neither KBRD nor our model uses. While ReDial uses auxiliary information, neither KBRD nor our model leverages such information apart from the information through KGs, our models can be adapted into the domain more quickly as compared to ReDial. Since the ReDial model does not make use of KGs, there is no scenario of disconnected entities. Table 4 shows the comparison of our model against KBRD. Our model performs better than KBRD in all four metrics. The results show the correctness of our assumption that using pre-trained embeddings are more helpful in improving the performance of recommendation models using KGs.

Effect of Session Sequence Length
We consider the session sequence length as the number of entities encountered by the user. We analyse the performance of our model on different subgraphs against the baseline. We compare the performance of different models on Recall@50 and is shown in Figure 2. The figures show that as the number of mentioned items increases, the performance of the model improves. This is attributed to the fact that when the number of mentioned items increases, the model has more information about the user and thereby represents the user better. Figure 2 also shows that our model on different subgraphs outperforms the baseline. In addition to that, embeddings learned using graphs with a higher number of entities and relationships demonstrate better performance. This can be attributed to the fact that a higher number of entities and relations help in learning more information and thereby better entity representations. Table 5 displays three dialogues where our model does not perform correctly. In the case of Dialogue 1, the correct target movie is Get Out. However, our model assigns it the fourth highest probability. This can be explained by the fact that when no initial entities are mentioned in the dialogue, the model is not able to infer the context of the conversation. As of result of this, the model gives the same output irrespective of the context. For Dialogues 2 and 3, the model gives the same result for both of the dialogues even although the context and the mentioned entities are different. Both dialogues have the movie in common and the results are biased towards that movie. The model does not discriminate the movies present in the user utterance or the recommendation utterance.

Conclusion
In this work, we propose a new model to improve the performance of the recommendation engine in CRSs. We introduce different subgraphs for the task to show the need for the construction of better subgraphs from KGs to propagate information into recommendation models. We show that our model outperforms the current-state-of-the-art on multiple metrics by considering the use of pre-trained KG embeddings and positional embeddings. Through multiple experiments, we demonstrate that our model performs better in recommending disconnected entities in KG. Also, as the number of entities mentioned in the text increases, the performance of our model improves.
In our future work, we plan towards incorporating the context of the dialogue text into the model. We will be working towards understanding the differences between the movies mentioned by the user and the recommender, resulting in a better understanding of the context which in turn can be used to produce better recommendations. Exploration of the performance of different KGs for this task is left for future work.