OpenDialKG: Explainable Conversational Reasoning with Attention-based Walks over Knowledge Graphs

We study a conversational reasoning model that strategically traverses through a large-scale common fact knowledge graph (KG) to introduce engaging and contextually diverse entities and attributes. For this study, we collect a new Open-ended Dialog <-> KG parallel corpus called OpenDialKG, where each utterance from 15K human-to-human role-playing dialogs is manually annotated with ground-truth reference to corresponding entities and paths from a large-scale KG with 1M+ facts. We then propose the DialKG Walker model that learns the symbolic transitions of dialog contexts as structured traversals over KG, and predicts natural entities to introduce given previous dialog contexts via a novel domain-agnostic, attention-based graph path decoder. Automatic and human evaluations show that our model can retrieve more natural and human-like responses than the state-of-the-art baselines or rule-based models, in both in-domain and cross-domain tasks. The proposed model also generates a KG walk path for each entity retrieved, providing a natural way to explain conversational reasoning.


Introduction
The key element of an open-ended dialog system is its ability to understand conversational contexts and to respond naturally by introducing relevant entities and attributes, which often leads to increased engagement and coherent interactions . While a large-scale knowledge graph (KG) includes vast knowledge of all the related entities connected via one or more factual connections from conversational contexts, the core challenge is in the domain-agnostic and scalable prediction of a small subset from those reachable entities that follows natural conceptual threads that can keep conversations engaging and meaningful. Hence, we study a data-driven reasoning model To generate a KG entity response at each dialog turn, the model learns walkable paths within KG that lead to engaging and natural topics or entities given dialog context, while pruning non-ideal (albeit factually correct) KG paths among 1M+ candidate facts. that map dialog transitions with KG paths, aimed at identifying a subset of ideal entities to mention as a response to previous dialog contexts. Figure 1 illustrates a motivating dialog example between two conversation participants, which spans multiple related KG entities from a starting seed entity The Catcher in the Rye. Specifically, we observe that there exists a small subset of walkable patterns within a KG or a preferred sequence of graph traversal steps which often leads to more engaging entities or attributes than others (e.g. Literacy Realism, Nathaniel Hawthorne,etc. vs. Catch Me If You Can,277,all connected via one-or multi-hop factual connections). Note also that the walkable degree of each entity varies by dialog contexts and domains, thus making conventional rule-based or entity-toentity learning approaches intractable or not scalable for open-ended dialogs with 1M+ candidate facts. Therefore, pruning the search space for entities based on dialog contexts and their relationbased walk paths is a crucial step in operating knowledge-augmented dialog systems at scale.
To this end, we propose a new model called DialKG Walker that can learn natural knowledge paths among entities mentioned over dialog contexts, and reason grounded on a large commonsense KG. Specifically, we propose a novel graph decoder that attends on viable KG paths to predict the most relevant entities from a KG, by associating these paths with the given input contexts: dialog, sentence, and a set of starting KG entities mentioned in the previous turn. We then build a parallel zeroshot learning model that predicts entities in the KG embeddings space, and ranks candidate entities based on decoded graph path output.
To train the DialKG Walker model with groundtruth reference to KG entities, we collect a new human-to-human multi-turn dialogs dataset (91K utterances across 15K dialog sessions) using Par-lAI (Miller et al., 2017), where conversation participants play a role either as a user or as an assistant, while annotating their mention of an entity in a large-scale common fact KG. This new dataset provides a new way for researchers to study how conversational topics could jump across many different entities within multi-turn dialogs, grounded on KG paths that thread all of them. To the best of our knowledge, our OpenDialKG is the first parallel Dialog ↔ KG corpus where each mention of a KG entity and its factual connection in an openended dialog is fully annotated, allowing for indepth study of symbolic reasoning and natural language conversations.
Note that our approaches are distinct from the previous work on dialog systems in that we completely ground dialogs in a large-scale commonfact KG, allowing for domain-agnostic conversational reasoning in open-ended conversations across various domains and tasks (e.g. chit-chat, recommendations, etc.) We therefore perform extensive cross-domain and transfer learning evaluations to demonstrate its flexibility. See Section 5 for the detailed literature review.
Our contributions are as follows: we propose (1) a novel attention-based graph decoder that walks an optimal path within a large commonsense KG (100K entities, 1.1M facts) to effectively prune unlikely candidate entities, and (2) a zeroshot learning model that leverages previous sentence, dialog, and KG contexts to re-rank candidates from pruned decoder graph output based on their relevance and path scores, which allows for generalizable and robust classification with a large number of candidate classes. We present (3) a new parallel open-ended dialog ↔ KG corpus called OpenDialKG where each mention of an entity in dialog is manually linked with its corresponding ground-truth KG path. We show that the proposed approaches outperform baselines in both indomain and cross-domain evaluation, demonstrating that the model learns domain-agnostic walking patterns that are generalizable for unseen domains. Figure 2 illustrates the overall architecture of the DialKG Walker model which retrieves a set of entities from a provided KG given multiple modalities of dialog contexts. Specifically, for each turn the model takes as input a set of KG entities mentioned at its current turn, a full sentence at the current turn, and all sentences from previous turns of dialog, which are encoded using Bi-LSTMs with self-attention modules (Section 2.2). The autoregressive graph decoder takes attention-based encoder output at each decoding step to generate a walk path for each starting KG entity, which is combined with zeroshot KG embeddings prediction results to rank candidate entities (Section 2.3).

Notations
We define the knowledge graph G KG = V KG × R KG which is composed of all common-sense entity nodes V KG and the relation set R KG that connects each pair of two nodes. Let us also denote V r (v) to be a set of nodes directly connected to a node v ∈ V KG by a relation r ∈ R KG . Similarly, we denote V R,n (v) to be a set of nodes connected to v via n-hops with a set of relations R.
Each input is composed of three modalities: e } is a set of entities mentioned in the current turn, x s is its surrounding sentence context in the same turn, and x d is its dialog context up to the previous turn.
Each output is a KG path sequence that con-  We formulate the future entity retrieval task as: where f x→y is a function with learnable parameters that projects input samples at the current turn (x) into the same space as the output representations (y), i.e. entities to be mentioned in the next turn and their optimal paths. V(x e ) ⊂ V KG denotes a set of KG entity nodes reachable from x e , defined accordingly to each decoding method.

Input Encoding
Entity representation: We construct KG embeddings to encode each entity mention (Bordes et al., 2013), in which semantically similar entities are distributed closer in the embeddings space. In brief formulation, the model for obtaining embeddings from a KG (composed of subject-relationobject (s, r, o) triples) is as follows: where I r is an indicator function of a known relation r for two entities (s,o) (1: valid relation, 0: unknown relation), e is a function that extracts embeddings for entities, e r extracts embeddings for relations, and score(·) is a deep neural network that produces a likelihood of a valid triple. Sentence representation: We represent textual context of surrounding words of a mention with a state-of-the-art attention-based Bi-LSTM language model (Conneau et al., 2017) with GloVe (Pennington et al., 2014) distributed word embeddings trained on the Wikipedia and the Gigaword corpus with a total of 6B tokens. Dialog representation: To encode previous dialog history, we use a hierarchical Bi-LSTM (Yang et al., 2016) over a sequence of previous sentences with a fixed window size. We apply self-attention over sentences to attenuate and amplify sentence contexts based on their relevance to the task, allowing for more robust and explainable prediction. Input aggregation: We aggregate input contexts x from entities, sentences and dialogs, by applying the modality attention (Moon et al., 2018a,b), which selectively attenuates or amplifies each modality based on their importance on the task: where α = [α e ; α s ; α d ] ∈ R 3 is an attention vector, and x is a final context vector that maximizes information gain.

Graph Decoder
Using the contextual information extracted from an entity and its surrounding text (Section 2.2), we build a network which predicts a corresponding KG entity based on its knowledge graph embeddings with the following objective: where L f (·) is a supervised loss for generating the correct entity at the next turn, and L walk (·) is a loss defined for taking the optimal path within a knowledge graph. W = {W f ,W p ,W input } are the learnable parameters for the final entity classifier (W f ), the path walker model (W p ), and the input encoder, respectively. R(W) denotes the weight decay regularization term.

Zeroshot Relevance Score
We compute zeroshot relevance score in the KG embeddings space, thus allowing for robust prediction for KG entities and domains unseen during training as well. Specifically, we use the supervised hinge rank loss for KG embeddings prediction as a choice of L f , defined for each sample (Moon and Carbonell, 2017).
where f (·) is a transformation function that walks through the knowledge graph and projects a predicted future entity in the KG embeddings space, andỹ refers to the embeddings of negative samples randomly sampled from KG entities except the ground truth label of the instance. Intuitively, the model is trained to produce a higher dot product similarity between the projected embeddings of a sample with its correct label (f (x (i) )·y (i) e ) than with an incorrect negative label in the KG label embeddings space (f (x (i) ) ·ỹ), where the margin is defined as the similarity between a ground truth sample and a negative sample (ỹ · y (i) e ).

KG Path Walker
Generating candidate KG entities solely based on their relevance score (Eq.5) is challenging due to the exponentially large search space. To this end, we define the attention-based DialKG graph decoder model which prunes unattended paths, which effectively reduce the search space. Decoding steps are formulated as follows (bias terms for gates are omitted for simplicity of notation): where z t is a context vector at decoding step t, produced from the attention over walkable path which is defined as follows: where α t ∈ R |R KG | is an attention vector over the relations space, r k is relation embeddings, and z t is a resulting entity context vector after walking from its previous entity on an attended path. We guide the graph decoder with the groundtruth walk paths by computing the following loss L walk (x, y) = i,t L ent + L rel between predicted paths and each of {y e , y r }, respectively (L ent : loss for entity paths, and L ent for relation paths): Once the model is trained, at each decoding step, we can rank the potential paths based on the sum of their zeroshot relevance (left) and softattention-based output path (right) scores: Adversarial Transfer Learning: if domain labels (y d ) are available (e.g. movie, book, sports, etc.), we can utilize these labels to further aid training by extracting transferrable features and learning optimal paths conditioned on domain embeddings (Ganin et al., 2016). We implement adversarial transfer learning for DialKG Walker as follows and study this specific setting in one of our experiments to demonstrate that the model can better generalize over multiple domains: 3 Dataset: OpenDialKG To empirically evaluate the proposed approach, we collected a new dataset, OpenDialKG, of chat conversations between two agents engaging in a dialog about a given topic (91K turns across 15K dialog sessions). Each dialog is paired with its corresponding "KG paths" that weave together the KG entities and relations that are mentioned in the dialog. This parallel corpus of textual dialogs and corresponding KG walks enables learning models that ground the implicit reasoning in human conversations to discrete KG operations.
Wizard-of-Oz setup The dialogs were generated in a Wizard-of-Oz setting (Shah et al., 2018) by connecting two crowd-workers to engage in a chat session, with the joint goal of creating natural and engaging dialogs. The first agent is given a seed entity and asked to initiate a conversation about that entity. The second agent is provided with a list of facts relevant to that entity, and asked to choose the most natural and relevant facts and use them to frame a free-form conversational response. Each fact is a 1-hop or 2-hop path initiating from the conversation topic. After the second agent sends their response, various new multi-hop facts from KG are surfaced to include paths initiating from new entities introduced in the latest message. This process allows the conversation participants to annotate any new fact or entity they want to introduce at each turn, along with the groundtruth KG walk path that connect the two KG entities. At this point the first agent is instructed to continue the conversation by choosing among the updated set of facts and framing a new message. This cycle continues for 6 messages per session on average spanning multiple KG paths, until one of the agents decides to end the conversation (e.g. the task goal is met).
We did two separate collections: a recommendation task where the second agent acts as an assistant who is providing useful recommendations to the user, and a chit-chat task where both agents act as users engaging in open-ended chat about a particular topic. To ensure sufficient separation of the dialog content, we used entities related to movies (titles, actors, directors) and books (titles, authors) for the recommendation task, and entities related to sports (athletes, teams) and music (singers) for the chit-chat task ( athletes list, etc.) and linked with the corresponding KG entities.
KG sources: We use the Freebase (Bast et al., 2014) KG which is a publicly available and comprehensive source of general-knowledge facts. To reduce noise, we filter tail-end entities based on their prominence scores, the resulting KG of which consists of total 1,190,658 fact triples over top 100,813 entities and 1,358 relations.

Empirical Evaluation
Task: Given a set of KG entity mentions from current turn, and dialog history of all current and previous sentences, the goal is to build a robust model that can retrieve a set of natural entities to mention from a large-scale KG that resemble human responses. Note that end-to-end generation of sentences (e.g. based on the retrieved entities) is not part of this study -instead, we focus on the important challenge of scaling the conversational reasoning and knowledge retrieval task to opendomain dialogs, requiring an aggressive subset selection (from 1M+ facts subset of Freebase).

Baselines
We choose as baselines the following state-of-theart approaches that augment external knowledge to dialog systems for various tasks (see Section 5 for details), and modify accordingly to fit to our entity retrieval task (e.g. we use the same 1M-facts FreeBase KG for all of the baselines): • seq2seq (Sutskever et al., 2014) with dialog contexts + zeroshot: we apply the seq2seq approach for entity path generation, given all of the dialog contexts. To make this baseline stronger, we add a zeroshot learning layer in the KG embeddings space (replacing typical softmax layers to improve generality) for entity token decoding.
• Tri-LSTM (Young et al., 2018): encodes each utterance and all of its related facts  within 1-hop from a KG to retrieve a response from a small (N=10) pre-defined sentence bank. We modify the retrieval bank to be the facts from the KG instead.
• Extended Enc-Dec (Parthasarathi and Pineau, 2018): conditions response generation with external knowledge vector input. A response entity token is generated at its final softmax layer, hence not utilizing structural information from KG.
We also consider several configurations of our proposed approach to examine contributions of each component (input modalities (E): entities, (S): sentence, (D): dialog contexts).
• (Proposed; E+S+D): is the proposed approach as described in Figure 2 • (E+S): relies only on its previous sentence and excludes dialog history from input.
• (E): only uses starting KG entities as input contexts, and excludes any textual context.

Results
Parameters  (Duchi et al., 2011) with batch size 10, learning rate 0.01, epsilon 10 −8 , and decay 0.1. In-domain evaluation: Table 2 shows the generation results of the top-k predictions of the model for in-domain train and test pairs (train & test on: all domains / train & test on: movie domain split). It can be seen that the proposed Di-alKG Walker model outperforms other state-ofthe-art baselines, especially for recalls at small ks. Specifically, when textual contexts are added as input (E+S and E+S+D), the model learns to condition its walk path output on textual contexts, thus outperforming the non-textual ablation model (E). seq2seq and Tri-LSTM models consider the nodes connected via all possible relations as candidates in the final layer (without pruning), resulting in extensive search space and consequently poor recall performance. In addition, Tri-LSTM only considers the facts connected via 1-hop relations as input contexts, which limits its prediction for multi-hop facts. Ext-ED relies its prediction in   Table 5: Human evaluation: "Which response is the most natural for given dialog context?" (metric: % of cases chosen as top-k response by the raters) the final softmax layer, which typically performs poorly for a large number of output class, compared to zeroshot learning approaches.
Cross-domain evaluation: Table 3 demonstrates that the DialKG Walker model can generalize to multiple domains better than the baseline approaches (train: movie & test: book / train: movie & test: music). This result indicates that our method also allows for zeroshot pruning by relations based on their proximity in the KG embeddings space, thus effective in cross-domain cases as well. For example, relations 'scenario by' and 'author' are close neighbors in the KG embeddings space, thus allowing for zeroshot prediction in cross-domain tests, although their training examples usually appear in two separate domains: movie and book.
Human evaluation: To compare the subjective quality of the models, i.e. the relative naturalness and relevance of the generated KG paths, we performed a human evaluation where paid raters were shown partial dialogs taken from the test dataset, along with the top 2 paths output from each model. The rater was asked to choose the 3 most appropriate paths for continuing the dialog. We evaluated 250 dialogs, showing each dialog to 3 raters, for a total 750 tasks. We report the % of cases when a top-k chosen fact was generated by each of the models ( Table 5). The numbers add up to more than 100% as models can generate identical paths. If such a path is chosen by the rater, it is counted towards each of the models that generated the path.
We show that the generated responses by our proposed methods achieve the highest scores in all top-k evaluation, validating that the model can output more natural human-like responses.
Transfer learning: In Figure 3, we show that cross-domain performance can greatly improve with a relatively small addition of in-domain target data, via the transfer learning approaches. Specifically, it can be seen that (TL:Adv), which simultaneously trains for both source and target data (effectively doubling the training size) with additional adversarial discriminator for source and target domains, achieves the best performance especially for domains that are semantically close (e.g. movie and book). (TL:FT) transfers knowledge from a pre-trained source model via finetuning (hence requiring significantly less training resources), and effectively avoids "cold start" training (Moon et al., 2015). This result shows that the DialKG model can quickly adapt to other new low-resource domains and improve upon the zeroshot cross-domain performance, demonstrating its potential capability to reason on open-ended conversations.
Error analysis: Table 4 shows some of the example output from each model (as well as groundtruth responses), given dialog contexts. In general, the DialKG Walker tends to explore more multihop relations than other baselines in order to generate natural and engaging entities, which consequently improves the diversity of answers. Note that if the graph decoder arrives at a sufficiently good entity to generate, it stops its traversal operation and outputs the most viable entity based on the relevance score. Some of the models do not take into account the dialog history, hence generating redundant topics from previous turns. There are some cases where the final entity prediction is different from the ground-truth, whereas its relation path is correctly predicted. The generated entities are often still considered valid and natural, because the proposed model uses zeroshot relevance score to best predict the candidates.

Discussion and Related Work
Knowledge augmented dialog systems: Young et al. (2018) propose to explicitly augment input text with concepts expanded via 1-hop relations (where KG triples are represented in the sentence embeddings space), and He et al. (2017) propose a system which iteratively updates KG embeddings and attends over connected entities for response generation. However, several challenges remain to scale the simulated knowledge graph used in the study to our open-ended and large-scale KG with 1M+ facts. Other line of work (Parthasarathi and Pineau, 2018;Ghazvininejad et al., 2018;Long et al., 2017) uses embedding vectors obtained from external knowledge sources (e.g. NELL (Carlson et al., 2010), Wikipedia, Freebase (Bast et al., 2014), free-form text, etc.) as an auxiliary input to the model in dialog generation. Our model extends the previous work by (1) explicitly modeling output reasoning paths in a structured KG, (2) by introducing an attentionbased multi-hop concept decoder to improve both recall and precision.
End-to-end dialog systems: Several models and corresponding datasets have recently been published. Most work focuses on task or goal oriented dialog systems such as conversational recommendations (Salem et al., 2014;Sun and Zhang, 2018;Dalton, 2018), information querying (Williams et al., 2017;de Vries et al., 2018;Reddy et al., 2018), etc., with datasets collected mostly through bootstrapped simulations , Wizard-of-Oz setup Wei et al., 2018), or online corpus (Li et al., 2016). Our OpenDialKG corpus is unique in that it includes open-ended natural human conversations over multiple scenarios (e.g. chit-chat and recommendation on various domains), where reasoning paths from each dialog are annotated with their corresponding discrete KG operations. Our work can also be viewed as extending the conventional state-tracking approaches (Henderson et al., 2014) to more flexible KG path as states.
KG embeddings and inference: Several methods have been proposed for KG inference tasks (e.g. edge prediction), which include neural models trained to discern positive and negative triples (Bordes et al., 2013;Wang et al., 2014;Nickel et al., 2016;Dettmers et al., 2018), or algorithms with discrete KG operations on structured data (Lao et al., 2011;Chen et al., 2015). KG embeddings have been shown effective in other NLP tasks when they are used as target labels for classification tasks, which also allows for effective transfer learning (Moon and Carbonell, 2017). For effective application of KG embeddings in NLP tasks, recent studies (Kartsaklis et al., 2018) proposed to map word embeddings and KG embeddings via end-to-end tasks. In contrast to the line of work on KG edge prediction, we aim to learn an optimal path within existing paths that resemble human reasoning in conversations.
We study conversational reasoning grounded on knowledge graphs, and formulate an approach in which the model learns to navigate a largescale, open-ended KG given conversational contexts. For this study, we collect a newly annotated Dialog ↔ KG parallel corpus of 15K humanto-human dialogs which includes ground-truth annotation of each dialog turn to its reasoning reference in a large-scale common fact KG. Our proposed DialKG Walker model improves upon the state-of-the-art knowledge-augmented conversation models by 1) a novel attention-based graph decoder that penalizes decoding of unnatural paths which effectively prunes candidate entities and paths from a large search space (1.1M facts), 2) a zeroshot learning model that predicts its relevance score in the KG embeddings space, combined score of which is used for candidate ranking. The empirical results from in-domain, cross-domain, and transfer learning evaluation demonstrate the efficacy of the proposed model in domain-agnostic conversational reasoning.