AttnIO: Knowledge Graph Exploration with In-and-Out Attention Flow for Knowledge-Grounded Dialogue

Retrieving the proper knowledge relevant to conversational context is an important challenge in dialogue systems, to engage users with more informative response. Several recent works propose to formulate this knowledge selection problem as a path traversal over an external knowledge graph (KG), but show only a limited utilization of KG structure, leaving rooms of improvement in performance. To this effect, we present AttnIO , a new dialog-conditioned path traversal model that makes a full use of rich structural information in KG based on two directions of attention ﬂows. Through the attention ﬂows, At-tnIO is not only capable of exploring a broad range of multi-hop knowledge paths, but also learns to ﬂexibly adjust the varying range of plausible nodes and edges to attend depending on the dialog context. Empirical evaluations present a marked performance improvement of AttnIO compared to all baselines in OpenDi-alKG dataset. Also, we ﬁnd that our model can be trained to generate an adequate knowledge path even when the paths are not available and only the destination nodes are given as label, making it more applicable to real-world dialogue systems.


Introduction
One of the milestone challenges in conversational AI is to engage users with a more informative and knowledgeable response, rather than merely outputting generic sentences. For instance, given a user's utterance saying "I'm a big fan of Steven Spielberg", it would be more engaging to respond "My favorite movie is his science fiction film A.I.", rather than "I like him too.". An external source of knowledge such as knowledge graph (KG) can * Equal contribution. play a crucial role here, as it could help the conversational agent with informative paths, such as "Steven Spielberg, directed, A.I., has genre, Science Fiction".
The above mentioned motivation gave rise to a conspicuous need for path retrieval model on KG, which can learn to traverse a path consisting of proper entities and relations to mention in the next response, given the dialog context. Previous approaches to this knowledge selection problem rely on either an RL-based agent (Liu et al., 2019) or a recurrent decoder (Moon et al., 2019), which greedily selects the most proper entity to traverse regarding its previous decision. Despite their novelty, we find several rooms of improvement from the previous works, to move toward a more finegrained modeling of knowledge path retrieval for dialogue systems. First, a model could make use of the rich relational information residing at the neighborhood of each node on KG. Typically, the number of entities in KG is large, while the numbers of each entity's usage in actual dialogues are small. Thus leveraging the neighborhood information of each entity in knowledge graph could be crucial, to overcome the sparsity of entity usage and learn proper representation of entities and relations.
Also, we find that the range of knowledge paths plausible for a response to a given dialog may vary, depending on the dialog context and user intent. In response to a closed question such as "Who directed the movie A.I.?", there could be only one or two knowledge paths valid as answer. On the contrary, in response to an open question such as "Do you know Steven Spielberg?", there could be a variety of knowledge paths natural enough to carry on the conversation. Therefore, a model should be able to choose the range of entities to attend, depending on the characteristics of a given dialog.
Lastly, one should note that it is practically hard Figure 1: Path decoding process of AttnIO. By propagating attention values at each step rather than selecting one node to traverse, AttnIO can choose between focusing on small number of entities or attending to multiple relevant neighbors, depending on the dialog context.
to gather a large-scale dialog-KG path parallel corpus, fully annotated with all entities and relations comprising each path. Retrieving the initial, and final entities of the KG path is relatively easier, as it only requires post-processing the {query, re-sponse} pairs in dialog. Therefore, it would be more desirable if a model could be trained to traverse a proper knowledge path, only with the destination nodes provided as label.
To this end, we propose AttnIO (Attention Inflow and Out-flow), a novel KG path traversal model that overcomes all challenges stated above. Aside to the conventional textual encoder which encodes dialog history and user utterance, AttnIO models the KG traversal mechanism into two subprocesses: incoming attention flow, and outgoing attention flow. Inspired by Attention Flow , the two attention flows explore KG by propagating the attention value at each node to its reachable neighbor nodes, as shown in Figure 1. The attention propagation mechanism enables our model to start exploring KG from multiple entities (A.I., and The Truman Show), then find out an intermediate node Drama relevant to both movies, and end the multi-hop reasoning by arriving at Catch me if you can. Such a complex interaction between entities cannot be modeled by a greedy decoder, limited to consider only an optimal node at each decoding step. In addition, our model provides better interpretation of its path reasoning process, by visualizing the attention distribution of nodes and edges at each step. Lastly, we consider our model in a more challenging, but more realistic setting of path retrieval task, where no ground-truth path is available for supervision, but only the final desti-nation nodes are given. Even in this setting, we find that AttnIO can be trained to infer a proper knowledge path for the input dialog.
In summary, our contributions are as follows: (1) We suggest a novel path traversal model AttnIO, achieving state-of-the-art performance in dialogconditioned knowledge path retrieval task on the OpenDialKG dataset.
(2) We demonstrate that At-tnIO can be trained even in a challenging setting where only the destination nodes are given, and show through both qualitative and quantitative analysis that the quality of paths generated from this setting does not fall behind that of the all-path supervision setting. (3) Through visualizing the attention distribution at each decoding step, we show that our model possesses better interpretability over the path reasoning process.

Related Works
Recently, lots of research effort have been devoted to grounding dialogue systems on structuredknowledge embedded in knowledge graphs. These works can be broadly classified into two categories, depending on the range of exploration over candidate knowledge. The first line of works, namely breadth-centric approaches, tend to focus on augmenting dialog context with entity representations, by aggregating their shallow (i.e., 1-hop or 2-hop) neighborhood information from an external knowledge graph Liu et al., 2018;Parthasarathi and Pineau, 2018;.  suggest to encode an auxiliary knowledge vector by attentively reading all 1-hop relations of each initial entity that appears in user's utterance.  extends the previous work's knowledge encoding scheme to 2-hop relations, encoding all initial entities and their 1-hop neighbors with two independent attention mechanisms. While these works are successful in contextualizing each entity with various relations in KG, they lack in retrieving small set of focused knowledge paths relevant to the dialog, or generalizing to multi-hop relations. We extend these approaches by suggesting a new framework that can be generalized to an arbitrary length of traversal, and dynamically updating entity features at each decoding step to facilitate multi-hop relational inference.
On the other hand, the second line of works resort to depth-centric search over candidate knowledge paths. Rather than augmenting entity representation with shallow but wide range of knowledge, they concentrate on traversing only a specific range of entities and relations directly usable for response generation. Liu et al. (2019) formulate the knowledge selection problem as Partially Observed Markov Decision Process, employing a policy network to traverse KG. Meanwhile, Moon et al.
(2019) suggest a recurrent path decoder that relies on a hidden state vector to choose the next entity among reachable nodes. Although these models are competent at inferring multi-hop relations, their discrete selection mechanism neglects rich relational information of nodes and edges they did not explicitly choose to traverse. To complement the weakness, AttnIO does not select an optimal node in advance; rather, it first propagates attention to all reachable entities, and then decode an optimal path from the output attention distribution.
Our work is also closely motivated from recent techniques suggested in the domain of knowledge graph completion tasks. To compensate for weak representation power of translative embedding (Bordes et al., 2013;Trouillon et al., 2016) and convolution-based embedding (Dettmers et al., 2018;Nguyen et al., 2018), several models have adopted graph neural networks (GNN), encoding structural information into entity embedding (Shang et al., 2019;Nathani et al., 2019). Other works perform traversal-based inference in nodeprediction tasks based on reinforcement learning (Das et al., 2018;Lin et al., 2018), or attention propagation . We extend these previous works by adopting graph neural network and attention propagation for the dialog-conditioned path generation problem. We denote the external knowledge graph as G KG = V KG × R KG , with nodes as entities and edges as relations between a pair of entities. We denote G v,n ⊆ G KG as a subgraph containing all nodes and edges reachable in less then or equal to n-hops, starting from vertex v. Also, we define − → N i as a set of incoming neighbor nodes of v i , i.e. nodes possessing edges toward v i , and ← − N i as a set of outgoing neighbor nodes of v i . Figure 2 illustrates the overview of AttnIO's path generation process. Given the input multiturn dialog sequence {s 1 , · · · , s n } and the set of entities V init = {e 1 , · · · , e m } appearing in the user's last utterance s n , AttnIO starts from encoding the input dialog into a fixed-size context vector. It also constructs the dialog-relevant subgraph G input = i G e i ,T , where T is a hyperparameter indicating the maximal length of path to traverse.
At each decoding step t = 1, · · · , T , the incoming attention flow iteratively updates the KG entity features by attentively aggregating rich relational features from their incoming neighbor nodes. Then, the outgoing attention flow propagates the attention value of each node to its outgoing neighbor nodes, yielding the node attention distribution a t i and edge attention distribution a t ij as step t's output. We show that each candidate en- r } can easily be ranked from these output attention distributions, in Section 3.4.1.

Dialog Encoder
AttnIO encodes the input multi-turn dialog into a fixed-size contextual representation. Specifically, we employ state-of-the-art textual representation from ALBERT (Lan et al., 2019), to effectively capture the context and intent of the user's utterance. We concatenate maximum of 3 last utterances in the dialog, and put it as input to the pretrained ALBERT. We use the final layer's hidden representation of [CLS] token, as it is typically considered to be an approximation of the sequence context. We denote this context vector as C, in the following section.
Note that our architecture does not require a specific type of textual encoder, and ALBERT can be replaced with any sequence encoder such as bidirectional RNN. For a fair comparison with previous work, we conduct an ablation study on ALBERT by replacing it with bidirectional GRU (Section 4.1).

Incoming Attention Flow
In order to find better entity representation, the Incoming Attention Flow iteratively updates each entity feature h j for all v j ∈ G input , by aggregating v j 's neighbor information. Recently suggested message-passing mechanism of graph attention networks (GAT) from Veličković et al. (2018) is suitable for this, as it learns to encode each node by selectively attending over its neighbors. Since GAT does not take account of edge features and hence may lose useful relational information integral in KG, we extend the attention-based message passing scheme of GAT into relational graphs. At each decoding step t, the Incoming Attention Flow computes message from entity v i to v j as follows: where h i denotes the feature of v i at step t, and r ij denotes the relation feature assigned to the edge between v i and v j . Then, the new node feature h j for the next time step t + 1 is computed as an attention-based weighted sum of messages from all incoming neighbor nodes of v j : The attention a ij is computed by applying softmax over v j 's all incoming neighbor nodes: where σ denotes LeakyReLU non-linearity.
In addition, we extend our attentive aggregation scheme to multi-headed attention, which helps to jointly attend to information from different representation subspaces of incoming messages (Vaswani et al., 2017). Thus our message aggregation mechanism in Eq.2 is transformed into: where K denotes the number of attention heads. The attention heads perform independent selfattention over neighborhood features, then are concatenated to form the new node feature h j . Another crucial step in Incoming Attention Flow is to fuse entity features with dialog context, such that even if a same set of initial entities are given, the decoder could traverse possibly different paths according to the dialog context. We achieve this fusion by concatenating the dialog context vector with entity feature computed from Eq.4 and then linear-transforming back to the entity embedding dimension:

Outgoing Attention Flow
At the core of AttnIO lies the Outgoing Attention Flow, which defines path traversal on KG as an attention propagation mechanism. In the beginning of the decoding step, it starts from computing the initial attention value a 0 i of nodes in V init , i.e. the set of entities appearing in the user's last utterance.
The relevance of each candidate node is scored as the dot-product with the dialog context vector. In case of entities not in V init , we initialize the node attention value to zero.
Hereafter, the decoder iterates for step 1 to T , where T denotes the maximal possible path length. We add self-loops to each node in G input , in order to indicate that a traversal ended before step T . Given the context-fused entity feature h t i for all v i ∈ G input at each step t, Outgoing Attention Flow essentially computes how much attention value to propagate from v i to its outgoing neighbor v j , as follows: A key here is the transition probability T ij , which can be derived from a function of two relevant node features, h i and h j . In this work, we formulate the process as averaging the multi-headed attentions computed over all outgoing neighbor nodes:

Scoring candidate paths
Given the output of Outgoing Attention Flow at each decoding step i.e. the node attention distribution a 0 i , · · · a T i and the edge attention distribution a 1 ij , · · · a T ij , we can score each candidate entity paths with the product of respective node attention value at each step: Likewise, the score of the relation path P r associated with P v can be retrieved by the product of respective edge attention value at each step:

Training Objective
We train the whole model in an end-to-end manner by directly supervising on the attention distribution at each step. In a default setting where the whole ground-truth paths are available, we use a negative log-likelihood loss on each step's attention distribution (left), and in the target-supervision setting where only the final entity labels are given, we supervise with the same loss only at the final step's attention distribution (right):

Dialog-KG Alignment by Initialization
AttnIO's training phase tends to be unstable in the beginning, as the model has to deal with two completely different modalities: KG entities, and the dialog. In order to align the two different features, we find that initializing each entity feature as the representation from pretrained ALBERT helps. Just as the dialogue context representation, we put each entity phrase as input with [CLS] token. We then take the hidden representation of [CLS] token from the last layer of ALBERT, and linear-transform it to create initial entity feature h 0 i . Note that we do not fine-tune ALBERT, but back-propagate to h 0 i during training. This additional process not only narrows down the gap between feature space of entities and dialog contexts, but also helps better understand each entity in several cases, as some entities span a lengthy phrase of natural language tokens (e.g. Grammy Award for Best Pop Collaboration with Vocals).

Experiments and Results
Dataset We evaluate our proposed method on OpenDialKG (Moon et al., 2019), a dialog -KG parallel corpus designed for knowledge path retrieval task. The dataset consists of 91k multi-turn conversations in form of either task-oriented (recommendation) dialog, or chit-chat on a given topic. Each pair of utterances in the conversations is annotated with a KG path, where its initial entity is mentioned in the former turn, and its destination entity is mentioned in the latter turn. As the train/valid/test partitions of OpenDialKG are not publicly available, we create our own split by randomly partitioning into train (70%), valid (15%), and test set (15%).
Baselines We take 4 models suggested by Moon et al. (2019) as baselines. These models include Di-alKG Walker, a state-of-the-art model designed to traverse a dialogue conditioned knowledge path. Other 3 models are Seq2Seq (Sutskever et al., 2014), Extended Encoder-Decoder (Parthasarathi and Pineau, 2018), Tri-LSTM  all modified to fit the entity path retrieval task. Implementation Details Our model depends heavily on message passing scheme of graph neural networks, which may lead to excessive memory usage when G input is large. To further scale AttnIO to larger graphs, we reduce the size of the input graph through edge-sampling on G input during training. Detailed explanation on this edge-sampling is presented in Appendix A.
As all ground-truth paths in OpenDialKG are either 1-hop or 2-hop, we set the maximal path length T = 2. We search for the best set of hyperparameters using grid-based search, choosing value with the best path accuracy with all other hyperparameters fixed. We implemented our model using PyTorch (Paszke et al., 2019) and DGL (Wang et al., 2019). Additional implementation details including hyperparameter search bounds and the best configuartion are provided in Appendix E.

Results
Table 1 presents the overall evaluation results of AttnIO, and its comparison to baseline models. In addition to the recall@k of ground-truth paths (path@k), we report recall@k on the target nodes (tgt@k), as the destination node can be considered as the most important component in knowledge path to generate response.
As can be seen in the table, our model outperforms all baselines in both path@k and tgt@k, when supervised with all entities in each path as label (AttnIO-AS). Especially, AttnIO-AS shows significantly better performance in metrics with small k. We also report our model in a more challenging setting of target supervision, assuming that only the destination node of each path is available (AttnIO-TS). In this case, our model shows a comparable target prediction performance (tgt@k) to AttnIO-AS, while its path@k is relatively poor in small ks.
Recurrent decoder based models, such as Di-alKG Walker and Seq2Path, relies only on a single state vector to model the transition between each decoding step. Therefore, once the model chooses to traverse a sub-optimal entity, it is hard to get back onto the right track without help of an aggressive beam search. In our method, on the contrary, the state of the decoder is essentially distributed into all the walkable entities' feature vectors; therefore, the transition is modeled alongside all the entities with nonzero attention value at each step, making the model more robust to 'misleading' hops. Also, note that AttnFlow shows consistent performance drop of about 30% then AttnIO-AS in all metrics, indicating the importance of neighborhood encod-   ing step for knowledge path retrieval.
Ablation Study We conduct ablation study with three different configurations. First, we put GRU (Cho et al., 2014) as dialog encoder in replace of ALBERT, for a fairer comparison with baseline models. As shown in Table 1, we find that although the performance of AttnIO with GRU slightly degrades from that with ALBERT, it still outperforms all existing models. Next, in order to find out the value of dialog context in the traversal, we train our model with only the initial entities given as input (with uniform attention prior assigned to each initial entity), but not the dialog context. Recall@1 significantly drops in this case, while metrics with large k relatively stays moderately. This implies that although information on initial nodes appearing in last utterance might be sufficient to prune improbable paths, the dialog context is essential in finding an optimal path among probable ones. We also find in the third ablation model where no dialog-KG alignment is applied (Section 3.4.3), that ALBERT initialization of node embedding helps, leading to performance gain of about 2% in path@1.

Analysis
Relation Accuracy The poor entity path accuracy of AttnIO-TS may seem natural, as the initial node and intermediate node (in case of multi-hop) are not given as label in target supervision setting. However, one should note that there can be a vari-  ety of entity paths that match human sense in naturalness and coherence for a specific dialog. For an example shown in Table 2, any film of comedy genre shall replace One Crazy Summer in GT-path, without loss of naturalness. The generated path from AttnIO-AS could even be an answer, giving more information on the chosen film. The inherent one-to-many relationship between dialog context and probable knowledge, makes it hard to correctly assess the performance of knowledge retrieval models. Relation path accuracy could be one way to relieve this problem, as relations represent important attributes shared by similar entities. The relation path accuracy of AttnIO in both supervision setting is as shown in Table 3. The relation path accuracy under both setting is clearly higher than the entity path accuracy, implying the generalization capability of our model based on reasoning over relations, rather than depending on specific entities. Notably, AttnIO-TS shows only about 10% relative difference from AttnIO-AS, unlike in entity path@1 in Table 1. This indicates that our model can learn to competently perform relational reasoning, even in this in-the-wild setting of target supervision.
Human Evaluation In order to further examine the quality of paths from the two supervision setting, we conduct a human evaluation. We randomly sample 100 dialogues from test set, then generate knowledge paths for half of the dialogues from AttnIO-AS, and half of the dialogues from AttnIO-TS. We then perform a pairwise comparison between the path generated from AttnIO, and the ground-truth path actually used in the dataset. For each dialogue, we ask 5 crowd-source workers to evaluate which knowledge path is more suitable for response generation among the two.
We report the win/tie/lose statistics of the model generated paths against ground-truth paths in Table  4. In both all-path supervision and target supervision setting, more than half of the paths from our model tied with the actual paths. The result attests to the quality of the generated paths, even including those marked as wrong in quantitative measures. Figure 3: Node attention visualization from the case study. Each figure represents the node attention at the initial state (Top), after the first decoding step (Center), and after the second decoding step (Bottom). We omit the edge attention to avoid visual cluttering. Best viewed in color.
AttnIO-TS especially performs much more comparably to AttnIO-AS than in Table 1, indicating that the destination nodes can function as an adequate guidance to our model, in replace of the whole path label.

Case Study
We resort to a case study, for a clear presentation of AttnIO's path reasoning process. Figure 3 presents the visualization of output attention distribution from our model, when the dialog context is given as follows: A: Can you recommend some films by Dan Scanlon? B: [RESPONSE] Note that there are hundreds of neighbor nodes connected to each entity in the external KG, but for the sake of clarity, we pruned most of them in the visualization leaving only entities relevant to the dialog. Intuitively, there could be diverse knowledge paths as response to the user's question. Before the initial step, AttnIO starts from assigning an attention value of 1.0 to the only entity mentioned in the utterance, Dan Scanlon. In the first propagation step, our model finds from the dialog context, that the most relevant relation in this case is wrote, propagating most attention in Dan Scanlon to two movie entities, Monster's University and Cars. In the second step, AttnIO understands that most of the entities directly connected to these two movies, can be a good option for the destination node. As a consequence, AttnIO chooses to propagate a fair amount of attention value evenly to all reachable entities, resulting in the distribution visualized at the third figure. Finally, an optimal path can be retrieved as Dan Scanlon ⇒ wrote ⇒ Monster's University ⇒ starred actor ⇒ Steve Buscemi.
Through the case study, we find that AttnIO directly reflects human intuition regarding an open question. It learns to perform relation-centric reasoning, and assign even amount of attention to equally likely reachable entities. In contrast, given a closed question such as "Who direced movie Cars?", AttnIO focuses on a small set of relevant entities and relations. A detailed analysis on the contrasting example is provided in Appendix C.

Conclusion
In this work, we suggest AttnIO, a novel path traversal model that reasons over KG based on two directions of attention flows. The empirical evaluations on OpenDialKG dataset show the strength of At-tnIO in knowledge retrieval compared to baselines. AttnIO can also be trained to generate proper paths even in a more affordable setting of target supervision. Lastly, we show through case study that our model enjoys from transparent interpretation of path reasoning process, and is capable of intuitively modeling knowledge exploration depending on the dialog characteristics. Zhibin Liu, Zheng-Yu Niu, Hua Wu, and Haifeng Wang. 2019

A Subgraph Sampling
We explain in more detail the subgraph sampling method adopted by AttnIO, as mentioned in Section 4. As shown in Figure 4, the out-degree distribution of G KG follows an extreme power-law distribution, which is typical in relational graphs. Among 100K nodes in G KG , about 31K nodes possess only one incoming neighbor node, making the graph extremely sparse. Meanwhile, a node with the highest in-degree has more than 21K incoming neighbor nodes, connecting the node to about 20% of all entities in the whole graph.
We find that the small number of hub nodes with high in/out degrees are the major factor that increases the size of input graph G input . Therefore, we choose to limit the maximal number of neighbors to sample from each entity, while constructing G input in the training time. We denote this limit as N max .
The effect of subgraph sampling with different N max is shown in Figure 5. Setting N max to 100, subgraph sampling effectively reduces down the number of edges in the input graph to only 5.67% of the original G input on average, while losing only about 1.0 absolute performance in path@1. In all our experiments, we set N max = 1000, leaving only 32.4% of the edges originally in G input , while not compromising for the retrieval accuracy. Figure 5: Effect of subgraph sampling. Blue bar denotes path@1 for each N max , while red bar denotes the average relative size of the sampled subgraph compared to the original G input .  The statistics of OpenDialKG dataset is as shown in Table 5. There are 1358 distinct types of relations comprising 1M+ edges. Figure 6: Node attention visualization for the additional case study. Each figure represents the node attention at the initial state (Top), after the first decoding step (Center), and after the second decoding step (Bottom). The center and bottom figure might look similar, due to the slight difference of node attention distribution between the two steps. The attention is focused on small number of entities, on the contrary with the even distribution at the former case study.

C Additional Case Study
We provide an additional visualization of node attention distribution for a given dialogue in Figure  6. This time, the model is given a dialogue context as following: A: Someone suggested the Northanger Abbey book to me. Do you know who the author is?
The dialogue is of a similar topic with the case study provided in Section 4.2, but the query in this case is a closed question that specifically asks for the author of a book.
AttnIO first starts from the only initial entity Northanger Abbey, and finds from the context that the most relevant relation here is written by. Therefore in the first decoding step, AttnIO propagates more than half of the attention value (0.63) from Northanger Abbey to its writer, Jane Austen. In the second propagation step, AttnIO chooses not to propagate much attention to any of Jane Austen's neighbor nodes, preserving most of the attention value (0.59) by traversing a self-loop. (We omitted self-loops in the visualization for clarity.) Note that there are a variety of neighbor nodes walkable from Jane Austen, just as in the former case study. However, AttnIO understands the intent of user's utterance requiring for a specific answer, concentrating most of the attention to the nodes and edges directly related to the dialogue context.

D Generation Examples
In Table 6, we present more path generation examples along with ground-truth paths for the given dialogues. Note that we only sampled cases where the paths generated from our model are different from the ground-truth paths. Dialogs are partially shown to meet the spatial constraints.

Dialog
A: I'm not sure who else was in it, but Ralph Fiennes also starred in Wrath of the Titans.
B: Wrath of the Titans, I didn't know Ralph Fiennes was in that movie. Tell me more about that movie and the stars in it.

MODEL-AS
Wrath of the Titans ⇒ starred ⇒ Liam Neeson