Grounded Conversation Generation as Guided Traverses in Commonsense Knowledge Graphs

Human conversations naturally evolve around related concepts and hop to distant concepts. This paper presents a new conversation generation model, ConceptFlow, which leverages commonsense knowledge graphs to explicitly model conversation flows. By grounding conversations to the concept space, ConceptFlow represents the potential conversation flow as traverses in the concept space along commonsense relations. The traverse is guided by graph attentions in the concept graph, moving towards more meaningful directions in the concept space, in order to generate more semantic and informative responses. Experiments on Reddit conversations demonstrate ConceptFlow’s effectiveness over previous knowledge-aware conversation models and GPT-2 based models while using 70% fewer parameters, confirming the advantage of explicit modeling conversation structures. All source codes of this work are available at https://github.com/thunlp/ConceptFlow.


Introduction
The rapid advancements of language modeling and natural language generation (NLG) techniques have enabled fully data-driven conversation models, which directly generate natural language responses for conversations (Shang et al., 2015;Vinyals and Le, 2015;Li et al., 2016b). However, it is a common problem that the generation models may degenerate dull and repetitive contents (Holtzman et al., 2019;Welleck et al., 2019), which, in conversation assistants, leads to off-topic and useless responses. (Tang et al., 2019;Zhang et al., 2018;.
Conversations often develop around Knowledge. A promising way to address the degeneration prob- * Indicates equal contribution.
† Part of work is conducted at Tsinghua University. lem is to ground conversations with external knowledge (Xing et al., 2017), such as open-domain knowledge graph (Ghazvininejad et al., 2018), commonsense knowledge base , or background documents (Zhou et al., 2018b). Recent research leverages such external knowledge by using them to ground conversations, integrating them as additional representations, and then generating responses conditioned on both the texts and the grounded semantics (Ghazvininejad et al., 2018;Zhou et al., 2018a,b).

Original
Integrating external knowledge as extra semantic representations and additional inputs to the conversation model effectively improves the quality of generated responses (Ghazvininejad et al., 2018;Logan et al., 2019;. Never-theless, some research on discourse development suggests that human conversations are not "still": People chat around a number of related concepts, and shift their focus from one concept to others. Grosz and Sidner (1986) models such concept shift by breaking discourse into several segments, and demonstrating different concepts, such as objects and properties, are needed to interpret different discourse segments. Attentional state is then introduced to represent the concept shift corresponding to each discourse segment. Fang et al. (2018) shows that people may switch dialog topics entirely in a conversation. Restricting the utilization of knowledge only to those directly appear in the conversation, effective as they are, does not reach the full potential of knowledge in modeling human conversations.
To model the concept shift in human conversations, this work presents ConceptFlow (Conversation generation with Concept Flow), which leverages commonsense knowledge graphs to model the conversation flow in the explicit concept space. For example, as shown in Figure 1, the concepts of a conversation from Reddit evolves from "chat" and "future", to adjacent concept "talk", and also hops to distant concept "dream" along the commonsense relations-a typical involvement in natural conversations. To better capture this conversation structure, ConceptFlow explicitly models the conversations as traverses in commonsense knowledge graphs: it starts from the grounded concepts, e.g., "chat" and "future", and generates more meaningful conversations by hopping along the commonsense relations to related concepts, e.g., "talk" and "dream".
The traverses in the concept graph are guided by graph attention mechanisms, which derives from graph neural networks to attend on more appropriate concepts. ConceptFlow learns to model the conversation development along more meaningful relations in the commonsense knowledge graph. As a result, the model is able to "grow" the grounded concepts by hopping from the conversation utterances, along the commonsense relations, to distant but meaningful concepts; this guides the model to generate more informative and on-topic responses. Modeling commonsense knowledge as concept flows, is both a good practice on improving response diversity by scattering current conversation focuses to other concepts (Chen et al., 2017), and an implementation solution of the attentional state mentioned above (Grosz and Sidner, 1986).
Our experiments on a Reddit conversation dataset with a commonsense knowledge graph, ConceptNet (Speer et al., 2017), demonstrate the effectiveness of ConceptFlow. In both automatic and human evaluations, ConceptFlow significantly outperforms various seq2seq based generation models (Sutskever et al., 2014), as well as previous methods that also leverage commonsense knowledge graphs, but as static memories Ghazvininejad et al., 2018;Zhu et al., 2017). Notably, ConceptFlow also outperforms two finetuned GPT-2 systems (Radford et al., 2019), while using 70% fewer parameters. Explicitly modeling conversation structure provides better parameter efficiency.
We also provide extensive analyses and case studies to investigate the advantage of modeling conversation flow in the concept space. Our analyses show that many Reddit conversations are naturally aligned with the paths in the commonsense knowledge graph; incorporating distant concepts significantly improves the quality of generated responses with more on-topic semantic information added. Our analyses further confirm the effectiveness of our graph attention mechanism in selecting useful concepts, and ConceptFlow's ability in leveraging them to generate more relevant, informative, and less repetitive responses.
The structured knowledge graphs include rich semantics represented via entities and relations (Hayashi et al., 2019). Lots of previous studies focus on task-targeted dialog systems based on domain-specific knowledge bases Zhu et al., 2017;Gu et al., 2016). To generate responses with a large-scale knowledge base,  and Liu et al. (2018) utilize graph attention and knowledge diffusion to select knowledge semantics for utterance understanding and response generation. Moon et al. (2019) focuses on the task of entity selection, and takes advantage of positive entities that appear in the golden response. Different from previous research, Concept-Flow models the conversation flow explicitly with the commonsense knowledge graph and presents a novel attention mechanism on all concepts to guide the conversation flow in the latent concept space.

Methodology
This section presents our Conversation generation model with latent Concept Flow (ConceptFlow). Our model grounds the conversation in the concept graph and traverses to distant concepts along commonsense relations to generate responses.

Preliminary
Given a user utterance X = {x 1 , ..., x m } with m words, conversation generation models often use an encoder-decoder architecture to generate a response Y = {y 1 , ..., y n }.
The encoder represents the user utterance X as a representation set H = { h 1 , ..., h m }. This is often done by Gated Recurrent Units (GRU): where the x i is the embedding of word x i . The decoder generates t-th word in the response according to the previous t − 1 generated words y <t = {y 1 , ..., y t−1 } and the user utterance X: Then it minimizes the cross-entropy loss L and optimizes all parameters end-to-end: where y * t is the token from the golden response. The architecture of ConceptFlow is shown in Figure 2. ConceptFlow first constructs a concept graph G with central graph G central and outer graph G outer according to the distance (hops) from the grounded concepts (Sec. 3.2).
Then ConceptFlow encodes both central and outer concept flows in central graph G central and outer graph G outer , using graph neural networks and concept embedding (Sec. 3.3).
The decoder, presented in Section 3.4, leverages the encodings of concept flows and the utterance to generate words or concepts for responses.

Concept Graph Construction
ConceptFlow constructs a concept graph G as the knowledge for each conversation. It starts from the grounded concepts (zero-hop concepts V 0 ), which appear in the conversation utterance and annotated by entity linking systems.
Then, ConceptFlow grows zero-hop concepts V 0 with one-hop concepts V 1 and two-hop concepts V 2 . Concepts from V 0 and V 1 , as well as all relations between them, form the central concept graph G central , which is closely related to the current conversation topic. Concepts in V 1 and V 2 and their connections form the outer graph G outer .

Encoding Latent Concept Flow
The constructed concept graph provides explicit semantics on how concepts related to commonsense knowledge. ConceptFlow utilizes it to model the conversation and guide the response generation. It starts from the user utterance, traversing through central graph G central , to outer graph G outer . This is modeled by encoding the central and outer concept flows according to the user utterance.
Central Flow Encoding. The central concept graph G central is encoded by a graph neural network that propagates information from user utterance H to the central concept graph. Specifically, it encodes concept e i ∈ G central to representation g e i : where e i is the concept embedding of e i . There is no restriction of which GNN model to use. We choose Sun et al. (2018)'s GNN (GraftNet), which shows strong effectiveness in encoding knowledge graphs. More details of GraftNet can be found in Appendix A.3.
Outer Flow Encoding. The outer flow f ep , hopping from e p ∈ V 1 to its connected two-hop concept e k , is encoded to f ep by an attention mechanism: where e p and e k are embeddings for e p and e k , and are concatenated (•). The attention θ e k aggregates concept triple (e p , r, e k ) to get f ep : where r is the relation embedding between the concept e p and its neighbor concept e k . w r , w h and w t are trainable parameters. It provides an efficient attention specifically focusing on the relations for multi-hop concepts.

Generating Text with Concept Flow
To consider both user utterance and related information, the texts from the user utterance and the latent concept flows are incorporated by decoder using two components: 1) the context representation that combines their encodings (Sec. 3.4.1); 2) the conditioned generation of words and concepts from the context representations (Sec. 3.4.2).

Context Representation
To generate t-th time response token, we first calculate the output context representation s t for t-th time decoding with the encodings of the utterance and the latent concept flow. Specifically, s t is calculated by updating the (t − 1)-th step output representation s t−1 with the (t − 1)-th step context representation c t−1 : where y t−1 is the (t − 1)-th step generated token y t−1 's embedding, and the context representation c t−1 concatenates the text-based representation c text t−1 and the concept-based representation c The text-based representation c text t−1 reads the user utterance encoding H with a standard attention mechanism (Bahdanau et al., 2015): and attentions α j t−1 on the utterance tokens: The concept-based representation c concept t−1 is a combination of central and outer flow encodings: The attention β e i t−1 weights over central concept representations: and the attention γ f t−1 weights over outer flow representations:

Generating Tokens
The t-th time output representation s t (Eq. 7) includes information from both the utterance text, the concepts with different hop steps, and the attentions upon them. The decoder leverages s t to generate the t-th token to form more informative responses. It first uses a gate σ * to control the generation by choosing words (σ * = 0), central concepts (V 0,1 , σ * = 1) and outer concept set (V 2 , σ * = 2): The generation probabilities of word w, central concept e i , and outer concepts e k are calculated over the word vocabulary, central concept set V 0,1 , and outer concept set V 2 : where w is the word embedding for word w, g e i is the central concept representation for concept e i and e k is the two-hop concept e k 's embedding.
The training and prediction of ConceptFlow are conducted following standard conditional language models, i.e. using Eq. 15 in place of Eq. 2 and training it by the Cross-Entropy loss (Eq. 3). Only ground truth responses are used in training and no additional annotation is required.

Experiment Methodology
This section describes the dataset, evaluation metrics, baselines, and implementation details of our experiments.
Dataset. All experiments use the multi-hop extended conversation dataset based on a previous dataset which collects single-round dialogs from Reddit . Our dataset contains 3,384,185 training pairs and 10,000 test pairs. Preprocessed ConceptNet (Speer et al., 2017) is used as the knowledge graph, which contains 120,850 triples, 21,471 concepts and 44 relation types.
Evaluation Metrics. A wide range of evaluation metrics are used to evaluate the quality of generated responses: PPL (Serban et al., 2016), Bleu (Papineni et al., 2002), Nist (Doddington, 2002), ROUGE (Lin, 2004) and Meteor (Lavie and Agarwal, 2007) are used for relevance and repetitiveness; Dist-1, Dist-2 and Ent-4 are used for diversity, which is same with the previous work (Li et al., 2016a;Zhang et al., 2018). The metrics above are evaluated using the implementation from . 's concept PPL mainly focuses on concept grounded models and this metric is reported in Appendix A.1.
The Precision, Recall, and F1 scores are used to evaluate the quality of learned latent concept flow in predicting the golden concepts which appear in ground truth responses.
Baselines. The six baselines compared come from three groups: standard Seq2Seq, knowledgeenhanced ones, and fine-tuned GPT-2 systems.
Knowledge-enhanced baselines include Mem-Net (Ghazvininejad et al., 2018), CopyNet (Zhu et al., 2017) and CCM . Mem-Net maintains a memory to store and read concepts. CopyNet copies concepts for the response generation. CCM ) leverages a graph attention mechanism to model the central concepts. These models mainly focus on the grounded concepts. They do not explicitly model the conversation structures using multi-hop concepts.
GPT-2 (Radford et al., 2019), the pre-trained model that achieves the state-of-the-art in lots of language generation tasks, is also compared in our experiments. We fine-tune the 124M GPT-2 in two ways: concatenate all conversations together and train it like a language model (GPT-2 lang); extend the GPT-2 model with encode-decoder architecture and supervise with response data (GPT-2 conv). Implement Details. The zero-hop concepts are initialized by matching the keywords in the post to concepts in ConceptNet, the same with CCM . Then zero-hop concepts are extended to their neighbors to form the central concept graph. The outer concepts contain a large amount of twohop concepts with lots of noises. To reduce the computational cost, we first train ConceptFlow (select) with 10% random training data, and use the learned graph attention to select top 100 two-hop concepts over the whole dataset. Then the standard train and test are conducted with the pruned graph. More details of this filtering step can be found in Appendix A.4.
TransE (Bordes et al., 2013) embedding and Glove (Pennington et al., 2014) embedding are used to initialize the representation of concepts and words, respectively. Adam optimizer with the learning rate of 0.0001 is used to train the model.

Evaluation
Five experiments are conducted to evaluate the generated responses from ConceptFlow and the effectiveness of the learned graph attention.

Response Quality
This experiment evaluates the generation quality of ConceptFlow automatically and manually.
Automatic Evaluation. The quality of generated responses is evaluated with different metrics from three aspects: relevance, diversity, and novelty. Table 1 and Table 2 show the results.
In Table 1, all evaluation metrics calculate the relevance between the generated response and the    golden response. ConceptFlow outperforms all baseline models by large margins. The responses generated by ConceptFlow are more on-topic and match better with the ground truth responses.
In Table 2, Dist-1, Dist-2, and Ent-4 measure the word diversity of generated responses and the rest of metrics measure the novelty by comparing the generated response with the user utterance. Con-ceptFlow has a good balance in generating novel and diverse responses. GPT-2's responses are more diverse, perhaps due to its sampling mechanism during decoding, but are less novel and on-topic compared to those from ConceptFlow.
Human Evaluation. The human evaluation focuses on two aspects: appropriateness and informativeness. Both are important for conversation systems . Appropriateness evaluates if the response is on-topic for the given utterance; informativeness evaluates systems' ability to provide new information instead of copying from the utterance . All responses of sampled 100 cases are selected from four methods with better performances: CCM, GPT-2 (conv), ConceptFlow, and Golden Response. The responses are scored from 1 to 4 by five judges (the higher the better). Table 3 presents Average Score and Best@1 ratio from human judges. The first is the mean of five judges; the latter calculates the fraction of judges that consider the corresponding response the best among four systems. ConceptFlow outperforms all other models in all scenarios, while only using 30% of parameters compared to GPT-2. This demonstrates the advantage of explicitly modeling conversation flow with structured semantics.
The agreement of human evaluation is tested to demonstrate the authenticity of evaluation results. We first sample 100 cases randomly for our human evaluation. Then the responses from four better conversation systems, CCM, GPT-2 (conv), Con-ceptFlow and Golden Responses, are provided with a random order. A group of annotators are asked to score each response ranged from 1 to 4 according to the quality on two testing scenarios, appropriateness and informativeness. All annotators have no clues about the source of generated responses.
The agreement of human evaluation for CCM, GPT-2 (conv) and ConceptFlow are presented in Table 4. For each case, the response from Con-ceptFlow is compared to the responses from two baseline models, CCM and GPT-2 (conv). The comparison result is divided into three categories: win, tie and loss. Then the human evaluation agreement is calculated with Fleiss' Kappa (κ). The κ value ranges from 0.21 to 0.40 indicating fair agreement, which confirms the quality of human evaluation.
Both automatic and human evaluations illustrate the effectiveness of ConceptFlow. The next experiment further studies the effectiveness of multi-hop concepts in ConceptFlow.

Effectiveness of Multi-hop Concepts
This part explores the role of multi-hop concepts in ConceptFlow. As shown in Figure 3, three experiments are conducted to evaluate the performances of concept selection and the quality of generated responses with different sets of concepts.
This experiment considers four variations of outer concept selections. Base ignores two-hop concepts and only considers the central concepts. Rand, Distract, and Full add two-hop concepts in three different ways: Rand selects concepts randomly, Distract selects all concepts that appear in the golden response with random negatives (distractors), and Full is our ConceptFlow (select) that selects concepts by learned graph attentions.
As shown in Figure 3(a), Full covers more golden concepts than Base. This aligns with our motivation that natural conversations do flow from central concepts to multi-hop ones. Compared to Distract setting where all ground truth two-hop concepts are added, ConceptFlow (select) has slightly less coverage but significantly reduces the number of two-hop concepts.
The second experiment studies the model's ability to generate ground truth concepts, by comparing the concepts in generated responses with those in ground truth responses. As shown in Figure 3(b), though Full filtered out some golden two-  Table 5: Statistics of Concept Graphs with different hops, including the total Amount of connected concepts, the Ratio and Number of covered golden concepts (those appear in ground truth responses). Con-ceptFlow indicates the filtered two-hop graph.
hop concepts, it outperforms other variations by large margins. This shows ConceptFlow's graph attention mechanisms effectively leverage the pruned concept graph and generate high-quality concepts when decoding.
The high-quality latent concept flow leads to better modeling of conversations, as shown in Figure 3(c). Full outperforms Distract in their generated responses' token level perplexity, even though Distract includes all ground truth two-hop concepts. This shows that "negatives" selected by Concept-Flow, while not directly appear in the target response, are also on-topic and include meaningful information, as they are selected by graph attentions instead of random.
More studies of multi-hop concept selection strategies can be found in Appendix A.2.

Hop Steps in Concept Graph
This experiment studies the influence of hop steps in the concept graph.
As shown in Table 5, the Number of covered golden concepts increases with more hops. Compared to zero-hop concepts, multi-hop concepts cover more golden concepts, confirming that conversations naturally shift to multi-hop concepts: extending the concept graph from one-hop to twohop improves the recall from 39% to 61%, and to three-hop further improves to 81%.
However, at the same time, the amounts of the concepts also increase dramatically with multiple hops. Three hops lead to 3,769 concepts on average, which are 10% of the entire graph we used. In this work, we choose two-hop, as a good balance of coverage and efficiency, and used ConceptFlow (select) to filter around 200 concepts to construct the pruned graph. How to efficiently and effectively leverage more distant concepts in the graph is reserved for future work.

Case Study
Some cases from three conversation models are listed in Table 6. Responses from CCM may repeat the same contents as it does not explicitly model the traverse in the concept space. For example, the responses from the first and third cases always repeat "I'm not sure". On the other hand, GPT-2 generates more fluent responses compared to CCM. Nevertheless, some cases from GPT-2 merely copy contents or concepts from the given post. For example, for the third case, GPT-2 (conv) mainly discusses the concept music.
In comparison, the generated responses from our ConceptFlow are more fluent and informative than those from both CCM and GPT-2. For example, in the third case, ConceptFlow brings associated concepts "sound" and "check" to the response generation, hopping from the grounded concepts "music" and "advice". Introducing these multi-hop concepts effectively improves the informativeness and diversity of generated responses. Figure 4 presents a case study of ConceptFlow. The attention score β e i and γ f are presented in the form of color intensity. The "championship" of zero-hop, "fan" of one-hop and "team" of two-hop receive more attention than others and are used to #1 Post actually i stayed at the building right next to the lighthouse . i believe it was a boiler room . another group was already sleeping at the lighthouse .

CCM
i 'm not sure if you 're joking , but i 'm not sure if you 're talking about the lighthouse . i 'm not sure if it was a blanket or a blanket . GPT-2 (conv) i 'm pretty sure it was a room with a door . ConceptFlow good luck , i 'm sure there are some good things in the house .

#2 Post
at this point it may be more beneficial for him to keep playing as much as he can before the next season . trick his body into thinking he is still in the season . CCM i think he 's just going to be a great player . GPT-2 (conv) i 'm sure he 'll be fine . he 's a great player , but he 's not a top 10 player . ConceptFlow he 's been playing the same game for the last two years . he 's been playing the same way . #3 Post i 've known how to change the music , it does n't feel natural . thanks for the advice though .

CCM
i 'm not sure if you 're joking , but i 'm not sure if you 're being sarcastic or not . i 'm not sure if you 're being sarcastic or not , but i 'm not sure if you 're being sarcastic or not . i 'm pretty sure it 's just a video . GPT-2 (conv) i 've been playing with a lot of different music and it 's always been the same . ConceptFlow i did n't even notice the natural sound . i 'll check it out ! #4 Post are milan and inter no longer top clubs given that they currently do n't have the finances to compete with the best ? CCM milan is n't a top club , but they are n't top clubs . they are n't the only ones who are n't in the world . GPT-2 (conv) i think they are . they are a club that has been in the top 4 for a long time . ConceptFlow the best part of the team is the way they are , and we do n't want to play the best team in the country . #5 Post no i 've never felt like i had friends whoops CCM i 'm sorry , i feel you . GPT-2 (conv) i 've never felt like i had friends whoops . i 've never felt like i 've had friends whoops . ConceptFlow you 're gonna have to call your friend ! generate the response. The concept flow from "fans" to "fan" models the concept shift from user post to response. The concept flow from "fan" to "team" further describes the concept shift in response generation. In addition, some concepts, such as "win" and "pretty", share higher attention and may help to understand the one-hop concepts, and are filtered out when generating response by the gate σ * according to the relevance with conversation topic.

Learned Attentions on Concepts
This experiment studies the learned attention of ConceptFlow on different groups of concepts. We consider the average attention score (β for central concepts and α (Appendix A.4) for two-hop concepts) from all decoding steps. The probability density of the attention is plotted in Figure 5.
Figure 5(a) shows the attention weights on central concepts. ConceptFlow effectively attends more on golden and zero-hop concepts, which include more useful information. The attention on two-hop concepts are shown in Figure 5(b). Con-ceptFlow attends slightly more on the Golden twohop concepts than the rest two-hop ones, though the margin is smaller-the two-hop concepts are already filtered down to high-quality ones in the ConceptFlow (select) step.

Conclusion and Future Work
ConceptFlow models conversation structure explicitly as transitions in the latent concept space, in order to generate more informative and meaningful responses. Our experiments on Reddit conversations illustrate the advantages of ConceptFlow over previous conversational systems. Our studies confirm that ConceptFlow's advantages come from the high coverage latent concept flow, as well as its graph attention mechanism that effectively guides the flow to highly related concepts. Our human evaluation demonstrates that ConceptFlow generates more appropriate and informative responses while using much fewer parameters.
In future, we plan to explore how to combine knowledge with pre-trained language models, e.g. GPT-2, and how to effectively and efficiently introduce more concepts in generation models.

A Appendices
Supplementary results of the overall performance and ablation study for multi-hop concepts are presented here. More details of Central Flow Encoding and Concept Selection are also shown.

A.1 Supplementary Results for Overall Experiments
This part presents more evaluation results of the overall performance of ConceptFlow from two aspects: relevance and novelty. Table 7 shows supplementary results on Relevance between generated responses and golden responses. ConceptFlow outperforms other baselines with large margins among all evaluation metrics. Concept-PPL is the Perplexity that calculated by the code from previous work .  calculates Perplexity by considering both words and entities. It is evident that more entities will lead to a better result in terms of Concept-PPL because the vocabulary size of entities is always smaller than word vocabulary size.
More results for model novelty evaluation are shown in Table 8. These supplementary results compare the generated response with the user post to measure the repeatability of the post and generated responses. A lower score indicates better performance because the repetitive and dull response will degenerate the model performance. Concept-Flow presents competitive performance with other baselines, which illustrate our model provides an informative response for users.
These supplementary results further confirm the effectiveness of ConceptFlow. Our model has the ability to generate the most relevant response and more informative response than other models.

A.2 Supplementary Results for Multi-hop Concepts
The quality of generated responses from four twohop concept selection strategies is evaluated to further demonstrate the effectiveness of ConceptFlow. We evaluate the relevance between generated responses and golden responses, as shown in Table 9. Rand outperforms Base on most evaluation metrics, which illustrates the quality of generated response can be improved with more concepts included. Distract outperforms Rand on all evaluation metrics, which indicates that concepts appearing in golden responses are meaningful and important for the conversation system to generate a more on-topic and informative response. On the other hand, Full outperforms Distract significantly, even though not all golden concepts are included. The better performance thrives from the underlying related concepts selected by our ConceptFlow (select). This experiment further demonstrates the effectiveness of our ConceptFlow to generate a better response.

A.3 Model Details of Central Flow Encoding
This part presents the details of our graph neural network to encode central concepts.
A multi-layer Graph Neural Network (GNN) (Sun et al., 2018) is used to encode concept e i ∈ G central in central concept graph: where e i is the concept embedding of e i and H is the user utterance representation set. The l-th layer representation g l e i of concept e i is calculated by a single-layer feed-forward network (FFN) over three states: where • is concatenate operator. g l−1 e j is the concept e j 's representation of (l − 1)-th layer. p l−1 is the user utterance representation of (l − 1)-th layer.
The (l − 1)-th layer user utterance representation is updated with the zero-hop concepts V 0 : f e j →e i r ( g l−1 e j ) aggregates the concept semantics of relation r specific neighbor concept e j . It uses attention α e j r to control concept flow from e i : where • is concatenate operator and r is the relation embedding of r. The attention weight α e j r is computed over all concept e i 's neighbor concepts according to the relation weight score and the Page Rank score (Sun et al., 2018): where PageRank(e l−1 j ) is the page rank score to control propagation of embeddings along paths starting from e i (Sun et al., 2018) and p l−1 is the (l − 1)-th layer user utterance representation.
The 0-th layer concept representation e 0 i for concept e i is initialized with the pre-trained concept embedding e i and the 0-th layer user utterance representation p 0 is initialized with the m-th hidden state h m from the user utterance representation set H. The GNN used in ConceptFlow establishes the central concept flow between concepts in the central concept graph using attentions.

A.4 Concept Selection
With the concept graph growing, the number of concepts is increased exponentially, which brings lots of noises. Thus, a selection strategy is needed to select high-relevance concepts from a large number of concepts. This part presents the details of our concept selection from ConceptFlow (select).
The concept selector aims to select top K related two-hop concepts based on the sum of attention scores for each time t over entire two-hop concepts: where s t is the t-th time decoder output representation and e k denotes the concept e k 's embedding. Then two-hop concepts are sorted according to the attention score α n . In our settings, top 100 concepts are reserved to construct the two-hop concept graph V 2 . Moreover, central concepts are all reserved because of the high correlation with the conversation topic and acceptable computation complexity. Both central concepts and selected two-hop concepts construct the concept graph G.