Conversational Graph Grounded Policy Learning for Open-Domain Conversation Generation

To address the challenge of policy learning in open-domain multi-turn conversation, we propose to represent prior information about dialog transitions as a graph and learn a graph grounded dialog policy, aimed at fostering a more coherent and controllable dialog. To this end, we first construct a conversational graph (CG) from dialog corpora, in which there are vertices to represent “what to say” and “how to say”, and edges to represent natural transition between a message (the last utterance in a dialog context) and its response. We then present a novel CG grounded policy learning framework that conducts dialog flow planning by graph traversal, which learns to identify a what-vertex and a how-vertex from the CG at each turn to guide response generation. In this way, we effectively leverage the CG to facilitate policy learning as follows: (1) it enables more effective long-term reward design, (2) it provides high-quality candidate actions, and (3) it gives us more control over the policy. Results on two benchmark corpora demonstrate the effectiveness of this framework.


Introduction
How to effectively learn dialog strategies is an enduring challenge for open-domain multi-turn conversation generation. To address this challenge, previous works investigate word-level policy models that simultaneously learn dialog policy and language generation from dialog corpora (Li et al., 2016b;Zhang et al., 2018b). But these word-level policy models often lead to a degeneration issue where the utterances become ungrammatical or repetitive (Lewis et al., 2017). To alleviate this issue, utterance-level policy models have been proposed to decouple policy learning from response generation, and they focus on how to incorporate Figure 1: Our system (1) understands the user message by linking it to CG. We call the linked vertices as hit what-vertices (green color) ; (2) selects a what-vertex ("sleepy") and a how-vertex (responding mechanism M 3 , a MLP network) from one-hop neighbors of hit vertices; (3) generates a coherent response with two sub-steps: firstly, obtains a response representationr using both M 3 and a message representation (from a message-encoder); Next, produces a response "It's so ..." with "sleepy" andr as input. Notice all the howvertices are from the same set rather than completely independent of each other.
high-level utterance representations, e.g., latent variables or keywords, to facilitate policy learning (He et al., 2018;Yao et al., 2018;. However, these utterance-level methods tend to produce less coherent multi-turn dialogs since it is quite challenging to learn semantic transitions in a dialog flow merely from dialog data without the help of prior information. In this paper, we propose to represent prior information about dialog transition (between a message and its response) as a graph, and optimize dialog policy based on the graph, to foster a more coherent dialog.
To this end, we propose a novel conversational graph (CG) grounded policy learning frame-work for open-domain multi-turn conversation generation (CG-Policy). It consists of two key components, (1) a CG that captures both localappropriateness and global-coherence information, (2) a reinforcement learning (RL) based policy model that learns to leverage the CG to foster a more coherent dialog. In Figure 1, given a user message, our system selects a what-vertex ("sleepy") and a how-vertex(responding mechanism M 3 ) to produce a coherent response. 1 We first construct the CG based on dialog data. We use vertices to represent utterance content, and edges to represent dialog transitions between utterances. Specifically, there are two types of vertices: (1) a what-vertex that contains a keyword, and (2) a how-vertex that contains a responding mechanism (from a multi-mapping based generator in Section 3.1) to capture rich variability of expressions. We also use this multi-mapping based method to build edges between two what-vertices to capture the local-appropriateness between the two keywords as a message and a response respectively. It can be seen that the what-vertices from the same highly connected region are more likely to constitute coherent dialog.
We then present a novel graph grounded policy model to plan a long-term success oriented vertex sequence to guide response generation. Specifically, as illustrated by the three pink lines in Figure  1, given a user message, CG-Policy first links its keywords to CG to obtain hit what-vertices. Next, the policy model learns to select a what-vertex from one-hop what-vertex neighbors of all hit whatvertices, and then select a how-vertex from howvertex neighbors of the chosen what-vertex. Finally, the two selected vertices are utilized to guide response generation. Thus we leverage the prior dialog-transition information (as graph edges) to narrow down candidate response content for more effective policy decision, instead of using the whole set of keywords as candidate actions. Moreover, to facilitate the modeling of long-term influence of policy decisions in an ongoing dialog, we first present novel CG based rewards to better measure the long-term influence of selected actions. We then employ a graph attention mechanism and graph embedding to encode global structure information of CG into dialog state representations, enabling global information aware decisions.
1 Each mechanism is a MLP network to model how to express response content (Chen et al., 2019). This paper makes the following contributions: • This work is the first attempt that represents dialog transitions as a graph, and conducts graph grounded policy learning with RL. Supported by CG and this policy learning framework, CG-Policy can respond better in terms of local appropriateness and global coherence.
• Our study shows that: (1) one-hop whatvertex neighbors of hit what-vertices provide locally-appropriate and diverse response content; (2) the CG based rewards can supervise the policy model to promote a globallycoherent dialog; (3) the use of how-vertices in CG can improve response diversity; (4) the CG can help our system succeed in the task of target-guided conversation, indicating that it gives us more control over the dialog policy.

Related Work
Policy learning for chitchat generation To address the degeneration issue of word-level policy models (Li et al., 2016b;Zhang et al., 2018b), previous works decouple policy learning from response generation, and then use utterance-level latent variables  or keywords (Yao et al., 2018) as RL actions to guide response generation. In this work, we investigate how to use prior dialog-transition information to facilitate dialog policy learning.
Knowledge aware conversation generation There are growing interests in leveraging knowledge bases for generation of more informative responses (Dinan et al., 2019;Ghazvininejad et al., 2018;Moghe et al., 2018;Zhou et al., 2018;Liu et al., 2019;Bao et al., 2019;Xu et al., 2020). In this work, we employ a dialog-modeling oriented graph built from dialog corpora, instead of a external knowledge base, in order to facilitate multi-turn policy learning, instead of dialog informativeness improvement.
Specifically, we are motivated by (Xu et al., 2020). The method in (Xu et al., 2020) has the issue of cross-domain transfer since it relies on labor-intensive knowledge graph grounded multiturn dialog datasets for model training. Compared with them, our conversational graph is automatically built from dialog datasets, which introduces very low cost for training data construction. Furthermore, we decouple conversation modeling into two parts: "what to say" modeling and "how to

Our Approach
The overview of CG-Policy is presented in Figure  2. Given a user message, to obtain candidate actions, the NLU module attempts to retrieve contextually relevant subgraphs from CG. The state/action module maintains candidate actions, history keywords that selected by policy at previous turns or mentioned by user, and the message. The policy module learns to select a response keyword and a responding mechanism from the above subgraphs. The NLG module first encodes the message into a representation using a message encoder and the selected mechanism, and then employs a Seq2BF model 2 (Mou et al., 2016) to produce a response 2 It decodes a response starting from the input keyword, and generates the remaining previous and future words subsequently. In this way, the keyword will appear in the response. with the above representation and the selected keyword as input. The models used in CG construction/policy/NLG/reward are trained separately.

Background: Multi-mapping Generator for NLG
To address the "one-to-many" semantic mapping problem for conversation generation, Chen et al.(2019) proposed an end-to-end multi-mapping model in which each responding mechanism (a MLP network) models how to express response content (e.g. responding with a specific sentence function). In test procedure, they randomly select a mechanism for response generation. As shown in Figure 3, the generator consists of a RNN based message encoder, a set of responding mechanisms, and a decoder. First, given a dialog message, the message-encoder represents it as a vector x. Second, the generator uses a responding mechanism (selected by policy) to convert x into a response representationr. Finally,r and a keyword (selected by policy) are fed into the decoder for response generation. To ensure that the given keyword will appear in generated responses, we introduce another Seq2BF based decoder (Mou et al., 2016) to replace the original RNN decoder. Moreover, this generator is trained on a dataset with pairs of [the message, a keyword extracted from a response]-the response. 3

CG Construction
Given a dialog corpus D, we construct the CG with three steps: what-vertex construction, how-vertex construction, and edge construction.
What-vertex construction To extract content words from D as what-vertices, we use a rule-based keyword extractor to obtain salient keywords from utterances in D. 4 After removing stop words, we obtain all the keywords as what-vertices.
How-vertex construction We obtain a set of N r responding mechanisms from the generator described in Section 3.1. Then they are used as howvertices. Notice that all the how-vertices in CG share the same set of responding mechanisms.
Edge construction There are two types of edges in CG. One is to join two what-vertices and the other is to join a what-vertex and a how-vertex. To build the first type of edges, we first construct another dataset that consists of keyword pairs, where each pair consists of any two keywords extracted from the message and the response respectively in D. To capture natural transitions between keywords, we train another multi-mapping based model on this new dataset. 5 For each what-vertex v w , we find appropriate keywords as its responses by selecting top five keywords decoded (decoding length is 1) by each responding mechanism, and then connect v w to vertices of these keywords.
To build the second type of edges, for the [message-keyword]-response pair in D (described in Section 3.1), we use the ground-truth response to select the most suitable mechanism for each keyword. Then, given a what-vertex v w , we select top five mechanisms that are frequently selected for v w 's keyword. Then we build edges to connect v w to each of the top ranked how-vertices. These edges lead to responding mechanisms that are suitable to generate v w .

NLU
To obtain subgraphs to provide high-quality candidate actions, we first extract keywords in the last utterance of the context (message) using the same tool in CG construction, and then link each keyword to the CG through exact string matching, to obtain multiple hit what-vertices. Then we retrieve a subgraph for each keyword, and use vertices (exclude hit what-vertices) in these subgraphs as candidate actions. Each subgraph consists of three parts: the hit what-vertex, its one-hop neighboring 0. Prepare dataset D and pretrained embedding. what-vertices, and how-vertices being connected to the above neighbors. If there are no keywords to be extracted from the message or to be linked to CG, we reuse the retrieved subgraphs at the last time. 6 Thus we leverage the CG to provide high-quality candidate actions, instead of using the whole set of candidates as done in previous work (Yao et al., 2018).

State/Action
This module maintains candidate actions, history keywords that selected by the policy or mentioned by user, and the message. Moreover, we use the message-encoder from Section 3.1 to represent the message as a vector x, and then we use all the responding mechanisms from Section 3.1 to convert x into N r candidate response representations {r j } Nr j=1 , which will be used in the policy.

Policy
State representation The state representation s t at the t-th time step is obtained by concatenating a message representation s M t and a history keywords representation s V t that are encoded by two RNN encoders respectively. Formally, To enable global information aware policy decisions, we employ a graph attention mechanism and graph embedding to encode global structure information into state representation.
Recall that we have a subgraph for each keyword in the message obtained by NLU. Here each subgraph g i consists of a hit what-vertex, its what-vertex neighbors (here we remove howvertices) and edges between them. Formally, where each τ k is a triple with τ k = (head k , rel k , tail k ), and N g i is the number of triples in g i . For non keywords in the message, a NULL subgraph is used.
Then we calculate a subgraph vector g i as a weighted sum of head vectors and tail vectors in the triples.
Here e * represents pretrained graph embedding (TransE (Bordes et al., 2013 where v w j (as model parameters, different from both w c i and e * ) is the embedding of the j-th candidate what-vertices, and N w act is the number of candidate what-vertices.
The how-policy µ how is defined by: where r i is a candidate response representation in the state module, and λ i is mechanism mask. λ i is set as 1 if the i-th responding mechanism is one of neighbors of the selected what-vertex, otherwise 0.

Rewards
Following previous works, we consider these utterance-level rewards: Local relevance We use a state-of-the-art multiturn response selection model, DualEncoder in (Lowe et al., 2015), to calculate local relevance.
Repetition Repetition penalty is 1 if the generated response shares more than 60% words with any contextual utterances, otherwise 0.
Target similarity For target-guided conversation, we calculate cosine similarity between the chosen keyword and the target word in pretrained word embedding space as target similarity. 7 To leverage the global graph structure information of CG to facilitate policy learning, we propose the following rewards: Global coherence We calculate the average cosine distance between the chosen what-vertex and one of history what-vertices (selected or mentioned previously) in TransE based embedding space (also used in Equation 2) as coherence reward.
Sustainability It is reasonable to promote whatvertices with a large number of neighbors to generate more sustainable, coherent, and diverse dialogs. For this reward, we calculate a PageRank score (calculated on the full CG) for the chosen whatvertex.
Shortest path distance to the target For targetguided conversation, if the chosen what-vertex is closer to the target what-vertex in terms of shortest path distance when compared to the previously chosen what-vertex, then this reward is 1, or 0 if the distance does not change, otherwise -1.
Moreover, we define the final reward as a weighted sum of the above-mentioned factors, where the weight of each factor is set as [0.5, -5, 0, 3, 8000, 0] by default. 8 We see that our rewards can fully leverage dialog transition information in training data by using not only utterance based rewards (e.g., local relevance), but also graph based rewards (e.g., coherence, sustainability).

Policy Optimization
To make training process more stable, we employ the A2C method (Sutton and Barto, 2018) for optimization. Moreover, we only update policy pa-7 If no keyword is chosen, as in baseline models, we calculate target similarity for each word in response and select the closest one. 8 We optimize these values on Weibo dataset by grid search. The weights of the third/sixth factors are set as 0 by default because they are proposed for target-guided conversation. rameters, and the parameters of other modules stay intact during RL training.

NLG
As described in Section 3.1, we use the mechanism selected by how-policy to convert x into a response representationr. Then we feed the keyword in the selected what-vertex andr into a Seq2BF decoder (Mou et al., 2016) for response generation.
4 Experiments and Results 9

Datasets
We conduct experiments on two widely used opendomain dialog corpora.
Weibo corpus (Shang et al., 2015). This is a large micro-blogging corpora. After data cleaning, we obtain 2.6 million pairs for training, 10k pairs for validation and 10k pairs for testing. We use publicly-available lexical analysis tools 10 to obtain POS tag features for this dataset and then we further use this feature to extract keywords from utterances. We use Tencent AI Lab Embedding 11 for embedding initialization in models.
Persona dialog corpus (Zhang et al., 2018a). This ia a crowd-sourced dialog corpora where each participant plays the part of an assigned persona. To evaluate policy controllability brought by CG-Policy, we conduct an experiment for target-guided conversation on the Persona dataset as done in (Tang et al., 2019). The training set / validation set / testing set contain 101,935 / 5,602 / 5,371 utterances respectively. Embeddings are initialized with Glove (Pennington et al., 2014). Conversational

Methods
We carefully select three SOTA methods that focus on dialog policy learning as baselines. 9 Please see the supplemental material for more details. 10 ai.baidu.com/ 11 ai.tencent.com/ailab/nlp/embedding.html 12 We randomly sample 500 edges for evaluation.
LaRL It is a latent variable driven dialog policy model . We use their released codes and choose the multivariate categorical latent variables as RL actions since it performs the best. For target-guided conversation, we implement another model LaRL-Target, where we add the "target similarity" factor into RL rewards, and its weight is set as 4 by grid search.
ChatMore We implement the keyword driven policy model (Yao et al., 2018) by following their original design. For target-guided conversation, we implement ChatMore-Target, where we add the "target similarity" factor into RL rewards, and its weight is set as 4 by grid search.
TGRM It is a retrieval based model for targetguided conversation, where the keyword chosen at each turn must move strictly closer (in embedding space) to a given target word (Tang et al., 2019). For target-guided conversation, we use the codes released by the original authors, denoted as TGRM-Target, and we use their kernel version since it performs the best. 13 To suit the task of open-domain conversation on Weibo, we remove the unnecessary constraint on keyword's similarity with the target word, denoted as TGRM.
CG-Policy It is our system presented in Section 3. For target-guided conversation, we implement another system CG-Policy-Target, where we use an additional feature, the "shortest path distance to the target" factor, to augment the original what- , wherev w j is the augmented representation, W 1 is a weighting matrix, e d j is an embedding for the distance value d j , andv w j has the same size with v w j . We also use this factor in reward estimation and its weight is set as 5 by grid search, and we don't use the "target similarity" factor. Moreover, we use the same dialog corpora to construct CG, train user simulator, reward functions, and the NLG module for CG-Policy.

User Simulator
We use the same user simulator for RL training of LaRL, ChatMore and CG-Policy. The user simulator is the original multi-mapping based generator with a RNN decoder, which is pretrained on dialog corpus and not updated during policy training. Please refer to (Chen et al., 2019) for more details. During testing, all the systems share this simulator.

Evaluation Settings
Conversation with user simulator Following previous work (Li et al., 2016b;Tang et al., 2019), we use a user simulator to play the role of human and let each of the models converse with it. Given a randomly selected model, we randomly select an utterance from all the utterances (at the starting position of sessions) in test set for the model to start a conversation. Moreover, we set a maximum allowed number of turns, which is 8 in our experiment. Finally, we collect 100 model-simulator dialogs for evaluation. For single-turn level evaluation, we randomly sample 100 message-response pairs from the dialogs for each model.
Conversation with human Following previous work (Tang et al., 2019), we also perform human evaluation for a more reliable system comparison. Given a model to be evaluated, we randomly select a dialogue from test set and pick its first utterance for the model to start a conversation with a human. Then the conversation will continue till 8 turns are reached. Finally, we obtain 50 dialogs for evaluation. For single-turn level evaluation, we randomly sample 100 message-response pairs from the dialogs for each model.

Evaluation Metrics
Metrics such as BLEU and perplexity have been widely used for dialog evaluation (Li et al., 2016a;, but it is widely debated how well these automatic metrics are correlated with true response quality (Liu et al., 2016). Since the proposed system does not aim at predicting the highest-probability response at each turn, but rather the long-term success of a dialog (e.g., coherence), we do not employ BLEU or perplexity for evaluation, and we propose the following metrics.

Multi-turn Level Metrics
Global coherence We define incoherence problems as follows: (1) Inconsistent dialogs where the model contradicts with itself, e.g., the model says he is a driver before and then says he is a doctor; (2) One-side dialogs in which the model ignores the user's topics with two or more consecutive turns. A session will be rated "0" if it contains more than three incoherence cases, or "+1" if a session contains 2 or 3 cases, otherwise "+2".
Distinct The metric Dist-i calculates the ratio of distinct i-gram in generated responses (Li et al., 2016a). We use Dist-2 to measure the diversity of generated responses.

Methods
Cohe.  Dialog-target success rate For target-guided conversation, we measure the success rate of generating the target word within 8 turns.

Single-turn Level Metrics
Local appropriateness 14 A response will be rated "0" if it is inappropriate as an reply to the given message, otherwise "1".

Setting
We ask three annotators to judge the quality of each dialog (at multi-turn level) or utterance pair (at single-turn level) for each model. Notice that model identifiers are masked during evaluation.

Conversation with simulator
As shown in Table 2, CG-Policy significantly outperforms (sign test, p-value < 0.01) baselines in terms of global coherence and local appropriateness. It indicates that the CG can effectively facilitate policy learning (see the ablation study for further analysis). For LaRL, its single-turn response quality is worse than other models. It might be explained by that their latent variables are not finegrained enough to provide sufficient information to guide response generation. ChatMore tends to select high-frequency or generic keywords, resulting in its worst performance in terms of Dist-2. TGRM performs the best in terms of Dist-2 and informativeness, indicating that retrieval-based models can produce more diverse responses than generation based models. It is consistent with the conclusions in previous work (Chen et al., 2017;Zhang et al., 2018a). However, TGRM performs the worst in terms of coherence, since TGRM does not use RL framework. It indicates the importance of RL framework for multi-turn dialog modeling. Here the Kappa value for inter-annotater agreement is above 0.4, indicating moderate agreement.

Conversation with human
As shown in Table 3, CG-Policy outperforms baselines in terms of both global coherence and local appropriateness (sign test, p-value < 0.01) , which is consistent with the results in Table 2. The Kappa value is above 0.4, indicating moderate agreement.

Ablation study
We conduct an ablation study for CG-Policy on Weibo corpus to investigate why CG-Policy performs better. First, to evaluate the contribution of CG, we remove the CG from CG-Policy, denoted as CG-Policy-noCG, where we do not use graph structure information for action space pruning and reward design. Moreover, we attempt to use the CG (without how-vertices) to augment the ChatMore model for action space pruning and reward design, denoted as Chatmore-CG. As shown in Table 4, the performance of CG-Policy-noCG drops dramatically in terms of coherence, Dist-2 and appropriateness when compared to the original model. Moreover, CG can boost the performance of ChatMore in terms of most of metrics. It indicates that the use of CG is crucial to the superior performance of CG-Policy, and it can also help other models, e.g., ChatMore. Second, to evaluate the contribution of CG for action space pruning or reward design respectively, we implement two system variants: (1) we use all the what-vertices in CG as action candidates at each turn, denoted as CG-Policy-noCGact; (2) we remove all the CG-based factors from RL rewards, denoted as CG-Policy-noCGrwd. As shown in Table 4, the performance of CG-Policy-noCGact drops significantly in terms of Dist-2 as it tends to select high-frequency keywords like ChatMore, indicating the importance of graph paths to provide both locally-appropriate and diverse response keywords. Moreover, the performance of CG-Policy-noCGrwd drops significantly in terms of coherence, indicating that CG based rewards can effectively guide CG-Policy to promote coherent dialogs. Third, we remove how-vertices from CG, denoted as CG-Policy-noCGhow. As shown in Table 4, how-vertex removal hurts its per-  formance in Dist-2, indicating the importance of how-vertices for response diversity.

The Task of Target-guided Conversation
Besides maintaining coherence, CG grounded policy learning can enable more control over dialog models, which is important to achieve certain goals for chatbot, e.g. proactive leading to certain chatting topics (keywords) or certain products.

Setting
Following the setting in (Tang et al., 2019), where we randomly sample a keyword as the target word for each session in testing procedure. Here we use a multi-mapping based user simulator trained on the Persona dataset for evaluation.   Table 5 presents the results on 100 dialogs for each model. We see that CG-Policy-Target can significantly outperform baselines in terms of dialogtarget success rate (sign test, p-value < 0.01). It can be seen that that CG-Policy can successfully lead the dialog to a given target word by learning to walk over the CG, indicating that this graph gives us more control over the policy. LaRL-Target and ChatMore-Target perform badly in terms of success rate. It may be explained by that they lack the ability of proactive dialog content planning.

Analysis of Responding Mechanisms
Figure 4 provides representative words of each mechanism. 15 For example, for Mech-1, its keywords are mainly subjective words (e.g. think) for 15 We select words that occur frequently in responses guided by this mechanism but rarely occur with other mechanisms. generation of responses with respect to personal opinion or intention. For Mech-2, it tends to respond with a specific type of mood.  Figure 4: Representative words of responding mechanisms.

Conclusion
In this paper we present a novel graph grounded policy learning framework for open-domain multiturn conversation, which can effectively leverage prior information about dialog transitions to foster a more coherent and controllable dialog. Experimental results demonstrate the effectiveness of this framework in terms of local appropriateness, global coherence and dialog-target success rate. In the future, we will investigate how to extend the CG to support hierarchical topic management in conversational systems.

A Appendices
Training Details and Two Conversation Cases For fair comparison, all models share the same vocab (maximum size is 50000 for the Weibo corpus or 20000 for the Persona corpus), initialized word embedding (size is 200), and keyword set. Further, One layer bidirectional GRU-RNN (hidden size is 512) is utilized for all encoders. Dropout rate is 0.3, and optimizer is Adam(lr=2le-3) for all models.
We initialize each session with a starting utterance chosen from the training dataset randomly. Maximum turns is set as 8 and the discounting weight for rewards is set as 0.95.   Figure 6: Case 2: One conversation between CG-Policy and human, where "B" is CG-Policy and "U" is human. The red words are keywords. We translate the original Chinese utterances into English.