Knowledge Aware Conversation Generation with Explainable Reasoning over Augmented Graphs

Two types of knowledge, triples from knowledge graphs and texts from documents, have been studied for knowledge aware open domain conversation generation, in which graph paths can narrow down vertex candidates for knowledge selection decision, and texts can provide rich information for response generation. Fusion of a knowledge graph and texts might yield mutually reinforcing advantages, but there is less study on that. To address this challenge, we propose a knowledge aware chatting machine with three components, an augmented knowledge graph with both triples and texts, knowledge selector, and knowledge aware response generator. For knowledge selection on the graph, we formulate it as a problem of multi-hop graph reasoning to effectively capture conversation flow, which is more explainable and flexible in comparison with previous works. To fully leverage long text information that differentiates our graph from others, we improve a state of the art reasoning algorithm with machine reading comprehension technology. We demonstrate the effectiveness of our system on two datasets in comparison with state-of-the-art models.


Introduction
One of the key goals of AI is to build a machine that can talk with humans when given an initial topic. To achieve this goal, the machine should be able to understand language with background knowledge, recall knowledge from memory or external resource, reason about these concepts together, and finally output appropriate and informative responses. Lots of research efforts have been devoted to chitchat oriented conversation generation (Ritter et al., 2011;Shang et al., 2015). 1 Data and codes are available at https://github. com/PaddlePaddle/models/tree/develop/ PaddleNLP/Research/EMNLP2019-AKGCM However, these models tend to produce generic responses or incoherent responses for a given topic, since it is quite challenging to learn semantic interactions merely from dialogue data without help of background knowledge. Recently, some previous studies have been conducted to introduce external knowledge, either unstructured knowledge texts (Ghazvininejad et al., 2018;Vougiouklis et al., 2016) or structured knowledge triples (Liu et al., 2018; to help open-domain conversation generation by producing responses conditioned on selected knowledge.
In the first research line, their knowledge graph can help narrowing down knowledge candidates for conversation generation with the use of prior information, e.g., triple attributes or graph paths. Moreover, these prior information can enhance generalization capability of knowledge selection models. But it suffers from information insufficiency for response generation since there is simply a single word or entity to facilitate generation. In the second line, their knowledge texts, e.g., comments about movies, can provide rich information for generation, but its unstructured representation scheme demands strong capability for models to perform knowledge selection or attention from the list of knowledge texts. Fusion of graph structure and knowledge texts might yield mutually reinforcing advantages for knowledge selection in dialogue systems, but there is less study on that.
To bridge the gap between the two lines of studies mentioned above, we present an Augmented Knowledge Graph based open-domain Chatting Machine (denoted as AKGCM), which consists of knowledge selector and knowledge aware response generator. This two-stage architecture and graph based knowledge selection make our system to be explainable. Explainability is very important in e.g., information oriented chatting scenarios, where a user needs to know how new knowledge in chatbot's responses is linked to the knowledge in their utterances, or business scenarios, where a user cannot make a business decision without justification.
To integrate texts into a knowledge graph, we take a factoid knowledge graph (KG) as its backbone, and align unstructured sentences of nonfactoid knowledge with the factoid KG by linking entities from these sentences to vertices (containing entities) of the KG. Thus we augment the factoid KG with non-factoid knowledge, and retain its graph structure. Then we use this augmented KG to facilitate knowledge selection and response generation, as shown in Figure 1.
For knowledge selection on the graph, we adopt a deep reinforcement learning (RL) based reasoning model (Das et al., 2018), MINERVA, in which the reasoning procedure greatly reflects conversation flow as shown in Figure 1. It is as robust as embedding based neural methods, and is as explainable as path based symbolic methods. Moreover, our graph differs from previous KGs in that: some vertices in ours contain long texts, not a single entity or word. To fully leverage this long text information, we improve the reasoning algorithm with machine reading comprehension (MRC) technology (Seo et al., 2017) to conduct fine-grained semantic matching between an input message and candidate vertices.
Finally, for response generation, we use an encoder-decoder model to produce responses conditioned on selected knowledge.
In summary, we make following contributions: • This work is the first attempt that unifies knowledge triples and texts as a graph, and conducts flexible multi-hop knowledge graph reasoning in dialogue systems. Supported by such knowledge and knowledge selection method, our system can respond more appropriately and informatively.
• Our two-stage architecture and graph based knowledge selection mechanism provide better model explainability, which is very important for some application scenarios.
• For knowledge selection, to fully leverage long texts in vertices, we integrate machine reading comprehension (MRC) technology into the graph reasoning process.

Related Work
Conversation with Knowledge Graph: There are growing interests in leveraging factoid knowledge (Han et al., 2015;Liu et al., 2018;Zhu et al., 2017) or commonsense knowledge  with graph based representation for generation of appropriate and informative responses. Compared with them, we augment previous KGs with knowledge texts and integrate more explainable and flexible multi-hop graph reasoning models into conversation systems. Wu et al. (2018) used document reasoning network for modeling of conversational contexts, but not for knowledge selection. Conversation with Unstructured Texts: With availability of a large amount of knowledge texts from Wikipedia or user generated content, some work focus on either modeling of conversation generation with unstructured texts (Ghazvininejad et al., 2018;Vougiouklis et al., 2016;Xu et al., 2017), or building benchmark dialogue data grounded on knowledge (Dinan et al., 2019;Moghe et al., 2018). In comparison with them, we adopt a graph based representation scheme for unstructured texts, which enables better explainability and generalization capability of our system.
Knowledge Graph Reasoning: Previous studies on KG reasoning can be categorized into three lines, path-based symbolic models (Das et al., 2017a;Lao et al., 2011), embedding-based neural models (Bordes et al., 2013;Wang et al., 2014), and models in unifying embedding and path-based technology (Das et al., 2018;Lin et al., 2018;Xiong et al., 2017), which can predict missing links for completion of KG. In this work, for knowledge selection on a graph, we follow the third line of works. Furthermore, our problem setting is different from theirs in that some of our vertices contain long texts, which motivates the use of machine reading technology for graph reasoning.
Fusion of KG triples and texts: In the task of QA, combination of a KG and a text corpus has been studied with a strategy of late fusion (Gardner and Krishnamurthy, 2017;Ryu et al., 2014) or early fusion (Das et al., 2017b;Sun et al., 2018), which can help address the issue of low coverage to answers in KG based models. In this work, we conduct this fusion for conversation generation, not QA, and our model can select sentences as answers, not restricted to entities in QA models.

Problem Definition and Model Overview
Our problem is formulated as follows: Let G = {V, E, L E } be an augmented KG, where V is a set of vertices, E is a set of edges, and L E is a set of edge labels (e.g., triple attributes, or vertex categories). Given a message X = {x 1 , x 2 , ..., x m } and G, the goal is to generate a proper response Y = {y 1 , y 2 , ..., y n } with supervised models. Essentially, the system consists of two stages: (1) knowledge selection: we select the vertex that maximizes following probability as an answer, which is from vertex candidates connected to v X : (1) v X is one of vertices retrieved from G using the entity or words in X, and it is ranked as top-1 based on text similarity with X. Please see Equation 8 and 10 for computation of P KS ( * ); (2) response generation: it estimates the probability: The overview of our Augmented Knowledge Graph based Chatting Machine (AKGCM) is shown in Figure 2. The knowledge selector first takes as input a message X = {x 1 , x 2 , ..., x m } and retrieves a starting vertex v X from G that is closely related to X, and then performs multi-hop graph reasoning on G and finally arrives at a vertex v Y that has the knowledge being appropriate for response generation. The knowledge aware response generator produces a response Y = {y 1 , y 2 , ..., y n } with knowledge from v Y . At each decoding position, it attentively reads the selected knowledge text, and then generates a word in the vocabulary or copies a word in the knowledge text.
For model training, each pair of [message, response] in training data is associated with groundtruth knowledge and its vertex ID (ground-truth vertex) in G for knowledge grounding. These vertex IDs will be used as ground-truth for training of knowledge selector, while the triples of [message, knowledge text, response] will be used for the training of knowledge aware generator.

Augmented Knowledge Graph
Given a factoid KG and related documents containing non-factoid knowledge, we take the KG as a backbone, where each vertex contains a single entity or word, and each edge represents an attribute or a relation. Then we segment the documents into sentences and align each sentence with entries of the factoid KG by mapping entities from these sentences to entity vertices of the KG. Thus we augment the factoid KG with nonfactoid knowledge, and retain its structured representation.

Knowledge Selection on the Graph
Task Definition: We formulate knowledge selection on G as a finite horizon sequential decision making problem. It supports more flexible multihop walking on graphs, not restricted to one-hop walking as done in previous work (Han et al., 2015;Zhu et al., 2017).
As shown in Figure 2, we begin by representing the environment as a deterministic partially observed Markov decision process (POMDP) on G built in Section 3.2. Our RL based agent is given an input query of the form (v X , X). Starting from vertex v X corresponding to X in G, the agent follows a path in the graph, and stops at a vertex that it predicts as the answer v Y . Using a training set of known answer vertices for messageresponse pairs, we train the agent using policy gradients (Williams, 1992) with control variates.
The difference between the setting of our problem and previous KG reasoning lies in that: (1) the content of our input queries is not limited to entities and attributes; (2) some vertices in our graph contains long texts, while vertices in previous KGs just contain a single entity or short text. It motivates us to make a few improvements on previous models, as shown in Equation (5), (6), and (7).
Next we elaborate the 5-tuple (S, O, A, δ, R) of the environment, and policy network.
States: A state S t ∈ S at time step t is represented by S t = (v t , v X , X, v gt ) and the state space consists of all valid combinations in V ×V ×X ×V, where v t is current location of the RL agent, v gt is the ground-truth vertex, and X is the set of all possible X.
Observations: The complete state of the environment cannot be observed. Intuitively, the agent knows its current location (v t ) and (v X , X), but not the ground-truth one (v gt ), which remains hidden. Formally, the observation func- Actions: The set of possible actions A St from a state S t consists of all outgoing edges of the vertex v t in G.
It means an agent at each state has option to select which outgoing edge it wishes to take with the label of the edge l e and destination vertex v d . We limit the length of the action sequence (horizon length) up to a fixed number (e.g., T ) of time steps. Moreover, we augment each vertex with a special action called 'NO OP' which goes from a vertex to itself. This decision allows the agent to remain at a vertex for any number of time steps. It is especially helpful when the agent has managed to reach a correct vertex at a time step t < T and can continue to stay at the vertex for the rest of the time steps.
Transition: The environment evolves deterministically by just updating the state to the new vertex according to the edge selected by the agent. Formally, the transition function δ : . l e is the label of an edge connecting v t and v d , and v d is destination vertex.
Rewards: After T time steps, if the current vertex is the ground-truth one, then the agent receives a reward of +1 otherwise 0. Formally, Policy Network: We design a randomized non- is a policy at time step t. In this work, for each d t , we employ a policy network with three components to make the decision of choosing an action from all available actions (A St ) conditioned on X.
The first component is a history dependent feedforward network (FFN) based model proposed in (Das et al., 2018). We first employ a LSTM to encode the history H t = (H t−1 , A t−1 , O t ) as a continuous vector h t ∈ R 2d , where H t is the sequence of observations and actions taken. It is defined by: where a t−1 is the embedding of the relation corresponding to the label of the edge the agent chose at time t − 1 and o t is the embedding of the vertex corresponding to the agent's state at time t.
Recall that each possible action represents an outgoing edge with information of the edge relation label l e and destination vertex v d . So let [l e ; v d ] denote an embedding for each action A ∈ A St , and we obtain the matrix A t by stacking embeddings for all the outgoing edges. Then we build a two-layer feed-forward network with ReLU nonlinearity which takes in the current history representation h t and the representation of X (e new X ). We use another single-layer feed-forward network for computation of e new X , which accepts the original sentence embedding of X (e X ) as input. The updated FFN model for action decision is defined by: Recall that in our graph, some vertices contain long texts, differentiating our graph from others in previous work. The original reasoning model (Das et al., 2018), MINERVA, cannot effectively exploit the long text information within vertices since it just learns embedding representation for the whole vertex, without detailed analysis of text in vertices. To fully leverage the long text information in vertices, we employ two models, a machine reading comprehension model (MRC) (Seo et al., 2017) and a bilinear model, to score each possible v d from both global and local view.
For scoring from global view, (1) we build a document by collecting sentences from all possible v d , (2) we employ the MRC model to predict an answer span (span aw ) from the document, (3) we score each v d by calculating a ROUGE-L score vector of v d 's sentence with span aw as the reference, shown as follows: (6) Here, T ext(·) represents operation of getting text contents, and ROU GE(·) represents operation of calculating ROUGE-L score. We see that the MRC model can help to determine which v d is the best based on global information from the whole document.
For scoring from local view, we use another bilinear model to calculate similarity between X and v d , shown as follows: Finally, we calculate a sum of outputs of the three above-mentioned models and outputs a probability distribution over the possible actions from which a discrete action is sampled, defined by: Please see Section 3.1 for definition of P KS ( * ). When the agent finally arrives at S T , we obtain v T as the answer v Y for response generation.
Training: For the policy network (π θ ) described above, we want to find parameters θ that maximize the expected reward: where we assume there is a true underlying distribution D, and (v 0 , X, v gt ) ∼ D.

Knowledge Aware Generation
Following the work of Moghe et al. (2018), we modify a text summarization model (See et al., 2017) to suit this generation task.
In the summarization task, its input is a document and its output is a summary, but in our case the input is a [selected knowledge, message] pair and the output is a response. Therefore we introduce two RNNs: one is for computing the representation of the selected knowledge, and the other for the message. The decoder accepts the two representations and its own internal state representation as input, and then compute (1) a probability score which indicates whether the next word should be generated or copied, (2) a probability distribution over the vocabulary if the next word needs to be generated, and (3) a probability distribution over the input words if the next word needs to be copied. These three probability distributions are then combined, resulting in P (y t |y <t , X, v Y ), to produce the next word in the response.

Datasets
We adopt two knowledge grounded multi-turn dialogue datasets for experiments, shown as follows: EMNLP dialog dataset (Moghe et al., 2018) This Reddit dataset contains movie chats from two participants, wherein each response is explicitly generated by copying or modifying sentences from background knowledge such as IMDB's facts/plots, or Reddit's comments about movies. We follow their data split for training, validation and test 2 . Their statistics can be seen in Table 1 sations from two participants. One participant selects a beginning topic, and during the conversation the topic is allowed to naturally change. The two participants are not symmetric: one will play the role of a knowledgeable expert while the other is a curious learner. We filter their training data and test data by removing instances without the use of knowledge and finally keep 30% instances 3 for our study since we focus on knowledge selection and knowledge aware generation. Their statistics can be seen in Table 1. For models (Seq2Seq, HRED) without the use of knowledge, we keep the original training data for them.

Experiment Settings
We follow the existing work to conduct both automatic evaluation and human evaluation for our system. We also compare our system with a set of carefully selected baselines, shown as follows. Seq2Seq: We implement a sequence-tosequence model (Seq2Seq) (Sutskever et al., 2014), which is widely used in open-domain conversational systems.

MemNet:
We implement an end-to-end 3 Their ground-truth responses should have high ROUGE-L scores with corresponding ground-truth knowledge texts. knowledge-MemNet based conversation model (Ghazvininejad et al., 2018).
GTTP: It is an end-to-end text summarization model (See et al., 2017) studied on the EMNLP data. We use the code 4 released by Moghe et al. (2018), where they modify GTTP to suit knowledge aware conversation generation.
BiDAF+G: It is a Bi-directional Attention Flow based QA Model (BiDAF) (Seo et al., 2017) that performs best on the EMNLP dataset. We use the code 4 released by Moghe et al. (2018), where they use it to find the answer span from a knowledge document, taking the input message as the query. Moreover, we use a response generator (as same as ours) for NLG with the predicted knowledge span.
TMemNet: It is a two-stage transformer-MemNet based conversation system that performs best on the ICLR dataset (Dinan et al., 2019). We use the code 5 released by the original authors.
CCM: It is a state-of-the-art knowledge graph based conversation model . We use the code 6 released by the original authors and then modify our graph to suit their setting by selecting each content word from long text as an individual vertex to replace our long-text vertices.
AKGCM: It is our two-stage system presented in Section 3. We implement our knowledge selection model based on the code 7 by (Das et al., 2018) and that 4 by (Moghe et al., 2018). We use BiDAF as the MRC module, shown in Equation (6), and we train the MRC module on the same training set for our knowledge selection model. We implement the knowledge aware generation model based on the code of GTTP 4 released by (Moghe et al., 2018). We also implement a variant AKGCM-5, in which top five knowledge texts are used for generation, and other setting are not changed.

Automatic Evaluations
Metrics: Following the work of (Moghe et al., 2018), we adopt BLEU-4 (Papineni et al., 2002), ROUGE-2 (Lin, 2004) and ROUGE-L (Lin and Och, 2004) to evaluate how similar the output response is to the reference text. We use Hit@1 (the top 1 accuracy) to evaluate the performance of knowledge selection. ICLR dialog dataset BLEU-4 ROUGE-2 ROUGE-L Hit@1 BLEU-4 ROUGE-2 ROUGE-L Hit@1   Table 3: Results of human evaluations on the two datasets. AKGCM (or AKGCM-5) outperforms all the baselines significantly (sign test, p-value < 0.05) in terms of the two metrics.
Results: As shown in Table 2, AKGCM (or AKGCM-5) can obtain the highest score on test set in terms of Hit@1, and the second highest scores in terms of BLEU-4, ROUGE-2 and ROUGE-L, surpassing other models, except BiDAF, by a large margin. It indicates that AKGCM has a capability of knowledge selection better than BiDAF and TMemNet, and generates more informative and grammatical responses. We notice that from EMNLP dataset to ICLR dataset, there is a significant performance drop for almost all the models. It is probably due to that the quality of ICLR dataset is worse than that of EMNLP dataset. A common phenomenon of ICLR dataset is that the knowledge used in responses is loosely relevant to input messages, which increases the difficulty of model learning.

Human Evaluations
Metrics: We resort to a web crowdsourcing service for human evaluations. We randomly sample 200 messages from test set and run each model to generate responses, and then we conduct pair-wise comparison between the response by AKGCM and the one by a baseline for the same message. In total, we have 1400 pairs on each dataset since there are seven baselines. For each pair, we ask five evaluators to give a preference between the two responses, in terms of the following two metrics: (1) appropriateness (Appr.), e.g., whether the response is appropriate in relevance, and logic, (2) informativeness (Infor.), whether the response provides new information and knowledge in addition to the input message, instead of generic responses such as "This movie is amazing". Tie is allowed. Notice that system identifiers are masked during evaluation.
Annotation Statistics: We calculate the agreements to measure inter-evaluator consistency. For appropriateness, the percentage of test instances that at least 2 evaluators give the same label (2/3 agreement) is 98%, and that for at least 3/3 agreement is 51%. For informativeness, the percentage for at least 2/3 agreement is 98% and that for at least 3/3 agreement is 55.5%.
Results: In Table 3, each score for win/tie/lose is the percentage of messages for which AKGCM (or AKGCM-5) can generate better/almost same/worse responses, in comparison with a baseline. We see that our model outperforms all the baselines significantly (sign test, p-value < 0.05) in terms of the two metrics on the two datasets. Furthermore, our model can beat the strongest baseline, BiDAF. It demonstrates the effectiveness of our graph reasoning mechanism that can use global graph structure information and exploit long text information. Our data analysis shows that both Seq2Seq and HRED tend to generate safe responses starting with "my favorite character is" or "I think it is". Both Memnet and TMemnet can generate informative responses. But the knowledge in their responses tends to be incorrect, which is a serious problem for knowledge aware conversation generation. Our results show that GTTP and BiDAF are very strong baselines. It indicates that the attention mechanism (from machine reading) for knowledge selection and the copy mechanism can bring benefits for knowledge aware conversation generation. Although CCM have the mechanisms mentioned above, this model is good at dealing with structured triples rather than long texts. It may explain the inferior performance of CCM in our problem setting.
We find that many responses are likely to be a simple copy of the selected knowledge, which is a reflection of the characteristics of the datasets. In the two datasets, some sentences in the background knowledge are directly used as responses to the messages. Therefore, the NLG module is likely to copy content from the selected knowledge as much as possible for generation. Moreover, the summarization model GTTP tends to copy words from its input message as its output due to its generation mechanism. Table 5 presents the examples in which AKGCM (AKGCM-5) performs better than other models on two dataset.

Model Analysis
AKGCM without (w/o) Non-factoid Knowledge: To verify contribution of non-factoid knowledge, we remove non-factoid knowledge from augmented KG in test procedure, and report the performance of our system with only factoid knowledge in Table 4. We see that with-  out non-factoid knowledge from EMNLP dataset, the performance of our system drops significantly in terms of BLEU and ROUGE. It indicates that non-factoid knowledge is essential for knowledge aware conversation generation. AKGCM w/o the MRC Model or Bilinear One: For ablation study, we implement a few variants of our system without the bilinear model or MRC for knowledge selection. Results of these variants are reported in Table 4. If we compare the performance of our full model with its variants, we find that both MRC and the bilinear model can bring performance improvement to our system. It indicates that the full interaction between messages and knowledge texts by their attention mechanism is effective to knowledge selection.
Model Generalization: As shown in Figure  3, we gradually reduce the size of training data, and then AKGCM can still manage to achieve acceptable performance, even when given extremely small training data (around 3,400 u-r pairs at the x-axis point of 10%). But the performance of It lies between violet and green on the spectrum of light. GTTP I think this movie was worth it , for i will be definitely be seeing it a second and possibly third time.
Yes , green is the color between blue and yellow on the visible spectrum .
CCM I think it was a worth UNK , but I still enjoyed it intense than the fraternities .
The energy is chemical on the Several minerals and tradition . BiDAF+G It not only serves as a great climax , but something that we can once again commend Pixar for telling great stories in out of the box thinking that we should think paying for the $ 10 dollar ticket is worth it .
It 's the color between blue and the color yellow .

Selected knowledge
It not only serves as a great climax, but something that we can once again commend Pixar for telling great stories in out of the box thinking that we should think paying for the $10 dollar ticket is worth it. And boy did I think this movie was worth it, ...
Green is the color between blue and yellow on the visible spectrum.

TMemNet
Man and i think this movie was worth it , i will definitely be seeing if it was a second time .
It ' s been around since the middle ages , it was associated with royalty . AKGCM (AKGCM-5) And boy did I think this movie was worth it , for I will be definitely be seeing it a second and possibly third time .
Green is evoked by light which has a dominant wavelength of roughly 495570 nm .

Selected knowledge
And boy did I think this movie was worth it, for I will be definitely be seeing it a second and possibly third time.
... It is evoked by light which has a dominant wavelength of roughly 495570 nm. Several minerals have a green color, including the emerald, ... Table 5: Examples in which AKGCM performs better than other models on two dataset. the strongest baseline, BIDAF+G, drops more dramatically in comparison with AKGCM. It indicates that our graph reasoning mechanism can effectively use the graph structure information for knowledge selection, resulting in better generalization capability of AKGCM.
Model Explainability: We check the graph paths traversed by our system for knowledge selection and try to interpret what heuristics have been learned. We find that our system can learn to visit different types of vertices conditioned on conversational contexts, e.g.. selecting comment vertices as responses for utterances starting with "what do you think". These results suggest that AKGCM may also be a good assistive tool for discovering new algorithms, especially in cases when the graph reasoning are new and less well-studied for conversation modeling.

Conclusion
In this paper, we propose to augment a knowledge graph with texts and integrate it into an opendomain chatting machine with both graph reasoning based knowledge selector and knowledge aware response generator. Experiments demonstrate the effectiveness of our system on two datasets compared to state-of-the-art approaches.