GraphDialog: Integrating Graph Knowledge into End-to-End Task-Oriented Dialogue Systems

End-to-end task-oriented dialogue systems aim to generate system responses directly from plain text inputs. There are two challenges for such systems: one is how to effectively incorporate external knowledge bases (KBs) into the learning framework; the other is how to accurately capture the semantics of dialogue history. In this paper, we address these two challenges by exploiting the graph structural information in the knowledge base and in the dependency parsing tree of the dialogue. To effectively leverage the structural information in dialogue history, we propose a new recurrent cell architecture which allows representation learning on graphs. To exploit the relations between entities in KBs, the model combines multi-hop reasoning ability based on the graph structure. Experimental results show that the proposed model achieves consistent improvement over state-of-the-art models on two different task-oriented dialogue datasets.


Introduction
Task-oriented dialogue systems aim to help user accomplish specific tasks via natural language interfaces such as restaurant reservation, hotel booking and weather forecast.There are many commercial applications of this kind (e.g.Amazon Alexa, Google Home, and Apple Siri) which make our life more convenient.Figure 1 illustrates such an example where a customer is asking for the information about restaurants.By querying the knowledge base (KB), the agent aims to provide the correct restaurant entities from the KB to satisfy the customer in a natural language form.Hence, the ability to understand the dialogue history, and to retrieve relevant information from the KB is essential in task-oriented dialogue systems.
One approach for designing task-oriented dialogue systems is the pipeline approach (Williams Figure 1: An example dialogue in the restaurant booking domain.The top part is knowledge base (KB) information that represented by a graph and the bottom part is the conversation between a customer and the agent.Our aim is to predict the agent responses given KB information and the customer utterances.and Young, 2007;Lee et al., 2009;Young et al., 2013), but it suffers from the difficulty in credit assignment and adaption to new domains.Another popular approach is the end-to-end models (Serban et al., 2016;Wen et al., 2017;Williams et al., 2017;Zhao et al., 2017;Serban et al., 2017), which directly map the dialogue history to the output responses.This approach has attracted more attention in the research community recently as it alleviates the drawbacks of the pipeline approach.However, end-to-end dialogue models usually suffer from ineffective use of knowledge bases due to the lack of appropriate framework to handle KB data.
To mitigate this issue, recent end-to-end dialogue studies (Eric et al., 2017;Madotto et al., 2018) employ memory networks (Weston et al., 2015;Sukhbaatar et al., 2015) to support the learning over KB, and have achieved promising results via integrating memory with copy mechanisms (Gulcehre et al., 2016;Eric and Manning, 2017).By using memory, they assume that the underlying structure of KB is linear since memory can be viewed as a arXiv:2010.01447v1[cs.CL] 4 Oct 2020 list structure.As a result, the relationships between entities are not captured.However, since KB is naturally a graph structure (nodes are entities and edges are relations between entities).By overlooking such relationships, the model fails to capture substantial information embedded in the KB including the semantics of the entities which may significantly impact the accuracy of results.Moreover, structural knowledge such as dependency relationships has recently been investigated on some tasks (e.g., relation extraction) (Peng et al., 2017;Song et al., 2018) and shown to be effective in the model's generalizability.However, such dependency relationships (essentially also graph structure) have not been explored in dialogue systems, again missing great potential for improvements.
With the above insight, we propose a novel graph-based end-to-end task-oriented dialogue model (GraphDialog) aimed to exploit the graph knowledge both in dialogue history and KBs.Unlike traditional RNNs such as LSTM (Hochreiter and Schmidhuber, 1997) and GRU (Cho et al., 2014), we design a novel recurrent unit (Section 3.1.2)that allows multiple hidden states as inputs at each timestep such that the dialogue history can be encoded with graph structural information.The recurrent unit employs a masked attention mechanism to enable variable input hidden states at each timestep.Moreover, We incorporate a graph structure (Section 3.2) to handle the external KB information and perform multi-hop reasoning on the graph to retrieve KB entities.
Overall, the contributions of this paper are summarized as follows: • We propose a novel graph-based end-to-end dialogue model for effectively incorporating the external knowledge bases into taskoriented dialogue systems.
• We further propose a novel recurrent cell architecture to exploit the graph structural information in the dialogue history.We also combine the multi-hop reasoning ability with graph to exploit the relationships between entities in the KB.
• We evaluate the proposed model on two realworld task-oriented dialogue datasets (i.e., SMD and MultiWOZ 2.1).The results show that our model outperforms the state-of-theart models consistently.

Related Work
Task-oriented dialogue system has been a longstanding studied topic (Williams and Young, 2007;Lee et al., 2009;Huang et al., 2020b) and can be integrated into many practical applications such as virtual assistant (Sun et al., 2016(Sun et al., , 2017)).Traditionally, task-oriented dialogue systems are built in the pipeline approach, which consists of four essential components: natural language understanding (Chen et al., 2016), dialogue state tracking (Lee and Stent, 2016;Zhong et al., 2018;Wu et al., 2019a), policy learning (Su et al., 2016;Peng et al., 2018;Su et al., 2018) and natural language generation (Sharma et al., 2017;Chen et al., 2019;Huang et al., 2020a).Another recent approach is the end-to-end models (Wu et al., 2018;Lei et al., 2018), which directly map the user utterances to responses without heavy annotations.Bordes et al. (2017) apply endto-end memory networks (Sukhbaatar et al., 2015) for task-oriented dialogues and shown that end-toend models are promising on the tasks.To produce more flexible responses, several generative models are proposed (Zhao et al., 2017;Serban et al., 2016).They formulate the response generation problem as a translation task and apply sequence-to-sequence (Seq2Seq) models to generate responses.Seq2Seq models have shown to be effective in language modeling but they struggle to incorporate external KB into responses.To mitigate this issue, Eric and Manning (2017) has enhanced the Seq2Seq model by adding copy mechanism.Madotto et al. (2018) combines the idea of pointer with memory networks and obtained improved performance.Wu et al. (2019b) incorporates global pointer mechanism and achieved improved performance.Our study differs from those works in that we exploit the powerful graph information both contained in the dialogue history and in the KBs to effectively incorporate KBs into dialogue systems.

Proposed Model
Our proposed model consists of three components: an encoder (Section 3.1), a decoder (Section 3.3) and a knowledge graph with multi-hop reasoning ability (Section 3.2).Formally, let X = {x 1 ,...,x n } be a sequence of tokens, where each token x i ∈ X corresponds to a word in the dialogue history.
We first obtain a dialogue graph G (Section 3.1.1),which is the dependency parsing graph of the sentences in the dialogue history X, as the input of the encoder.The encoder then learns a fixed-length vector as the encoding of the dialogue history based on G, which is then fed to the decoder for hidden state initialization.The knowledge graph adopts another graph G = {V,E} to store and retrieve the external knowledge data (Section 3.2.1),where V denotes the entities and E denotes the edges.
The decoder generates the system response Y = {y 1 ,...,y m } token-by-token either by copying entities from graph G via querying the knowledge graph or by generating tokens from vocabularies.
Figure 2 illustrates the overall architecture of the proposed model.In the following sections, we describe each component in detail.

Dialogue Graph
To enable learning semantic rich representations of words with various relationships, such as adjacency and dependency relations, we first use the off-the-shelf tool spacy1 to extract the dependency relations among the words in the dialogue history X. Figure 3 gives an example of the dependency parsing result.The bi-directional edges among words allow information flow both from dependents to heads and from heads to dependents.The intuition is that the representation learning of the head words should be allowed being influenced by the dependent words and vice versa, thus allowing the learning process to capture the mutual relationships between the head words and the dependent words to provide richer representation.
We compose the dialogue graph by combining the obtained dependency relations with the sequential relations (i.e., Next and Pre) among words, which serves as the input to the graph encoder.To further support bi-directional representation learning, we split the obtained dialogue graph into two

Recurrent Cell Architecture
The recurrent cell architecture (Figure 4) is the core computing unit of the graph encoder, and is used to compute the hidden state of each word in the obtained dialogue graph.The cell traverse the words in the dialogue graph sequentially according to the word order in the dialogue history.Next, we show how to compute the cell hidden state h t at timestep t.
Let us define x t as the input word representation at timestep t.P(t) = {p 1 ,p 2 ,. . .,p k } is the set of precedent words for x t where each p i ∈ P(t) denotes a word in the dialogue graph that connects to x t , and k is the total number of the precedents of x t .H = {h 1 ,h 2 ,. . .,h k } is a set of hidden states where each element h j ∈ H denotes the hidden state of the j-th predecessor p j ∈ P(t).
The input of the cell consists of two parts: the input word vector x t , and the predecessor hidden states H. First, we loop over the k hidden states in H and compute a reset gate for each of them.Specifically, we compute r j for the j-th hidden state using: where σ is the logistic sigmoid function, x t and h j are the current input and the hidden state of the j-th predecessor at timestep t respectively.W r and U r are parameters which will be learned.We then compute a candidate hidden state h t using: where φ is the hyperbolic tangent function, k is the number of predecessors of word x t , W n and U n are the learnable weight matrices.Intuitively, h t is the contextualized representation of current input x t .
Next, we combine the obtained candidate hidden state h t with the predecessor hidden states H, and use an masked attention mechanism (Equation 6) to aggregate them together to yield the output hidden state h t at timestep t.To obtain sufficient expressive power, we first apply linear transformations to the input x t and the hidden states h j ∈ H using: (3) where W z , U z are parameters which are learned, t is the current timestep.We denote H ={h 1 ,h 2 ,. . .,h k } as the transformed set of hidden states.Then we add the previously obtained candidate hidden state h t into the transformed set of hidden states H and obtain H ={h 1 ,h 2 ,. . .,h k , h t }.The intuition is that the output hidden state depends on both the history information (h 1 to h k ) and the current input ( h t ).
Then we perform attention mechanism by using the hidden states H as keys and the current input x t as query.Intuitively, different inputs (e.g.different predecessors in H ) should have different impacts on the output hidden state h t , and we expect our model to capture that.However, the inputs may have different number of predecessors at different timesteps.To handle this, inspired by (Vaswani et al., 2017), we employ an masked attention mechanism to learn the importance of each predecessor at every timestep, thus avoiding the pad information affecting the learning process.We compute the attention using: where v is a learnable parameter, h j is the j-th vector in H , sof tmax(z i )=e z i / j e z j , α j denotes the attention weight on the j-th vector in H , [•] m denotes the mask operation.In our implementation, we simply set the number to negative infinity if the j-th hidden state corresponds to a pad token.Finally, we compute the weighted sum to obtain the cell output hidden state h t at timestep t using: Intuitively, the reset gate controls the information flow from the multiple predecessors to the hidden state of current timestep.If a precedent word is more correlated to the current input word, then it is expected to let the information of the precedent word flow through the gate to affect the representation of current timestep.

Bi-directional Representation
To obtain a bi-directional representation for the dialogue history, we use the same cell architecture (Section 3.1.2) to loop over the forward graph and backward graph separately, and compute a forward

Multi-hop Reasoning Mechanism over Knowledge Graph
A straightforward way to explore the graph information in KB is to represent the KB as a graph structure, and then query the graph using attention mechanism with the decoder hidden states.However, our preliminary experiments didn't show a good performance using this approach.We conjecture that it may be due to the poor reasoning ability of this method.To address this issue, we extend the graph with multi-hop reasoning mechanism, which aimed to strengthen the reasoning ability over graph as well as to capture the graph structural information between entities via self-attention.We call it knowledge graph module in the following sections.
Formally, the knowledge graph module contains two sets of trainable parameters C = {C 1 ,C 2 ,. . .,C K+1 }, where each C k is an embedding matrix that maps tokens to vector representations, and V = {V 1 ,V 2 ,. . .,V K+1 }, where each V k is a weight vector for computing self-attention coefficients, and K is the maximum number of hops.Now we describe how to compute the output vector of the knowledge graph.The model loops over K hops on an input graph.At each hop k, a query vector q k is employed as the reading head.First, the model uses an embedding layer C k to obtain the continuous vector representations of each node i in the graph as C k i , where C k i =C k (n i ) and n i is the i-th node in the graph.Then we perform selfattention mechanism on the nodes and compute the attention coefficients using: where ϕ is the LeakyReLU activation function (with negative input slope α = 0.2), V k is the parametrized weight vector of the attention mechanism at hop k, C k i and C k j are the node vectors for the i-th and j-th node in the graph at hop k, and is the concatenation operation.We then normalize the coefficients of each node i with respect to all its first-order neighbors using the softmax function: where N i is the first-order neighbors of node i (including i), exp is the exponential function.
Then we update the representation of each node i by a weighted sum of its neighbors in N i using: Next, the query vector q k is used to attend to the updated nodes in the graph and compute the attention weights for each node i at hop k using: To obtain the output of the knowledge graph, we apply the same self-attention mechanism (Equations 8 and 9) and update strategy (Equation 10) to the node representation C k+1 i .We use C k+1 here since the adjacent weighted tying strategy is adopted.The updated node representation for output is denoted as C k+1 i .Once obtained, the model reads out the graph o k by the weighted sum over it using: Then the query vector q k is updated for the next hop using q k+1 = q k + o k .The final output of the knowledge graph is o K , which will become a part of the inputs to the decoder.

Graph Construction
In practice, dialogue systems usually use KBs (mostly in a relational database format) to provide external knowledge.We have converted the original relational database into a graph structure to exploit the relation information between KB entities.First, we find all the entities in the relational database as the nodes of the graph.Then we assign an edge to a pair of entities if there exists relationship between them according to the records in the relational database.Thus we can obtain the graph structured external knowledge.

Decoder
We use a standard Gated Recurrent Unit (GRU) (Cho et al., 2014) as the decoder to generate the system response word-by-word.The initial hidden state h 0 consists of two parts: the graph encoder output and the knowledge graph output.We take the output hidden state of the graph encoder h e n as the initial query vector q 0 to attend to the knowledge graph and obtain the output o K .The initial hidden state h 0 is then computed using: At each decoder timestep t, the GRU takes the previously generated word y t−1 and the previous hidden state h t−1 as the input and generates a new hidden state h t using: Next, we follow (Wu et al., 2019b) that the decoder learns to generate a sketch response that the entities in the response are replaced with certain tags.The tags are obtained from the provided ontologies in the training data.The hidden state h t are used for two purposes.The first one is to generate a vocabulary distribution P vocab over all the words in the vocabulary using: where W o is the learnable parameter.The second one is to query the knowledge graph to generate a graph distribution P graph over all the nodes in the graph.We use the attention weights at the last hop of the knowledge graph p K t as P graph .At each timestep t, if the generated word from P vocab (the word has the maximum posterior probability) is a tag, then the decoder choose to copy from the graph entities that has the largest attention value according to P graph .Otherwise, the decoder will generate the target word from P vocab .During training, all the parameters are jointly learned via minimizing the sum of two cross-entropy losses: one is between P vocab and y t ∈ Y, and the other is between P graph and G Label t , where G Label t is the node id that corresponds to the current output y t .

Experiments 4.1 Dataset
To validate the efficacy of our proposed model, we evaluate it on two public multi-turn task-oriented diaglogue datasets: Stanford multi-domain dialogue (SMD) (Eric et al., 2017) and MultiWOZ 2.1 (Eric et al., 2019).The SMD is a human-human dataset for in-car navigation task.It includes three distinct task domains: point-of-interest navigation, calendar scheduling and weather information retrieval.The MultiWOZ 2.1 dataset is a recently released human-human dialogue corpus with much larger data size and richer linguistic expressions that make it a more challenging benchmark for endto-end task-oriented dialogue modeling.It consists of seven distinct task domains: restaurant, hotel, attraction, train, hospital, taxi and police.We select four domains (restaurant, hotel, attraction, train) to test our model since the other three domains (police, taxi, hospital) lack KB information which is essential to our task.We will make our code and data publicly available for further study.To the best of our knowledge, we are the first to evaluate end-to-end task-oriented dialogue models on Mul-tiWOZ 2.1.The train/validation/test sets of these two datasets are split in advance by the providers.

Training Details
We implement our model 2 in Tensorflow and is trained on NVIDIA GeForce RTX 2080 Ti.We use grid search to find the best hyper-parameters for our model over the validation set (use BLEU as criterion for both datasets).We randomly initialize all the embeddings in our implementation.The embedding size is selected between [16,512], which is also equivalent to the RNN hidden state (including the encoder and the decoder).We also use dropout for regularization on both the encoder and the decoder to avoid over-fitting and the dropout rate is set between [0.1,0.5].We use Adam optimizer (Kingma and Ba, 2015) to accelerate the convergence with a learning rate chosen between [1e −3 ,1e −4 ].We simply use a greedy strategy to search for the target word in the decoder without advanced techniques like beam-search.

Effect of Models
We compare our model with several existing models: standard sequence-to-sequence (Seq2Seq) models with and without attention (Luong et al., 2015), pointer to unknown (Ptr-Unk, (Gulcehre et al., 2016)), GraphLSTM (Peng et al., 2017), BERT (Devlin et al., 2019), Mem2Seq (Madotto et al., 2018) and GLMP (Wu et al., 2019b).Note that the results we listed in Table 2 for GLMP is different from the original paper, since we reimplement their model in Tensorflow according to their released Pytorch code for fair comparison.
Stanford Multi-domain Dialogue.entities from the external knowledge data more accurately than those baselines.We also conduct comparisons with BERT to validate the effectiveness of our proposed model.Specifically, we use the bert-base-uncased model (due to GPU memory limit) from huggingface library 3 as our encoder to encode the dialogue history and the remaining parts are the same as our model.We then fine-tune BERT on our dialogue dataset.We can find that our mode significantly outperforms the fine-tuned BERT by a large margin which further demonstrates the effectiveness of our proposed model.We conjecture that the reasons may lie in two aspects.First, the context of the corpus used for pretraining BERT differs from our dialogue dataset.Secondly, the model complexity of BERT may cause overfitting issue on small-scale datasets like SMD etc. MultiWOZ 2.1.Table 3 shows the results on a more complex dataset MultiWOZ 2.1.Our model outperforms all the other baselines by a large margin both in entity F1 and BLEU score, which confirms our model has a better generalization ability than those baselines.One may find that the entity F1 and BLEU score has a huge gap between Multi-WOZ 2.1 and SMD.This performance degradation phenomenon has also been observed by other dialogue works (Budzianowski et al., 2018) which implies that the MultiWOZ corpus is much more chal-3 https://github.com/huggingfacelenging than the SMD dataset for dialogue tasks.
Ablation Study.Table 4 shows the contributions of each components in our model.Ours without graph encoder means that we do not use the dependency relations information and the proposed recurrent cell architecture.We simply use a bidirectional GRU to serve as the encoder and the other parts of the model remain unchanged.We can observe that our model without the graph encoder has a 1.6% absolute value loss (over 25% in ratio) in BLEU score and a 1.1% absolute value loss (9.8% in ratio) in entity F1 on MultiWOZ 2.1, which suggests that the overall quality of the generated sentences are better improved by our graph encoder.On the other hand, ours without knowledge graph means that we do not use the graph structure to store and retrieve the external knowledge data.Instead we use memory networks (Sukhbaatar et al., 2015) that has been shown useful to handle the knowledge base similar to (Wu et al., 2019b).We can find a significant entity F1 drop (3.8% in absolute value and 33.9% in ratio) on MultiWOZ 2.1, which verifies the superiority of the proposed graph-based module with multi-hop reasoning ability in retrieving the correct entities, even compared to the strong memory-based baselines.
Model Training Time.We also compare the training time of GraphDialog with those baselines.GraphDialog is about 3 times faster than BERT since its model complexity is smaller.The number Based on the question "Where is a nearby parking garage?", the generated response of our model is "palo alto garage is 1 miles away".Specifically, the attention results at each generation timestep for the knowledge graph information of this example are shown in (a), (b), (c) and (d) respectively.The color and size of the nodes represent their attention weights.The darker and bigger the nodes are, the larger their attention weights are.Our model successfully learns to attend to the correct KB entities (i.e., palo alto garage and 1 miles at generation timesteps 0 and 2) which have the highest attention, and the model copies them to serve as the output words.During timesteps 1 and 3, the model generates output words (i.e., is and away) from the vocabulary.   of parameters for GraphDialog is almost 90% less than BERT, which also saves space for model storage.GraphDialog is slower than GLMP, which is expected as it needs to encode more information.However, the gap of the training time is up to 69%, and we can complete the whole training process within one day which seems reasonable.

Analysis and Discussion
Why does dependency relations help?We have conducted in-depth analyses from the edge path distance perspective.Table 5 shows the edge path distance distribution in the dialogue graph (Section 3.1.1)on both SMD and MultiWOZ 2.1.The edge path distance is defined as the the number of words between the head word and the tail word along the linear word sequence plus one.For example, for the sentence "There is a supermarket", the edge distance of the "Next" edge between "There" and "is" is 1, the edge path distance of the "nsubj" edge between "is" and "supermarket" is 2. We can find that although many edges have small edge path distances, there are still a considerable number of edges with relatively large distances, which could encourage more direct information flow between distant words in the input.This may partly explain the benefits of using information such as dependency relations in encoding the dialogue history.
Attention Visualization.To further understand the model dynamics, we analyze the attention weights of the knowledge graph module to show its reasoning process.Figure 5 has shown an example of the attention distribution over all the nodes at the last hop of the knowledge graph.Based on the question "Where is a nearby parking garage?" asked by the user, the generated response of our model is "palo alto garage is 1 miles away", and the gold answer is "The nearest one is palo alto garage, it's just 1 miles away".We can find that our model has successfully learned to copy the correct entities (i.e., palo alto garage at timestep 0 and 1 miles at timestep 2) from the knowledge graph.
Error Analysis.To inspire future improvements, we also inspect the generated responses manually.We find that the model tends to omit entities when the responses contain multiple KB entities.Besides, about 10% of the generated responses contain duplicate KB entities.For example, "The temperature in New York on Monday is 100F, 100F".This may be attributed to the training of GRU in the decoder, and we aim to solve the problem in future work.

Conclusion
In this work, we present a novel graph-based endto-end model for task-oriented dialogue systems.The model leverages the graph structural information in dialogue history via the proposed recurrent cell architecture to capture the semantics of dialogue history.The model further exploits the relationships between entities in the KB to achieve better reasoning ability by combining the multi-hop reasoning ability with graph.
We empirically show that our model outperforms the state-of-the-art models on two real-world taskoriented dialogue datasets.Our model may also be applied to end-to-end open-domain chatbots since the goal is to generate responses given inputs and external knowledge, which is what our model can do.We will explore this direction in future work.

Figure 2 :
Figure 2: Overview of the proposed architecture.(a) Graph Encoder, top is forward graph and bottom is backward graph.(b) Decoder and Knowledge Graph with multi-hop reasoning mechanism.(c) Self-Attention Mechanism.

Figure 3 :
Figure 3: An example of dialogue graph.

Figure 4 :
Figure 4: Overview of the proposed recurrent unit.
representation → h n and a backward representation ← h n , respectively.Then we concatenate them together to serve as the final representation of dialogue history h e n =[ → h n ; ← h n ], which will become a part of the inputs to the decoder.

Figure 5 :
Figure5: Knowledge graph attention visualization when generating responses in the SMD navigation domain.Based on the question "Where is a nearby parking garage?", the generated response of our model is "palo alto garage is 1 miles away".Specifically, the attention results at each generation timestep for the knowledge graph information of this example are shown in (a), (b), (c) and (d) respectively.The color and size of the nodes represent their attention weights.The darker and bigger the nodes are, the larger their attention weights are.Our model successfully learns to attend to the correct KB entities (i.e., palo alto garage and 1 miles at generation timesteps 0 and 2) which have the highest attention, and the model copies them to serve as the output words.During timesteps 1 and 3, the model generates output words (i.e., is and away) from the vocabulary.

Table 2 :
Table 2 has shown the results on SMD dataset.Our proposed model achieves a consistent improvement over all the baselines with the highest BLEU score 13.6 and 57.4% entity F1 score.The performance gain in BLEU score suggests that the generation error in the decoder has been reduced.The improvement on entity F1 indicates that our model can retrieve Evaluation on SMD dataset.Human, rule-based and KV Retrieval Net results are reported from(Eric et al.,  2017), which are not directly comparable since the problem is simplified to canonicalized forms.K denotes the maximum number of hops for knowledge graph.Ours achieves highest BLEU and entity F1 score over baselines.

Table 3 :
Evaluation on MultiWOZ 2.1 dataset.Ours achieves highest BLEU and entity F1 score over baselines.

Table 4 :
Model ablation study: Effects of Graph Encoder and Knowledge Graph.Number in the parentheses means the absolute value gap between the full version and the ablation one on corresponding metrics.

Table 5 :
Edge path distance distribution on different datasets.