Graph-Based Knowledge Integration for Question Answering over Dialogue

Question answering over dialogue, a specialized machine reading comprehension task, aims to comprehend a dialogue and to answer specific questions. Despite many advances, existing approaches for this task did not consider dialogue structure and background knowledge (e.g., relationships between speakers). In this paper, we introduce a new approach for the task, featured by its novelty in structuring dialogue and integrating background knowledge for reasoning. Specifically, different from previous “structure-less” approaches, our method organizes a dialogue as a “relational graph”, using edges to represent relationships between entities. To encode this relational graph, we devise a relational graph convolutional network (R-GCN), which can traverse the graph’s topological structure and effectively encode multi-relational knowledge for reasoning. The extensive experiments have justified the effectiveness of our approach over competitive baselines. Moreover, a deeper analysis shows that our model is better at tackling complex questions requiring relational reasoning and defending adversarial attacks with distracting sentences.


Introduction
Humans obtain information by engaging in conversations. Question answering (QA) over dialogue, a specialized machine reading comprehension (MRC) task (Hermann et al., 2015), aims to test the ability of a QA system to comprehend a dialogue, by asking it to answer questions about the dialogue. Consider the example shown in Table 1. Given a dialogue and a related question Q1: "What is Joey going to do with Kathy tonight?", the task requires a system to give the correct answer "having a late dinner". U1 Chandler: Hey-Hey-Hey! Who was that? U2 Joey: That would be Casey. We're going out :::::: tonight. U3 Chandler: Goin' out, huh? Wow! Wow! So things didn't work out with Kathy, huh? Bummer. Q1 What is Joey going to do with Kathy tonight? A1 having a late dinner Q2 ::::: When will Joey have dinner with Casey? A2 :::::: tonight Table 1: Up: a dialogue from FriendsQA corpus (Yang and Choi, 2019). Down: two related questions with their answers. An evidence sentence for inferring A1 is given in [].
Compared with other MRC tasks, QA over dialogue is more challenging (Yang and Choi, 2019) owing to that conversations often involve complex relationships and background knowledge. In detail, studies show that a dialogue with 12 turns contains 6.1 co-reference chains (Zhou and Choi, 2018) and expresses 4.5 relationships (Yu et al., 2020) on the average. Therefore, to excel in this task, a QA system must master background knowledge for reasoning. Let us consider the reasoning process of Q1 in the above example. To find the correct answer, a QA system should not only locate the evidence sentence "I'am having a late dinner with her" (in U4) but also master co-reference knowledge that "I" refers to Joey (the speaker) and "her" refers to Kathy. While, how to effectively integrate the background knowledge in this task remains an open question. Existing approaches for this task (Yang and Choi, 2019;Li and Choi, 2020) did not consider background knowledge and only learned reasoning patterns from plain texts. This may expose them at risk of achieving sub-optimal results and becoming vulnerable facing adversarial attacks (i.e., models only learn shallow patterns for reasoning is fragile facing adversarial examples by adding distracting sentences (Jia and Liang, 2017)).
In this paper, we propose a new approach for QA over dialogue, featured by its novelty in structuring dialogue and integrating background knowledge -specifically, co-reference and relation knowledgefor reasoning. Different from previous "structure-less" methods, our approach structures the dialogue as a "relational graph", where nodes correspond to words in contexts and edges designate their relationships. The graph uses different types of edges to indicate different types of relations and thus is a heterogeneous graph. To encode this graph, we devise a model based on relational graph convolutional networks (R-GCN) (Schlichtkrull et al., 2018), which learns reasoning patterns considering the topology of the graph. We show in this way, background knowledge is effectively incorporated to guide the reasoning process of question answering.
To confirm the effectiveness of our method, we have conducted extensive experiments on a benchmark dataset FriendsQA (Yang and Choi, 2019). Experimental results demonstrate that our approach achieves superior performance over competitive baselines. Moreover, a deeper analysis reveals that, by integrating background knowledge, our approach is better than baselines at 1) tackling complex questions requiring relational reasoning, and 2) defending adversarial attack with distracting sentences. We have released our code at https://github.com/jianliu-ml/dialogMRC to encourage more studies in this research line.
To sum up, we make the following contributions: • We propose a new approach for QA over dialogue, featured by its novelty in structuring dialogue and integrating background knowledge for reasoning.
• We consider both co-reference and relation knowledge for the task. And a R-GCN is devised to explore multi-relation data characteristics of a heterogeneous relational graph representing a dialogue. To our best knowledge, this is the first work introducing R-GCN to QA over dialogue.
• We set up a new state-of-the-art performance on the benchmark dataset. Moreover, results of robustness testing suggest that our method is robust against adversarial examples.

Related Work
QA over Dialogue. QA over dialogue is a specified MRC task (Hermann et al., 2015), which requires a system to answer questions regarding a dialogue. Many recent studies have benchmarked and advanced this task. To name a few, Reddy et al. (2019) introduce CoQA corpus, which measures MRC over oneto-one conversation. Ma et al. (2018) introduce a corpus based on transcripts of a TV show friends and focus on questions whose answers are PERSON entities. Sun et al. (2019) propose DREAM, which focuses on multiple-choice question answering over multi-turn dialogues. Yang and Choi (2019) extend the work of Ma et al. (2018) and propose FriendsQA, a dataset annotated open-domain questions and answers. QA over dialogue is recognized more challenging than general MRC tasks. In our study, we chose FriendsQA as the testbed, considering its diversity in different types of questions. Moreover, the extractive QA style is more suitable than the multiple-choice style for building practical QA applications. On this benchmark, the best reported method (Li and Choi, 2020) combines a pre-trained language model (Devlin et al., 2019) with an utterance-level pre-training strategy.
Knowledge Incorporation for MRC. Integrating background knowledge to enhance machine reading is a longstanding goal of artificial intelligence. In the task of MRC, previous studies (Yang and Mitchell, 2017;Mihaylov and Frank, 2018;Weissenborn, 2017;Bauer et al., 2018;Qiu et al., 2019) have exploited Figure 1: The overview of our approach, which structures the dialogue as a relational graph and integrates co-reference and relation knowledge for reasoning. external knowledge. While, such work may not be applied to QA over dialogue, whose contexts are dynamic. It also worth noting that Qiu et al. (2019) adopt graph structure to model external knowledge, but in their work, the relation is not discerned, with a general "related to" relation. By contrast, our approach uses a heterogeneous graph to incorporate different types of knowledge.
Graph Representation Learning. Graph neural networks (GNNs) (Kipf and Welling, 2016; Veličković et al., 2017;Schlichtkrull et al., 2017) provide an effective way to model graph-structure data and show promising results in many NLP problems (Vashishth et al., 2019;Gui et al., 2019;Qiu et al., 2019). Among all GNNs, Relational Graph Convolution Networks (R-GCNs) (Schlichtkrull et al., 2017) are variations of Graph Convolution Networks (GCNs) that are designed for modeling multi-relation data. To our knowledge, this is the first work introducing R-GCNs to model co-reference and relation knowledge for the task of QA over dialogue.
3 Approach Figure 1 schematically visualizes our approach, which involves three major steps: • Joint dialogue-question representation. In this step, the dialogue and question are jointly encoded to build their representations, and the dialogue representations are taken as the initial node representations of the relational graph.
• Graph-based knowledge integration, where the dialogue is organized as a relational graph and a R-GCN is proposed to integrate co-reference and relational knowledge for reasoning the answer.
• Answer span prediction. This module reasons over the knowledge enhanced representation and generates a text span as the answer to the question.
In the following illustrations, let D = {U 1 , U 2 , ..., U N } be a dialogue with N utterances. Each utterance U i is associated with a speaker s i . The texts in U i can be denoted as {w i1 , w i2 , ..., w im } where w ij is the jth token in U i , and m is the length of U i . Let a question be Q = {q 1 , q 2 , . . . , q L } where L is the length of Q. Given D and Q, QA over dialogue requires to predict an answer a. Note a is restricted to be a (continuous) span in D (Yang and Choi, 2019).

Joint Dialog and Question Representation
We first encode D and Q into continuous representations to learn their joint representations. We adopt the BERT based QA architecture (Devlin et al., 2019) considering its effectiveness. Specifically, given D and Q, we first construct an input sequence to concatenate them:  (Sennrich et al., 2016) and adopt BERT to encode the sequence 1 . We take the last hidden layer of BERT as the joint representation of D and Q, denoted as H ∈ R T ×d , where T is the length of the extended input sequence (regarding sub-word pieces), and d is the hidden dimension of BERT. H can be divided as H D and H Q to indicate dialogue-specific and question-specific representations. H D is used to initialize the node representations in the relational graph.

Graph-Based Knowledge Integration
Graph-based knowledge integration involves relational graph construction, knowledge integration via graph convolution, and representation fusion.
Relational Graph Construction. We first organize dialogue contexts as a "relational graph", where the nodes correspond to words in D, and the edges reflect their relationships. We consider two types of relationships: 1) co-reference knowledge (Chen et al., 2017), which designates expressions referring to the same entity, and 2) relation knowledge (Yu et al., 2020), which reflects semantic relations between two entities (We refer to § 4.1 for how we obtain such knowledge). A heterogeneous graph is proposed to model the knowledge, which uses different types of edges to indicate different types of knowledge. We also add self-loop edges in the graph to facilitate effective computation (Schlichtkrull et al., 2017). Thus, the total number of different types of edges is 1 + 1 + N r (self-loop + co-reference + number of semantic relation). Knowledge Integration via Graph Convolution. We next encode the relational graph via a relational graph convolution network (R-GCN), to allow knowledge integration. In R-GCN, the representation of a node is computed by gathering information from its neighbor nodes, using the following rules: where N r (i) is the neighbor set of node i regarding a relation r; h (l) j is the representation of note j at the lth layer; W (l) r corresponds to the parameter matrix associated with a relation r at the lth layer. c i,r is the normalization term that equals to the size of N r (i). σ is the sigmoid function. In this way, the representation of a node 2 is encoded by considering all nodes that have relationships with it, and meanwhile different types of relations are considered. We use H D as the initialized representation of each node in the relational graph. And the overall graph would be updated k times (where k is a hyper-parameter tuned on the development set) to allow long-range dependency. The obtained representations are denoted by H G .
Representation Fusion. In practice, we found combing H G with the original representation H D yields better performance. Perhaps because that H G may not preserve the position information in the dialogue contexts well. The final fused representation is computed as: where ⊕ is the element-wise add computation and α is a hyper-parameter tuned on the development set 3 . Finally, H enh is used as the final dialogue representation to infer the answer.

Answer Span Prediction
To generate the answer, we compute two probability vectors containing the starting and ending positions of answer by taking H enh as the input: where w start , w end are learnable parameters. We rank all legal spans (i.e., the starting position should be ahead of ending position) based on their summed starting and ending probabilities and select the one with the highest value as the answer span. We map BPE positions to original positions to generate the answer.

Training and Optimization Strategy
We adopt cross-entropy loss to train our model. Specifically, the training loss function regarding a specific (dialogue, question, answer) triple (D, Q, a) is: where a s and a e indicate the starting and ending positions of the ground-truth answer a. The overall training loss sums up cross-entropy loss of each training instance in the training set. We adopt Adam (Kingma and Ba, 2014) to optimize our model and a linear decaying strategy to smooth the training.

Datasets and Evaluations
Our experiments are conducted on a benchmark dataset FriendsQA (Yang and Choi, 2019), which annotates QA pairs on transcripts of a TV show friends. We split the dataset as training/developing/test set following the setting of Li and Choi (2020). The co-reference knowledge is obtained by aligning Friend-sQA with Character Identification project (Zhou and Choi, 2018); the relation knowledge is obtained by aligning FriendsQA with DialogRE (Li and Choi, 2020), which defines 36 different semantic relations (e.g., per:father, per:date) between entities. The data statistics are shown in Table 2  For evaluation, we adopt three evaluation metrics: utterance matching (UM), which evaluates whether the predicted answer matching the utterance, span matching (SM), which treats an answer as a bag-oftoken, and conducts a set-level token matching, and exact matching (EM), which measures whether a prediction matches the ground-truth answer exactly. We also adopt three training strategies, shortestanswer strategy, longest-answer strategy, and multiple-answer strategy, following Yang and Choi (2019; Li and Choi (2020), to evaluate our method.

Implementation Details
In our implementation, we choose BERT base architecture, with 12 layers and 768 hidden units, same as previous methods to ensure comparability (Yang and Choi, 2019;Li and Choi, 2020 Table 3: Results on the test set of FriendsQA. SA Strategy, LA Strategy, and MA Strategy are three training strategies using the shortest answer, the longest answer, and all of the answers for training. The best results are denoted in bold. † denotes that the results are directly taken from the original paper. dimension of R-GCN is set as 60, chosen from 50 to 100. The layers of R-GCN, k, is set as 3, chosen from 1 to 5. The learning rate is set as 1.0 × 10 −5 . The balance factor is set as 0.5, chosen from 0.1 to 0.9. We use Deep Graph Library (DGL) 4 to build the graph and implement graph model.

Baseline Models
We compare our model with the following baseline models: BERT, the standard BERT MRC model. BERT pre is a model that uses dialog contexts to pre-train BERT (Li and Choi, 2020), which corresponds to the best reported method (denoted as SoTA). We also compare with R-Net (Wang et al., 2017), the earlier SoTA model achieving the 1st place on the SQuAD leaderboard, which builds representations for questions and evidence passages via a self-matching mechanism. Our model is denoted as + Graph (e.g., BERT pre + Graph indicates using BERT pre as basic encoding model).

Experimental Results
The results of comparing our approach with baselines models are shown in Table 3. Here we adopt golden co-reference and relation knowledge to build the relational graph (The results of using system predicted results are shown in § 7.1). From the results, our approach consistently outperforms models without leveraging background knowledge, regarding each evaluation metric. Moreover, our approach also outperforms the previous best-reported system (Li and Choi, 2020), especially in SM evaluation, setting up a new state-of-the-art. Among different training strategies, the multi-answer strategy yields better results, as expected, as it can leverage more data for training compared with other training strategies. It is also worth noting that pre-training on the dialogue datasets can improve the performance.

Discussion
We further investigate the performance of our model on different types of questions and conduct a robustness testing to understand why our approach is effective.

Results on Different Question Types
We compare BERT pre and our model BERT pre +G on different types of questions. The results are shown in Table 4. From the results, among all factoid questions (Who, Where, When, and What), our approach is especially excelled in answering Who and What questions, and outperforms BERT pre by considerable margin. The reason may be that Who and Why questions are more relation-related, which are difficult for BERT pre reasoning over plain texts. Moreover, our method demonstrates very promising results in answering Why questions (+6.5% over BERT pre ). This implies that answering Why questions should considering background knowledge, where structure-less methods such as BERT pre are difficult to master such knowledge.

Results of Robustness Testing
We conduct a robustness testing on our model and BERT pre . Following Jia and Liang (2017), we add distracting sentences in the dialogue, which are similar to the What and Who questions. For example, assume a Who question is "Who is the girlfriend of Joey?". We construct a sentence "X is the girlfriend of Joey." where X is a random-select speaker, and randomly insert the distracting sentence into the dialogue. The distracting sentence is shown to confuse models using shallow patterns for reasoning. Results are shown in Table 5. From the results, the performance of BERT pre drops seriously. By contrast, our approach demonstrates robustness in such adversarial testing scenarios. The reason is that our approach uses background knowledge for reasoning. With such background knowledge, our approach tends not to select the answer from these added distracting sentences, because these adversarial sentences do not have relationships with other parts of dialogue.

Impact of Different Types of Knowledge
We compare the impact of different types knowledge on the results and we also investigate using system predicted results rather than golden knowledge (We train co-reference identifier and relation classifier following Zhou and Choi (2018) and Yu et al. (2020), resulting 74.4% in F1 (B 3 ) and 57.0%, matching the state-of-the-arts). The results are given in Table 6. From the results, relational knowledge is more effective than co-reference knowledge for this task. And their combination leads to the highest result. We note using system predicted knowledge for reasoning leads to a drop of performance, but still achieves better performance than SoTA (Li and Choi, 2020 Table 6: Results on the test set of FriendsQA by compared with existing models using the predicted relation/co-reference chain. GOLD and PREDICTED denote using the ground-truth annotated information or system predicted results to construct the relational graph for reasoning.  (Veličković et al., 2017). We also compare our model with a system simply concatenate the relational knowledge, in a triple format of (subject, predict, object), at the end of dialogue for reasoning. Results are shown in Table 7. From the results, our model, i.e., BERT pre + Graph (R-GCN), achieves the best performance. The reason might be that GCN and GAT use the homogeneous graph to encode knowledge, where edges do not have type information, thus they cannot discern different types of knowledge. While R-GCN uses the heterogeneous graph, and uses different types of edges to indicate different types of relationships. Also note, simply concatenating knowledge leads to a negative result.
Impact of Graph Depth. Figure 3 shows the impact of the graph depth of R-GCN, GAT, and GCN. The results are on the development set. From the results, for GCN and GAT, more graph layers lead to better performance. While R-GCN is more effective, even with only one layer can it yield good performance.

Case Study
We study several cases by comparing the difference in predictions of BERT pre and our approach. The results are shown in Table 8. (1) In the first example, Rach and Rachel Green have an alternate:name relation. And our model can integrate this knowledge for reasoning, which correctly predicts the answer for "What does Monica call Rachel for short?". By contrast, BERT pre lacks the above knowledge. It wrongly outputs "my brother" as the answer. (2) In the second example, the question is "When does Dough come?", but the dialogue only contains "his boss approaches" in the scene utterance (A dialogue in FriendsQA can contain a special "scene utterance" describing backgrounds of a scene). To solve this problem, the model should figure out the relation between Doug and Chandler is per:boss. Our method can reason over such knowledge and find the correct answer. (3) In the third example, note BERT pre incorrectly predicts Steven as the answer. While, Steven is actually the per:alternative names of Stephen Waltham (and Steven and Stephen Waltham share a co-reference relation), which obviously can not be the answer of "Who did Stephen Waltham tell not to take that tone with him?". Our model is aware of per:alternative names knowledge, and it will not make such mistake. Table 8: Results of case study comparing predictions of BERT pre and our approach (BERT pre + Graph). The answer to the question is annotated in bold in the dialogue.

Conclusion and Future Work
In this paper, we study the problem of question answering over dialog. We propose a new model that can effectively integrate background evidence for reasoning via a graph based knowledge integration process. The effectiveness of our approach is verified on extensive experiments. In the current study, we have used the results of additional co-reference identifier and relation extractor to build the relational graph, working in a pipeline style. In the future, we would study the inter-dependency of question answering and co-reference/relation identification tasks, in a multi-task setting to boost performance.