FlowDelta: Modeling Flow Information Gain in Reasoning for Conversational Machine Comprehension

Conversational machine comprehension requires deep understanding of the dialogue flow, and the prior work proposed FlowQA to implicitly model the context representations in reasoning for better understanding. This paper proposes to explicitly model the information gain through the dialogue reasoning in order to allow the model to focus on more informative cues. The proposed model achieves the state-of-the-art performance in a conversational QA dataset QuAC and sequential instruction understanding dataset SCONE, which shows the effectiveness of the proposed mechanism and demonstrate its capability of generalization to different QA models and tasks.


Introduction
Machine reading comprehension has been increasingly studied in the NLP area, which aims to read a given passage and then answer questions correctly. However, human usually seeks answers in a conversational manner by asking follow-up questions given the previous answers. Traditional machine reading comprehension (MC) tasks such as SQuAD (Rajpurkar et al., 2016) focus on a single-turn setting, and there is no connection between different questions and answers to the same passage. To address the multi-turn issue, several datasets about conversational question answering (QA) were introduced, such as CoQA (Reddy et al., 2018) and QuAC .
Most existing machine comprehension models apply single-turn methods and augment the input with question and answer history, ignoring previous reasoning processes in the models. Recently proposed FlowQA  attempted at modeling such multi-turn reasoning in dialogues in order to improve performance for 1 Our code can be found in https://github.com/ MiuLab/FlowDelta. conversational QA. However, the proposed FLOW operation is expected to incorporate salient information in an implicit manner, because the learned representations captured by FLOW would change during multi-turn questions. It is unsure whether such change correlates well with the current answer or not. In order to explicitly model the information gain in FLOW and further relate the current answer to the corresponding context, we present a novel mechanism, FlowDelta, which focuses on modeling the difference between the learned context representations in multi-turn dialogues illustrated in Figure 1. The contributions are 3-fold:

Time (Question Turns)
• This paper proposes a simple and effective mechanism to explicitly model information gain in flow-based reasoning for multi-turn dialogues, which can be easily incorporated in different MC models. • FlowDelta consistently improves the performance on various conversational MC datasets, including CoQA and QuAC. • The proposed method achieves the state-ofthe-art results on QuAC and sequential instruction understanding task (SCONE).

Background
Given a document (context), previous conversation history (i.e., question/answer pairs) and the current question, the goal of conversational QA is to find the correct answer. We denote the  context document as a sequence of m words C = {c 1 , c 2 , . . . .c m }, and the i-th question Q i = {q 1 , q 2 , . . . , q n } as a sequence of n words. In the extractive setting, the i-th answer A i is guaranteed to be a span in the context. The main challenge in conversational QA is that current question may depend on the conversation history, which differs from the classic machine comprehension. Therefore, how to incorporate previous history into the QA model is especially important for better understanding. Prior work  proposes an effective way to model the reasoning in multiturn dialogues summarized below.
FLOW Operation Instead of only using shallow history like previous questions and answers,  proposed the FLOW operation that feeds the model with entire hidden representations generated during the reasoning process when answering previous questions. FLOW is defined as a sequence of latent representations based on the context tokens and is demonstrated effective for conversational QA tasks, because it well incorporates multi-turn information in dialogue reasoning.
Let the context representation for i-th question be C i = c i,1 , . . . , c i,m and the dialogue length is t. When answering questions in the dialogue, there are t context sequences of length m, one for each question. We reshape it to become m sequences of length t, one for each context word, and then pass each sequence into a unidirectional GRU. All context word representation j (1  j  m) are processed in parallel in order to model the information via the FLOW direction (vertical direction illustrated in Figure 1).
Then we reshape the outputs from GRU back and FlowQA The FLOW layer described above is incorporated in FLOWQA for conversational MC, which is built on the single-turn MC model Fu-sionNet (Huang et al., 2017), and the full structure is shown in the left part of Figuire 2. Briefly, FLOWQA first performs word-level attention to fuse the information of i-th question Q i into context C. Then it uses two LSTM cells combined with FLOW layers to integrate the context representations, followed by the context-question attention computation. Finally, FLOWQA performs self-attention (Yu et al., 2018) on the context and predict the answer span. Modeling FLOW is shown effective to improve the performance for conversational MC.

Proposed Approaches
This paper extends the concept of FLOW and proposes a flow-based approach, FLOWDELTA, to explicitly model information gain in flow during dialogues illustrated in Figure 2. The proposed mechanism is flexible to integrate with different models, including FlowQA and others. To examine such flexibility and generalization capability, we further apply FLOW and FLOWDELTA to BERT (Devlin et al., 2018), a pretrained language understanding model that shows strong performance in MC tasks, to allow model to grasp dialogue history.

FlowDeltaQA
In the original FLOW operation in (1), the k-th step computation of GRU is h k,j = GRU (c k,j , h k 1,j ). We assume that the difference of previous hidden representations h k 1,j and h k 2,j indicates whether the flow change is important, which can be viewed as the information gain through the reasoning process. For example, 3 consecutive questions Q k 2 , Q k 1 , Q k . Q k 1 and Q k all discuss the same event described in the span {c j , c j+1 , . . . , c l } of the context, while Q k 2 is about another topic. We expect the hidden state {h k 1,j , h k 1,j+1 . . . , h k 1,l } of the span in turn k 1 is dissimilar to the hidden state in the turn k 2, because their topics are different. By explicitly modeling such difference, our model more easily relates the current reasoning process to the corresponding context.
Following the intuition above, we propose FLOWDELTA by modifying the single step computation of FLOW into: where [x; y] is the concatenation of the vectors x and y. We also investigate other variants such as Hadamard product (h k 1,j ⇤ h k 2,j ) detailed in Appendix C.

BERT-FlowDelta
BERT (Devlin et al., 2018) with fine-tuning recently has reached the state-of-the-art in many single-turn MC tasks, such as SQuAD (Rajpurkar et al., 2016(Rajpurkar et al., , 2018. However, how to extend BERT to the multi-turn setting remains unsolved. We propose to incorporate the FLOWDELTA mechanism to deal with the multi-turn problem, where the FLOW layer automatically integrates multiturn information instead of tuning the number of QA pairs for inclusion. Each layer of BERT is a Transformer block (Vaswani et al., 2017) that consists of multi-head attention (MH) and fully-connected feed forward network (FFN):

SA(h) = F F N(LN (h + MH(h)),
where h l is the hidden representation of the l-th layer, LN is layer normalization (Ba et al., 2016) and SA means self-attention. To utilize L layers from BERT for the extractive question answering task, we feed the hidden representation from last layer h L to a fully-connected layer (NN) to predict the answer span, written as P S , P E = NN(h L ), where P S and P E are span start and span end probability for each word respectively.
BERT-FlowDelta incorporates the proposed FLOWDELTA mechanisms for two parts shown in the bottom right corner of Figure 2. First, we add FLOWDELTA layer before the final prediction layer, P S , . Second, we further insert FLOWDELTA into the last BERT layer, considering that modeling dialogue history within BERT may be benefitial.
These two modifications are called exFlowDelta and inFlowDelta respectively, and the latter also meets the idea from Stickland and Murray who added additional parameters into BERT layers to improve the performance of multi-task learning.
In our experiments, we only modify the last BERT layer to avoid largely increasing model size.

Experiments
To evaluate the effectiveness of the proposed FLOWDELTA, various tasks that contains dialogue history for understanding are performed in the following experiments.

Setup
Our models are tested on two conversational MC datasets, CoQA (Reddy et al., 2018) and QuAC , and a sequential instruction understanding dataset, SCONE (Long et al., 2016). For QuAC, we also report the Human Equivalence Score (HEQ). HEQ-Q and HEQ-D represent the percentage of exceeding the model performance over the human evaluation for each question and dialogue respectively. While CoQA and QuAC both follow the conversational QA setting, SCONE is the task requiring model to understand a sequence of natural language instructions and modify the word state accordingly. We follow  to reduce instruction understanding to machine comprehension. Appendix A contains the example and reduction detail of SCONE for reference.     showing the superiority of our model in modeling whole dialogue. Note that FLOWDELTA actually introduced few additional parameters compared to FLOW, since it only augments the input dimension of GRU. The consistent improvement from both data demonstrates the generalization capability of applying the proposed mechanism to various models. Table 2 shows the ablation study of BERT-FlowDelta, where two proposed modules are both important for achieving such results. It is interesting that the proposed inFlowDelta and exFlowDelta boost the performance more on QuAC. As Yatskar (2018) mentioned, the topics in a dialogue shift more frequently on QuAC than on CoQA, and we can see vanilla BERT also performs well on CoQA in the ablation of FLOW which provides long term dialog history information. Therefore, we can conclude that while FLOWDELTA improves the ability to grasp information gain in the dialog, it bring less performance improvement in the setting we do not need much contexts to answer the question. Table 3 shows the performance of our FlowDeltaQA on the SCONE 2 . Our model outperforms FlowQA and achieves the state-of-theart in SCENE and TANGRAMS domains. The small performance drop in ALCHEMY aligns well with the statement in the ablation study. Because experiments show that removing FLOW affects performance in ALCHEMY less when comparing between FlowQA and FusionNet (Huang et al., 2017) (same models except FLOW), we claim that the previous dialogue history is less important in this domain. Thus replaying FLOW with FlowDelta does not bring any improvement in the ALCHEMY domain. The detailed qualitative study can be found in Appendix D.

Conclusion
This paper presents a simple and effective extension of FLOW named FLOWDELTA, which is capable of explicitly modeling the dialogue history in reasoning for better conversational machine comprehension. The proposed FlowDelta is flexible to apply to other machine comprehension models including FlowQA and BERT. The experiments on three datasets show that the proposed mechanism can model the information flow in the multiturn dialogues more comprehensively, and further boosts the performance consistently. In the future, we will investigate more efficient ways to model the dialogue flow for conversational tasks.