Reinforced Dynamic Reasoning for Conversational Question Generation

This paper investigates a new task named Conversational Question Generation (CQG) which is to generate a question based on a passage and a conversation history (i.e., previous turns of question-answer pairs). CQG is a crucial task for developing intelligent agents that can drive question-answering style conversations or test user understanding of a given passage. Towards that end, we propose a new approach named Reinforced Dynamic Reasoning network, which is based on the general encoder-decoder framework but incorporates a reasoning procedure in a dynamic manner to better understand what has been asked and what to ask next about the passage into the general encoder-decoder framework. To encourage producing meaningful questions, we leverage a popular question answering (QA) model to provide feedback and fine-tune the question generator using a reinforcement learning mechanism. Empirical results on the recently released CoQA dataset demonstrate the effectiveness of our method in comparison with various baselines and model variants. Moreover, to show the applicability of our method, we also apply it to create multi-turn question-answering conversations for passages in SQuAD.


Introduction
In this work, we study a novel task of conversational question generation (CQG) which is given a passage and a conversation history (i.e., previous turns of question-answer pairs), to generate the next question. CQG is an important task in its own right for measuring the ability of machines to lead a question-answering style conversation. It can serve as an essential component of intelligent social bots or tutoring systems, asking meaningful * Work done while visiting the Ohio State University.
Shelly is in second grade. She is a new student at her school. Shelly's family has lived in many different places. Shelly was born in Florida. Her family moved to Tennessee when she was two years old. When she was four years old, they moved to Texas. They moved from there to Arizona, where they now live. Q1: What grade is Shelly in ? A1: second R1: Shelly is in second grade.
Q2: Was she a new student ? A2: Yes R2: She is a new student at her school. Q3: Where did she move at 2 years old ? A2: Tennessee R3: Her family moved to Tennessee when she was two years old. The dataset also provides a rationale (R) (i.e., a text span from the passage) to support each answer. and coherent questions to engage users or test student understanding about a certain topic. On the other hand, as shown in Figure 1, large-scale highquality conversational question answering (CQA) datasets such as CoQA (Reddy et al., 2018) and QuAC  can help train models to answer sequential questions. However, manually creating such datasets is quite costly, e.g., CoQA spent 3.6 USD per passage on crowdsourcing for conversation collection, and automatic CQG can potentially help reduce the cost, especially when there are a large set of passages available.
In recent years, automatic question generation (QG), which aims to generate natural questions based on a certain type of data sources including structured knowledge bases (Serban et al., 2016b;Guo et al., 2018) and unstructured texts (Rus et al., 2010;Heilman and Smith, 2010;Du et al., 2017;Du and Cardie, 2018), has been widely studied. However, previous works mainly focus on generating standalone and independent questions based on a given passage. To the best of our knowledge, we are the first to explore CQG, i.e., generating the next question based on a passage and a conversation history.
Comparing with previous QG tasks, CQG needs to take into account not only the given passage, but also the conversation history, and is potentially more challenging as it requires a deep understanding of what has been asked so far and what information should be asked for the next round, in order to make a coherent conversation.
In this paper, we present a novel framework named Reinforced Dynamic Reasoning (ReDR) network. Inspired by the recent success of reading comprehension models (Xiong et al., 2017;Seo et al., 2017), ReDR adapts their reasoning procedure (which encodes the knowledge of the passage and the conversation history based on a coattention mechanism) and moreover dynamically updates the encoding representation based on a soft decision maker to generate a coherent question. In addition, to encourage ReDR to generate meaningful and interesting questions, ideally, one may employ humans to provide feedback, but as widely acknowledged, involving humans in the loop for training models can be very costly. Therefore, in this paper, we leverage a popular and effective reading comprehension (or QA) model (Chen et al., 2017) to predict the answer to a generated question and use its answer quality (which can be seen as a proxy for real human feedback) as rewards to fine-tune our model based on a reinforcement learning mechanism (Williams, 1992).
Our contributions are summarized as follows: • We introduce a new task of Conversational Question Generation (CQG), which is crucial for developing intelligent agents to drive question-answering style conversations and can potentially provide valuable datasets for future relevant research.
• We propose a new and effective framework for CQG, which is equipped with a dynamic reasoning component to generate a conversational question and is further fine-tuned via a reinforcement learning mechanism.
• We show the effectiveness of our method us-ing the recent CoQA dataset. Moreover, we show its wide applicability by using it to create multi-turn QA conversations for passages in SQuAD (Rajpurkar et al., 2016).

Task Definition
Formally, we define the task of Conversational Question Generation (CQG) as: Given a passage X and the previous turns of questionanswer pairs {(q 1 , a 1 ), (q 2 , a 2 ), ..., (q k−1 , a k−1 )} about X, CQG aims to generate the next question q k that is related to the given passage and coherent with the previous questions and answers, i.e., where P (q k |X, q <k , a <k ) is a conditional probability of generating the question q k .

Methodology
We show our proposed framework named Reinforced Dynamic Reasoning (ReDR) network in Figure 2. Since a full passage is usually too long and makes it hard to focus on the most relevant information for generating the next question, our method first selects a text span from the passage as the rationale at each conversation turn, and then dynamically models the reasoning procedure for encoding the conversation history and the selected rationale, before finally decoding the next question.

Rationale Selection
We simply set each sentence in the passage as the corresponding rationale for each turn of the conversation. When experimenting with CoQA, we use the rationale span provided in the dataset. Besides for simplicity and efficiency, another reason that we adopt this rule-based method is that previous research demonstrated that the transition of the dialog attention is smooth (Reddy et al., 2018;, meaning that earlier questions in a conversation are usually answerable by the preceding part of the passage while later questions tend to focus on the ending part of the passage. The selected rationale is then leveraged by subsequent modules for question generation.

Encoding & Reasoning
At each turn k, we denote the conversation history as a sequence of m tokens, i.e., c =  Figure 2: Overview of our Reinforced Dynamic Reasoning (ReDR) network. The reasoning mechanism iteratively reads the conversation history and at each iteration, its output is dynamically combined with the previous encoding representation through a soft decision maker (p d ) as the new encoding representation, which is fed into the next iteration. The model is finally fine-tuned by the reward defined by the quality of the answer predicted from a QA model.
{c 1 , c 2 , ..., c m }, which concatenates the previous questions and answers <q 1 , a 1 , ..., q k−1 , a k−1 >, and represent the rationale as a sequence of n tokens, i.e., r = {r 1 , r 2 , ..., r n }. As mentioned earlier, different from previous question generation tasks, we have two knowledge sources (i.e., the conversation history and the rationale) as the inputs. A good encoding of them is crucial for task performance and might involve a reasoning procedure across previous question-answer pairs and the selected rationale for determining the next question. We feed them respectively into a bidirectional LSTM and obtain their contextual representations C ∈ R d×m and R ∈ R d×n . Inspired by the coattention reasoning mechanism in previous reading comprehension works (Xiong et al., 2017;Seo et al., 2017;Pan et al., 2017), we compute an alignment matrix of C and R to link and fuse the information flow: S = R C ∈ R n×m . We normalize this alignment matrix column-wise (i.e., softmax(S)) to obtain the relevance degree of each token in the conversation history to the whole rationale. The new representation of the conversation history w.r.t. the rationale is obtained via: Similarly, we compute the attention over the conversation history for each word in the rationale via softmax(S ) and obtain the contextdependent representation of the rationale by C · softmax(S ). In addition, as in (Xiong et al., 2017), we also consider the above new representation of the conversation history and map it to the space of rationale encodings via H · softmax(S ), and finally obtain the codependent representation of the rationale and the conversation history: (3) where [; ] means concatenation across row dimension. To deeply capture the interaction between the rationale and the conversation history, we feed the co-dependent representation G combined with the rationale R into an integration model instantiated by a bi-directional LSTM: We define the reasoning process in our paper as Eqn. (2-4), and now obtain a matrix U 0 = [u 0 1 , u 0 2 , ..., u 0 n ] as the encoding representation after one-layer reasoning procedure, which can be fed into the decoder subsequently.

Dynamic Reasoning
Oftentimes the conversation history is very informative and complicated, and one single layer of reasoning may be insufficient to comprehend the subtle relationship among the rationale, the conversation history, and the to-be-generated question. Therefore, we propose a dynamic reasoning procedure to iteratively update the encoding representation. We regard U 0 as a new representation of the rationale and input it to the next layer of reasoning together with C: where F reason is the reasoning procedure (Eqn. 2-4), and U 1 is the hidden states of the BiLSTM integration model at the next reasoning layer. To effectively learn what information in U 1 and U 0 is relevant to keep, we use a soft decision maker to determine their weights: where e 1 is an all-ones vector, and w u , w g , w r , b are trainable parameters. p d ∈ R n is the decision maker, used as a soft switch to choose between different levels of reasoning. U 1 is the representation to be used for the next layer of reasoning. This iterative procedure halts when a maximum number of reasoning layers N is reached (N ≥ 1). The final representation U N is fed into the decoder.

Decoding
The decoder generates a word by sampling from the probability P gen (y t |y <t , c, r) which can be computed via: where MLP stands for a standard multilayer perceptron network, y t is the t-th word in the generated question, o t is the hidden state of the decoder at time step t, and Emb(·) indicates the word embedding. v t is an attentive read of the encoding representation Observing that a question may share common words with the rationale that it is based on and inspired by the widely adopted copy mechanism (Gu et al., 2016;See et al., 2017), we also apply a pointer network for the generator to copy words from the rationale. Now the probability of generating target word y t becomes: P (y t |y <t , c, r) = λP gen (y t ) + (1 − λ)P pt (y t ) (8) where P gen (y t )=P gen (y t |y <t , c, r) is defined earlier, P pt (y t ) = i:r i =yt α t,i is the probability of copying word y t from r (only if r contains y t ), and λ is the weight to balance the two: (9) where w v , w o , w y and b pt are to be learnt. To optimize all parameters in ReDR, we adopt the maximum likelihood estimation (MLE) approach, i.e., maximizing the summed log likelihood of words in a target question.

Reinforcement Learning for Fine-tuning
As shown by recent datasets like CoQA and QuAC, human-created questions tend to be meaningful and interesting. For example, in Figure 1, given the second rationale R2 "She is a new student at her school", humans tend not to ask "Where is she?", and similarly given R3, they usually do not create the question "What happened?". Although both are legitimate questions, they tend to be less interesting and meaningful compared with the human-created ones shown in Figure 1. The interestingness or meaningfulness of a question is subjective and hard to define, automatically measuring which is a difficult problem itself. Ideally, one can involve humans in the loop to judge the generated question and provide feedback, but it can be very costly, if not impossible.
Driven by such observations, we use the RE-INFORCE (Williams, 1992) algorithm and adopt one of the state-of-the-art reading comprehension models DrQA (Chen et al., 2017) as a substitute for humans to provide feedback to the question generator. DrQA answers a question based on the given passage and has achieved a competitive performance on CoQA (Reddy et al., 2018). During training, we apply DrQA to answer a generated question, and compare its answer with the human-provided answer (which is associated with the same rationale for generating the question) 1 . If the answers match well with each other, we regard our generator produces a meaningful question since it asks about the same thing as humans do, and will assign high rewards to such questions.
Formally, we minimize the negative expected reward for a generated question: where π(q|r, c) = t P (y t |y <t , c, r) is the action policy defined in Eqn. (8)  q given rationale r and conversation history c, and R(a, a * ) is the reward function defined by the F1 score 2 between the DrQA predicted answer a and the human-provided answer a * . For computational efficiency concerns, during training, we make sure that the ground-truth question is in the sampling pool and use beam search to generate 5 more questions.
Note that besides providing rewards for finetuning our generator, DrQA model also serves another purpose: When applying our framework to any passage, we can use DrQA to produce an answer to the currently generated question so that the conversation history can be updated for the next-turn of question generation. In addition, our framework is not limited to DrQA and other more advanced QA models can apply as well.

Dataset
We use the CoQA dataset 3 (Reddy et al., 2018) to experiment with our ReDR and baseline methods. CoQA contains text passages from diverse domains, conversational questions and answers developed for each passage, as well as rationales (i.e., text spans extracted from given passages) to support answers. The dataset consists of 108k questions in the training set and 8k questions in the development (dev) set with a large hidden test set for competition purpose, and our results are shown on the dev set.

Baselines
As discussed earlier, CQG has been underinvestigated so far, and there are few existing baselines for our comparison. Because of their high relevance with our task as well as their superior performance demonstrated by previous works, we choose to compare with the following models:

Seq2Seq
(Sutskever et al., 2014) is a basic encoder-decoder sequence learning system, which has been widely used for machine translation (Luong et al., 2015) and dialogue generation . We concatenate the rationale and the conversation history as the input sequence in our setting.
NQG (Du et al., 2017) is a strong attentionbased neural network approach for question generation task. The input is the same as the above Seq2Seq model.

Implementation Details
Our word embeddings are initialized by glove.840B.300d (Pennington et al., 2014). We set the LSTM hidden unit size to 500 and set the number of layers of LSTMs to 2 in both the encoder and the decoder. Optimization is performed using stochastic gradient descent (SGD), with an initial learning rate of 1.0. The learning rate starts decaying at the step 15000 with a decay rate of 0.95 for every 5000 steps. The mini-batch size for the update is set at 64. We set the dropout (Srivastava et al., 2014) ratio as 0.3 and the beam size as 5. The maximum number of iterations for the dynamic reasoning is set to be 3. Since the CoQA contains abstractive answers, we apply DrQA as our question answering model and follow Yatskar (2018) to separately train a binary classifier to produce "yes" or "no" for yes/no questions 4 . Code is available at https: //github.com/ZJULearning/ReDR.

Metrics
We follow previous question generation work (Xu et al., 2017;Du et al., 2017) to use BLEU 5 (Papineni et al., 2002) and ROUGE-L (Lin, 2004) to measure the relevance between the generated question and the ground-truth one. To evaluate the diversity of the generated questions, we follow (Li et al., 2016a) to calculate Dist-n (n=1,2), which is the proportion of unique n-grams over the total number of n-grams in the generated questions for all passages, and  to use the Ent-n (n=4) metric, which reflects how evenly the n-gram distribution is over all generated questions. For all the metrics, the larger they are,  the more relevant or diverse the generated questions are. Table 2 shows the performance of various models on the CoQA dataset. As we can see, our model ReDR and its variants perform much better than the baselines, which indicates that the reasoning procedure can significantly boost the quality of the encoding representations and thus improve the question generation performance.

Results and Analysis
To investigate the effect of the reasoning procedure and fine-tuning in our model design, we also conduct an ablation study: (1) We first test our model with only one layer of reasoning, i.e., directly feeding the encoding representation U 0 into the decoder. The results drop a lot on all the metrics, which indicates that there is abundant semantic information in the input text so the multi-layer reasoning is necessary. (2) We then augment our model with two or three layers of reasoning but without the decision maker p d . In other words, we directly use the hidden states of the integration LSTM as the input to the next reasoning layer (formally, U j =Ũ j ). We can see that the performance of our model increases with a two-layer reasoning while decreases with a three-layer reasoning. We conjecture that the two-layer reasoning network is saturated for most of the input text sequences, thus directly adding a layer of network for all the input text seems not optimal. (3) When we add the decision maker to dynamically compute the encoding representations, the results are greatly improved, which demonstrates that using a dynamic procedure can distribute proper weight of each layer to the input sequences in different lengths and amount of information. (4) Finally, we fine-tune the model with the reinforcement learning framework, and the results show that using the  answer quality as the reward is helpful for generating better questions.

Human Evaluation
We conduct human evaluation to measure the quality of generated questions. We randomly sampled 50 questions along with their conversation history and the passage, and consider 5 aspects: Naturalness, which indicates the grammaticality and fluency; Relevance, which indicates the connection with the topic of the passage; Coherence, which measures whether the generated question is coherent with the conversation history; Richness, which measures the amount of information contained in the question. Answerability, which indicates whether the question is answerable based on the passage. For each sample, 5 people 6 are asked to rank three questions (the ReDR question, the NQG question and the human-created question) by assigning each a score from {1,2,3} (the higher, the better). For each aspect, we show the average score across the five annotators on all samples. Table 3 shows the results of human evaluation. We can see that our method almost outperforms NQG in all aspects. For Naturalness, the three  methods obtain the similar scores, which is probably because that the most generated questions are short and fluent, makes them have no significant difference on this aspect. We also observe that on the Relevance, Coherence and Answerability aspects, there is an obvious gap between the generative models and human annotation. This indicates that the contextual understanding is still a challenging problem for the task of the conversational question generation.

Linguistic Analysis
We further analyze the generated questions in terms of their linguistic features and constitutions in Table 4, from which we draw three observations: (1) Overall, the distribution of the major types of questions generated by ReDR is closer to human-created questions, in comparison with NQG. For example, ReDR generates a large portion of "what" and "who" questions, similarly as humans.
(2) We observe that NQG tends to generate many single-word questions such as "Why?" while our method successfully alleviates this problem.
(3) Both ReDR and NQG generate fewer yes/no questions than humans, as a result of generating more "wh"-type of questions. For the relationship between a question and its conversation history, following the analysis in CoQA, we randomly sample 150 questions respectively from each method and observe that about 50% questions generated by ReDR contain explicit coreference markers such as "he", "she" or "it", which is similar to the other two methods.
Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. But Cotton wasn't alone in her little home above the barn, oh no. She shared her hay bed with her mommy and 5 other sisters...  However, NQG generates much more questions consisting of implicit coreference markers like "Where?" or "Who?", which can be less meaningful or not answerable as also verified in Table 3.

Case Study
In Figure 3, we show the output questions of our ReDR and NQG on an example from CoQA dataset. For the first turn, both ReDR and NQG generate a meaningful and answerable question. For the second turn, NQG generates "What was it?", which is answerable and related to the conversation history but simpler than our question "What kind of house did she live?". For the third turn, NQG generates a coherent but less meaningful question "Why?", while our method generates "Was she alone?", which is very similar to the human-created question. For the last turn, NQG produces a question that is neither coherent nor answerable, while ReDR asks a much better question "Who else?".
To show the applicability of ReDR to generate QA style conversations on any passages, we apply it to passages in the SQuAD reading comprehension dataset (Rajpurkar et al., 2016) and show an example in Figure 4. Since there are no rationales  provided in the dataset for generating consecutive questions, we first apply our rule-based rationale selection as introduced in Section 3.1 and then generate a question based on the selected rationale and the conversation history. The answers are predicted by our modified DrQA. Figure 4 shows that our generated questions are closely related to the passage, e.g., the first question contains "Monday" and the third one mentions "opening ceremony". Moreover, we can also generate interesting questions such as "Where?" which connects to previous questions and makes a coherent conversation.

Related Work
Question Generation. Generating questions from various kinds of sources, such as texts (Rus et al., 2010;Heilman and Smith, 2010;Mitkov and Ha, 2003;Du et al., 2017), search queries (Zhao et al., 2011), knowledge bases (Serban et al., 2016b) and images (Mostafazadeh et al., 2016), has attracted much attention recently. Our work is most related to previous work on generating questions from sentences or paragraphs. Most early approaches are based on rules and templates (Heilman and Smith, 2010;Mitkov and Ha, 2003), while Du et al. (2017) recently proposed to generate a question by a Sequence-to-Sequence neural network model  with attention (Luong et al., 2015). Other approaches such as Subramanian et al., 2017) take into account the answer information in addition to the given sentence or paragraph. (Du and Cardie, 2018;Song et al., 2018) further modeled the surrounding paragraph-level information of the given sentence. However, most of the work focused on generating standalone questions solely based on a sentence or a paragraph. In contrast, this work explores conversational question generation and has to additionally consider the conversation history in order to generate a coherent question, making the task much more challenging.
Conversation Generation. Building chatbots and conversational agents has been pursued by many previous work (Ritter et al., 2011;Vinyals and Le, 2015;Sordoni et al., 2015;Serban et al., 2016a;Li et al., 2016a,b). Vinyals and Le (2015) used a Sequence-to-Sequence neural network  for generating a response given the dialog history. Li et al. (2016a) further optimized the response diversity by maximizing the mutual information between inputs and output responses. Different from these work where the response can be in any form (usually a declarative statement) and is generated solely based on the dialog history, our task is potentially more challenging as it additionally restricts the generated response to be a follow-up question about a given passage.
Conversational Question Answering (CQA). CQA aims to automatically answer a sequence of questions. It has been studied in the knowledge base setting (Saha et al., 2018;Iyyer et al., 2017) and is often framed as a semantic parsing problem. Recently released large-scale datasets (Reddy et al., 2018; enabled studying it in the textual setting where the information source used to answer questions is a given passage, and they inspired many significant work (Zhu et al., 2018;Yatskar, 2018). However, collecting such datasets has heavily relied on human efforts and can be very costly. Based on one of the most popular datasets CoQA (Reddy et al., 2018), we examine the possibility of automatically generating conversational questions, which can potentially reduce the data collection cost for CQA.

Conclusion
In this paper, we introduce the task of Conversational Question Generation (CQG), and propose a novel framework which achieves promising performance on the popular dataset CoQA. We in-corporate a dynamic reasoning procedure to the general encoder-decoder model and dynamically update the encoding representations of the inputs. Moreover, we use the quality of the answers predicted by a QA model as rewards and fine-tune our model via reinforcement learning. In the future, we would like to explore how to better select the rationale for each question. Besides, it would also be interesting to consider using linguistic knowledge such as named entities or part-of-speech tags to improve the coherence of the conversation.