Interconnected Question Generation with Coreference Alignment and Conversation Flow Modeling

We study the problem of generating interconnected questions in question-answering style conversations. Compared with previous works which generate questions based on a single sentence (or paragraph), this setting is different in two major aspects: (1) Questions are highly conversational. Almost half of them refer back to conversation history using coreferences. (2) In a coherent conversation, questions have smooth transitions between turns. We propose an end-to-end neural model with coreference alignment and conversation flow modeling. The coreference alignment modeling explicitly aligns coreferent mentions in conversation history with corresponding pronominal references in generated questions, which makes generated questions interconnected to conversation history. The conversation flow modeling builds a coherent conversation by starting questioning on the first few sentences in a text passage and smoothly shifting the focus to later parts. Extensive experiments show that our system outperforms several baselines and can generate highly conversational questions. The code implementation is released at https://github.com/Evan-Gao/conversaional-QG.


Introduction
Question Generation (QG) aims to create humanlike questions from a range of inputs, such as natural language text (Heilman and Smith, 2010), knowledge base (Serban et al., 2016) and image (Mostafazadeh et al., 2016). QG is helpful for the knowledge testing in education, i.e., the intelligence tutor system, where an instructor can actively ask questions to students given reading comprehension materials (Heilman and Smith, 2010;Du et al., 2017). Besides, raising good questions Passage: Incumbent Democratic President Bill Clinton was ineligible to serve a third term due to term limitations in the 22nd Amendment of the Constitution, and Vice President Gore was able to secure the Democratic nomination with relative ease. Bush was seen as the early favorite for the Republican (Reddy et al., 2019). Each turn contains a question Q i and an answer A i .
in a conversational can enhance the interactiveness and persistence of human-machine interactions . Recent works on question generation for knowledge testing are mostly formalized as a standalone interaction (Yuan et al., 2017;Song et al., 2018), while it is a more natural way for human beings to test knowledge or seek information through conversations involving a series of interconnected questions (Reddy et al., 2019). Furthermore, the inability for virtual assistants to ask questions based on previous discussions often leads to unsatisfying user experiences. In this paper, we consider a new setting called Conversational Question Generation (CQG). In this scenario, a system needs to ask a series of interconnected questions grounded in a passage through a questionanswering style conversation. Table 1 provides an example under this scenario. In this dialogue, a questioner and an answerer chat about the above passage. Every question after the first turn is dependent on the conversation history.
Considering that the goal of the task is to generate interconnected questions in conversational question answering, CQG is challenging in a few aspects. Firstly, a model should learn to generate conversational interconnected questions depending on the conversation so far. As shown in Table  1, Q 3 is a single word 'Why?', which should be 'Why was he ineligible to serve a third term?' in a standalone interaction. Moreover, many questions in this conversation refer back to the conversation history using coreferences (e.g., Q 2 , Q 6 , Q 9 ), which is the nature of questions in a human conversation. Secondly, a coherent conversation must have smooth transitions between turns (each turn contains a question-answer pair). We expect the narrative structure of passages can influence the conversation flow of our interconnected questions. We further investigate this point by conducting an analysis on our experiment dataset CoQA (Reddy et al., 2019). We first split passages and turns of QA pairs into 10 uniform chunks and identify passage chunks of interest for each turn chunk. Figure  1 portrays the conversation flow between passage chunks and turn chunks. We see that in Figure  1, a question-answering style conversation usually starts focusing on the first few chunks in the passage and as the conversation advances, the focus shifts to the later passage chunks.
Previous works on question generation employ attentional sequence-to-sequence models on the crowd-sourced machine reading comprehension dataset SQuAD (Rajpurkar et al., 2016). They mainly focus on generating questions based on a single sentence (or paragraph) and an answer phrase (Du et al., 2017;Sun et al., 2018;Zhao et al., 2018), while in our setting, our model needs to not only ask a question on the given passage (paragraph) but also make the questions conversational by considering the conversation his-tory. Meanwhile, some researchers study question generation in dialogue systems to either achieve the correct answer through interactions (Li et al., 2017) or enhance the interactiveness and persistence of conversations . Although questions in our setting are conversational, our work is different from these because our conversations are grounded in the given passages rather than open-domain dialogues.
We propose a framework based on the attentional encoder-decoder model (Luong et al., 2015) to address this task. To generate conversational questions (first challenge), we propose a multi-source encoder to jointly encode the passage and the conversation so far. At each decoding timestep, our model can learn to focus more on the passage to generate content words or on the conversation history to make the question succinct. Furthermore, our coreference alignment modeling explicitly aligns coreferent mentions in conversation history (e.g. Clinton in Q 1 Table 1) with corresponding pronominal references in generated questions (e.g. he in Q 2 ), which makes generated questions interconnected to conversation history. The coreference alignment is implemented by adding extra supervision to bias the attention probabilities through a loss function. The loss function explicitly guides our model to resolve to the correct non-pronominal coreferent mentions in the attention distribution and generate the correct pronominal references in target questions. To make the conversations coherent (second challenge), we propose to model the conversation flow to transit focus inside the passage smoothly across turns. The conversation flow modeling achieves this goal via a flow embedding and a flow loss. The flow embedding conveys the correlations between number of turns and narrative structure of passages. The flow loss explicitly encourages our model to focus on sentences contain key information to generate the current turn question and ignore sentences questioned several turns ago.
In evaluations on a conversational question answering dataset CoQA (Reddy et al., 2019), we find that our proposed framework outperforms several baselines in both automatic and human evaluations. Moreover, the coreference alignment can greatly improve the precision and recall of generated pronominal references. The conversation flow modeling can learn the smooth transition of conversation flow across turns.

Current Evidence Sentence
Passage Encoder

Problem Setting
In this section, we define the Conversation Question Generation (CQG) task. Given a passage P , a conversation history C i−1 = {(Q 1 , A 1 ), ..., (Q i−1 , A i−1 )} and the aspect to ask (the current answer A i ), the task of CQG is to generate a question Q i for the next turn: in which the generated question should be as conversational as possible.
Note that we formalize this setting as an answeraware QG problem (Zhao et al., 2018), which assumes answer phrases are given before generating questions. Moreover, answer phrases are shown as text fragments in passages. Similar problems have been addressed in (Du and Cardie, 2018;Zhao et al., 2018;Sun et al., 2018). Our problem setting can also be generalized to the answerignorant case. Models can identify which answers to ask first by combining question-worthy phrases extraction methods (Du and Cardie, 2018;Wang et al., 2019).

Multi-Source Encoder
Since a conversational question is dependent on a certain aspect of the passage P and the conversation context C i−1 so far, we jointly encode in-formation from two sources via a passage encoder and a conversation encoder.
Passage Encoder. The passage encoder is a bidirectional-LSTM (bi-LSTM) (Hochreiter and Schmidhuber, 1997), which takes the concatenation of word embeddings w and answer position embeddings a as input x i = [w i ; a i ]. We denote the answer span using the typical BIO tagging scheme and map each token in the paragraph into the corresponding answer position embedding (i.e., B ANS, I ANS, O). Then the whole passage can be represented using the hidden states of the bi-LSTM encoder, i.e., (h p 1 , ..., h p m ), where m is the sequence length.
Conversation Encoder. The conversation history C i−1 is a sequence of question-answer pairs {(Q 1 , A 1 ), ..., (Q i−1 , A i−1 )}. We use segmenters <q><a>to concatenate each question answer pair (Q, A) into a sequence of tokens (<q>, q 1 , ..., q m ; <a>, a 1 , ..., a m ). We design a hierarchical structure to conduct conversation history modeling. We first employ a token level bi-LSTM to get contextualized representation of question- To model the dependencies across turns in the conversation history, we adopt a context level bi-LSTM to learn the contextual dependency (h c 1 , ..., h c i−1 ) across different turns (denoted in the subscript 1, ..., i − 1) of question-answer pairs.

Decoder with Attention & Copy
The decoder is another LSTM to predict the word probability distribution. At each decoding timestep t, it reads the word embedding w t and the hidden state of previous timestep h d t−1 to generate the current hidden state h d t = LSTM(w t , h d t−1 ). To generate a conversational question grounded in the passage, the decoder itself should decide to focus more on passage hidden states h p j or the hidden states of conversation history h w i−k,j at each decoding timestep. Therefore, we flat token level conversation hidden states h w i,j and aggregate the passage hidden states h p j with token level conversation hidden states h w i,j into a unified memory: where h w i,j denotes the jth token of the i-th turn in token level conversation hidden states. Then we attend the unified memory with the standard attention mechanism (Luong et al., 2015) for the passage attention (α 1 , ..., α m ) and the hierarchical attention mechanism for the conversation attention (β 1,1 , ..., β 1,m ; ...; β i−1,1 , ..., β i−1,m ): where e total = Σ j e p j + Σ k,j e w i−k,j * e c i−k and W p , W w , W c are learnable weights.
Finally, we derive the context vector c t and the final vocabulary distribution P V : where W v , W a are learnable weights. Please refer to See et al. (2017) for more details on the copy mechanism.

Coreference Alignment
Using coreferences to refer back is an essential property of conversational questions. Almost half of the questions contains explicit coreference markers such as he, she, it in CoQA (Reddy et al., 2019). Therefore, we propose the coreference alignment to enable our model such ability. Take Q 2 in Table 1 as an example, traditional question generation system can only generate question like "What was Clinton ineligible to serve?", while our system with coreference alignment can align the name "Clinton" to its pronominal reference "he" and generate a more conversational question "What was he ineligible to serve?".
The coreference alignment modeling tells the decoder to look at the correct non-pronominal coreferent mention in the conversation attention distribution to produce the pronominal reference word. We achieve this via two stages. In the preprocessing stage, given the conversation history C i−1 and the question Q i which has a pronominal reference (e.g., he for Q 2 in Table 1), we first run a coreference resolution system (Clark and Manning, 2016) to find its coreferent mention (w c 1 , ...w c m ) (e.g. Clinton) in the conversation history C i−1 , where the superscript c denotes tokens identified as the coreferent mention. During training, we introduce a novel loss function built on the conversation attention of coreferent mentions β c i and the output word probability of its pronominal reference word p coref ∈ P V . As shown in Figure  2, when our model need to refer back to the coreferent mention, we ask the model focus correctly on the antecedent (e.g. Clinton) and maximize the probability of its pronominal reference (e.g. he) p coref in the output vocabulary distribution P V , where λ 1 , λ 2 are hyperparameters, s c is the confidence score between the non-pronominal coreferent mention and the pronoun obtained during the pre-processing stage.

Conversation Flow Modeling
Another key challenge in CQG is that a coherent conversation must have smooth transitions between turns. As illustrated in Figure 1, we find that as the conversations go on, most of the questioners transit their focus from the beginning of passages to the end. Following this direction, we model the conversation flow to learn smooth transitions across turns of the conversation.
Flow Embedding. As shown in Figure 2, we feed our model with the current turn number indicator in the conversation and the relative position for each token in the passage, which, intuitively, are useful for modeling the conversation flow. We achieve this goal via two additional embeddings. The turn number embedding is a learned lookup table [t 1 , ..., t n ] to map the turn number i into its feature embedding space, where n is the maximum turn we consider. For encoding the relative position of each token, we split the passage into L uniform chunks. Each token in the passage is mapped to its corresponding chunk embedding [c 1 , ..., c L ]. The final input to the passage encoder is the concatenation of word embedding, answer position embedding (introduced in Section 3.1) and these two additional embeddings: We further add a gated self-attention modeling mechanism (Zhao et al., 2018) in the passage encoder. Motivating our use of self-attention we consider two desiderata. One is self-attention with answer position embedding can aggregate answer-relevant information from the whole passage for question generation. Another is we want to learn the latent alignment between the turn number embedding and the chunk embedding for better modeling the conversation flow. We first match the rich-feature enhanced passage representation H p = [h p 1 ; ...; h p m ] with itself h p j to compute the self-matching representation u p j , and then combine it with the original representation h p j : The final representationh p j is derived via a gated summation through a learnable gate vector g p j , where W s , W f , W g are learnable weights, is the element-wise multiplication. Self matching enhanced representationh p j takes the place of the passage representation h p j for calculating the passage attention.
Flow Loss. In Section 3.1, our answer position embedding can help model the conversation flow by showing the position of answer fragments inside the passage. However, it is still helpful to tell the model explicitly which sentences around the answer are of high informativity to generate the current turn question. The flow loss is designed to help our model to locate the evidence sentences correctly. Firstly, we define two kinds of sentences in the passage. If a sentence is informative to the current question, we call it Current Evidence Sentence (CES). If a sentence is informative to questions in the conversation history and irrelevant to the current question, we call it History Evidence Sentence (HES).
Then our model is taught to focus on current evidence sentences and ignore the history evidence sentences in the passage attention α j via the following flow loss: where λ 3 , λ 4 are hyperparameters, and w j ∈ CES/HES indicates the token w j is inside the sentence with a CES/HES label.

Joint Training
Considering all the aforementioned components, we define a joint loss function as: in which L nll = −log Prob(Q i |P, A i , C i−1 ) is the the negative log-likelihood loss in the sequence to sequence learning (Sutskever et al., 2014).

Dataset Preparation
We conduct experiments on the CoQA dataset (Reddy et al., 2019). It is a large-scale conversational question answering dataset for measuring the ability of machines to participate in a questionanswering style conversation. The authors employ Amazon Mechanical Turk to collect 8k conversations with 127k QA pairs. Specifically, they pair two crowd-workers: a questioner and an answerer to chat about a passage. The answerers are asked to firstly highlight extractive spans in the passage as rationales and then write the free-form answers. We first extract each data sample as a quadruple of passage, question, answer and conversation history (previous n turns of QA pairs) from CoQA. Then we filter out QA pairs with yes, no or unknown as answers (28.7% of total QA pairs) because there is too little information to generate the question to the point. Finally, we randomly split the dataset into a training set (80%, 66298 samples), a validation set (10%, 8409 samples) and a testing set (10%, 8360 samples). The average passage, question and answer lengths are 332.9, 6.3 and 3.2 tokens respectively.

Implementation Details
Locating Extractive Answer Spans. As studied by Yatskar (2018), abstractive answers in CoQA are mostly small modifications to spans occurring in the context. The maximum achievable performance by a model that predicts spans from the context is 97.8 F1 score. Therefore, we find the extractive spans from the passage which have the maximum F1 score with answers and treat them as answers for our answer position embedding.
Number of Turns in Conversation History. Reddy et al. (2019) find that in CoQA dataset, most questions in a conversation have a limited dependency within a bound of two turns. Therefore, we choose the number of history turns as n = 3 to ensure the target questions have enough conversation history information to generate and avoid introducing too much noise from all turns of QA pairs.
Labeling Evidence Sentences. As mentioned in Section 4.1, the crowd-workers label the extractive spans in the passage as rationales for actual answers. We treat sentences containing the rationale as Current Evidence Sentence.
Model Settings. We employ the teacher-forcing training, and in the generating stage, we set the maximum length for output sequence as 15 and block unigram repeated token, the beam size k is set to 5. All hyperparameters and models are selected on the validation set and the results are reported on the test set.

Baselines and Ablations
We compare with the state-of-the-art baselines and conduct ablations as follows: PGNet is the pointer-generator network (See et al., 2017). We concatenate the passage P , the conversation history C i−1 and the current answer A i as a sequence for the input. NQG (Du and Cardie, 2018) is similar to the previous one but it takes current answer features concatenated with the word embeddings during encoding. MSNet is our base model Multi-Source encoder decoder network (Section 3.1 & 3.2). CorefNet is our proposed Coreference alignment model (Section 3.3). FlowNet is our proposed conversation Flow model (Section 3.4). CFNet is the model with both the Coreference alignment and the conversation Flow modeling.

Main Results
Since the average length of questions is 6.3 tokens only, we employ BLEU (1-3) (Papineni et al., 2002) and ROUGE-L (R-L) (Lin, 2004)   to evaluate n-gram similarity between the generated questions with the ground truth. We evaluate baselines and our models by predicting the current question given a passage, the current answer, and the ground truth conversation history. Table 2 shows the main results, and we have the following observations: • NQG outperforms PGNet by a large margin.
The improvement shows that the answer position embedding (Zhou et al., 2017) is helpful for asking questions to the point. • Our base model MSNet outperforms NQG, which reveals that the hierarchical encoding and the hierarchical attention to conversation history can model the dependency across different turns in conversations. • Both our CorefNet and FlowNet outperform our base model. We will analyze the effectiveness of our coreference alignment and conversation flow modeling in the following two sections respectively. • Our CFNet is significantly better than two baselines (PGNet, NQG), our MSNet, and our CorefNet. However, the difference between our CFNet and our FlowNet is not significant. This is because the conversation flow modeling improves all test samples while the coreference alignment contributes only to questions containing pronominal references.

Coreference Alignment Analysis
As we discussed in Section 3.3, it is the nature of conversational questions to use coreferences to refer back. In order to demonstrate the effectiveness of the proposed coreference alignment, we evaluate models on a subset of the test set called coreference set. Each sample in the coreference set requires a pronoun resolution between the conversation history and the current question (e.g., Q 2 , Q 6 ,  Table 3: Evaluation results on the coreference test set. Precision (P), Recall (R) and F-score (F) of predicted pronouns are also reported. Significant tests with t-test are conducted between CorefNet and models without the coreference alignment. (underline: p-value <0.05, *: p-value <0.01).  Table 1). In additional to the BLEU(1-3) and ROUGE-L metrics, we also calculate the Precision (P), Recall (R) and F-score (F) of pronouns in the generated questions with regard to pronouns in the ground truth questions.
The results are depicted in Table 3. With the help of the coreference alignment, CorefNet significantly improves the precision, recall, and fscore of the predicted pronouns. Moreover, the performance on n-gram overlapping metrics is also boosted. To gain more insights into how the coreference alignment model influence the generation process, in Figure 3, we visualize the conversation attention distribution β j at the timestep the model predicts a pronoun. The conversation history distribution β j is renormalized to Σ j β j = 1. All two examples show that our model put the highest attention probability on the coreferent mentions (i.e. McCain/Clinton) when it generates the pronominal references (his/he). We can conclude that our coreference alignment model can align correct coreferent mentions to generate corresponding pronouns.

Conversation Flow Modeling Analysis
As discussed in Section 3.4, a coherent conversation should have smooth transitions between turns, and we design our model to follow the narrative structure of the passage. Figure 4 shows an example illustrating the transition of passage attention distribution a j (normalize to 1) during first 11 turns of a conversation. We see that the model transits its focus smoothly across the first 11 turns from the first sentence in the passage to later parts. Sometimes the model drills down with two questions for the same sentence such as turn 2 & 3, 4 & 5 and 10 & 11.
To quantitatively validate the effectiveness of our conversation flow modeling, we study the alignment between passage attention α j and sentences of interest in the passage. Ideally, a successful model should focus on sentences of interest (i.e., Current Evidence Sentence) and ignore sentences questioned several turns ago (i.e., History Evidence Sentence). We validate this intuition by calculating Σ j:w j ∈CES α j and Σ j:w j ∈HES α j for all examples in test set. Results show that Σ j:w j ∈CES α j and Σ j:w j ∈HES α j for our model with conversation flow modeling are 0.9966 and 0.0010 on average, which demonstrates that our conversation flow modeling can locate the current evidence sentences precisely and ignore the history evidence sentence. For the model without the flow modeling (CorefNet), Σ j:w j ∈CES α j = 0.4093, Σ j:w j ∈HES α j = 0.1778, which proves our intuition in Section 3.4 that the answer position embedding cannot have comparable effects on the conversation flow modeling.

Human Evaluation
We randomly sample 93 questions with the associated passage and conversation history to conduct human evaluation. We hire 5 workers to evaluate the questions generated by PGNet, MSNet, and our CFNet. All models are evaluated in terms of following 3 metrics: "Grammaticality", "Answerability" and "Interconnectedness". "Grammaticality" measures the grammatical correctness and fluency of the generated questions. "Answerability" evaluates whether the generated question can be  answered by the current answer. "Interconnectedness" measures whether the generated questions are conversational or not. If a question refers back to the conversation history using coreference or is dependent on the conversation history such as incomplete questions 'Why?', 'Of what?', we define it as a conversational question. All metrics are rated on a 1-3 scale (3 for the best). The results are shown in Table 4. All models achieve high scores on "Grammaticality", owing to the strong language modeling capability of neural models. MSNet and our CFNet perform well on "Answerability" while PGNet does not. This demonstrates our base model MSNet and our CFNet can ask questions to the point. Finally, our CFNet outperforms the other two models in terms of "Interconnectedness" by a large gap, which proves that the proposed coreference alignment and conversation flow modeling can effectively make questions conversational.

Related Work
The task of Question Generation (QG) aims at generating natural questions from given input contexts. Some template-based approaches (Vanderwende, 2007;Heilman and Smith, 2010) were proposed initially, where well-designed rules and heavy human labor are required for declarativeto-interrogative sentence transformation. With the rise of data-driven learning approach and sequence to sequence (seq2seq) framework (Sutskever et al., 2014), Du et al. (2017) first formulate QG as a seq2seq problem with attention mechanism. They extract sentences and pair them with questions from SQuAD (Rajpurkar et al., 2016), a largescale reading comprehension dataset. Recent works along this line focus on how to utilize the answer information better to generate questions to the point (Zhou et al., 2017;Gao et al., 2019b;Sun et al., 2018), how to generate questions with specific difficulty levels (Gao et al., 2019a) and how to effectively use the contexts in paragraphs to generate questions that cover context beyond a single sentence (Zhao et al., 2018;Du and Cardie, 2018).
In parallel to question generation for reading comprehension, some researchers recently investigate question generation in dialogue systems. Li et al. (2017) show that asking questions through interactions can receive useful feedbacks to reach the correct answer.  consider asking questions in open-domain conversational systems with typed decoders to enhance the interactiveness and persistence of conversations.
In this paper, we propose a new setting which is related to the above two lines of research. We consider asking questions grounded in a passage via a question-answering style conversation. Since the questions and answers are in the format of a conversation, questions in our setting are highly conversational and interconnected to conversation history. This setting is challenging because we need to jointly model the attention shifting in the passage and the structure of a conversation (Grosz and Sidner, 1986). A limitation of the conversation in our setting is that we can only generate a series of interconnected questions according to predefined answers but in a real dialog the questioner can ask different questions according to the answers' response.
In this paper, we study the problem of questionanswering style Conversational Question Generation (CQG), which has never been investigated before. We propose an end-to-end neural model with coreference alignment and conversation flow modeling to solve this problem. The coreference alignment enables our framework to refer back to the conversation history using coreferences. The conversation flow modeling builds a coherent conversation between turns. Experiments show that our proposed framework achieves the best performance in automatic and human evaluations.
There are several future directions for this setting. First, the presented system is still contingent on highlighting answer-like nuggets in the declarative text. Integrating answer span identification into the presented system is a promising direction. Second, in our setting, the roles of the questioner and the answerer are fixed. However, questions can be raised by either part in real scenario.