Reinforced Multi-task Approach for Multi-hop Question Generation

Question generation (QG) attempts to solve the inverse of question answering (QA) problem by generating a natural language question given a document and an answer. While sequence to sequence neural models surpass rule-based systems for QG, they are limited in their capacity to focus on more than one supporting fact. For QG, we often require multiple supporting facts to generate high-quality questions. Inspired by recent works on multi-hop reasoning in QA, we take up Multi-hop question generation, which aims at generating relevant questions based on supporting facts in the context. We employ multitask learning with the auxiliary task of answer-aware supporting fact prediction to guide the question generator. In addition, we also proposed a question-aware reward function in a Reinforcement Learning (RL) framework to maximize the utilization of the supporting facts. We demonstrate the effectiveness of our approach through experiments on the multi-hop question answering dataset, HotPotQA. Empirical evaluation shows our model to outperform the single-hop neural question generation models on both automatic evaluation metrics such as BLEU, METEOR, and ROUGE, and human evaluation metrics for quality and coverage of the generated questions.


Introduction
In natural language processing (NLP), question generation is considered to be an important yet challenging problem. Given a passage and answer as inputs to the model, the task is to generate a semantically coherent question for the given answer. In the past, question generation has been tackled using rule-based approaches such as question templates (Lindberg et al., 2013) or utilizing named entity information and predictive argument structures of sentences (Chali and Hasan, 2015). Recently, Document: A few sects, such as the Bishnoi, lay special emphasis on the conservation of particular species, such as the antelope. (ii) ... Question SHQ : Who lay special emphasis on conservation of particular species ? Document (1): (i) Stig Lennart Blomqvist (born 29 July 1946) is a Swedish rally driver.
(ii) ... (iii) Driving an Audi Quattro for the Audi factory team, Blomqvist won the World Rally Championship drivers' title in 1984 and finished runner-up in 1985. Document (2): (i) The Audi Quattro is a road and rally car, produced by the German automobile manufacturer Audi, part of the Volkswagen Group. (ii) ... Question M HQ : Which car produced by German automobile manufacturer, was driven by Stig Lennart Blomqvist? Table 1: An example of Single-hop question (SHQ) from the SQuAD dataset and a Multi-hop Question (MHQ) from the HotPotQA dataset. The relevant sentences and answer required to form the question are highlighted in blue and red respectively.
neural-based approaches have accomplished impressive results (Du et al., 2017;Sun et al., 2018;Kim et al., 2018) for the task of question generation. The availability of large-scale machine reading comprehension datasets such as SQuAD (Rajpurkar et al., 2016), NewsQA (Trischler et al., 2017), MSMARCO (Nguyen et al., 2016) etc. have facilitated research in question answering task. SQuAD (Rajpurkar et al., 2016) dataset itself has been the de facto choice for most of the previous works in question generation. However, 90% of the questions in SQuAD can be answered from a single sentence (Min et al., 2018), hence former QG systems trained on SQuAD are not capable of distilling and utilizing information from multiple sentences. Recently released multi-hop datasets such as QAngaroo (Welbl et al., 2018), ComplexWebQuestions (Talmor and Berant, 2018) and HotPotQA  are more suitable for building QG systems that required to gather and utilize information across multiple documents as opposed to a single paragraph or sentence.
In multi-hop question answering, one has to reason over multiple relevant sentences from different paragraphs to answer a given question. We refer to these relevant sentences as supporting facts in the context. Hence, we frame Multi-hop question generation as the task of generating the question conditioned on the information gathered from reasoning over all the supporting facts across multiple paragraphs/documents. Since this task requires assembling and summarizing information from multiple relevant documents in contrast to a single sentence/paragraph, therefore, it is more challenging than the existing single-hop QG task. Further, the presence of irrelevant information makes it difficult to capture the supporting facts required for question generation. The explicit information about the supporting facts in the document is not often readily available, which makes the task more complex. In this work, we provide an alternative to get the supporting facts information from the document with the help of multi-task learning. Table  1 gives sample examples from SQuAD and Hot-PotQA dataset. It is cleared from the example that the single-hop question is formed by focusing on a single sentence/document and answer, while in multi-hop question, multiple supporting facts from different documents and answer are accumulated to form the question.
Multi-hop QG has real-world applications in several domains, such as education, chatbots, etc. The questions generated from the multi-hop approach will inspire critical thinking in students by encouraging them to reason over the relationship between multiple sentences to answer correctly. Specifically, solving these questions requires higher-order cognitive-skills (e.g., applying, analyzing). Therefore, forming challenging questions is crucial for evaluating a students knowledge and stimulating self-learning. Similarly, in goal-oriented chatbots, multi-hop QG is an important skill for chatbots, e.g., in initiating conversations, asking and providing detailed information to the user by considering multiple sources of information. In contrast, in a single-hop QG, only single source of information is considered while generation.
In this paper, we propose to tackle Multi-hop QG problem in two stages. In the first stage, we learn supporting facts aware encoder representation to predict the supporting facts from the documents by jointly training with question generation and subsequently enforcing the utilization of these supporting facts. The former is achieved by sharing the encoder weights with an answer-aware supporting facts prediction network, trained jointly in a multi-task learning framework. The latter objective is formulated as a question-aware supporting facts prediction reward, which is optimized alongside supervised sequence loss. Additionally, we observe that multi-task framework offers substantial improvements in the performance of question generation and also avoid the inclusion of noisy sentences information in generated question, and reinforcement learning (RL) brings the complete and complex question to otherwise maximum likelihood estimation (MLE) optimized QG model.
Our main contributions in this work are: (i). We introduce the problem of multi-hop question generation and propose a multi-task training framework to condition the shared encoder with supporting facts information. (ii). We formulate a novel reward function, multihop-enhanced reward via question-aware supporting fact predictions to enforce the maximum utilization of supporting facts to generate a question; (iii). We introduce an automatic evaluation metric to measure the coverage of supporting facts in the generated question. (iv). Empirical results show that our proposed method outperforms the current state-of-the-art single-hop QG models over several automatic and human evaluation metrics on the HotPotQA dataset.

Related Work
Question generation literature can be broadly divided into two classes based on the features used for generating questions. The former regime consists of rule-based approaches (Heilman and Smith, 2010;Chali and Hasan, 2015) that rely on humandesigned features such as named-entity information, etc. to leverage the semantic information from a context for question generation. In the second category, question generation problem is treated as a sequence-to-sequence  learning problem, which involves automatic learning of useful features from the context by leveraging the sheer volume of training data. The first neural encoder-decoder model for question generation was proposed in Du et al. (2017). However, this work does not take the answer information into consideration while generating the question. Thereafter, several neural-based QG approaches (Sun et al., 2018;Zhao et al., 2018;Chen et al., 2018) have been proposed that utilize the answer position information and copy mechanism. Wang et al. (2017a) and  demonstrated an appreciable improvement in the performance of the QG task when trained in a multi-task learning framework.
The model proposed by Seo et al. (2017b); Weissenborn et al. (2017) for single-document QA experience a significant drop in accuracy when applied in multiple documents settings. This shortcoming of single-document QA datasets is addressed by newly released multi-hop datasets (Welbl et al., 2018;Talmor and Berant, 2018; that promote multi-step inference across several documents. So far, multi-hop datasets have been predominantly used for answer generation tasks (Seo et al., 2017a;Tay et al., 2018;. Our work can be seen as an extension to single hop question generation where a non-trivial number of supporting facts are spread across multiple documents.

Proposed Approach
Problem Statement: In multi-hop question generation, we consider a document list L with n L documents, and an m-word answer A. Let the total number of words in all the documents D i ∈ L combined be N . Let a document list L contains a total of K candidate sentences CS = {S 1 , S 2 , . . . , S K } and a set of supporting facts 1 SF such that SF ∈ CS. The answer Our task is to generate an n Q -word question sequenceQ = {y 1 , y 2 , . . . , y n Q } whose answer is based on the supporting facts SF in document list L. Our proposed model for multi-hop question generation is depicted in Figure 1.

Multi-Hop Question Generation Model
In this section, we discuss the various components of our proposed Multi-Hop QG model. Our proposed model has four components (i). Document and Answer Encoder which encodes the list of documents and answer to further generate the question, (ii). Multi-task Learning to facilitate the QG model to automatically select the supporting facts to generate the question, (iii). Question Decoder, which generates questions using the pointer-generator mechanism and (iv). MultiHop-Enhanced QG component which forces the model to generate those questions which can maximize the supporting facts prediction based reward.

Document and Answer Encoder
The encoder of the Multi-Hop QG model encodes the answer and documents using the layered Bi-LSTM network.

Answer Encoding:
We introduce an answer tagging feature that encodes the relative position information of the answer in a list of documents. The answer tagging feature is an N length list of vector of dimension d 1 , where each element has either a tag value of 0 or 1. Elements that correspond to the words in the answer text span have a tag value of 1, else the tag value is 0. We map these tags to the embedding of dimension d 1 . We represent the answer encoding features using {a 1 , . . . , a N }.
Hierarchical Document Encoding: To encode the document list L, we first concatenate all the documents D k ∈ L, resulting in a list of N words. Each word in this list is then mapped to a d 2 dimensional word embedding u ∈ R d 2 . We then concatenate the document word embeddings with answer encoding features and feed it to a bi-directional LSTM encoder {LST M f wd , LST M bwd }.
We compute the forward hidden states z t and the backward hidden states z t and concatenate them to get the final hidden The answeraware supporting facts predictions network (will be introduced shortly) takes the encoded representation as input and predicts whether the candidate sentence is a supporting fact or not. We represent the predictions with p 1 , p 2 , . . . , p K . Similar to answer encoding, we map each prediction p i with a vector v i of dimension d 3 .
A candidate sentence S i contains the n i number of words. In a given document list L, we have K candidate sentences such that i=K i=1 n i = N . We generate the supporting fact encoding sf i ∈  R n i ×d 3 for the candidate sentence S i as follows: where e n i ∈ R n i is a vector of 1s. The rows of sf i denote the supporting fact encoding of the word present in the candidate sentence S i . We denote the supporting facts encoding of a word w t in the document list L with s t ∈ R d 3 . Since, we also deal with the answer-aware supporting facts predictions in a multi-task setting, therefore, to obtain a supporting facts induced encoder representation, we introduce another Bi-LSTM layer.
Similar to the first encoding layer, we concatenate the forward and backward hidden states to obtain the final hidden state representation.

Multi-task Learning
We introduce the task of answer-aware supporting facts prediction to condition the QG model's encoder with the supporting facts information. Multitask learning facilitates the QG model to automatically select the supporting facts conditioned on the given answer. This is achieved by using a multitask learning framework where the answer-aware supporting facts prediction network and Multi-hop QG share a common document encoder (Section 3.1.1). The network takes the encoded representation of each candidate sentence S i ∈ CS as input and sentence-wise predictions for the supporting facts. More specifically, we concatenate the first and last hidden state representation of each candidate sentence from the encoder outputs and pass it through a fully-connected layer that outputs a Sigmoid probability for the sentence to be a supporting fact. The architecture of this network is illustrated in Figure 1 (left). This network is then trained with a binary cross entropy loss and the ground-truth supporting facts labels: where N is the number of document list, S the number of candidate sentences in a particular training example, δ j i and p j i represent the ground truth supporting facts label and the output Sigmoid probability, respectively.

Question Decoder
We use a LSTM network with global attention mechanism (Luong et al., 2015) to generate the questionQ = {y 1 , y 2 , . . . , y m } one word at a time. We use copy mechanism (See et al., 2017;Gulcehre et al., 2016) to deal with rare or unknown words. At each timestep t, The attention distribution α t and context vector c t are obtained using the following equations: The probability distribution over the question vocabulary is then computed as, where W q is a weight matrix. The probability of picking a word (generating) from the fixed vocabulary words, or the probability of not copying a word from the document list L at a given timestep t is computed by the following equation: where, W a and W b are the weight matrices and σ represents the Sigmoid function. The probability distribution over the words in the document is computed by summing over all the attention scores of the corresponding words: where 1{w == w i } denotes the vector of length N having the value 1 where w == w i , otherwise 0. The final probability distribution over the dynamic vocabulary (document and question vocabulary) is calculated by the following: P (w) = Pgen * P vocab (w) + (1 − Pgen) * Pcopy(w) (10)

MultiHop-Enhanced QG
We introduce a reinforcement learning based reward function and sequence training algorithm to train the RL network. The proposed reward function forces the model to generate those questions which can maximize the reward.
MultiHop-Enhanced Reward (MER): Our reward function is a neural network, we call it Question-Aware Supporting Fact Prediction network. We train our neural network based reward function for the supporting fact prediction task on HotPotQA dataset. This network takes as inputs the list of documents L and the generated question Q, and predicts the supporting fact probability for each candidate sentence. This model subsumes the latest technical advances of question answering, including character-level models, self-attention (Wang et al., 2017b), and bi-attention (Seo et al., 2017b). The network architecture of the supporting facts prediction model is similar to , as shown in Figure 1 (right). For each candidate sentence in the document list, we concatenate the output of the self-attention layer at the first and last positions, and use a binary linear classifier to predict the probability that the current sentence is a supporting fact. This network is pre-trained on HotPotQA dataset using binary cross-entropy loss.
For each generated question, we compute the F1 score (as a reward) between the ground truth supporting facts and the predicted supporting facts. This reward is supposed to be carefully used because the QG model can cheat by greedily copying words from the supporting facts to the generated question. In this case, even though high MER is achieved, the model loses the question generation ability. To handle this situation, we regularize this reward function with additional Rouge-L reward, which avoids the process of greedily copying words from the supporting facts by ensuring the content matching between the ground truth and generated question. We also experiment with BLEU as an additional reward, but Rouge-L as a reward has shown to outperform the BLEU reward function.
Adaptive Self-critical Sequence Training: We use the REINFORCE (Williams, 1992) algorithm to learn the policy defined by question generation model parameters, which can maximize our expected rewards. To avoid the high variance problem in the REINFORCE estimator, self-critical sequence training (SCST) (Rennie et al., 2017) framework is used for sequence training that uses greedy decoding score as a baseline. In SCST, during training, two output sequences are produced: y s , obtained by sampling from the probability distribution P (y s t |y s 1 , . . . , y s t−1 , D), and y g , the greedydecoding output sequence. We define r(y, y * ) as the reward obtained for an output sequence y, when the ground truth sequence is y * . The SCST loss can be written as, L scst rl = −(r(y s , y * ) − r(y g , y * )) * R where, R = n t=1 log P (y s t |y s 1 , . . . , y s t−1 , D). However, the greedy decoding method only considers the single-word probability, while the sampling considers the probabilities of all words in the vocabulary. Because of this the greedy reward r(y g , y * ) has higher variance than the Monte-Carlo sampling reward r(y s , y * ), and their gap is also very unstable. We experiment with the SCST loss and observe that greedy strategy causes SCST to be unstable in the training progress. Towards this, we introduce a weight history factor similar to (Zhu et al., 2018). The history factor is the ratio of the mean sampling reward and mean greedy strategy reward in previous k iterations. We update the SCST loss function in the following way: where α is a hyper-parameter, t is the current iteration, h is the history determines, the number of previous rewards are used to estimate. The denominator of the history factor is used to normalize the current greedy reward r(y g , y * ) with the mean greedy reward of previous h iterations. The numerator of the history factor ensures the greedy reward has a similar magnitude with the mean sample reward of previous h iterations.

Experimental Setup
With y * = {y * 1 , y * 2 , . . . , y * m } as the ground-truth output sequence for a given input sequence D, the maximum-likelihood training objective can be written as, We use a mixed-objective learning function (Wu et al., 2016;Paulus et al., 2018) to train the final network: where γ 1 , γ 2 , and γ 3 correspond to the weights of L rl , L ml , and L sp , respectively. In our experiments, we use the same vocabulary for both the encoder and decoder. Our vocabulary consists of the top 50,000 frequent words from the training data. We use the development dataset for hyperparameter tuning. Pre-trained GloVe embeddings (Pennington et al., 2014) of dimension 300 are used in the document encoding step. The hidden dimension of all the LSTM cells is set to 512. Answer tagging features and supporting facts position features are embedded to 3-dimensional vectors. The dropout (Srivastava et al., 2014) probability p is set to 0.3. The beam size is set to 4 for beam search. We initialize the model parameters randomly using a Gaussian distribution with Xavier scheme (Glorot and Bengio, 2010). We first pre-train the network by minimizing only the maximum likelihood (ML) loss. Next, we initialize our model with the pretrained ML weights and train the network with the mixed-objective learning function. The following values of hyperparameters are found to be optimal: (i) γ 1 = 0.99, γ 2 = 0.01, γ 3 = 0.1, (ii) d 1 = 300, d 2 = d 3 = 3, (iii) α = 0.9, β = 10, h = 5000. Adam (Kingma and Ba, 2014) optimizer is used to train the model with (i) β 1 = 0.9, (ii) β 2 = 0.999, and (iii) = 10 −8 . For MTL-QG training, the initial learning rate is set to 0.01. For our proposed model training the learning rate is set to 0.00001. We also apply gradient clipping (Pascanu et al., 2013) with range [−5, 5].
Dataset: We use the HotPotQA  dataset to evaluate our methods. This dataset consists of over 113k Wikipedia-based questionanswer pairs, with each question requiring multistep reasoning across multiple supporting documents to infer the answer. While there exists other multi-hop datasets (Welbl et al., 2018;Talmor and Berant, 2018), only HotPotQA dataset provides the sentence-level ground-truth labels to locate the supporting facts in the list of documents. We combine the training set (90, 564) and development set (7, 405) and randomly split the resulting data, with 80% for training, 10% for development, 10% for testing.
Model BLEU-4 ROUGE-L SF Coverage NQG (Zhou et al., 2017) 19   which measures in terms of F1 score. This metric is similar to MultiHop-Enhanced Reward, where we use the question-aware supporting facts predictions network that takes the generated question and document list as input and predict the supporting facts. F1 score measures the average overlap between the predicted and ground-truth supporting facts as computed in .

Results and Analysis
We first describe some variants of our proposed MultiHop-QG model.
(1) SharedEncoder-QG: This is an extension of the NQG model (Zhou et al., 2017) with shared encoder for QG and answer-aware supporting fact predictions tasks. This model is a variant of our proposed model, where we encode the document list using a two-layer Bi-LSTM which is shared between both the tasks. The input to the shared Bi-LSTM is word and answer encoding as shown in Eq. 1. The decoder is a single-layer LSTM which generates the multi-hop question.
(2) MTL-QG: This variant is similar to the SharedEncoder-QG, here we introduce another Bi-LSTM layer which takes the question, answer and supporting fact embedding as shown in Eq. 3. The automatic evaluation scores of our proposed method, baselines, and state-of-the-art single-hop question generation model on the HotPotQA test set are shown in Table 2. The performance improvements with our proposed model over the baselines and state-of-the-arts are statistically significant 2 as (p < 0.005). For the question-aware supporting fact prediction model (c.f. 3.1.4), we obtain the F1 and EM scores of 84.49 and 44.20, respectively, on the HotPotQA development dataset. We can not directly compare the result (21.17 BLEU-4) on the HotPotQA dataset reported in Nema et al. (2019) as their dataset split is different and they only use the ground-truth supporting facts to generate the questions.
We also measure the multi-hopping in terms of SF coverage and reported the results in Table 2 and  Table 3. We achieve skyline performance of 80.41 F1 value on the ground-truth questions of the test dataset of HotPotQA.

Quantitative Analysis
Our results in Table 2 are in agreement with (Sun et al., 2018;Zhao et al., 2018;Zhou et al., 2017), which establish the fact that providing the answer tagging features as input leads to considerable improvement in the QG system's performance. Our SharedEncoder-QG model, which is a variant of our proposed MultiHop-QG model outperforms all the baselines state-of-the-art models except Semantic-Reinforced. The proposed MultiHop-QG model achieves the absolute improvement of 4.02 Document (1): (a) after bedich smetana, he was the second czech composer to achieve worldwide recognition . ... Document (2): (a) concert at the end of summer ( czech : koncert na konci lta ) is 1980 czechoslovak historical film . ... Target Answer: bedich smetana Reference: which czech composer achieved worldwide recognition before the subject of " concert at the end of summer " ? with only Rouge-L reward: who was the composer of the composer of concert at the end of summer ? with Rouge-L and MER: what was the second czech composer to achieve worldwide recognition for the composer of the concert at the end of summer ? Document (1): (a) seedley railway station is a disused station located in the seedley area of pendleton , salford , on the liverpool and manchester railway . ... Document (2): (a) pendleton is an inner city area of salford in greater manchester , england . ... Target Answer: england Reference: seedley railway station is a disused station located in the seedley area of pendleton , is an inner city area of salford in greater manchester , in which country ? with only Rouge-L reward: : seedley railway station is located in a city area of salford in what country ? with Rouge-L and MER: seedley railway station is a disused station located in a city area of salford in greater manchester , in which country ? To analyze the contribution of each component of the proposed model, we perform an ablation study reported in Table 3. Our results suggest that providing multitask learning with shared encoder helps the model to improve the QG performance from 19.55 to 20.64 BLEU-4. Introducing the supporting facts information obtained from the answeraware supporting fact prediction task further improves the QG performance from 20.64 to 21.28 BLEU-4. Joint training of QG with the supporting facts prediction provides stronger supervision for identifying and utilizing the supporting facts information. In other words, by sharing the document encoder between both the tasks, the network encodes better representation (supporting facts aware) of the input document. Such presentation is capable of efficiently filtering out the irrelevant information when processing multiple documents and performing multi-hop reasoning for question generation. Further, the MultiHop-Enhanced Reward (MER) with Rouge reward provides a considerable advancement on automatic evaluation metrics.

Qualitative Analysis
We have shown the examples in Table 5, where our proposed reward assists the model to maximize the uses of all the supporting facts to generate better human alike questions. In the first example, Rouge-L reward based model ignores the information 'second czech composer' from the first supporting fact, whereas our MER reward based proposed model considers that to generate the question. Similarly, in the second example, our model considers the information 'disused station located' from the supporting fact where the former model ignores it while generating the question. We also compare the questions generated from the NQG and our proposed method with the ground-truth questions.
Human Evaluation: For human evaluation, we directly compare the performance of the proposed approach with NQG model. We randomly sample 100 document-question-answer triplets from the test set and ask four professional English speakers to evaluate them. We consider three modalities: naturalness, which indicates the grammar and fluency; difficulty, which measures the documentquestion syntactic divergence and the reasoning needed to answer the question, and SF coverage similar to the metric discussed in Section 4 except we replace the supporting facts prediction network with a human evaluator and we measure the relative supporting facts coverage compared to the ground-truth supporting facts. measure the relative coverage of supporting facts in the questions with respect to the ground-truth supporting facts. SF coverage provides a measure of the extent of supporting facts used for question generation. For the first two modalities, evaluators are asked to rate the performance of the question generator on a 1-5 scale (5 for the best). To estimate the SF coverage metric, the evaluators are asked to highlight the supporting facts from the documents based on the generated question.
We reported the average scores of all the human evaluator for each criteria in Table 4. The proposed approach is able to generate better questions in terms of Difficulty, Naturalness and SF Coverage when compared to the NQG model.
In this paper, we have introduced the multi-hop question generation task, which extends the natural language question generation paradigm to multiple document QA. Thereafter, we present a novel reward formulation to improve the multi-hop question generation using reinforcement and multi-task learning frameworks. Our proposed method performs considerably better than the state-of-theart question generation systems on HotPotQA dataset. We also introduce SF Coverage, an evaluation metric to compare the performance of question generation systems based on their capacity to accumulate information from various documents. Overall, we propose a new direction for question generation research with several practical applications. In the future, we will be focusing on to improve the performance of multi-hop question generation without any strong supporting facts supervision.