Multi-hop Question Generation with Graph Convolutional Network

Multi-hop Question Generation (QG) aims to generate answer-related questions by aggregating and reasoning over multiple scattered evidence from different paragraphs. It is a more challenging yet under-explored task compared to conventional single-hop QG, where the questions are generated from the sentence containing the answer or nearby sentences in the same paragraph without complex reasoning. To address the additional challenges in multi-hop QG, we propose Multi-Hop Encoding Fusion Network for Question Generation (MulQG), which does context encoding in multiple hops with Graph Convolutional Network and encoding fusion via an Encoder Reasoning Gate. To the best of our knowledge, we are the first to tackle the challenge of multi-hop reasoning over paragraphs without any sentence-level information. Empirical results on HotpotQA dataset demonstrate the effectiveness of our method, in comparison with baselines on automatic evaluation metrics. Moreover, from the human evaluation, our proposed model is able to generate fluent questions with high completeness and outperforms the strongest baseline by 20.8% in the multi-hop evaluation. on. The code is publicly availableat https://github.com/HLTCHKU


Introduction
Question Generation (QG) is a task to automatically generate a question from a given context and, optionally, an answer. Recently, we have observed an increasing interest in text-based QG (Du et al., 2017;Zhao et al., 2018;Scialom et al., 2019;Nema et al., 2019;Zhang and Bansal, 2019).
Most of the existing works on text-based QG focus on generating SQuAD-style (Rajpurkar et al., 2016;Puri et al., 2020) questions, which are generated from the sentence containing the answer or nearby sentences in the same paragraph, via   (Yang et al., 2018) dataset. Given the answer is Location H, to ask where is T located, the model needs a bridging evidence to know that T is located in C, and C is located in H (T → C → H). This is done by multi-hop reasoning. single-hop reasoning Zhao et al., 2018). Little effort has been put in multi-hop QG, which is a more challenging task. Multi-hop QG requires aggregating several scattered evidence spans from multiple paragraphs, and reasoning over them to generate answer-related, factual-coherent questions. It can serve as an essential component in education systems (Heilman and Smith, 2010;Lindberg et al., 2013;Yao et al., 2018), or be applied in intelligent virtual assistant systems (Shum et al., 2018;Pan et al., 2019). It can also combine with question answering (QA) models as dual tasks to boost QA systems with reasoning ability (Tang et al., 2017).
Intuitively, there are two main additional challenges needed to be addressed for multi-hop QG. The first challenge is how to effectively identify scattered pieces of evidence that can connect the reasoning path of the answer and question (Chauhan et al., 2020). As the example shown in Table 1, to generate a question asking about "Marine Air Control Group 28" given only the answer "Havelock, North Carolina", we need the bridging evidence like "Marine Corps Air Station Cherry Point". The second challenge is how to reason over multiple pieces of scattered evidence to generate factual-coherent questions.
Previous works mainly focus on single-hop QG, which use neural network based approaches with the sequence-to-sequence (Seq2Seq) framework. Different architectures of encoder and decoder have been designed (Nema et al., 2019;Zhao et al., 2018) to incorporate the information of answer and context to do single-hop reasoning. To the best of our knowledge, none of the previous works address the two challenges we mentioned above for multi-hop QG task. The only work on multi-hop QG (Chauhan et al., 2020) uses multi-task learning with an auxiliary loss for sentence-level supporting fact prediction, requiring supporting fact sentences in different paragraphs being labeled in the training data. While labeling those supporting facts requires heavy human labor and is time-consuming, their method cannot be applied to general multi-hop QG cases without supporting facts.
In this paper, we propose a novel architecture named Multi-Hop Encoding Fusion Network for Question Generation (MulQG) to address the aforementioned challenges for multi-hop QG. First of all, it extends the Seq2Seq QG framework from sing-hop to multi-hop for context encoding. Additionally, it leverages a Graph Convolutional Network (GCN) on an answer-aware dynamic entity graph, which is constructed from entity mentions in answer and input paragraphs, to aggregate the potential evidence related to the questions. Moreover, we use different attention mechanisms to imitate the reasoning procedures of human beings in multihop generation process, the details are explained in Section 2.
We conduct the experiments on the multi-hop QA dataset HotpotQA (Yang et al., 2018) with our model and the baselines. The proposed model outperforms the baselines with a significant improvement on automatic evaluation results, such as BLEU (Papineni et al., 2002). The human evaluation results further validate that our proposed model is more likely to generate multi-hop questions with high quality in terms of Fluency, Answerability and Completeness scores.
Our contributions are summarized as follows: • To the best of our knowledge, we are the first to tackle the challenge of multi-hop reasoning over paragraphs without any sentence-level information in QG tasks. • We propose a new and effective framework for Multi-hop QG, to do context encoding in multiple hops(steps) with Graph Convolutional Network (GCN). • We show the effectiveness of our method on both automatic evaluation and human evaluation, and we make the first step to evaluate the model performance in multi-hop aspect.

Methodology
The intuition is drawn from human's multi-hop question generation process (Davey and McBride, 1986). Firstly, given the answer and context, we skim to establish a general understanding of the texts. Then, we find the mentions of entities in or correlated to the answer from the context, and analyse nearby sentences to extract useful evidence. Besides, we may also search for linked information in other paragraphs to gain a further understanding of the entities. Finally, we coherently fuse our knowledge learned from the previous steps and start to generate questions.
To mimic this process, we develop our MulQG framework. The encoding stage is achieved by a novel Multi-hop Encoder. At the decoding stage, we use maxout pointer decoder as proposed in Zhao et al. (2018). The overview of the framework is shown in Figure 1.
The context and answer are split into word-level tokens and denoted as c = {c 1 , c 2 , ..., c n } and a = {a 1 , a 2 , ..., a m }, respectively. Each word is represented by the pre-trained GloVe embedding (Pennington et al., 2014). Furthermore, for the words in context, we also append the answer tagging embeddings as described in Zhao et al. (2018).
The context and answer embeddings are fed into two bidirectional LSTM-RNNs separately to obtain their initial contextual representations C 0 ∈ R d×n and A 0 ∈ R d×m , in which d is the hidden state dimension in LSTM.  Figure 1: Overview of our MulQG framework. In the encoding stage, we pass the initial context encoding C 0 and answer encoding A 0 to the Answer-aware Context Encoder to obtain the first context encoding C 1 , then C 1 and A 0 will be used to update a multi-hop answer encoding A 1 via the GCN-based Entity-aware Answer Encoder, and we use A 1 and C 1 back to the Answer-aware Context Encoder † to obtain C 2 . The final context encoding C f inal are obtained from the Encoder Reasoning Gate which operates over C 1 and C 2 , and will be used in the max-out based decoding stage.

Answer-aware Context Encoder
Inspired by the co-attention reasoning mechanism in previous machine reading comprehension works (Xiong et al., 2016), we compute the answeraware context representation via the following steps: A 0 = C 0 · S ∈ R d×m (4) Firstly, we compute an alignment matrix S (Eq.1), and normalize it column-wise and rowwise to get two attention matrices S (Eq.2) and S (Eq.3). S represents the relevance of each answer token over the context, and S represents the relevance of each context token over the answer. The new answer representation A 0 w.r.t. the context is obtained by Eq.4. Next, the answer dependent context representation is calculated by concatenating old and new answer representations and times the attention weight matrix S (Eq.5). Finally, to deeply incorporate the interaction between answer and context, we feed the answer dependent rep-resentationC 1 combined with original C 0 into a bi-directional LSTM and obtain the answer-aware context encoding C 1 (Eq.6).

GCN-based Entity-aware Answer Encoder
As shown in Figure 2, in order to obtain the multihop answer representation, we first compute the entity encoding from the answer-aware context encoding C 1 , then we apply GCN to propagate multihop information on the answer-aware sub-graph.
Finally we obtain the updated answer encoding A 1 via bi-attention mechanism.
Entity Graph Construction The entity graph is constructed with the name entities in context as nodes, where we use BERT-based name entity recognition model to recognize name entities from the context. The edges are created for the entity pairs if they are in the same sentence, or appear in the same paragraphs. We also connect the entities from each paragraph title to entities within the same paragraph.
Entity Encoding With the answer-aware context encoding C 1 obtained from Answer-aware Context Encoder, we use a mapping matrix M to calculate the entity encoding. M is a binary matrix where M i,j = 1 if the i-th token in the context is within the span of the j-th entity. Each entity's encoding will be calculated via a mean-max pooling applied over it's corresponding context token encoding span. E 0 = {e 1 , e 2 , ..., e g } ∈ R 2d×g , where g is the number of entities, and 2d is the dimension since we directly concatenate the meanpooling and max-pooling encoding.
Answer-aware GCN First we calculate an answer-aware sub-graph, where irrelevant entities are masked out, only those entity nodes related to answer are allowed to disseminate information. Similar to Xiao et al. ( where each m i indicate the relatedness of the entity i to the answer, and then apply M on the original graph entities to obtain answer-aware dynamic sub entities graph E sub via Eq. 8. where V is a linear projection matrix and a 0 is the mean pooling over answer encoding A 0 , and σ is sigmoid function. Then we calculate the answer-aware sub-graph's attention matrix as described in Veličković et al. (2017) where α i,j represents the information that will be assigned from entity i to it's neighbor j, and obtain the one-layer information propagation over the sub-graph via: The computation from Eq. 9 can be repeated for multiple times to obtain multi-hop entity representation E M .
Multi-hop Answer Encoding we use biattention mechanism (Seo et al., 2016) regarding entities on the sub-graph as memories to update our multi-hop answer encoding A 1 via:

Encoder Reasoning Gate
We apply a gated feature fusion module on the answer-aware context representations C 1 and C 2 from previous context encoder hops, to keep and forget information to form the final context representation C f inal via:

Maxout Pointer Decoder
Uni-directional LSTM model is utilized as the decoder of our model. Moreover, we introduce the Maxout Pointer proposed by Zhao et al. (2018) into the decoder for sake of reducing the repetitions in the generation. Pointer Generator enables the decoder to generate the next output token by either computing from the generative probabilistic distribution over the vocabulary or copying from the input sequence. To compute the copy score, the attention over the input sequence which has a vocabulary of V from the current decoder hidden state is leveraged. For the Maxout Pointer Generator, instead of leveraging all the attention score over the input tokens, only the maximal is taken into consideration to avoid the repetitions caused by the input tokens (as it's shown in Eq. 13, where a t,k annotates the decoder-encoder attention score).

Breadth-First Search Loss
In addition to the cross-entropy loss, we also introduce Breadth-First Search (BFS) Loss (Xiao et al., 2019) which is a weakly supervised loss to further assist the training procedure. Given the answer entities, we conduct the BFS over the adjacent matrices of the entity graph we build to obtain heuristic masks as a weak supervision signal. The BFS loss is calculated via binary cross-entropy loss between the predicted soft masks M in GCN-based Entity-aware Answer Encoder (Section 2.1.2) and the heuristic masks using Eq. 14 to encourage the model to learn the answer-aware dynamic entity graph better.   where λ here is a heuristic number and can be selected using cross-validation.

Dataset
To demonstrate the performance of our model, we conduct the experiments using HotpotQA (Yang et al., 2018) dataset in an opposite manner. In the QG task, paragraphs and the answers are considered as input, while the corresponding questions are the expected output. HotpotQA is a multihop question answering dataset, which contains Wikipedia-based question-answer pairs, with each question requiring multi-hop reasoning across multiple paragraphs to infer the answer. There are mainly two types of multi-hop reasoning in the HotpotQA dataset: bridge and comparison. Focusing on the multi-hop ability of our model, we filter out all the yes/no data samples in the dataset and run our experiments using the remaining corresponding train and test set, which consists of 73k questions in the training set and 8k in the test set.

Baselines
Since multi-hop QG has been under explored so far, there are very few existing baselines for our comparison. We choose the following two models because of their high relevance with our task and relatively superior performance: MP-GSN is the first QG model to demonstrate a large improvement with paragraph-level inputs for single-hop QG proposed by Zhao et al. (2018). While they conducted their experiments on SQuAD (Rajpurkar et al., 2016), we use exactly the same experiment settings provided in their configuration file on HotpotQA dataset.
RefNet is the first work that has reported results on HotpotQA dataset for QG proposed by Nema et al. (2019). However, their inputs based on the gold supporting sentences, which contains the facts related to the multi-hop question, and no paragraphlevel results have been shown. We experiment with the code they released on paragraphs-level, and test their model's performance on both their validation set and test set of HotpotQA dataset.
We also fine-tuned large pre-trained language models UniLM (Dong et al., 2019) and BART (Lewis et al., 2019) on the multi-hop QG task as comparison benchmark, to further show the effectiveness of our method. The details and the results will be covered in Appendix.

Model
Fluency

Implementation Details
Our word embeddings are initialized by glove.840B.300d (Pennington et al., 2014) and we keep our vocab size as 45000. We use two-layer bi-directional LSTMs for encoder and two-layer uni-directional LSTMs for decoder, and the hidden size is 300 for all the models. We use stochastic gradient descent (SGD) as the optimizer. The initial learning rate is 0.1, and it is reduced during the training stage using a cosine annealing scheduler (Loshchilov and Hutter, 2016). The batch size is 12 and the beam size is 10. We set the dropout probability for LSTM to 0.2 and 0.3 for GCN. The maximum number of epochs is set to 20. We set the maximum number of entities in each context to 80, and we use a two-layer GCN in our GCN-based answer encoder module. After training the model for 10 epochs, we further fine-tune the MulQG model with the help of BFS loss, where the λ in Eq.14 is set to 0.5.

Metrics
We use the metrics in previous work on single-hop QG to evaluate the generation performance of our model, with n-gram similarity metrics BLEU 1 (Papineni et al., 2002), ROUGE-L (LIN, 2004), and METEOR using the package released in Lavie and Denkowski (2009). We also quantify the QBLEU4 (Nema and Khapra, 2018a) and answerability score of our models, which was shown to correlate significantly better with human judgements (Nema and Khapra, 2018b). Table 2 shows the performance of various models on the HotpotQA test set. We report the both results of the experiments on our proposed model before and after fine-tuning with auxiliary BFS loss. As it's shown in the table, our MulQG model perform much better than the two baselines methods, with 1 https://github.com/Maluuba/nlg-eval regard to all those measuring metrics, which indicates that the multi-hop procedure can significantly boost the quality of the encoding representations and thus improve the multi-hop question generation performance. Also the BFS loss can further improve the system performance by encouraging learning the answer-aware dynamic entity graph better, which is a key and bottleneck module in the MulQG model.

Ablation Study
To further evaluate and investigate the performance of different components in our model, we perform the ablation study. As we can see from Table 3, both the GCN-based entity-aware answer encoder module and Gated Context Reasoning module are important to the model. Each of them provides a relative contribution of 2%-3% for overall performance improvement.
w/o GEAEnc: Without GCN-based Entityaware Answer Encoder, answer-related multi-hop evidence information cannot be identified. Without multi-hop answer encoding being updated, next step's answer-aware context encoding will be affected and thus the performance will drop a lot.
w/o GEAEnc + ACEnc: The performance continues to decrease but not that much. This matches with our expectation, since without an informative input A 1 containing multi-hop information from the GCN-based Entity-aware Answer Encoder, the Answer-aware Context Encoder † cannot generate an informative C 2 . Thus remove it won't hurt the performance that much.
w/o ERG: When we remove the Encoder Reasoning Gate, the performance drops by around 3% in BLEU-1. This also matches our intuition since without effective feature reasoning and fusion, all the previous encoders cannot generate effective representations. Thus the generation performance will be affected.

Answer:
jeremy renner Baseline: which american actor starred in the 2016 american science fiction film directed by denis villeneuve ? Ours: which star of the movie arrival was nominated for the academy award for best supporting actor for his performance in " the town "? Human: name the actor who has acted in the film arrival and who has been nominated for the academy award for best supporting actor for the film " the town " ? The tables show the generated questions from different models along with the corresponding paragraphs and the answer. Moreover, we highlight the reasoning paths of our proposed model in green for a more intuitive display. We also use ::::: wavy :::: lines to mark out the snippets of the paragraphs that the questions generated by the MP-GSN model derive from. single-hop QG system level, which proves the contributions of the whole proposed model.

MulQG (1-layer GCN):
When apply 1-layer GCN and only allow information propagation being limited to each node's neighbor, the answer-related evidences might not be able to be fully obtained, thus the performance are not as good as our 2-layer GCN-based model.

Human Evaluation
Human evaluation is conducted to further analyze the performance of our model (Table 4). We compare the generated questions from MP-GSN model, our model and gold ones on four metrics: Fluency, Completeness, Answerability and whether the generated questions are multi-hop question or not. Fluency emphasizes the grammar correctness of the question, while Completeness only focuses on the sentence completeness. Answerability mainly indicates the relationship between the answers and the generated questions. For the first three index, the score for each data sample could be chosen from {1,2,3} in comparison with the other samples generated from the other two models with the same input, where a higher score indicates a better performance on that matrix, For the multi-hop evaluation, we only carry out binary discrimination. We randomly sample 100 data samples from the test set. Ten annotators are asked in total to evaluate them on the aforementioned four metrics. Each sample is evaluated by three different annotators.
To present a more convincing analysis, we conduct the t-test on the human evaluation results. All the reported results between our proposed model and the baseline are statistically significant with a p-value<0.05. We also calculate the inter-annotator agreement using Fleiss' Kappa (Fleiss, 1971) measure and achieve high agreement scores on the proposed model. We observe that our MulQG model largely outperforms the MP-GSN model in terms of Fluency, Answerability and Completeness with more stable quality. Moreover, our model tends to generate more complete question and achieve comparable completeness score with the human annotations. For the multi-hop evaluation, we outperform the strongest baseline by 20.8% on the multi-hop evaluation.

Case Study
We present a case study comparing between the strong baseline MP-GSN model, our model and the human annotations. Three cases are presented in Figure 3. In the first two examples, it's clearly shown in the examples that the baseline model tends to copy a contiguous and long span of context as the generation, while our proposed model performs better in this aspect. we observe that since the supporting fact information is not leveraged in our method, the generated questions from our model may show a different reasoning path with that for the gold question. There could be multiple ways to construct a multi-hop question given the same input. So the generations may be much different from the gold label, although they are still correct questions, which could be indicated from the first two examples. This phenomenon causes a lower score in automatic matrices, such as BLEU and METEOR, but we note that the generated questions still follow the multi-hop scheme and can be answered with the given answers.
In Example III, we show the data sample in an easier mode. In this case, while the answer entity is in one paragraph, a similar entity (annotated with orange color) also appears in another paragraph, which gives a strong clue of the reasoning path and makes it easier for the model to attend to both paragraphs. The generations from our model and the human annotation show almost the same reasoning path. However, we observe that the question generated by MP-GSN model still tends to attend to the entities that are closer to the answer entities. Moreover, for the human annotation in Example I and Example III, the gold questions have a problem with fluency, which is harmful for the QG models, but interestingly, even with training using these labels, our model is still capable of generating rela-tively fluent outputs.

Related Work
Question Generation Early single-hop QG use rule-based methods to transform sentences to questions (Labutov et al., 2015;Lindberg et al., 2013). Recently neural network based approaches adopt the sequence-to-sequence (Seq2Seq) based framework, with different types of encoders and decoders have been designed Nema et al., 2019;Zhao et al., 2018). Zhao et al. (2018) proposes to incorporate paragraph level content by using Gated Self Attention and Maxout pointer networks, while Nema et al. (2019) proposes a model which contains two decoders where the second decoder refines the question generated by the first decoder using reinforcement learning. There are different ways to attend answer information to the context encoding stage.  and  directly concatenate answer tagging with the context embedding, while Nema et al. (2019) also applies bi-attention mechanism proposed by Seo et al. (2016) for QA to do answer-aware context representation. Chen et al. (2019) is the most recent work which proposes a reinforcement learning based graph-to-sequence (Graph2Seq) model which use a bidirectional graph encoder on a syntax-based graph for QG, while they still focus on the single-hop QG.
Multi-hop QA Popular Graph Nueral Network (GNN) frameworks, such as graph convolutional networks (Kipf and Welling, 2016), graph attention network (Veličković et al., 2017), and graph recurrent network (Song et al., 2018) have been explored and showed promising results on multi-hop QA task that requiring reasoning. Xiao et al. (2019) proposes a dynamic fused graph network to work on multi-hop QA on the HotpotQA dataset. De Cao et al. (2018) proposes an entity-GCN method to reason over across multiple documents for multi-hop QA on the WIKIHOP dataset (Welbl et al., 2018).

Conclusion
Multi-hop QG task is more challenging and worthy of exploration compared to conventional singlehop QG. To address the additional challenges in multi-hop QG, we propose MulQG, which does multi-hop context encoding with Graph Convolutional Network and encoding fusion via a Gated Reasoning module. To the best of our knowledge, we are the first to tackle the challenge of multi-hop reasoning over paragraphs without any sentencelevel information. The model performance on Hot-potQA dataset demonstrates its effectiveness on aggregating scattered pieces of evidence across the paragraphs and fusing information effectively to generate multi-hop questions. The strong reasoning ability of the Multi-hop Encoder in the MulQA model can potentially be leveraged in complex generation tasks for the future work.  Table A1: Performance comparison between our MultQG model and fine-tuning state-of-the-art large pre-trained models on HotpotQA test set.