PathQG: Neural Question Generation from Facts

Existing research for question generation encodes the input text as a sequence of tokens without explicitly modeling fact information. These models tend to generate irrelevant and uninformative questions. In this paper, we explore to incorporate facts in the text for question generation in a comprehensive way. We present a novel task of question generation given a query path in the knowledge graph constructed from the input text. We divide the task into two steps, namely, query representation learning and query-based question generation. We formulate query representation learning as a sequence labeling problem for identifying the involved facts to form a query and employ an RNN-based generator for question generation. We ﬁrst train the two modules jointly in an end-to-end fashion, and further enforce the interaction between these two modules in a variational framework. We construct the experimental datasets on top of SQuAD and results show that our model outperforms other state-of-the-art approaches, and the performance margin is larger when target questions are complex. Human evaluation also proves that our model is able to generate relevant and informative questions. 1

Existing research for question generation encodes the input text as a sequence of tokens without explicitly modeling fact information. These models tend to generate irrelevant and uninformative questions. In this paper, we explore to incorporate facts in the text for question generation in a comprehensive way. We present a novel task of question generation given a query path in the knowledge graph constructed from the input text. We divide the task into two steps, namely, query representation learning and query-based question generation. We formulate query representation learning as a sequence labeling problem for identifying the involved facts to form a query and employ an RNN-based generator for question generation. We first train the two modules jointly in an end-to-end fashion, and further enforce the interaction between these two modules in a variational framework. We construct the experimental datasets on top of SQuAD and results show that our model outperforms other state-of-the-art approaches, and the performance margin is larger when target questions are complex. Human evaluation also proves that our model is able to generate relevant and informative questions. 1

Introduction
Question Generation (QG) from text aims to automatically construct questions from textual input (Heilman and Smith, 2010). It receives increasing attentions from research communities recently, due to its broad applications in scenarios of dialogue system and educational reading comprehension (Piwek et al., 2007;. It can also help to augment the question set to enhance the performance of question answering systems. Current QG systems mainly follow the sequenceto-sequence structure with an encoder for modeling the textual input and a decoder for text generation (Du et al., 2017). These neural-based models have shown promising performance, however, they suffer from generating irrelevant and uninformative questions. Figure 1a presents two sample questions generated by a nueral QG model. Q2 contains irrelevant information "Everton Fc". Although Q1 is correct, it is a safe play without mentioning any specific information in the input text. One possible reason causing the problem is that current sequenceto-sequence models learn a latent representation for the input text without explicitly modeling semantic information included. We therefore argue that modeling facts in the input text can help to alleviate the problem of existing neural QG models.
Some researchers explore to incorporate the answer entity  or a so called question worthy phrase (Wang et al., 2019) as the fact to guide the generation of target question and make some progresses. However, a complex question usually involves multiple facts. Therefore, a single word piece or phrase is not able to provide enough information for the generation. In this paper, we propose to represent facts in the input text as a knowledge graph (KG) and present a novel task of generating a question given a query path from the KG. More specifically, a KG contains a set of fact triples, and a query path is an ordered sequence of triples in the KG. A fact triple consists of two entities and their relationship. Figure 1b shows the KG of the input text in Figure 1a and it includes two query paths. We can see that not all facts in a query path are mentioned in a specific target question (see Path 2 and GTQ2). Therefore, the model needs to extract the involved facts to form a query before it generates a question. Intuitively, we divide the task of question generation from a query path into two steps, namely, query representation learning and query-based question generation. We formulate the former step as a sequence labeling problem for identifying the involved facts to form a query. For query-based question generation, an RNN-based generator is used to generate the question word by word. We first train the two modules jointly in an end-to-end fashion (PathQG in Section 3). In order to further enforce the interaction between theses two modules, we employ a variational framework to train the two modules (Chen et al., 2018; that treats query representation learning as an inference process from the query path taking the generated question as the target (PathQG-V in Section 4).
For model evaluation, we build the experimental environment on top of the benchmark dataset SQuAD (Rajpurkar et al., 2016). In specific, we automatically construct the KG for each piece of input text, and pair ground-truth questions with corresponding query paths from the KG. Experimental results show that our generation model outperforms other state-of-the-art QG models, especially when the questions are more complicated. Human evaluation also proves the effectiveness of our model in terms of both relevance and informativeness.

Task Definition
We first introduce some notations in our task: -x = (x 1 , ..., x n ): an input text with n tokens, where x i is the ith token; -G: a knowledge graph constructed from x, which is a set of fact triples {(e 1 , r, e 2 ), ...}, where e i is an entity and r i is the relation between e i and e i+1 ; -s = (e 1 , r 1 , e 2 , ..., e m ): a query path in the knowledge graph, which is an ordered sequence of triples, and it's a subset of the G; -y = (y 1 , ..., y |y| ): the generated question based on the x and s, where y i is a token.
The task is described as following: given an input text x and its corresponding knowledge graph G, our model aims to generate a question y i based on a query path s i from G.

Path-based Question Generation
We divide the task of question generation from a given query path into two steps, namely, query representation learning and query-based question generation. A Query Representation Learner and a Query-based Question Generator are designed for the two steps separately. We directly combine these two modules into a unified framework PathQG and the overall architecture is illustrated in Figure 2.

Query Representation Learner
Query (representation) learner takes a query path s as input and learn the query representation L. Considering entities and relations in a query path have different contributions to generate the target question, we calculate their contribution weights for query representation learning.

Contribution Weight Calculation
We treat the task of contribution weight calculation as a sequence labeling task on the query path s = (e 1 , r 1 , e 2 , r 2 , ..., e m ) taking entities and relations as tokens.
Context Encoding Considering the input text x can be useful to identify the weights of components in the path, we first encode the input text via a context encoder. Following , we use additional entity information (e.g., start, end entity of the query path) to improve the encoding of x. We use two BIO tags b a = (b a 1 , ..., b a n ) and b e = (b e 1 , ..., b e n ) to mark the positions of start and end entities in x. Then we concatenate the embeddings of x and two BIO tags as the input of the context encoder and use a bi-directional LSTM (Huang et al., 2015) to get the context states h c i as Eq. 1, where E w and E b are word embedding and tag embedding matrix respectively.
Contribution Weighting Since each entity or relation in the path is also a sequence of tokens, we take the average pooling of their word embeddings as input The encoding process of the path sequence labeling is as . And the encoding state h s i at each step i will attend to h c and the attention output is computed as c i . Then c i will be concatenated with hidden state h s i to calculate the sigmoid probability of i th component s i in path as its contribution weight w i where where σ(·) is the sigmoid activation function and FFN l is a l-layers feed-forward network.

Query Representation Learning
With the contribution weights of entities and their relations as w = (w 1 , w 2 , ..., w 2m−1 ), we encode the query path s in a weighted manner to learn the query representation. First we also utilize the average embeddings of entities and relations to compose the whole weighted query path as Eq. 4.
Considering a path has two different types of elements: entity and relation appearing in an alternating order, and the basic structural units constituting a path are triples, an RNN encoder is not able to capture these special structural information. Thus we adopt the recurrent skipping network (RSN) (Guo et al., 2019) instead of BiLSTM to encode the weighted path sequence and form a query representation as Eq. 5, 6, where L = (L 1 , ..., L 2m−1 ) is the learned query representation.

Query-based Question Generator
Taking the query representation L as input, we generate the corresponding question. Follow-ing the sequence-to-sequence paradigm of NQG model (Du et al., 2017), we take the concatenation of final query representation L 2m−1 and final hidden state h c n of the input text from Eq. 1 as the initial state of the decoder and generate the question word by word. For the decoder, a LSTM is applied with attention mechanism and both sentence context and learned query L are utilized in the attention module.
The decoder attends to the learned query L = (L 1 , ..., L 2m−1 ) and gets an attention-based query representation d 1 t . Besides, it also attends to the textual context states h c = (h c 1 , ..., h c n ) and computes an attended context d 2 t . The h t , d 1 t and d 2 t are concatenated to calculate the softmax probability distribution over the whole vocabulary, where the y t is the prediction at time t, and the generated question is y = (y 1 , ..., y |y| ).

Variational Path-based Question Generation
The query representation learning can be treated as an inference process from the query path with the input text as the condition. Motivated by the variational models for KG reasoning, we propose a variational inference model PathQG-V to train the query learner and question generator to further enforce the interaction between them. Additionally, it introduces a posterior query learner to infer a posterior query distribution assuming the target question provided. Compared with the original objective of PathQG as Eq. 8, the variational model aims to minimize its negative evidence lower bound (ELBO) as Eq. 9, where P θ 1 (L|x, s), P θ 3 (L|y, x, s) and P θ 2 (y|L, x) are prior and posterior query distribution, and the likelihood of question y respectively. The structure of the variational model PathQG-V is shown in Figure 3. Note that the prior query learner and the query-based question generator are the same with query learner and question generator in Section 3.

Posterior Query Learner
The posterior query learner is designed in a similar manner as the query learner in Section 3.1, except that the target question is given. We incorporate the target question y in the same way as the input text x, where we employ a BiLSTM to encode the question y and get their hidden states h q = (h q 1 , ..., h q t ). In the decoder of contribution weighting process, those question states are attended as same as context states h c and get the attention output q i at each step i. Then q i together with c i are concatenated with the encoding hidden state h s i to compute the posterior contribution weight of i th component s i in path as Then following Eq. 4, 5, 6, the posterior query representation L = (L 1 , ..., L 2m−1 ) can be learned.

Optimization and Inference
During training period the posterior learned query representation L is fed to the question generator, and the objective is to minimize the negative ELBO as Eq. 9. And the first term of negative ELBO can be viewed as Eq. 11: L 1 = KL(P θ 3 ((s 1 , ..s 2m−1 )|y, x, s)|| P θ 1 ((s 1 , ..s 2m−1 )|x, s)) Then the P θ 2 (y|L, x) is the generation probability of question y and the log-likelihood can be rewritten as Eq. 12. We use the weighted path to form a query representation instead of sampling from the query distribution, therefore the second term of negative ELBO can be formulated as Eq. 13 where the expectation over posterior distribution E P θ 3 (L|y,x,s) [·] is omitted.
To ensure the performance of query representation learner, we also add a contribution weighting loss defined as Eq. 14: We combine all losses in a weighted manner as L = λL 1 +L 2 +βL 3 to jointly train the framework, where λ and β are weighted hyper-parameters.
For the inference, only the prior query learner and the question generator are involved. The process is the same as PathQG.

Experimental Dataset
Our experiments are conducted on SQuAD (Rajpurkar et al., 2016) consisting of 61,623 sentences. Each sentence is annotated with several questions together with their answers extracted from the input text. We build our experimental dataset on top of SQuAD. We construct knowledge graph for each sentence automatically and identify query paths for ground truth questions for evaluation. The resulted dataset consists of 89,976 tuples (input sentence x, query path s, ground truth question y).

KG construction
We employ the scene graph parser (Schuster et al., 2015) for KG construction from a textual description. It identifies entities and their relationships from a text and build a scene graph. It turns out that the generated scene graph usually misses some key information in the text, thus we employ the part-of-speech tagger to extract verb phrases between entities to further enrich relationship labels. The extended scene graph is used as the knowledge graph for the input text. The average quantities of entities and facts in each KG are 6.53 and 4.68 respectively. The average information coverage rate of the input text by constructed KG is 68.52%. Note that our question generation models are compatible with KGs constructed by other methods.
Complex question set construction Our setting is motivated by the scenario where questions are related to multiple facts. We are then curious about the effectiveness of our model for complex question generation. Therefore we further construct a complex question set. A question is treated as complex if the corresponding query path contains more than 3 triples. The resulted complex question set contains 16,578 samples 2 . The detailed statistics of complex and whole datatset can be seen in Table 1 Query path and question pairing We then identify corresponding query paths from the KG for ground truth questions. In practice, a path can be determined by a start node and an end node. We thus use answer entity of the question as the start node and use the entity identified in the question as the end node. If the question contains multiple entities, we take the one farthest to the start node in the KG as the end node. We ignore the edge directions to simplify the modeling of query path.

Implementation Details
We construct different vocabularies for input texts and questions respectively by keeping words which appear more than twice. Glove (Pennington et al., 2014) is used to initialize word embedding with dimension 300 and the embedding for BIO tag is randomly initialized of size 20. The size of hidden units in LSTM cell in all encoders is 300 while the size of the generation decoder is 1200. The hyperparameters to balance weights of losses are chosen as λ = 0.5 and β = 0.   process, we use beam search of beam size 5. Refer to Appendices A for further information of training details and parameter numbers.

Models for Comparison
We compare our approach with some state-of-theart models. For fair comparison, the start and the end nodes of the query path are provided for all models.
-NQG + follows , which is an attention-based encoder-decoder model with the sentence as input and uses BIO tagging scheme to incorporate additional entity information (start and end nodes) to generate questions.
-AFPA (Sun et al., 2018) combines answerfocused model and position-aware model for question generation. For fair comparison, the model is re-trained with rich features including NE and POS removed and end entity provided.
-ASs2s (Kim et al., 2019) utilizes additional answer information via answer separation. For fair comparison, we do not implement the retrieval style word generator and the model is re-trained in our setting with the end entity supplied.
-NQG + (pl) is an extension of NQG + . Instead of learning a continuous latent query L, we sample entities and relations from the path via sequence labeling. Together with the start and end entities, those identified extra information are all encoded using BIO scheme for question generation.
-PathQG is our proposed generation framework consisting of a query representation learner and a query-based question generator. PathQG-V is the variational version of PathQG with an additional posterior query learner.
-NQG ++ is an oracle model that is aware of all path information contained in the target question and encode them via BIO scheme. It can be treated as the upper bound of NQG + (pl). We present this result for reference.

Automatic Evaluation Results
For the automatic evaluation, we utlize some widely adopted metrics including BLEU 1-4 (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005) and ROUGE L (Lin and Hovy, 2003). Besides, we also compare results in the semantic content level by using a metric named SPICE (Anderson et al., 2016). It evaluates the similarity of scene graphs generated from candidate and reference questions. Evaluation results on both whole and complex datasets are shown in the Table 2 and 3. We have several findings: -PathQG-V outperforms other models in terms of all metrics on both datasets by a considerable margin. This indicates the effectiveness of our variational inference framework for modeling the query path for better question generation.
-PathQG identifying involved entities and relations along the path performs better than NQG + , AFPA and ASs2s, which demonstrates the effectiveness of introducing more related facts for question generation. And the improvement of PathQG compared to NQG + (pl) shows the necessity of joint training.
-Our model generates larger improvement on the complex dataset compared to the whole dataset. This follows our intuition that questions related to longer query path are more complicated and our model has more advantage on these cases. Using length of query path to control the difficulty of questions is also a novel design (Gao et al., 2018).
-NQG + , AFPA and ASs2s utilize answer for QG in different ways and our model PathQG-V follows the way of NQG + for simplicity. From the improvement of AFPA and ASs2s compared to NQG + , our model can further be adapted to follow them and performs better.
-Although good performance is achieved by PathQG-V, there is still a certain gap between it and the oracle model NQG ++ . It shows the query learning from the path still has potential to be improved.

Human Evaluation Results
To better evaluate the quality of the generated questions, we conduct human evaluation through Amazon Mechanical Turk (AMT). We randomly choose 100 instances and 3 crowd annotators are invited to compare the questions generated by PathQG-V with NQG + , AFPA and ASs2s in pair-wise . For each instance, the annotators are asked to read the text with the answer, and compare two candidate questions to determine which one is better in terms of three aspects respectively. (1) Fluency: the question is fluent.
(2) Correctness: the question is consistent to the text and the answer.
(3) Informativeness: the question contains specific information of the input text. The comparison results are shown in Figure 4. We can see that our model outperforms others in terms of all characteristics. This further proves that our model can generate more informative and consistent questions.

Further Analysis
In order to evaluate whether our model can utilize the facts in the input text to generate questions Figure 4: Pairwise comparison between the questions generated by PathQG-V and other methods in three characteristics. Each color is the percentage of annotators who consider the question generated by the corresponding method is better. "Tie" represents hard to tell.
with less irrelevant information. We analyse the relevance of the generated question to the text. We also demonstrate case studies.
Relevance of generated questions We evaluate the relevance of the generated questions to the input texts from different models by computing the overlapping rate. The results are presented in Figure 5. On both datasets, PathQG-V achieves the highest overlapping rate among all models, which shows our model can better utilize facts in the input text to generate more relevant questions. And the improvement of PathQG compared to other models reveals the effectiveness of learning involved entities and relations among path for question generation. model PathQG-V is more informative and specific, which consists of information "plymouth" and "late 18th". In sample 2, our generated question is consistent to the input text while the one from NQG + contains irrelevant phrase "swazi economy".

Related Work
Question Generation, aiming at generating questions from a range of inputs, such as raw text (Heilman and Smith, 2010), structured data (Serban et al., 2016) and images (Mostafazadeh et al., 2016;Fan et al., 2018a,b), has attracted increasing attention in recent years. Most previous studies on textual question generation are rule-based and transform a declarative sentence into an interrogative sentence according to hand-crafted patterns (Heilman and Smith, 2010;Heilman, 2011). With the advance of neural network, Du et al. (2017) propose to apply a seq2seq structure with attention for automatic question generation. As follow-up, ; Sun et al. (2018); Kim et al. (2019) propose to utilize the answers to decrease the generation uncertainty. Meanwhile,  and Li et al. (2019) explore to use answer-relevant context to guide question generation. Besides, some studies (Wang et al., 2017;Wang et al., 2019) take question generation as a subtask, and jointly learn it with other tasks, such as question answering and phrase extraction, which also help to alleviate the uncertainty and improve the generation performance.
Another stream of research for question generation is from KG to question. Reddy et al. (2017); Elsahar et al. (2018) explore to generate questions from a single KG triple using text as context information. It is close to our setting, but we are different in two aspects. First, we propose to form a query path consisting of multiple triples for question generation instead of a single triple. Second, the context we process is where the extracted triples from. This setting is more natural and different from using retrieved text as context as they did.

Conclusion and Future Work
In this paper, we propose to model facts in the input text as knowledge graph for question generation. We present a novel task of generating a question based on a query path from the constructed KG. We propose to learn query representation for question generation in a joint model and a variational inference model is also proposed. We extend the SQuAD dataset by automatic constructing KG for each input sentence and identifying corresponding query paths for ground truth questions. Experimental results proves the effectiveness of our proposed model qualitatively and quantitatively.
In the future, there can be two research directions. First, we would like to explore more explainable reasoning method for question generation, such as symbolic-based models. Second, novel evaluation metrics for question generation taking consistency and informativeness into consideration would be of interest.