Ask to Learn: A Study on Curiosity-driven Question Generation

We propose a novel text generation task, namely Curiosity-driven Question Generation. We start from the observation that the Question Generation task has traditionally been considered as the dual problem of Question Answering, hence tackling the problem of generating a question given the text that contains its answer. Such questions can be used to evaluate machine reading comprehension. However, in real life, and especially in conversational settings, humans tend to ask questions with the goal of enriching their knowledge and/or clarifying aspects of previously gathered information. We refer to these inquisitive questions as Curiosity-driven: these questions are generated with the goal of obtaining new information (the answer) which is not present in the input text. In this work, we experiment on this new task using a conversational Question Answering (QA) dataset; further, since the majority of QA dataset are not built in a conversational manner, we describe a methodology to derive data for this novel task from non-conversational QA data. We investigate several automated metrics to measure the different properties of Curious Questions, and experiment different approaches on the Curiosity-driven Question Generation task, including model pre-training and reinforcement learning. Finally, we report a qualitative evaluation of the generated outputs.


Introduction
The growing interest in Machine Reading Comprehension (MRC) has sparked significant research efforts on Question Generation (QG), the dual task to Question Answering (QA). In QA, the objective is to produce an adequate response given a query and a text; conversely, for QG, the task is generally defined as generating relevant questions given a source text and, optionally, a specific target answer included therein. To our knowledge, all works tackling QG have thus far exclusively focused on generating relevant questions which can be answered given the source text: for instance, given The 1st COLING conference took place in 1965 as input, a question likely to be automatically generated would be When did the 1st COLING conference take place?, where the answer 1965 is a span of the input. Such questions are useful to evaluate reading comprehension for both machines (Hermann et al., 2015;Eyal et al., 2019) and humans (Mani et al., 1999).
However, the human ability of asking questions goes well beyond evaluation: asking questions is essential in education (Gall, 1970) and has been proven to be fundamental for children cognitive development (Chouinard et al., 2007). Curiosity is baked into the human experience: it allows to extend one's comprehension and knowledge by asking questions that, while being relevant to context, are not directly answerable by it, thus being inquisitive and curious. The significance of such kind of questions is two-fold: first, they allow for gathering novel relevant information, e.g. a student asking for clarification; second, they are tightly linked to one's understanding of the context, e.g. a teacher testing a student's This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http:// creativecommons.org/licenses/by/4.0/. knowledge by asking questions whose answers require a deeper understanding of the context and more complex reasoning.
From an applicative point of view, we deem the ability to generate curious, inquisitive, questions as highly beneficial for a broad range of scenarios: i) in the context of human-machine interaction (e.g. robots, chat-bots, educational tools), where the communication with the users could be more natural; and ii) during the learning process itself, which could be partially driven in a self-supervised manner, reminiscent of how humans learn by exploring and interacting with their environment. To our knowledge, this is the first paper attempting to tackle Curiosity-driven neural question generation.
The contributions of this paper can be summarized as follows: • we propose a new natural language generation task: curiosity-driven question generation; • we propose a method to derive data for the task from popular non-conversational QA datasets; • we experiment using language model pre-training and reinforcement learning, on two different datasets; • we report a human evaluation analysis to assess both the pertinence of the automatic metrics used and the efficacy of the proposed dataset-creation method above.

Related Works
Deep learning models have been widely applied to text generation tasks such as machine translation (Kalchbrenner and Blunsom, 2013), abstractive summarization (Rush et al., 2015) or dialog (Henderson et al., 2013), providing significant gains in performance. The state of the art approaches are based on sequence to sequence models (Cho et al., 2014;Sutskever et al., 2014). In recent years, significant research efforts have been directed to the tasks of Machine Reading Comprehension (MRC) and Question Answering (QA) (Hermann et al., 2015;Rajpurkar et al., 2016). The data used for tackling these tasks are usually composed of {context, question, answer} triplets: given a context and the question, a model is trained to predict the answer. Following QA, research on Question Generation (QG) (Amidei et al., 2018) has also seen increasing interest from the community: the QG task (Du et al., 2017; can be considered as the dual task for QA (Duan et al., 2017): given a context and an answer span, the model is trained to generate the corresponding question. One of the main motivations is that an effective QG model can be used to generate synthetic data in order to augment existing QA datasets (Yuan et al., 2017;Alberti et al., 2019). For instance, Yuan et al. (2017) proposed a reinforcement learning setup trained using a QA-based metric reward: given a paragraph and an answer, the model first generates questions; then, the paragraph and the corresponding generated questions are given to a pre-trained QA model which predicts an answer; finally, the reward is computed as the number of overlapping words between the ground truth answer and the predicted answer -in other words, the reward to maximize, for the QG model, corresponds to the QA score. For an extensive evaluation of models trained with different rewards we refer the reader to the work of Hosking and Riedel (2019).
Most of these works follow the approach by Ranzato et al. (2015), who applied reinforcement to neural machine translation: first, a sequence to sequence model is trained under teacher forcing (Williams and Zipser, 1989) to optimize cross-entropy, hence helping to reduce the action space (i.e. the vocabulary size); then, the model is finetuned with a mix of teacher forcing and REINFORCE (Williams, 1992). For automatic evaluation, all previous works on QG resort to BLEU metrics (Papineni et al., 2002), originally developed and widely used in Machine Translation. However, how to evaluate text generation models remains an open research question: Nema and Khapra (2018) pointed out that, on QG tasks, the correlation between BLEU and human evaluation was poor.
A thorough investigation of the behavior of open-domain conversational agents has been recently presented by See et al. (2019). Using controllable neural text generation methods, the authors control important attributes for chit-chat dialogues, including question-asking behavior. Among the take-away messages of this work, is that question-asking represents an essential component in an engaging chitchat pipeline: the authors find, via a large-scale human validation study, that agents with higher rates of question-asking obtain qualitative improvements in terms of inquisitiveness, interestingness and engagingness.
Indeed, in a conversational setting, it can be expected that the nature of follow-up questions significantly differs from those used as target in a traditional QG training setup: as mentioned earlier, QG has so far been framed as the dual task to QA, hence training models to generate questions whose answer is present in the input context. In contrast, we argue that in natural conversations the questions follow the input context but are rather a means to augment one's knowledge (as their answer is not explicit in the input context). In this work, we thus define the task as Curiosity-driven Question Generation.

Dataset
Question Answering datasets are usually composed of a set of questions associated with reading passages (the context) and the corresponding answers contained therein. The QA task is defined as finding the answer to a question given the context. As opposed, the Question Generation (QG) task is to generate the question given the input and (optionally) the answer. Most previous efforts on the QG task have resorted to the widely used Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016). It contains roughly 100,000 questions posed by crowd-workers on a selected sample of Wikipedia articles. Several other QA datasets have also been recently published accounting for characteristic such as requiring multi-passage or discrete reasoning (Yang et al., 2018;Dua et al., 2019); further, conversational QA datasets have been made available: CoQA (Reddy et al., 2019) and QuAC (Choi et al., 2018) have the desirable property to be in a dialogue-like setting.
In our scenario, Curiosity-driven QG, the reading passage associated with a question should not contain the answer, but rather pave the way for asking a new, curious, question -whose answer would eventually enrich the knowledge on the matter at hand. Therefore, a natural choice to build QG data would be to rely on existing datasets for conversational QA. A detailed comparison of the above-mentioned CoQA and QuAC datasets is provided by Yatskar (2019), who reports the proportion of Topic Error (i.e. questions unlikely to be asked in the context) and Entity Salad (i.e. questions unanswerable for any context): 1 compared to QuAC, CoQA is found to include significantly more Topic Error and Entity Salad questions. For this reason, we resort to QuAC in order to derive data Curiosity-driven QG.
Furthermore, recognizing the fact that the great majority of QA datasets available does not account for conversational characteristics, we propose a methodology to derive data for Curiosity-driven Question Generation from standard QA datasets, and apply it to the popular SQuAD (Rajpurkar et al., 2016).
For both our data sources, and consistently with standard QA and QG tasks, we encode each sample as a triplet {P, q, a} where the paragraph P comprises n sentences [s 0 , ..., s n ], and a represents the answer to the question q. A canonical QG approach would thus use s a , i.e. the sentence of P that contains the answer, as source, and q as generation target. On the contrary, for Curiosity-driven QG, any sentence s x from P can potentially be used as the source sequence, as long as it does not contain the answeri.e. under the necessary constraint of x = a. In the following subsections, we elaborate on additional constraints depending on the nature of the source data.
In general, we define samples as triplets where s x and P are, respectively, the input sentence and the paragraph P modified according to the appropriate dataset-depending constraint, as detailed in the following, and y is the reference (target) question.

Conversational QA Data
As mentioned above, we first derive our data from the QuAC dataset, which is built from Wikipedia articles by iterating over the following procedure: given a sentence, a student annotator asks a relevant  Du et al. (2017), from which our data is derived. The bottom rows refer to the data we obtain using our methodology, with and without NER constraining.
question for which he does not have the answer; then, the teacher (annotator) retrieves a sentence that contains the answer. Thus, the logical conversational ordering in QuAC makes each question curious by design, given the text that precedes it. More formally, for a question q (our target), we consider the source s x as the text P preceding the sentence s a that contains the answer. In other words, our QuAC-derived dataset is built by applying the stricter constraint x < a. Numerically, QuAC compounds to 83,568 questions (on 11,567 articles) for the train set, 7,354 for the validation set and 7,353 for the test set (each covering 1,000 articles). Since the test set is not public, we use the original QuAC validation set for testing. From the training set, we randomly drop 1,000 articles (hence, 7,224 samples) which we use to derive our validation set, thus resulting in 76,345 questions for training.

Standard QA Data
As discussed in section 2, most of the available QA datasets are not conversational. Thus, we propose a simple method to obtain data for Curiosity-driven QG from standard QA datasets. For this, we use the widely popular SQuAD (Rajpurkar et al., 2016), and specifically the original splits released by Du et al. (2017), which are commonly used for Question Generation. As opposed to QuAC, the questions in SQuAD do not follow a logical ordering. Therefore, any sentence s x from P can potentially be used as the source sequence, as long as it does not contain the answer a (constraint: x = a). Nonetheless, as is reasonable for factoid QA datasets, several questions are so specific to their associated sentence s a that they would be extremely unlikely to be asked without knowing the contents of s a itself. To exemplify this issue, take the following paragraph from SQuAD: Nikola Tesla was the fourth of five children. Nikola had an older brother named Dane [..] Given "Nikola had an older brother named Dane." as s a , and operating under the sole constraint of x = a, the sentence "Nikola Tesla was the fourth of five children" would be eligible as a source s x for the target question "Who was Dane?". This question can only be asked if either contextual information or background knowledge is available, since it requires to know that Dane was among Tesla's four siblings. To overcome this problem, we added an additional constraint based on Named Entity Recognition (NER): s x is an acceptable input only if all the entities present in the question q are also present in the input sentence s x . In the previous example, this would thus filter out the target "Who was Dane?" while allowing for "How much brothers and sisters Nikola have?". For our experiments we used spaCy. 2 In Table 1 we report the number of samples we obtained from SQuAD before and after applying NER filtering. After applying the above methodology to construct a dataset for Curiosity-driven QG, our training dataset contains 25,356 samples for training, 2,076 for development, and 2,087 for testing.

Metrics
Automatic evaluation of Natural Language Generation (NLG) systems is a challenging task (Nema and Khapra, 2018). For QG, n-gram based similarity metrics are commonly used. These measures evaluate how similar the generated text is to the corresponding reference(s). While they are known to suffer from several shortcomings (Liu et al., 2016;Paulus et al., 2017;Scialom et al., 2019a), they allow to evaluate specific properties of the developed models. In this work, we use various automatic metrics detailed below, and we assess their quality for our task through a human evaluation -see Section 6.
BLEU One of the most popular metrics for QG, BLEU (Papineni et al., 2002) provides a set of measures to compare automatically generated texts against one or more references. In particular, BLEU-N is based on the count of overlapping n-grams between the candidate and its corresponding reference(s).
Self-BLEU Within the field of Computational Creativity, Diversity is considered a desirable property (Karampiperis et al., 2014). Indeed, generating always the same question such as "What is the meaning of the universe?" would be an undesirable behavior, reminiscent of the "collapse mode" observed in Generative Adversarial Networks (GAN) (Goodfellow et al., 2014). Therefore, we adopt Self-BLEU, originally proposed by Zhu et al. (2018), as a measure of diversity for the generated text sequences. Self-BLEU is computed as follows: for each generated sentence s i , a BLEU score is computed using s i as hypothesis while the other generated sentences are used as reference. When averaged over all the references, it thus provides a measure of how diverse the questions are: low Self-BLEU scores indicate high diversity.
QA-based metrics Given a text, a question can be considered curious if the answer is not contained in the input text. In our framing, this implies that a question q should not be answerable given its corresponding input sentence s x . Thanks to the recent improvements obtained on Question Answering tasks -for instance, human-level performance has been achieved on SQuAD-v1 3 -the answerability of a question can be automatically measured. Therefore, given a question-context pair as input to a QA model, two type of metrics can be computed as: n-gram-based, measuring the average n-gram overlap between the retrieved answer and the ground truth; and, probability-based: the confidence of the QA model for its retrieved answer; this corresponds to the probability of being the correct answer assigned by the QA model to the retrieved answer. This latter metric is more abstractive, allowing more flexibility beyond n-grams.
Since several diverse questions can be generated for a given input, we consider the latter metric (probability-based) to better fit the Curiosity-driven QG task. Hence, given the evaluated question q and the input text s x , we define a metric QA prob as the confidence of the QA model that its predicted answer is correct. This metric measures answerability of q given s x : therefore, the lower this score, the less likely the answer is contained in the input text.
While being non-answerable represents a necessary condition for q being a curious question with respect to its context s x , we also want q to be as relevant and useful as possible. To this end, we compute the above QA prob for question q on P , which represents the source paragraph stripped from the sentence containing the answer (see Eq. 1). The higher this score, the more likely the question is relevant and useful to augment the knowledge provided by s x . Thus, the two proposed metrics are defined as QA source = QA prob (q, s x ) and QA context = QA prob (q, P ). Hence, under our definition, Curiosity-driven questions are those that minimize QA source while maximizing QA context . In other words, we want a curious question to not be answerable given its input, while being answerable given the context.
To compute these QA-based metrics, we use the HuggingFace implementation 4 of BERT (Devlin et al., 2018).

Experiments
Baseline model As baseline architecture we adopt the popular Transformer (Vaswani et al., 2017), which proved to perform well on a wide range of text generation tasks. Among these, neural machine translation (Ott et al., 2018b), automatic summarization (Gehrmann et al., 2018), and question generation (Dong et al., 2019;Scialom et al., 2019b). It can be briefly described as a sequence-to-sequence model with symmetric encoder and decoder based on a self-attention mechanism, which allows to overcome the inherent obstacles to parallelism present in recurrent models such as Long Short Time Memory (LSTM) networks (Hochreiter and Schmidhuber, 1997).
The copy mechanism (Gulcehre et al., 2016) proved beneficial for QG (Zhao et al., 2018;Scialom et al., 2019b); indeed, the QG task is very sensitive to rare and out of vocabulary words such as named entities and this mechanism helps deal with it efficiently: more than 50% of the answers in SQuAD, for instance, correspond to named entities -see Table 2 in Rajpurkar et al. (2016). Hence, following (Gehrmann et al., 2018;Scialom et al., 2019b), we include a copy mechanism in our Transformer architecture.
For our experiments, we used the following hyper-parameters for the transformer: N=2 (number of blocks); d model=256 (hidden state dimension); d ff=512 (position-wise feed-forward networks dimension); and, h=2 (number of attention heads). Experiments run with the original hyper-parameters 5 as proposed by Vaswani et al. (2017) obtained consistent and numerically similar results. During training, we used mini-batches of size 64 and the Adam optimizer (Kingma and Ba, 2014). At generation time, the decoding steps are computed via beam search with k = 5.

Reinforcement
Reinforcement Learning (RL) is an efficient technique to maximize discrete metrics for text generation. Previously, Ranzato et al. (2015) used the REINFORCE algorithm (Williams, 1992) to train RNNs for several generation tasks, showing improvements over previous supervised approaches. Moreover, Paulus et al. (2017) combined supervised and reinforcement learning, demonstrating improvements over competing approaches both in terms of ROUGE and on human evaluation. However, the metrics used as reward are often found to overfit, leading to numerical improvements which do not correspond to increased output quality -and rather contribute to degrading, leading to reduced effectiveness of the trained models for practical applications. On this matter, and with a particular focus on QG, Hosking and Riedel (2019) performed a human evaluation of RL models trained with several metrics as reward, finding them to be indeed poorly aligned with human judgments: the models appear to learn to exploit the weaknesses of the reward source. In particular, the model learns to generate questions which are adversarial to a QA model: while meaningless, the QA would systematically be duped into assigning a high probability for their answerability. For more details on adversarial probing of QA systems we refer to Jia and Liang (2017). To overcome this issue, we propose to use a balanced reward: r(q, P, P ) = QA context − QA source (2) thus maximizing the probability of finding an answer to the generated question within the input paragraph but not in the source sentence. We hypothesize that such a metric might lead the model to avoid generating adversarial questions, having to find a balance between QA context or −QA source . In our experiments, we follow the approach proposed by (Ranzato et al., 2015;Paulus et al., 2017), considering a mixed loss L ml+rl which combines supervised and reinforcement learning schemes: where the maximum likelihood L ml is defined as L ml = − m t=0 log(p(y t |y 0 , ..., y t−1 , X)), with X = [x 1 , ..., x n ] representing the source text of length n and Y = [y 1 , ..., y m ] the corresponding reference question of length m. Conversely, we define the reinforcement loss L rl to be minimized according to the standard RL actor-critic scheme, where r(q, P, P ) is the reward function defined in Section 2: Greedy decoding according to the conditional distribution p(y|X) is used to obtain a sequence Y . The model is sampled using its Markov property, i.e. one token at a time, producing the output sequence Y s .  Table 2: Results obtained on QuAC-derived data. b1, b3, b5 suffixes indicate the beam size used. Table 1, the constrained dataset amounts to roughly three times less samples than both QuAC and the original SQuAD dataset it derives from. We thus investigate, for this dataset, the impact of pretraining the model under the traditional (i.e. not Curiosity-driven) QG training setup, using the training set provided by Du et al. (2017)). Then, we resume training using the data obtained after applying the NER-based constraints for Curiosity-driven QG to the same training samples. For the QuAC Curiosity-driven dataset, the amount of data is comparable to the original dataset, given the conversational nature of QuAC. Therefore, we do not use pretraining for the experiments on QuAC.

Results
Automatic metrics In Table 2 we report the results of our experiments on QuAC for the baseline model (base) and the RL model. We use a beam k, and compute the results for k = [1, 3, 5]. In addition the generated questions with a beam k = 5, we also computed the results for k = 1 and k = 3. While one would expect to see for all the metrics a slight improvement, with increasing beam size, we observe a strong divergence among the results: increasing values for k correspond to a significant improvements in terms of BLEU-4 and notable drops for BLEU-1. A similar phenomena was observed by Ott et al. (2018a) in the context of machine translation: they found that the presence of 1 or 2% of noisy data is enough to significantly degrade the beam search results. In our case, one of most frequent generated question is Are there any other interesting aspects about this article ?. Indeed, the frequency of this question in our training set amounts to 4.18% of the questions. On the test set we see that roughly 80% of the generated questions start with the token "are" . This sequence is not very likely to be generated with a greedy search (k = 1): at any time step during the generation, if any other token has a higher probability, this question will be dismissed. On the other hand, with a higher beam, it is likely to be kept and eventually result as the most probable sequence, among the different remaining beams at the end of the inference, consistently with what observed by Ott et al. (2018a).
Moving to our SQuAD-based experiments, we observe that the models trained on SQuAD do not seem to suffer from this issue since all the metrics improved when increasing the beam size from k = 1 to k = 5. This is consistent with the results reported by (Zhao et al., 2018) where improving the beam improve slightly all the metrics. Thus, we only report the results with k = 5 in Table 3. A possible explanation is that SQuAD only contains factoid questions, as opposed to QuAC wherein, for instance, the open-ended question "Are there any other interesting aspects about this article" covers 4.18% of the samples.
We observe that the models trained with RL obtain, as could be expected, higher scores for QA context with respect to those trained without RL. A higher QA context implies that the QA model is more likely to find an answer in the near context of the source. QA source is lower, as expected, for SQuAD based models, though comparatively higher than the models trained with RL on QuAC. We identify two possible reasons for this: first, the QA model is trained on answerable questions; second, the nature of the QUaC questions is less factoid than the SQuAD ones, and non-factoid questions can arguably be harder for the QA model   to evaluate. This could explain why, in the RL setting, QA context (the evaluation on answerable questions) is higher for both SQuAD and QUaC models, but only SQuAD models achieve a lower QA source (the evaluation on non-answerable questions). Further, we observe that pretraining allows to achieve higher BLEU scores at the cost of lower Self-BLEU, thus showing an increased accuracy but less diversity in the generated questions. Indeed, we find that pretrained models tend to generate a higher number of questions starting with "What" compared to both other models and the references; the distribution for the first words of the human questions appears closer to that of non-pretrained models.
Human Evaluation In addition to the automatic metrics, we proceeded to a human evaluation. We chose to use the data from our SQuAD-based experiments in order to also to measure the effectiveness of the proposed approach to derive Curiosity-driven QG data from a standard, non-conversational, QA dataset. We randomly sampled 50 samples from the test set. Three professional English speakers were asked to evaluate the questions generated by: humans (i.e. the reference questions), and models trained using pre-training (PT) or (RL), and all combinations of those methods. Before submitting the samples for human evaluation, the questions were shuffled. Ratings were collected on a 1-to-5 likert scale, to measure to what extent the generated questions were: answerable by looking at their context; grammatically correct; requiring external knowledge to be answered; relevant to their context; and, semantically sound.
The results of this human evaluation are reported in Table 4.

Discussion
What is the impact of the pretraining? We observe that for pretrained models (i.e. PT and PT+RL) the Correctness is significantly higher than the models without pretraining (i.e. base and RL). This is consistent with the higher BLEU observed for these models in Table 3. Additionally, we observe that for pretrained models the External Knowledge required to answer the generated questions is lower, while the Relevance is slightly higher. This might be due to the nature of the pretraining, during which the models learn to generate non-curious questions that focus on their inputs. Again, this is consistent with the significantly higher QA source scores obtained by these models (see Table 3). Does Reinforcement help? From the human assessment we conducted -see Table 4, we observe that the models trained via RL obtain higher scores for Relevance and lower ones for Soundness, as compared to their non-reinforced counterparts. Further, the results reported in Table 3 show the reinforced models obtaining lower BLEU and QA source source; conversely, they score higher when it comes to QA context . We thus conclude that reinforcement brings improvements in terms of diversity of the generated questions, at the price of slightly degraded formulations in the outputs.
How effective is our dataset creation methodology? Looking at the bottom row of Table 4, which shows the scores obtained by the reference (human) questions, we observe the highest relative values for all dimensions, with the exception of Answerability. This indicates that the data we derived from a non-conversational QA dataset (SQuAD) fits well the task of Curiosity-driven question generation. As a sidenote, we remark that the models we built obtain lower scores than humans in terms of Answerability, a fact we hypothesize due to the lower quality of the generated questions: the less sound and correct, the less answerable a question would be, regardless of its context.
How well do the metrics fit human judgement? We report the pairwise Spearman correlation and p-value among all the different metrics and human measures in Figure 1. Our analysis shows that BLEU metrics correlate positively with Relevance (B4: .29, p < .005) and Soundness (B4: .19, p < .005), and to a weaker extent with Answerability (B1: .15, p < .05) and Unexpectedness (B1: .13, p < .05). 6 Self-BLEU metrics correlate significantly with Soundness (Self-B1: .17, p < .05) and Correctness (Self-B4: .15, p < .05), while QA context is associated with Relevance (.18, p < .005). The only human measure that does not correlate with any automatic metric is External knowledge. It is indeed one of the most challenging aspect to evaluate, even for humans. However, as expected, it correlates negatively with Answerability.

Conclusions
Asking inquisitive questions allows humans to learn from each other and increase their knowledge. We thus proposed a new task: Curiosity-driven Question Generation, which attempts to address such a key component for several human-machine interaction scenarios. In absence of data directly usable for this task, we proposed an automatic method to derive it from conversational QA datasets. Further, recognizing that the great majority of QA datasets are not conversational, we also extended the method to standard QA data. Our experiments, which include learning strategies such as pretraining and reinforcement, show promising results under both automatic and human evaluation. In future works, we plan to extend the approach to conditional generation of Curiosity-driven questions.