On the Importance of Diversity in Question Generation for QA

Automatic question generation (QG) has shown promise as a source of synthetic training data for question answering (QA). In this paper we ask: Is textual diversity in QG beneficial for downstream QA? Using top-p nucleus sampling to derive samples from a transformer-based question generator, we show that diversity-promoting QG indeed provides better QA training than likelihood maximization approaches such as beam search. We also show that standard QG evaluation metrics such as BLEU, ROUGE and METEOR are inversely correlated with diversity, and propose a diversity-aware intrinsic measure of overall QG quality that correlates well with extrinsic evaluation on QA.


Question Generation and Diversity
Besides areas such as dialog (Bordes et al., 2017) and tutoring systems (Lindberg et al., 2013), automatic question generation (QG) has recently been applied with great success to generating synthetic training examples for question answering (QA) (Alberti et al., 2019;Dong et al., 2019). Yet an important question has remained unexplored: Does increased textual diversity in automatically generated questions lead to better QA? In Figure 1 we show four questions generated by one of our QG models (details in Section 2) from a SQuAD (Rajpurkar et al., 2016) passage and an answer span (the QG prompt). The questions are different not only lexically, but also in what information about the answer entity they draw upon and even their use of world knowledge, e.g., Tesla's reputation as a "mad scientist". Intuitively, such sample diversity, if sufficiently accurate, could provide QA models with rich training signal.
Existing QG work has predominantly relied on customary beam search decoding for generation and n-gram similarity metrics such as BLEU for evaluation (Du et al., 2017;Alberti et al., 2019; On Tesla' s 75th birthday in 1931, Time magazine put him on its cover. The cover caption "All the world' s his power house" noted his contribution to electrical power generation. He received congratulatory letters from more than 70 pioneers in science and engineering, including Albert Einstein.
✏ Who appeared on Time magazine' s cover on his 75th birthday? ✏ Which famous scientist was in the cover of Time Magazine in 1931? ✏ Which mad scientist received more than a 70 people congratulating him on his birthday? ✏ What famous scientist was also 75?
Figure 1: A passage with an underlined answer span ("Tesla"), and corresponding questions generated by our model. The generated questions exhibit both lexical and factual diversity. Dong et al., 2019;Zhang and Bansal, 2019). 1 Such methods/metrics solely optimize/reward similarity with human-generated reference questions treated as the ground truth (GT). However, in many openended generation tasks where only one or a few of many possible GTs are available through human annotation, this approach directly penalizes diversity by discouraging deviation from the GT(s).
In recent years, massively pre-trained neural language models (LMs) Radford et al., 2019; have revolutionized NLP. In open-ended text generation, these models show remarkable robustness under sampling (Radford et al., 2019;Holtzman et al., 2020). This observation, coupled with the examples presented in Figure 1, suggests that treating QG for QA as a more open-ended generation problem and relying on the power of modern text generators to produce diverse yet accurate samples might yield better QA results than the current approach of optimizing for the "most likely" question.
We test this hypothesis by fine-tuning a pretrained transformer-based masked LM  for QG, and sampling questions from it using top-p nucleus sampling (Holtzman et al., 2020). Other diversity-promoting text generation techniques exist-both at training time (e.g., VAEs (Kingma and Welling, 2014)) and during inference (e.g., top-k sampling and diverse beam search (Vijayakumar et al., 2018))-that have been applied to various NLP tasks: language modeling (Bowman et al., 2016), dialog (Cao and Clark, 2017), visual QG (Jain et al., 2017;Fan et al., 2018), image captioning (Vijayakumar et al., 2018) and so on. We choose nucleus sampling because of its effectiveness, simplicity and speed. Our experiments lead to the following discoveries: Nucleus sampling indeed produces better QA results than beam search, even when only one question is generated per prompt. QG metrics that only reward similarity with GT are negatively correlated with diversity, and as a result, are inaccurate predictors of downstream QA performance of diversity-promoting QG. A measure of QG can be devised that combines diversity with similarity to GT, showing strong correlations with QA performance.

Question Generation using RoBERTa
We fine-tune a RoBERTa masked LM  for QG given an answer span within a textual context (as shown in Figure 1), and use nucleus sampling (Holtzman et al., 2020) for generation.
Model: Various transformer architectures can be used for text generation (Raffel et al., 2019). Following (Dong et al., 2019;Alberti et al., 2019), we fine-tune a pre-trained masked LM as a prefix LM (Raffel et al., 2019) to predict a question token q t given (1) a prompt p 1:N : a tokenized textual context with special tokens delimiting an answer span, and (2) question tokens q 1:t 1 , if any, that have already been generated for the given prompt in a left-to-right order. A special separator token separates the question prefix from the prompt. The prompt is encoded using bidirectional attention and question tokens using causal (left-only) attention. We choose RoBERTa as our pre-trained model because of its extended pre-training on large amounts of text . Our implementation of the QG model is based on Hugging Face's (Wolf et al., 2019) PyTorch implementation of RoBERTa.
Fine-Tuning: For each QG training example, the model is asked to predict a single question token q t given the prompt p 1:N , the previous question tokens q 1:t 1 (teacher-forced), and the mask m at timestep t. All questions end with an EOS token that marks the end of generation. Training attempts to minimize the masked LM loss, i.e., the negative log-likelihood of the GT token q t as the prediction for m in position t: Inference: During generation, the fine-tuned RoBERTa QG model outputs a probability distribution over the entire vocabulary at each question timestep t. Top-p nucleus sampling (NS@p henceforth) samples from the (re-normalized) categorical distribution P N of the nucleus N, which is the smallest subset of vocabulary items that has (1) a cumulative probability mass greater than p, and (2) the highest probability among all such subsets: By restricting the pool to a high-likelihood region of the vocabulary, compared to top-k sampling, NS reduces the chances of generating low-probability items when the original distribution is peaked at one or a few items. Our question generation works by repeated nucleus sampling of question tokens untilq t = EOS.

Experiments and Results
To test the effect of QG diversity on QA, we generate questions with both nucleus sampling and beam search from a number of different QG models and compare their performance.
General Setup: Considering that performances of different generation methods may vary across models of different capacities, we train eight QG models, each uniquely characterized by: (1) its size (# of parameters), and (2) the amount of training data it was fine-tuned on. The two model sizes are those of RoBERTa: base (125M parameters) and large (355M parameters). For fine-tuning we use the train set of the SQuAD1 split by Du et al. (2017). 2 This is a three-way split of the public portion of SQuAD1 widely adopted in QG literature, with approximately 76k train, 18k dev and 12k test (prompt, question) pairs. We draw varying amounts of samples (ranging from 5% to 100%) at random from the train set to fine-tune each model on, simulating different points on the low-to high-resource For each of the eight QG models, we evaluate beam search (BEAM henceforth) and NS@p for different values of p. Our BEAM experiments with the RoBERTa-base model did not show significant performance differences between beam sizes 5 and 10, therefore we report results only for b = 5 in this paper. An important point to note here is that given paragraph-long input prompts in QG for QA, where large numbers of synthetic examples may also be needed in many practical use cases, large beam sizes can become prohibitively expensive from a computational standpoint for transformerbased generators.
For NS, we evaluate with p 2 {.1, .5, .75, .95}. Among these, p = .1 closely approximates greedy decoding, as we observed for all models an average nucleus size of practically 1 in this setup. We also set the maximum number of vocabulary items in a nucleus to 20, which even the largest p values rarely reached in our experiments. Table 1 shows performances (mean over five different seeds) of all generators in BLEU-1 (B 1 ), ROUGE-4 (R 4 ) and METEOR (MT), the variant in each metric family that showed the highest correlation with downstream QA performance. We also show QA performances measured by SQuAD's official F 1 score metric, which computes the degree of lexical overlap between the predicted and the target answer. As expected, model performance improves with both model size and # of training instances, both in intrinsic evaluation and on QA. Importantly, however, while BEAM has the best intrinsic evaluation results for all eight models, it is competitive in QA only in the lowest-resource setup (5% training data). On the other hand, NS@.95 has the lowest QG but the highest QA scores, especially when sufficient training data is available (20% or more). Note that in these experiments we generate a single question per prompt; yet generation diversity across different prompts yields higher-quality QA training data for NS, which is also a faster alternative to BEAM. Sampling five questions per prompt from the large-100% model with NS@.95 provides additional improvement (F 1 = 86.4).
Out-of-Domain Experiments: As we increase p to make generation more diverse, the chances of NS@p drawing less likely candidates and thus  generating incorrect questions also go up. In Table  1, the gains in QA due to QG diversity are generally greater than any drop in performance likely due to decreased accuracy. To find out if the same holds in a more challenging out-of-domain setup, we perform a zero-shot application (i.e., with no further fine-tuning) of four of the above SQuAD-trained QG models to NewsQA, a reading comprehension dataset of CNN news articles (Trischler et al., 2017). Table 2 shows results on the answerable subset of NewsQA, with 76k train (from which we extract our QG prompts) and 4k test (used for QA evaluation) samples: while the absolute scores are lower than those in SQuAD, the relative performances of BEAM and NS are similar both in intrinsic (the best predictor of QA performance for NewsQA was  and extrinsic (QA F 1 ) evaluation.

Comparison with and Augmentation of Human Generation:
To assess the quality of our generated questions in absolute terms, in Table 3 we compare the QA performances of the best QG model above (large-100%, NS@.95) and corresponding human annotations (GT). Impressively, in-domain model performance on QA is very similar to that of GT, while zero-shot score on NewsQA is also within roughly 4 points of GT. We also evaluate the generator's ability to augment human-generated questions. Taking an approach similar to prior augmentation experiments  who, when, why, how, how many, how much, and how long. SYNTH* is used to fine-tune a BERTwwm LM for QA, which is finally fine-tuned on the target datasets (SQuAD1-Du, NewsQA). As Table 3 shows, SYNTH* achieves 1.3-2.3 absolute points improvements for the high-performance large BERT-wwm model.

Summary of Results:
The above results empirically show that given enough training data and sufficiently powerful QG models: (1) diverse QG leads to strong in-domain and out-of-domain QA training, (2) asking the "most likely" question (i.e., beam search) every time is less useful, and (3)

Intrinsic Evaluation of Diverse QG
To better understand the performance of existing generation metrics as measures of diverse QG, we take the set of all 32 samplers in Table 1 (e.g., base-100%-p@.75) and randomly generate a large number (100k) of subsets, each consisting of n samplers (2  n  32) to be evaluated. We assign each n (# of samplers) to a bin and measure performances of QG metrics separately in each bin. The process is repeated for Table 2. Note that the member sets of a given bin, say n = 5, all contain the same number of generators (5), but the actual selection of generators are generally different in different members of a bin. This setup allows us to evaluate a varying number of generators with different capacities and performance, and to average over a large number of experiments. Figure 2 shows for all bins a rather poor, for some bins negative, median Spearman's ⇢ score between the best QG metric (SQuAD1-Du: ROUGE-4, NewsQA: ROUGE-1) and downstream QA F 1 . These results provide quantitative confirmation that ROUGE and similar metrics are inadequate evaluators of diverse QG for QA due to their sole focus on accuracy with respect to available GTs. This leads us to our final research question: How to intrinsically measure the overall quality of QG for QA under diverse nucleus sampling?
Given the categorical distribution P N of vocabulary items in a model's nucleus N, we propose to measure both its accuracy (relative to GT) and diversity of generation.
Accuracy: Similarly to LM perplexity, for timestep t of evaluation example s, we take the probability P N (q s,t | p, q s,1:t 1 ) of the model (more precisely, its nucleus N) generating the GT token q s,t , given prompt p and GT history q s,1:t 1 . We then average over all evaluation (s, t) pairs to compute model accuracy P (GT).
Diversity: An intuitive measure of the diversity of a model's nucleus N is the average entropy of P N over all evaluation timesteps. However, entropy is an unbounded measure, and has a non-linear inverse growth relative to our proposed accuracy metric, which makes their mathematical combination difficult. We instead rely on the observation that as we increase p in NS@p to make generation more diverse, the cardinality of N also goes up, on average, and so does the probability P (GT 2 N) that N contains the GT token. Our experiments on both datasets showed that this measure of diversity, computed as the proportion of times N was found to include GT across all timesteps in the QG evaluation data, has high positive correlations with the entropy of P N (Pearson's r: 98%-99%, Spearman's ⇢: 87%-95%). Note that unlike the accuracy metric P (GT), at each timestep t, the diversity metric P (GT 2 N) is Boolean: the GT token is either in N or it is not. But importantly, its average across many evaluation timesteps is a probability measure of diversity, which enables a straightforward convex combination with our proposed accuracy metric.
Our final QG metric is a weighted sum of accuracy and diversity: w·P (GT)+(1 w)·P (GT 2 N), where w 2 [0, 1] is a tunable parameter reflecting the weight of accuracy relative to diversity. In our experiments, this metric outperforms all existing metrics by a large margin for a wide range of w values. In Figure 2, the median Spearman's ⇢ score between this metric and QA F 1 in both in-domain (w=.7) and out-of-domain (w=.8) evaluation is over 90% for all bins. We observe similar performance differences between the proposed and existing metrics with Pearson's r.
Given the scope of this paper, we evaluate the combined metric only on QG, but the underlying ideas apply to diverse text generation in general. Further experiments are necessary to evaluate the metric on other generation tasks.

Conclusion
While diversity of generation has received significant attention in other text generation problems (e.g., dialog), we show in this paper that it is also an important and measurable dimension of quality in question generation for QA. We hope that our work will encourage further exploration of diversity-promoting QG and its evaluation. Possible future directions include a systematic study of different aspects of QG diversity (e.g., lexical and factual) and controlled diversification of individual aspects in generation.