Exploring Question-Specific Rewards for Generating Deep Questions

Recent question generation (QG) approaches often utilize the sequence-to-sequence framework (Seq2Seq) to optimize the log likelihood of ground-truth questions using teacher forcing. However, this training objective is inconsistent with actual question quality, which is often reflected by certain global properties such as whether the question can be answered by the document. As such, we directly optimize for QG-specific objectives via reinforcement learning to improve question quality. We design three different rewards that target to improve the fluency, relevance, and answerability of generated questions. We conduct both automatic and human evaluations in addition to thorough analysis to explore the effect of each QG-specific reward. We find that optimizing on question-specific rewards generally leads to better performance in automatic evaluation metrics. However, only the rewards that correlate well with human judgement (e.g., relevance) lead to real improvement in question quality. Optimizing for the others, especially answerability, introduces incorrect bias to the model, resulting in poorer question quality. The code is publicly available at https://github.com/YuxiXie/RL-for-Question-Generation.


Introduction
Question Generation (QG) aims to endow machines with the ability to ask relevant and to-the-point questions about a document. QG has important practical applications, such as generating assessments for course materials in education (Heilman and Smith, 2010;Lindberg et al., 2013), prompting user interaction in dialog systems (Shukla et al., 2019), enabling machines to ask clarification questions such as FAQs (Saeidi et al., 2018;Krishna and Iyyer, 2019), and automatically building large-scale QA datasets for the research community (Du et al., 2017;Zhao et al., 2018).
Recent QG approaches (Du et al., 2017;Zhao et al., 2018;Liu et al., 2019) have used Seq2Seq models with attention (Bahdanau et al., 2015), which feeds the input document into an encoder, and generates a question about the document through a decoder. The training objective is to maximize the log likelihood of the ground-truth question paired with each input document using teacher forcing (Williams and Zipser, 1989). However, as the ground-truth questions are insufficient to account for the many equivalent ways of asking a question, this likelihood-based training suffers from the problem of exposure bias (Ranzato et al., 2016), i.e., the model does not learn how to distribute probability mass over sequences that are valid but different from the ground truth. To address this issue, previous QG works proposed to optimize the model directly on question-specific rewards via Reinforcement Learning (RL). This process decouples the training procedure from the ground truth data, so that the space of possible questions can be better explored. Moreover, it allows the training to target on specific properties we want the question to exhibit, such as relevant to a specific topic or answerable by the document. Although various rewards have been employed for QG -such as BLEU (Kumar et al., 2019), the answerability reward (Zhang and Bansal, 2019), and the word movers distance  -optimizing the reward scores does not always lead to higher question quality in practice, as observed by Hosking and Riedel (2019). How to define robust and effective QG-specific rewards still requires further investigation.
We aim to analyze the effectiveness of question-specific rewards in QG. Instead of using general natural language generation metrics such as BLEU, we target three QG-related metrics that are commonly cited in human evaluations of question quality: (1) Fluency indicates whether the question follows the grammar and accords with the correct logic; (2) Relevance indicates whether the question is relevant to the document; and (3) Answerability indicates whether the question is answerable given the document. We design a specific RL reward for each metric: a language model based reward for fluency, a discriminator-based reward for relevance, and a QA-based reward for answerability. After optimizing each reward via RL, we conduct comprehensive analysis, including automatic and human evaluation, to arrive at the following conclusions: (1) both individual and joint optimization of these rewards can lead to performance gain in automated metrics, but this does not guarantee an improvement in the real question quality; (2) the reward for relevance substantially helps to improve the question quality, while the reward for answerability reduces the quality due to the bias brought by the QA model; and (3) a reward is more likely to improve the question quality if the reward score correlates well with human judgement.

Related Work
Early QG studies focused on using manually-designed rules or templates to transform a piece of given text to questions (Heilman, 2011;Chali and Hasan, 2012), with low generalizability and scalability. To address this, recent neural question generation (NQG) models take advantage of the Seq2Seq framework with attention, which are trained in an end-to-end manner, requiring far less labor and enabling better language flexibility. Many improvements have been made to the original Seq2Seq NQG model (Du et al., 2017), such as encoding answer information (Zhou et al., 2017;Sun et al., 2018) and incorporating linguistic features (Liu et al., 2019). A comprehensive survey of QG can be found in (Pan et al., 2019).
Among these attempts, utilizing RL to optimize QG-specific rewards has been adopted by recent works to address the exposure bias problem. To find a good proxy for question quality, various rewards have been proposed. One common type of reward is the similarity between the generated question and the reference question written by human. Kumar et al. (2019) adopted BLEU, ROUGE, and METEOR as rewards. Followup works employed more semantic-relevant metrics, such as the word movers distance Yu et al., 2020) and the paraphrasing probability (Zhang and Bansal, 2019). To generate more passage-relevant questions, Kumar et al. (2019) designed a reward to measure the relevance between the input passage and the generated question based on their degree of overlapping. The answerability reward measures whether the generated question can be answered by the input passage. It is designed as either the confidence score that a pre-trained QA model can correctly answer the generated question (Zhang and Bansal, 2019), or the overlapping degree between the target answer and the answer predicted by the QA model . Other types of rewards include Yao et al. (2018), which train a discriminator to measure the naturalness, i.e., the question is human-written or generated.
Most question-specific rewards are empirically successful since they achieve performance gain in automatic evaluation metrics after RL training. However, this brings several followup questions that existing works have failed to answer: (1) does optimizing RL rewards really improve the question quality from the human standard, (2) which reward is more effective in improving the question quality, and (3) how the rewards interfere with each other when jointly optimized. This paper aims to bridge this gap through human evaluation and analytic experiments, aiming to provide a better understanding of how different rewards affect the question generation process.

Methodology
Given a document D as input, the objective is to generate a relevant questionŶ which can be answered by the document D. This is formulated as maximizing the conditional probability p(Y|D): where y t is the t-th token of the generated question Y, and Y <t represents the previous decoded tokens, i.e., y 1 , · · · , y t−1 . The general framework of our model is shown in Figure 1, consisting of two parts: the  Figure 1: The framework of our model, consisting of the basic question generator (on the left) and the discriminators for QG-specific rewards (on the right). The blue sequence on the right represents the input document, and the green sequence is the generated question.
Question Generator and the QG-specific Rewards. The Question Generator uses the Seq2Seq framework with attention (Bahdanau et al., 2015), copying (Gu et al., 2016;See et al., 2017), and coverage mechanisms (Tu et al., 2016), following most existing NQG works. The model is trained by maximizing the likelihood of ground-truth questions. As discussed in the introduction, this basic question generator suffers from the exposure bias problem. Therefore, we design three QG-Specific Rewards aiming at evaluating the fluency, relevance, and answerablity of the question generated by the basic model. We then fine-tune the model by optimizing these rewards following the RL framework with a baseline (Rennie et al., 2017). In the following, we describe the design of the three QG-specific rewards in detail.

LM-based Reward for Fluency
The perplexity of a sentence under a well-trained Language Model (LM) usually serves as a good indicator of its fluency (Yang et al., 2018b). Therefore, we introduce an LM-based reward to improve the fluency of the generated question. We first pre-train a language model P LM and then define the fluency reward R f lu of a generated question Y as its negative perplexity evaluated by P LM , formulated as: To optimize the fluency reward in training, we define the following loss function L f lu : whereŷ t is the t-th token in the predicted questionŶ, which is sampled from the vocabulary distribution P QG (y t |D, Y <t ) specified by the RNN decoder of the question generator. α f lu is a pre-defined negative perplexity, which is used as the baseline reward in the RL algorithm to stabilize the training process.

Discriminator-Based Reward for Relevance
We then design a classifier-based discriminator to judge whether the generated question is relevant to the input document. As shown in Figure 1, the discriminator is a binary classifier based on the pre-trained BERT (Devlin et al., 2019), which takes both the input document D and the generated question Y as inputs and outputs the probability that Y is relevant to the D. To train the relevance discriminator, we use the human-written ground-truth questions Y G for each document as the positive training data. For a document-question pair (D, Y G ), we create the negative sample Y N for D in the following three ways: • Question Swap. We randomly select a ground-truth question from another document D as the negative sample for the document D.
• Inter-Doc Entity Swap. We create the negative sample Y N by replacing the entity in the groundtruth question Y G with another entity of the same type but does not occur in the document D. This helps the discriminator to learn whether the question involves entities not mentioned in the document.
• Intra-Doc Entity Swap. We also replace the entity in the ground-truth question with a different entity from the same document. This often creates logical errors in the question, e.g., William Shakespeare is written by the book, which is more challenging for the discriminator to differentiate.
Following the above process, we create three negative samples for each ground-truth question. To address the unbalance between positive and negative training data, we adopt the α-balanced focal loss (Lin et al., 2017) to train the relevance discriminator, given as follows.
where P t is the predicted probability for class t.
(1 − P t ) λ is a modulating factor with a tunable focusing parameter λ ≥ 0 that smoothly adjusts the rate at which easy examples are down-weighted. After training the relevance discriminator, we use it to obtain the relevance reward and then fine-tune the question generator by maximizing the relevance reward in RL training. Given a document D and a question Y, the relevance reward R rel (D, Y) is defined as a scaling of the relevance probability P rel (D, Y) output by the relevance discriminator as: where is a positive factor close to zero to avoid calculating log 0. We scale the relevance probability in this way to augment the reward value for positive samples, i.e., those samples whose rewards are greater than the baseline, because the QG model generally samples more negative samples during training. To optimize the relevance reward in RL training, we define the loss function L rel as follows.

QA-Based Reward for Answerability
Answerability indicates whether the question is answerable by the document without the need of external information. We design the answerability reward based on the SpanBERT (Joshi et al., 2020), a stateof-the-art model for extractive QA. Given a document D and a question Y as inputs, SpanBERT predicts the start and end spans of the potential answer in the document D. Formally, it outputs two probability distributions over the tokens in the document: P s ans and P e ans , where P s ans (i)/P e ans (i) is the probability that the i-th token is the start/end span of the answer. Based on the pre-trained SpanBERT model, we first fine-tune it with the HotpotQA dataset (Yang et al., 2018a) and then use it to obtain the answerability reward for the generated question Y. Intuitively, when the question is answerable, the model should be quite confident about the start/end span of the answer, so the distribution should be peak for both P s ans and P e ans , i.e., the value of max i P s ans (i) and max j P e ans (j) are both large. Therefore, we use the geometric average of these two values to indicate the answerability, formulated as follows.
where l represents the maximum allowed length of the answer. Similar to Equation 5, we also scale the probability to balance positive and negative samples during training. Similar to previous sections, to optimize the answerability reward in RL training, we define a loss function L ans :

Model Training
We train the whole model following the pre-training and fine-tuning paradigm, as in (Hosking and Riedel, 2019). We first pre-train the question generation model by minimizing the cross-entropy loss together with the copying loss and the coverage loss, which can be written together as L base : where the copy mechanism is involved in the question generator P QG , and a t i is the i th element of the attention score vector over the document at time stamp t. We then fine-tune the basic QG model trained with L base to maximize the previously defined QG-specific rewards. This is achieved by linearly combining L base with the RL-based losses L f lu , L rel , L ans , as follows.
where the hyper-parameters γ f lu , γ rel and γ ans specify the trade-off between different kinds of rewards. Note that we empirically set baseline rewards α f lu , α rel , and α ans to reduce the variance of gradient estimation during RL training, as reflected in Equations 3, 6, and 8.

Experiments
We conduct experiments on HotpotQA (Yang et al., 2018a), containing ∼100K crowd-sourced questions paired with Wikipedia articles. Generating a fluent, relevant, and answerable question in HotpotQA is a non-trivial task as it requires reasoning over different pieces of information in the input document. We follow the data split of Pan et al. (2020) to get 90,440 / 6,072 examples for training and testing, respectively. We further hold out 6,072 examples from the training data as the development set. The basic question generator is a Seq2Seq framework with copying (Gu et al., 2016), coverage (See et al., 2017), and attention (Hou et al., 2019) mechanisms. We employ a 1-layer bi-directional GRU as the encoder and a 1-layer GRU as the decoder. We use the cased WordPiece tokenizer for the question generator following Joshi et al. (2020). The hidden size of the Seq2Seq model and the maximal input sequence length are set as 512 and 256, respectively.
To train the language model used for evaluating the fluency reward, we fine-tune the pre-trained BERT model (Devlin et al., 2019) on our target dataset, resulting in an LM with a perplexity of 8.85 on the dev set. Our relevance discriminator, which is also fine-tuned from the pre-trained BERT model, achieves a 91.16 F 1 score. The answerability discriminator based on SpanBERT-large obtains a 70.60 Exact Match (EM) score and an 83.44 F1 score on the HotpotQA development set. In RL training, we empirically set the baseline rewards α f lu , α rel and α ans as −10, log(2), and log (2), respectively. When jointly training all the rewards, the trade-off parameters γ cov , γ f lu , γ rel and γ ans are tuned on the dev set and set to 0.25, 0.2, 1, 1, respectively. We provide ancillary material about supplementary experiments on hyper-parameter sensitivity and result analysis in our codebase 1 . Other settings for training follow the standard best practice 2 .

Automatic Evaluation
To investigate the effect of different QG-specific rewards, we first report the performance of automatic evaluation metrics when optimizing different rewards. The metrics include: a) BLEU 1 and 4 (Papineni et al., 2002), METEOR (Lavie and Agarwal, 2007), and ROUGE-L (Lin, 2004), which are based on the n-gram similarity between the generated questions and the ground truth; and b) gain on reward scores (the fluency, relevance, answerability rewards) after RL training. B1 by optimizing either a single (S1-S3) or multiple rewards together (E1-E4). F, R, and A represents the fluency, relevance reward, and answerability rewards, respectively. We make four major observations: 1. Optimizing a single reward alone (F, R, A) can lead to an improvement on the BLEU score and also its corresponding reward score (F→R-FLU, R→R-REL, and A→R-ANS). When optimizing one reward, the scores for the other two also slightly increase, showing that the three rewards are correlated. This is in line with our intuition; e.g., a question answerable by the passage is also likely to be fluent.
2. Jointly training multiple rewards in general leads to better performance. For example, the best improvement of R-REL, R-FLU and R-ANS are achieved by E3 and E4. This shows that different rewards can mutually enhance each other in joint training, which provides a prospective future direction on RL reward integration.
3. In general, the increase in rewards do not correlate well with improvement on automatic metrics. For example, E3 has the largest reward gain in fluency and answerability, but achieves relatively worse results in BLEU4, METEOR and ROUGE-L. This shows that the RL rewards focus on different parts of the question quality other than the n-gram based similarity with the ground truth. We further investigate how each reward affects the question quality later in Section 4.3.
4. We find that our B1 baseline tends to generate longer questions (the average question length is 1.44 times that of the ground truth, compared with 1.13 for E4). The RL rewards thus encourage shortening to lengths which are closer to the ground truth. This explains why the improvements brought by RL rewards are especially significant on BLEU.

Comparison with Baselines
We then compare our best performing model (E4. F + R + A) against several strong baselines in QG. The technologies employed by each model as well as the performance results are summarized in Table 2. Without using the answer information and any external linguistic knowledge, our model achieves a comparative BLEU4 with the state-of-the-art QG model (B7) in HotpotQA. This demonstrates the effectiveness of optimizing QG-specific rewards via RL. Surprisingly, the CGC-QG (B6) model exhibits an unusual pattern, achieving the best METEOR and ROUGE-L, but worst BLEU1 among all baselines. Our analysis finds that CGC-QG tends to generate irrelevant word during word-level content selection, leading to lengthy questions that are unanswerable or which contain semantic errors (Pan et al., 2020).

Human Evaluation
To further investigate whether optimizing QG-specific rewards lead to real improvement in question quality, we conduct human evaluation on the generated questions for 200 randomly-sampled testing documents. We ask 6 workers to rate the questions generated by 5 different models: the basic question generator (B1), the models fine-tuned with a single reward (S1, S2, S3), and the model with all three rewards (E4). Raters were blinded from the identity of the models. We designed the scale differently for each metric to ease human rating effort. For each question, we ask three workers to give ratings on four   invalid question ghost entity no ghost entity, but information insufficient others Q5. Whether this question require reasoning to answer? • yes, and very hard • yes, but simple reasoning • no Table 4: Questionnaire designed for human evaluation, where • and indicate single-item and multiple-item selection, respectively. criteria: Fluency (on a scale of 1-5), Relevance (scaled 1-3), Answerability (0 for unanswerable and 1 for answerable), and Complexity (scaled 1-3). To reduce the subjectivity of human rating, we obtain the rating score according to the annotator's answers to our designed questionnaire shown in Table 4. For more accurate evaluation, we give an unreadable question labeled by Q1 the lowest fluency rating and do not consider its relevance, answerability, and complexity ratings, as it is infeasible to judge them when the question is unreadable. The proportion of the unreadable questions generated by B1, S2, S2, S3 and E4 are 11.8%, 10.7%, 4.4%, 11.0% and 10.0%, respectively. For a readable question, the fluency rating is determined by the number of grammar errors it makes (the answer to Q2). The answerability and complexity ratings are given by the answers of Q3 and Q5, respectively. The relevance score depends on both Q3 and Q4. We average the scores from raters on each question, reporting the performance in Table 3. We discuss four major findings: 1. Human ratings do not correlate well with automatic evaluation metrics (BLEU4, Meteor, ROUGE-L), showing that the n-gram based metric is not a good reflection of actual question quality. Similar observations also exist in other language generation tasks (Callison-Burch et al., 2006;Novikova et al., 2017) for fluency, adequacy and coherence, validating our findings.
2. Optimizing the relevance reward (S2) alone leads to a substantial improvement of the human ratings for fluency, relevance, and answerability. Our further analysis in Section 4.5 shows that optimizing the relevance reward reduces ghost entity errors, a major source of error in previous QG models.
3. In contrast, optimizing for answerability (S3) has a surprising negative effect: reducing scores against all three human ratings, compared against the baseline (B1). We believe this is due to the immature of the current QA model in answering deep questions; i.e., when used as a discriminator, the QA model we used cannot accurately predict whether a question is answerable or not, especially when the question involves reasoning (the case in HotpotQA). We analyse this in more depth in Section 4.4. We also show in Section 4.5 that the model tends to learn spurious correlations for answerability (e.g., a what year question is more likely to be answerable), which also accounts for its poor performance.

Consistency between Rewards and Human Judgement
To figure out why certain rewards improve the real question quality while others do not, we plot violins to show the distribution of reward scores on each level of human rating, shown in Figure 2.
We observe that the relevance reward has the highest consistency with human judgement; i.e., both the median and the maximal rewards improves when the human rating gets higher. This provides an explanation of why optimizing the relevance reward leads to the best question quality.
The answerability reward predicted by the QA model, however, exhibits a poor correlation with the true answerability judged by humans. The median answerability reward is low for both answerable and unanswerable questions labeled by humans. This lends evidence for our claim in Section 4.3 that the innate capability of the QA model is the bottleneck for this reward. We expect the answerability reward to become more effective as deep QA improves, and could become a key component in future work.
The correlation between the fluency reward is also unsatisfactory: the increase of the fluency reward score is not obvious when the human rating for fluency increases. This makes the performance of S1 similar to the baseline model in Table 3; i.e., the effect of optimizing the fluency reward is not obvious.
Based on the above observations, we conclude that the rewards that correlate well with human judgements tend to achieve real improvement in question quality. Therefore, to design an effective QG-specific reward, testing its performance on n-gram based metrics such as BLEU may not faithfully reflect its effectiveness. Instead, running an initial test of how well the reward score correlates with human judgement seems more viable.
We further provide a full view of how human ratings correlate to each other and how they correlate to the reward scores in Figure 3. We find that the relevance rating has strong correlations with both the fluency rating (0.79) and the answerability rating (0.67), indicating that a question is more relevant to the document when it is fluent and answerable. However, a relatively weak correlation exists between the fluency and answerability, meaning a fluent question is not necessarily answerable. In Figure 3(b), we further find that the relevance reward has a strong correlation with not only the relevance rating, but also fluency and answerability ratings. This explains why optimizing the relevance reward alone (S2) leads to improvements on fluency and answerability as well. In contrast, R-ANS has poor correlation with Flu. and Rel., explaining why it decreases the fluency and relevance ratings.
(a) Correlation between human ratings (b) Correlation between reward scores and human ratings Figure 3: Heatmaps of the Pearson correlation coefficient matrices between human ratings and rewards. Flu., Rel., Ans. and Cpx. denote fluency, relevance, answerability, and complexity ratings in human evaluation, respectively. R-FLU / R-REL / R-ANS represents fluency / relevance / answerability reward.

Mesoscopic Analysis of Generated Questions
To further understand why the fluency and the answerability reward fail to produce a consistent judgement with humans, we conduct a mesoscopic analysis on our E4 model by comparing the generated questions receiving high rewards with those with low rewards. We detail our observations for each reward type, guided by the results in Table 5.
• Fluency. From Table 5 Row F, we observe that sometimes the fluency reward is consistent with the human judgement on fluency; e.g., the incomplete question [FL-1] receives a low reward. However, there is often inconsistency between the fluency judged by the language model and that of human-judged fluency. For example, [FH-2] has a repetition error but is assigned a high reward, while [FL-2] with a similar repetition error receives a low reward. This is caused by the statistical bias in the language model; i.e., the LM tends to assign low rewards to the questions with rare or unseen entities (e.g., Kenji Mizoguchi). The lack of commonsense knowledge is another problem of the LM: e.g., in [FH-1] the model fails to replace the word born with founded to make the question logically correct.
• Relevance. Table 5 Row R shows that the relevance discriminator judges the document-question relevance largely based on two aspects: 1) whether the question contains an entity that does not appear in the passage (ghost entity), e.g., Granly in [RL-1], and 2) whether the question has a logical inconsistency with the document, e.g., . These two targets are quite consistent with the human judgement on relevance, which explains its good correlation in Figure 2. However, when the question is asking about an unmentioned aspect of an entity in the document, it is difficult for the model to assign an appropriate relevance score as in . A potential solution is to factor in the judgement of a good answerability discriminator (a challenge itself).
• Answerability. We observe in Row A that the answerability reward follows quite different criteria for whether a question can be answered compared against humans. First, most of the questions with high rewards are asking what year (the text highlighted in pink). We find that 45.0% questions generated by S3 are what year questions, compared with 11.2% for the baseline model B1. This may be caused by the data bias of the training set. Since a large portion of questions in HotpotQA are asking about date or time, this leads the QA model to learn a spurious correlation that a what year question is more likely to be answered and hence should receive high rewards. Second, when the question becomes complex, i.e. requiring the QA model to conduct reasoning such as comparison (Questions [AL-1] and [AL-2]) and to utilize world knowledge (e.g. United States is a country), the QA model tends to give a low answerability reward. This can be explained by the insufficient ability of current QA model in answering deep questions. To improve the answerability via a QA-based reward, we believe it is crucial to address the QA model's bias in prediction and improve its reasoning ability. Otherwise, optimizing an immature QA-based reward may introduce an incorrect bias, which in turn harms the question quality.

Conclusion
In this paper, we optimize three question-specific rewards via reinforcement learning on a Seq2Seq based question generator, aiming to improve the fluency, relevance and answerability of the generated questions. Through comprehensive analytic experiments, including automatic and human evaluation, consistency validation, and meso analysis, we show that the effectiveness of a reward is poorly reflected by automatic evaluation metrics such as BLEU. Instead, we find a reward that correlates well with the human judgement generally has better effects on improving the question quality. In future works, we believe these observations can help to guide the design of other QG-specific rewards that target on unexplored aspects of question generation, such as the informativeness and the utility of questions.