Diversify Question Generation with Continuous Content Selectors and Question Type Modeling

Generating questions based on answers and relevant contexts is a challenging task. Recent work mainly pays attention to the quality of a single generated question. However, question generation is actually a one-to-many problem, as it is possible to raise questions with different focuses on contexts and various means of expression. In this paper, we explore the diversity of question generation and come up with methods from these two aspects. Specifically, we relate contextual focuses with content selectors, which are modeled by a continuous latent variable with the technique of conditional variational auto-encoder (CVAE). In the realization of CVAE, a multimodal prior distribution is adopted to allow for more diverse content selectors. To take into account various means of expression, question types are explicitly modeled and a diversity-promoting algorithm is proposed further. Experimental results on public datasets show that our proposed method can significantly improve the diversity of generated questions, especially from the perspective of using different question types. Overall, our proposed method achieves a better trade-off between generation quality and diversity compared with existing approaches.


Introduction
As a reverse task of question answering (QA), question generation (QG) aims to generate questions from a given answer and its relevant context. The task holds the potential value of educational purpose to generate questions for reading comprehension materials (Heilman and Smith, 2010). It can also be deployed as chatbot components (Li et al., 2017) for evaluating or improving mental health (Colby, 1975). Moreover, QG can be applied to extend the question-answer pairs (Du and Cardie, 2018) for QA systems.
Traditional methods for QG mainly use rigid heuristic rules to transform a sentence into related Source context: the network was engineered and operated by mci telecommunications under a cooperative agreement with the nsf . Target question: who operated the vbsn network?  questions (Heilman, 2011). However, these approaches heavily rely on manually crafted features, which cannot be easily generalized. In recent years, neural techniques are applied to this task and have achieved significant progress (Zhou et al., 2017;Du et al., 2017). Most of these methods follow the one-to-one encoder-decoder paradigm and focus on improving the quality of a single generated question (Zhao et al., 2018;. However, given an answer and its associated context, it is possible to raise multiple questions with different focuses on the context and various means of expression. Figure 1 shows some different questions that can be generated from a given source context. The characteristic of diversity is inherent in QG and has the potential to enhance the value of this task. However, the diversity is not fully explored with existing methods.  and Fan et al. (2018b) noticed this problem and modeled the variety with latent variable models. However, the introduced latent variable was regarded as a holistic attribute, whose meaning was opaque and weakly related to the origin of diversity. More recently, Cho et al. (2019) proposed a mixture content selection model for generation, whose diversity is determined by a fixed number of selectors. However, the discrete property confines its variety to a large extent.
In this paper, we use a more flexible continuous latent variable for content selection to deal with different focuses on a context. Moreover, question types are explicitly incorporated to consider different ways of expression. With these components, a question can be generated in three steps. Firstly, a content selector in the form of a continuous latent variable is sampled conditioning on the source context. Secondly, a question type is predicted based on the context as well as the content selector. Lastly, the content of a question is generated with above information about contextual focuses and means of expression. Considering the variety of content selectors and question types, the diversity of generated questions can be ensured.
Overall, the main contributions of this paper are as follows: • We explicitly consider the content selection process of QG and model content selectors as a continuous latent variable for different focuses on contexts. CVAE is utilized and the multimodal prior technique is adopted for more diverse selectors.
• We consider various means of expression through the incorporation of question type modeling. A diversity-promoting algorithm concerning the use of distinct question types among generations is proposed further.
• We conduct experiments on the public datasets SQuAD and NewsQA, whose results demonstrate a better trade-off between generation quality and diversity compared with previous methods. Further analysis demonstrates the effectiveness of our proposed components.

Related Work
Automatic question generation has attracted an increasing attention from the natural language gen-eration community in recent years, which is reflected in newly published datasets (Zhou et al., 2017;Chen et al., 2018) and sophisticated techniques (Du et al., 2017;Liu et al., 2019). Traditional methods are mainly rule-based, where they first transform the source information into syntactic representation and then use templates to generate related questions (Heilman, 2011). These methods largely depend on rigid heuristic rules and cannot be easily generalized.
In contrast to rule-based methods, neural networks have the potential to learn implicit patterns from labeled data, thus become more prevalent in question generation. Du et al. (2017) and Zhou et al. (2017) followed the paradigm of sequenceto-sequence and showed promising results when combining rich features and attention mechanism.  and Zhou et al. (2019) incorporated answer-focused information to improve the relevance between answers and questions. Liu et al. (2019) and Chen et al. (2020) introduced graph networks to estimate significant contents in the source context.
Most of previous work regarded question generation as a one-to-one problem and focused on improving the quality of a single generated question. Some work noticed the diversity inherent in QG and came up with methods to consider this characteristic. Yao et al. (2018) used a latent variable to model the holistic attributes in questions. Similar ideas could also been found in some related work (Jain et al., 2017;Fan et al., 2018b). However, the meaning of the holistic features is only opaque and cannot be strongly connected with diversity. More recently, Cho et al. (2019) proposed a mixture content selection model for generation. The diversity was determined by a fixed number of content selectors. Different from their work, we model the latent variable of content selectors in a continuous space, which holds the potential of capturing more variety inherent in content selection.
Besides above related work, other techniques plugged into the general encoder-decoder framework can also be utilized to promote diversity (Li et al., 2016;Shen et al., 2019). However, the particular characteristics of question generation are not fully considered in these approaches.

Method
Question generation aims to model the probability of a question q given an answer a and its context c,  which can be combined as the source information x = {c, a}.
To diversify generated questions, we incorporate a continuous multi-dimensional latent variable z for content selection and explicitly model question types to deal with means of expression. Generation can be factorized into three stages. Firstly, a content selector z is sampled conditioning on the input x. This is used to indicate which parts of the source information should be focused on. Secondly, a question type q t is predicted considering the specific content selector z and the input x. Lastly, the relevant question content q c is generated with selected contents and predicted question type. The final question q can be composed as (q t , q c ). The factorization can be formulated as follows: (1) The choice of a continuous latent variable as content selectors leads to more variety compared with its discrete counterpart. CVAE (Sohn et al., 2015) is adopted to make training more tractable. Then the objective function turns out be the evidence lower bound (ELBO) of logp θ (q|x): (2) where p φ (z|x, q) is incorporated to approximate the the posterior distribution p θ (z|x, q). L(θ, φ; x, q) can be approximated using Monte Carlo estimate and learning can be conducted with re-parameterization trick (Kingma and Welling, 2014) on p φ (z|x, q) and p θ (z|x): (3) The first two components in L denote the reconstruction error that forces the sampled content selector to be informative of what to focus on. The last two components constitute a kind of regularization that drive the posterior to match the prior.
The overall architecture is illustrated in Figure 2. In the following subsections, we will elaborate the details of each stage.

Content Selector
In our framework, the content selector is modeled as a continuous multi-dimensional latent variable z, which is used to focus on relevant contextual information. Following CVAE, a recognition network p φ (z|x, q) is defined to approximate the true posterior distribution. As shown in the form of p φ (z|x, q), it is conditioned on the source information x as well as the target question q.
As for the source information, we decompose the context c as a sequence of words {x i } n i=1 . Following Zhou et al. (2017), we exploit lexical features to enrich word embeddings as . Then a bidirectional recurrent neural network (Bi-RNN) is used to produce a sequence of hidden states {h i } n i=1 . At last, condensed source information s is aggregated with a self-attention operation: We assume the target question has content words {y t } m t=1 . Then, the target information t can be calculated with a similar process as Equation 4.
To model the continuous property of the latent variable z, we assume p φ (z|x, q) follows multivariate Gaussian distribution with a diagonal covariance matrix, hence the recognition network can be calculated as: Given Equation 3, we also need to define the prior distribution p θ (z|x) of the latent variable z.
Traditional methods often represent the prior as another Gaussian distribution for the sake of tractable calculation. To enrich the model with more variety and prevent the variational posterior to be over-regularized, we adopt a multimodal prior distribution. Gaussian mixture distribution has the potential to fit more diverse multi-dimensional data, which are suitable to enlarge the divergence between content selectors with different focuses.
Instead of introducing transformation matrices to mean and variance for each mode, we adopt the multimodal prior technique of VampPrior (Tomczak and Welling, 2018), where only marginal additive parameters are needed and overfitting can be alleviated. More specifically, the multimodal prior distribution can be formulated as follows: wheret k denotes a pseudo-input, which is a learnable vector with the same dimension as t. K is a hyper-parameter denoting the number of modes. Given above recognition and prior networks, we can use re-parametrization trick to obtain samples of z from p φ (z|x, q) (training) or p θ (z|x) (testing). With the sampled latent variable z, we can calculate what to focus on the context c: denotes the word embedding of question type q t , which will be elaborated in subsection 3.2. We use o to represent {o i } n i=1 for simplicity.

Question Type Predictor
Given source information s and sampled content selector z, question type predictor produces a probability distribution to indicate how likely the selected contents can be inquired by different question types. In this paper, we categorized question types according to the interrogative words commonly used in general questions. Specifically, they are classified into 8 types -what, who, how, when, which, where, why and other (Zhou et al., 2019). We combine the contextual information s and the selector representation z as the input. Two fully connected layers followed by a softmax layer are Algorithm 1 Pseudo-code for diversity-promoting question type selection algorithm. P ∈ N × L is the question type distributions of N different samples with L types. −inf represents the negative infinity. decay is a hyper-parameter controlling the degree of diversity and tuned by the development set. The algorithm returns q i t for each sample, which means its predicted question type.
9. end procedure used to estimate the final question type distribution for a relevant question. The loss corresponds to the first item in Equation 3: Given the question type predictor, we propose a diversity-promoting algorithm in the inference phase. In Algorithm 1, we utilize decay to explicitly control the degree of diversity for multiple generations. Specifically, given multiple samples with their question type distributions as a whole, we iteratively pick the highest probability and assign its type to the corresponding sample. Then, the probability of choosing the same question type for other samples will be restrained by decay. Therefore, it is more likely to allocate different types to the rest, thus the degree of diversity in question types can be explicitly promoted.

Controlled Generator
We utilize focused encoder and decoder to make the generation process aware of the selected contents and the predicted question type.

Focused Encoder
The selected contents can be regarded as a clue indicator feature (Liu et al., 2019), which assigns a binary value to each word to signify its importance.
To stabilize training, we use the soft version of this indicator feature, whose weight is given by o in Equation 7. In the inference phase, we discrete this indicator by setting a threshold (Cho et al., 2019). Specifically, this feature is transformed into another embedding as follows: (9) where E 1 and E 0 correspond to the trainable embeddings for the two values of this clue indicator. I(o i ) represents the discreteness of the content selection probability o i . This embedding is appended to the word embedding x i introduced in subsection 3.1. The resulting embeddings are denoted as . Then another Bi-RNN is utilized to obtain focused contextual representations as h = {h i } n i=1 .

Focused Decoder
We assume that the contextual representations h , the content selection indicator o and the question type q t should be combined to generate relevant question content q c = {y t } m t=1 , which is the remaining part of a question other than its type.
Following the traditional paradigm, a unidirectional Gated Recurrent Unit (GRU) (Cho et al., 2014) is employed to form the decoder. It takes the question type q t as the initial input word y 0 and refers to representations h for attention mechanism (Bahdanau et al., 2015). More details can be found in the implementation of NQG++ (Zhou et al., 2017).
Traditional methods calculate attention weights using the correlation between the hidden states of the encoder and the decoder, which is defined at the word level. In our method, the content selector z decides what to focus on before generation, thus has the ability to provide attention at the sentence level. This is similar to the idea used in data-to-text generation (Mei et al., 2016). Therefore, we combine the content selection probability o to refine the attention weights α t,i at position t: Note that incorporating content selection in this way is an independent operation, which can be plugged into any standard attention method. As for generation distribution, we adopt copygenerator (See et al., 2017) to deal with the out-of-vocabulary problem. Then, the loss function exerted on the question content, which corresponds to the second term of Equation 3, can be calculated as follows:

Training
As the selected contents play an important role in our model, we assume they are consistent with the final generation. Although this behavior can be learned with Equation 11 in an end-to-end manner, we add an auxiliary loss function to facilitate it. Formally, we set the gold label of content selection g i to 1 if the source token x i appears in the target question q and 0 otherwise. Without annotations of real focuses, above labels serve as proxies to ease learning. The loss function is thus defined as: (12) It is well known that a vanilla CVAE with RNN decoder has the risk of failing to encoding meaningful information in the latent variable (Bowman et al., 2016). Inspired by the same concern in the previous work (Zhao et al., 2017), we also adopt the bag-of-word loss L bow (θ, φ; x, q) as an auxiliary loss, which requires the latent variable to predict the words shown in the target question. Moreover, the technique of KL cost annealing (Bowman et al., 2016) is also incorporated to let the divergence of p φ (z|x, q) and p θ (z|x) gradually influence the learning procedure.
Therefore, the overall loss function of the whole framework is defined as: (13) which can be optimized by stochastic gradient descent.

Experiment Settings
Dataset We conduct experiments on two public datasets SQuAD (Rajpurkar et al., 2016) and N ewsQA (Trischler et al., 2017). As for SQuAD, we follow the same corpus split by Zhou et al. (2017) and directly utilize their provided lexical features 1 . There are 86635, 8965 and 8964 sentence-answer-question triples in the training, development and testing set respectively. As for N ewsQA, we follow the original split of this dataset, resulting in 92549, 5166 and 5126 triples for training, development and testing.

Implementation Details
The vocabulary is set to contain the most frequent 20000 words in each training set. We set the dimension of word embedding to 300 and hidden size to 512. The representations of lexical features and focus indicator are randomly initialized as 16-dimensional vectors. The dimension of the latent variable z and the hidden size of the question type predictor are set to 128. The number of layers for RNN is set to 1 in both the encoder and the decoder. We update the model parameters using Adam optimizer (Kingma and Ba, 2014) with learning rate of 0.001, momentum parameters β 1 = 0.9 and β 1 = 0.999. Batch size is set to 64 during training. The development set is used to find the best model and hyper-parameters. Our model is implemented with Pytorch 1.0.0.

Baselines and Metrics
We compare our method with recent diversified generation methods including Truncated Sampling (Fan et al., 2018a), Diverse Beam Search (Vijayakumar et al., 2018), Mixture Decoder (Shen et al., 2019) and Mixture Content Selection (Cho et al., 2019). The implementations and naming conventions of above baselines follow those by Cho et al. (2019).
As for our method, to get N generations for each passage-answer pair, we sample N content selectors from the multimodal prior defined by Equation 6. Given these content selectors, question types are promoted to be distinct with Algorithm 1 and greedy search is conducted for a fair comparison. Note that there is no restriction on the number of prior modes (K) to get N samples. However, it is a natural choice to set K = N and get a sample from each mode. We name this model as N -M. Prior. In further analysis, we will also show the influence of setting different values to K.
We use metrics 2 adopted by Cho et al. (2019) to 1 https://res.qyzhou.me/redistribute.zip 2 ⇑ is used for a metric which is higher with better performance, otherwise ⇓ is marked.   Pairwise metric (⇓) This measures the withindistribution similarity. The metric computes the average of sentence-level metrics (Self BLEU-4) between one sentence and the rest in a generated collection. Low pairwise metric indicates high diversity. Given these metrics, we come up with a comprehensive measurement to balance generation quality and diversity.
Overall metric (⇑) This measures the overall performance concerning both quality and diversity: Top-1 metric × Oracle metric ÷ Pairwise metric Also, we introduce other two metrics regarding with the diversity of generated question types.   Type coverage metric (⇑) This measures the percentage that the question type of the target question is covered by top-N generations.
Type diversity metric (⇑) This measures the average number of distinct question types in top-N generations.

Results and Analysis
Results compared with baselines The experimental results on SQuAD are displayed in Table  1. The table shows that the quality of generated questions with our method (N -M. Prior) scores comparable BLEU-4 to the state-of-the-art, which is much superior compared with methods based on beam search and sampling. Moreover, from the perspective of diversity, our method performs evidently better than other mixture models, resulting in the best trade-off between diversity and quality as shown by the overall metric. Furthermore, focusing on the measurements concerning question types, we can find that our model demonstrates significant improvements from both the coverage and the diversity, which are caused by the explicit modeling and diversifying of question types. We can observe the similar phenomenon that our method performs better with regard to the diversity metrics from the performance on N ewsQA in Table 2.
We also conduct human evaluation comparing   the diversity of the generated questions from our model 3-M. Prior with other mixture model baselines in Table 3. The table shows that our method outperforms its counterparts in terms of diversity with statistical significance.
Diversifying question types As described in Algorithm 1, the diversity of question types can be explicitly controlled by setting different values of decay. The influence is clearly shown in the Figure  3(a). As decay gradually increases, the diversity of question types increases as well as their coverage of the golden type. Also, from the Figure 3(b), we can see that, a small value of decay results in better generation quality metrics. The reason is that the incorporation of more diverse question types may lead to more possibilities of raising good questions. As its value continues to grow, the diversity keeps on increasing at the risk of inappropriate question types used, which results in a slight degradation of the generation quality. We can select an appropriate decay value according to the overall metric.
Ablation Analysis To show the effects of important components in our model, we conduct an ablation study on SQuAD. As shown in Table 4, the proposed diversity-promoting algorithm can clearly improve the generation diversity with nearly no negative impact on the quality, which can also be shown in Figure 3 when decay is small. As for  content selection, incorporating its influence in the encoder-decoder architecture improves the overall metric obviously. Also, we observe that the auxiliary loss function on selected contents can make a big difference, demonstrating its necessity to make content selectors focus on diverse and valid text pieces. Moreover, learning tricks about CVAE contribute to a more informative latent variable and improve the diversity evidently.

Influence of multimodal prior distribution
The continuous property of content selectors make it possible to generate N questions even given a standard gaussian prior. However, the introduction of multimodal prior can enrich content selectors with more variety and lead to more diverse generations. As shown in Table 5, the number of prior modes (K = 1, 3, 5) has an effect on metrics when generating multiple questions (N = 3, 5). First, we can see that the multimodal prior has the ability to improve the generation diversity compared with the standard one, which tallies with our conjecture. Second, when experimenting with the setting N = K, almost all of the metrics are better. We can explain this from the fact that samples of content selectors can be taken from different prior modes, which are more diverse. Also, inference accords with the training process in this situation. Figure 4 shows an example of the generated questions from our model 3-  M. Prior and its mixture model counterparts. As shown in this example, our generations often varies in question types and exhibit more diversity. Moreover, we highlight the selected contents of each generation from our model in Figure 1, which shows the effectiveness of our content selection module. As we use the multimodal prior technique, the diversity of generated questions can be reflected from both intra and inter modes. We can see from Figure  5 that different from other mixture models which can only generate a fixed number of questions, our continuous modeling option makes it possible to produce more generations by sampling from each mode repeatedly. In this example, questions from different modes exhibit a larger divergence compared with those from the same one, which demonstrates once more that the use of a multimodal prior makes a difference to the generation diversity.

Conclusion
In this paper, we explicitly diversify the question generation from the perspectives of contextual focuses and means of expression. We model focuses through continuous content selectors and introduce a multimodal prior to allow for more diverse selectors. We consider various means of expression through the modeling of question types and a related diversity-promoting algorithm. On public datasets, our approach achieves the best trade-off between generation quality and diversity. Further analysis also demonstrates the effectiveness of our proposed model components.