A Question Type Driven and Copy Loss Enhanced Frameworkfor Answer-Agnostic Neural Question Generation

The answer-agnostic question generation is a significant and challenging task, which aims to automatically generate questions for a given sentence but without an answer. In this paper, we propose two new strategies to deal with this task: question type prediction and copy loss mechanism. The question type module is to predict the types of questions that should be asked, which allows our model to generate multiple types of questions for the same source sentence. The new copy loss enhances the original copy mechanism to make sure that every important word in the source sentence has been copied when generating questions. Our integrated model outperforms the state-of-the-art approach in answer-agnostic question generation, achieving a BLEU-4 score of 13.9 on SQuAD. Human evaluation further validates the high quality of our generated questions. We will make our code public available for further research.


Introduction
Question Generation (QG) has been investigated for many years because of its huge potential benefits to various fields, especially for education (Mitkov et al., 2006;Rus and Arthur, 2009;Heilman and Smith, 2010). QG can also act as an essential component of other comprehensive task, such as dialogue systems (Piwek et al., 2007;Shum et al., 2018). Besides, it can supplement question answering task by automatically constructing a large question set .
Traditional methods for automatic question generation are mostly rule-based, which need a set of complex empirical rules (Beulen and Ney, 1998;Brown et al., 2005;Heilman and Smith, 2010;Mazidi and Nielsen, 2014). Recently, with the flourish of deep learning, especially the sequence-tosequence (seq2seq) frame (Sutskever et al., 2014)  with attention mechanism (Bahdanau et al., 2015), many neural models have been proposed to solve the QG task and achieve rapid progress Dong et al., 2019;Nema et al., 2019;.
However, most of the previous works are devoted to deal with answer-aware question generation. That is, given a text and also an answer span, the system is required to generate questions. But in a real application for educational purpose, people or machines are often required to generate questions for natural sentences without explicitly annotated answer. Comparing with the answer-aware QG, the answer-agnostic QG (AG-QG) task is more challenging and attractive. Unfortunately, AG-QG has been much less studied. Du's work  is the first one to tackle this problem, and (Scialom et al., 2019) achieve the state-of-the-art by employing an extended transformer network.
For the AG-QG task, where the input is only a sentence but without any answer, multiple questions might be asked from various perspectives. According to our statistics on SQuAD (Rajpurkar et al., 2016), nearly 34% of the source sentences are offered multiple gold reference questions, and nearly 20% of the source sentences are offered different types of questions. Table 1 gives an example, where one source sentence corresponds with four different types of questions. However, most existing approaches can only generate one question for one input sentence.
To enable the model to ask different types of questions given the same input sentence, we propose a question type driven framework for AG-QG task. Specially, our model firstly predicts the probability of different question types distribution on the input sentence, which allows us to choose the best K question types with higher possibility. Then these different question types will be embedded into different vectors, which will guide the decoder to pay attention to informative parts with respect to different questions.
Meanwhile, according to the statistics on SQuAD, on average there are 3.09 non-stop words copying from the source sentence for each reference question. Those non-stop words appearing in both questions and sentences are regarded as keywords, since they act as the connection of these two parts. To increase the probability of copying keywords from source sentences, we design a new copy loss to enhance the traditional copy mechanism. In our model, by minimizing the new copy loss, the model will be forced to copy these keywords at least once during decoding.
We conduct experiments on SQuAD. Both the question type module and the new copy loss improve performance over the baseline model, and our full model combining two modules obtains a new state-of-the-art performance with a BLEU-4 of 13.9. Moreover, our model can ask different types of questions for a given sentence.
We conclude the contributions as follows: • We propose a question type driven framework for AG-QG, which enables the model to generate diverse questions with high quality.
• We design a new copy loss function to enhance the standard copy mechanism, which increases the probability of copying keywords from source sentences.
• Our model achieves a new state-of-the-art performance for the challenging AG-QG. The human evaluation further validates a high quality of our generated questions in fluency, relevance and answer-ability.

Related Work
Answer-Aware Question Generation. Most previous works on question generation focus on answer-aware QG task. Yuan  proposes three parts of loss to enhance the performance of sequence-to-sequence attention model. Zhou  leverage lexical features (part-of-speech and named entity tags) to help the model get better encoder representation. Zhao  uses paragraph information to do answer-aware QG task. And to use answer information more efficiently, Song (Song et al., 2018) uses multi-perspective matching and Sun (Sun et al., 2018) proposes position-aware model to pay more attention to the surrounding context of answer span. (Nema et al., 2019) uses a answer encoder to encode answer and fusion it with paragraph representation. (Chen et al., 2019) applies reinforcement learning to increase the performance.
Answer-Agnostic Question Generation. AG-QG is more challenging than answer-aware QG. Du's work  is the first one to tackle this problem, and they achieve better performance than rule-based approaches by employing a sequence-to-sequence attention model.  aim to automatically find questionworthy sentences from a paragraph and then generate questions.  treat QG as a two-stage task: answer phrase extraction and answer-aware question generation.  propose a multi-agent communication framework, using a local extraction agent to extract question-worthy phrases, and then taking extracted phrases as assistance to generate questions. (Scialom et al., 2019) employ the transformer network (Vaswani et al., 2017) and extend it with the placeholder strategy, copy mechanism and contextualized embedding.
Question Word Prediction. Question word is one of the most important components of a question. (Fan et al., 2018) study multi-types visual question generation, by feeding the encoded representation to a multi-layer perception to calculate the question words distribution. (Sun et al., 2018) propose an answer-focused and position-aware model to generate the first question word. (Kim et al., 2019) propose an answer-separated sequence-tosequence model to identify the proper question word. They replace the answer span in the source sentence with a special token to make better use of the context information.
Multi-Types Question Generation. The multitypes QG has been much less researched. In Ma's work , in order to generate different types of questions, they use question type embedding at the first step of decoding. However, because of the difficulty in automatically predicting question types, their model fails to outperform the previous works. The question type driven framework has also been tried for visual question generation (Fan et al., 2018), where they concatenate the question type embedding with the encoded representation of input.
3 Proposed Framework

Framework Overview
For each sample in our dataset, we have a source sentence S = (x 1 , x 2 , . . . , x l ), which is a word sequence and l denotes the number of words. Let Q = (y 1 , y 2 , . . . , y m ) to represent the question, which is another word sequence and m is the length.
The answer-agnostic QG task can be defined as finding the best Q that: log P (y i | S, y <i ) Figure 1 shows the framework of our full question type driven QG model. With the development of pointer network (Vinyals et al., 2015), the copy mechanism (Gu et al., 2016) has been increasingly more applied to natural language generation task. Thus, our model is based on the general sequenceto-sequence attention model with copy mechanism, which we regard as our baseline. In the next sections, we firstly describe the baseline model and then separately show our question type module and enhanced copy mechanism.

Baseline Model
Our baseline method is a sequence-to-sequence attention model with copy mechanism. Let x t represent the t − th word in the source sentence and e(x t ) is its corresponding embedded vector. A bidirectional LSTM layer (Hochreiter et al., 1997) is used to encode the embedded vector sequence: The hidden state at time step t is the concatenation of the forward − → u t and backward ← − u t , which can be represented as , where l is the number of words in the source sequence.
The decoder is another LSTM network, which generates a new hidden state h t , conditioned on the previous state h t−1 and previously embedded generated word e(y t−1 ). At the first decoding step, it takes the last encoding hidden state u l and a special token [SOS], which stands for the start of sequence, as input: Given the encoded state U and decoder state h t , the baseline model calculates a generating distribution of words at each time step t, which is calculated as follow: where v t is the weighted sum of the encoded representation U, and the attention weight a t is calculated by a bi-linear scoring function and a softmax normalization. W t and W o are both trainable parameters, which can be regarded as applying a multi-layer perceptron to the concatenation of hidden state h t and global attention representation v t . W b is also trainable parameter that is applied to calculate attention weights. Our baseline model also utilizes the copy mechanism. We use the attention score a t obtained from the decoder attention as the copy distribution. The probability of generating a word, p gen, is calculated as p gen = sigmoid(W g [h t ; v t ]). The final distribution is the weighted sum of generating distribution and copy distribution: whereĝ t , c t are the weighted generate distribution and copy distribution, respectively. We use f t to represent the final distribution and f ti is the probability of decoding the i − th word in the vocabulary. Let w i represent the i − th word in the vocabulary.
The final distribution is calculated as follow: Given the training corpus D, in which each sample contains a source sentence S and a target question Q, the training objective is to minimize the negative log-likelihood of the target questions L: where θ represents all the parameters of our model.

Question Type Prediction
We propose the question type module for two goals. One is to enable our model to generate multiple types of questions for one source sentence, and the other is to improve the generating performance. Our question type module firstly predicts the most proper type and then uses the embedding of it to help the decoding process.
As for question types, we count the distribution of question types in SQuAD and finally category all the questions into 7 types: what, who, how, where, when, yes/no and others.
The question type prediction is a multi-layer perceptron, which takes the last hidden state of encoder as input to predict the probability distribution of question types, denoted by T: Please note that our model can generate multiple questions for one source sentence. When the number of questions need to be generated is set to K, our model will select K question types with the highest probability as output. Consequently, our decoder will decode K times. ty 1 , ty 2 , . . . , ty K = T opK(T) At every decoding time, for one of the best K question types, ty, we embed it into a question type vector qt: qt = Embedding(ty) ty ∈ ty 1 , ty 2 , . . . , ty K The embedded question type vector will be used in decoding, and in this way, the question type vector would guide the model to generate questions that follow the pattern of a specific question type. Specially, we use the question type vector qt instead of the embedding of [SOS] token as the input of the decoder at the first decoding step: Besides, when calculating the generating distribution p gen, we concatenate qt with the global attention representation and the decoder hidden state: Finally, we train the question type prediction and question generating simultaneously in multi-task learning framework. For each sample < S, Q >, we calculate the ground-truth question type distribution T Y and we add an additional factor to the loss function, which is the negative log-likelihood of the target questions' types: where λ 1 is a hyper-parameter to balance two parts.

Enhanced Copy Mechanism
According to our observation on SQuAD, when creating questions people often (may have to) copy some keywords from the source sentence. So we consider non stop words appearing in both questions and source sentences as keywords, and we assume that these keywords should also occur in the generated sequence. In order to push the model to copy keywords from source sentences, we propose a copy loss to enhance the traditional copy mechanism. For a word x l in the source sentence, we first define a function cl(copylabel) as follow: cl(x i ) = 1 x i ∈ Q and x i / ∈ stopword 0 otherwise (12) We believe that at one decoding step, the copy probability of keywords (cl(x i ) = 1) should be close to 1. So we define the copy loss as: where c ti is the copy probability of the i − th word in the source sentence at the t − th decoding step, computed by Equation 4. Thus,ĉ i is the highest copy probability of the i − th word in the source sentence among all the m decoding step. l denotes the number of words in the source sentence.
Finally, we add the copy loss into the total loss: where λ 2 is another hyper-parameter to control the impact of penalty loss.

Dataset and Pre-processing
We conduct experiments on the SQuAD dataset (Rajpurkar et al., 2016), which contains more than 70k training samples, 10k development samples and 11k test samples. In either training, development or test dataset, multiple samples might share the same source sentence but with different target questions. But a same source sentence will not appear in different datasets, which ensures the confidentiality of test data. We adopt subword representations (Sennrich et al., 2015) rather than raw words, which can not only reduce the size of vocabulary, increase the training speed, address the problem of out of vocabulary words, but also improve the model performance. By using byte-pair encoding, our vocabulary size is reduced to less than 6k. Due to the vanishing gradient problem in recurrent neural networks (Pascanu et al., 2013; ?), we choose 256 for the maximum length of inputs and 50 for the maximum length of target questions.

Implementation Details
We adopt a 2-layers bi-directional LSTM for encoding and a 1-layer LSTM for decoding. The number of hidden units is 600, and the dimension of both word embedding and question type embedding is 300. We do not use pre-trained word embedding since we use subword representations rather than word-level representations. The drop rate (Krizhevsky et al., 2012) between each layer is 0.3. We firstly use Adam (Kingma and Ba, 2014) with learning rate of 0.001 for fast training, and after training 5 epochs, the stochastic gradient descent(SGD) with learning rate of 0.01 is used for fine-tuning. We train our model for 15 epochs with mini-batch size of 64. During training, hyperparameter K is set to 1 and when decoding, we do beam search with a beam size of 4. For hyperparameters λ 1 and λ 2 , we try different settings and choose the best one by observing the descending trend of total loss and ascending trend of BLEU-4 score on the valid set. Finally, both λ 1 and λ 2 are set to 0.1.

Evaluation Metrics
We adopt BLEU (Papineni et al., 2002), ME-TEOR (Denkowski and Lavie, 2014) and ROUGE-L (Flick, 2004) for evaluation, and use the evaluation package released by Chen (Chen et al., 2015). BLEU measures the precision of n-grams on a set of references, with a penalty for over short generation. METEOR calculates the similarity between generations and references by considering synonyms, stemming and paraphrases. ROUGE measures the recall of n-grams on the set of references.

Results and Analysis
In this section, we report the automatic evaluation results of our proposed model and do ablation study to prove the effectiveness of different parts of the model. Then we conduct human evaluation and case study to test the quality of generated questions. Furthermore, we give a detailed analysis on multiple questions generation.

Main Results
We compare our model with the following previous works: • Seq2Seq Attention: It is a traditional sequence-to-sequence attention model.
• Seq2SeqAtt (Du): This is the first work in AG-QG task, which is a sequence-to-sequence attention model .
• Transformer (Scialom): This is the stateof-the-art result for the AG-QG task, which adopts a transformer network with some extension. (Scialom et al., 2019).
We do not take  into comparison because their evaluation is done on a different test set and is not accessible.
The experimental results are shown in Table 2. The full version of our model which uses both the question type module and copy loss mechanism obtains the best results on all of metrics, achieving a new state-of-the-art result of BLEU-4 13.90 for the challenging AG-QG task. It outperforms the baseline model with 0.73 points and beats the previous best result by 0.67 points.

Ablation Study
We conduct extensive experiments with different model modules, where k is set to 1 in decoding. The results are reported in Table 3.
• Baseline: Our baseline model is a general sequence-to-sequence attention model enhanced with copy mechanism.
• Baseline+Type: It adds the question type module to the baseline model.
• Baseline+CopyLoss: Based on the baseline model, it calculates and minimizes the additional copy loss.
• Baseline+CopyLoss+Type: This is the full version of our proposed model. That is, the question type module is applied to the baseline model and the extra copy loss is also calculated.
• Upper Bound: Since our full model incorporates the question type prediction part, the accuracy of question type prediction will undoubtedly affect the final quality of generation. If the right question type is given for every test sample, we get the upper bound of our model.

Effect of Question Type Module.
Comparing with the baseline model, the question type module brings a slight performance gain. The upper bound shows if the right type is given for each test sample, the model will yield a much better performance with a BLEU-4 score of 15.27, which demonstrates the huge potential of our model. It proves that our model has successfully learned the patterns of different types of questions. However, our question type predict module cannot achieve a 100% accuracy, and once a wrong question type is offered to the decoder, it will have negative influence on the generating quality. Actually, our question type predict part achieves an overall 69% accuracy, and the prediction results of different question types are shown in Table 4. It shows that without the answer as input, to predict the types of questions that should be asked for a given sentence is non-trivia.     Table 5: Performance of copying key words between two models with and without the enhanced copy mechanism.
Effect of Enhanced Copy Mechanism Our designed copy loss aims to enhance the copy mechanism. Since it tries to make the model ensure that every key word is copied, it directly leads to a higher BLEU-1. To our delight, the experiment shows the copy loss mechanism also contributes to a stable 0.48 increment of BLEU-4. To make an in-depth analysis on the new copy mechanism, we also conduct experiments with and without the copy loss, counting the average number of keywords in questions generated by different models. The results are shown in Table 5. Our copy loss brings an absolute 4.53% increment on copying words from source sentences, which helps the model generate higher quality questions.

Human Evaluation
We also conduct human evaluation to judge the quality of questions generated by our model and the baseline model, respectively. We take three  metrics into consideration: 1) Fluency: it measures whether the question is grammatical; 2) Relevance: it measures whether the question is asked for and highly related to the source sentence; and 3) Answer-ability: it measures whether the generated question can be answered by the information of the source sentence. We randomly selected 100 sentence-question pairs generated by different models, and asked three annotators to score the questions on a 1−5 scale (5 for the best). We also exploit the Spearman correlation coefficient to measure the inter-annotator agreement. The results are shown in Table 6. It shows that the consistency among three annotators is satisfying, and our generated questions are better from different perspectives. Besides, a further two-tailed t-test proves that our generated questions are better than that of the baseline model significantly, with p < 0.001 for every metric.

Case Study
In order to show the effectiveness of our model, we offer two real samples in the test set, as shown in Table 7. In both samples, the baseline model generates a wrong type of question while our model predicts the right type of question. At the same time, our model successfully copies more key words from the source sentence, which are shown in italics, while baseline model fails. In both samples, Sample 1 Source sentence: the ex-president of def jam l.a. reid has described beyonc as the greatest entertainer alive . Ground-truth: who has said that beyonc is the best entertainer alive ? Baseline: what is the greatest entertainer alive ? Our model(K=1): who described beyonc as the greatest entertainer alive ?
Sample 2 Source sentence: tibetan sources say deshin shekpa also persuaded the yongle emperor not to impose his military might on tibet as the mongols had previously done . Ground-truth: who convinced the yongle emperor not to send military forces into tibet ? Baseline: what did tibetan sources say deshin ? Our model(K=1): who persuaded the yongle emperor not to impose his military might on tibet ? Sample 2 Source sentence: buddhist architecture , in particular , showed great regional diversity . Groud-truth: which cultures architecture showed a lot of diversity ? what type of architectural is especially known for its regional differences ? Our Model (K=2): 1st: which buddhist architecture has showed great regional diversity? 2nd: what is buddhist architecture ? our generated questions are more fluent and coherent.

Asking Different Types of Questions
For a given sentence, our question type driven framework offers the model the ability to generate different types of questions. In this case, the parameter K is set to more than 1, and the question type predictor will give K question types with the highest possibility. Then the model automatically decodes K times to generate the best K types of questions. We list a sample in Table 8 with K = 2 to show the generating diversity of our model, where two types of questions (what and which) are generated from the same input sentence.
Besides, to identify the effect of our model, we visualize the decoder attention, as shown in Figure 2. The two attention maps show the attention points when our model is generating different types (which and what) of questions with respect to the same input sentence, where x-axis is the source sentence and y-axis is the generated question. Differences between these attention maps prove that our model can attend on different information when generating different types of questions.
From the table, we prove that our model has the ability to generate multiple questions. However, the limitation is also obvious. First, if K is too large, the generated questions of some low probable types are of low quality. Second, since the probability distribution of question types are automatically calculated, the types of generated questions cannot be known beforehand.

Conclusion
In this paper, we propose two new strategies to deal with the answer-agnostic QG: question type module and copy loss mechanism. These proposed modules improve the performance over the base-line model, achieving the state-of-the-art. Moreover, our model has the ability and flexibility to generate multiple questions for one source sentence. Hopefully, the idea of question type module and copy loss mechanism can also be used to do answer-aware QG task or other similar text generation tasks.