How to Ask Good Questions? Try to Leverage Paraphrases

Given a sentence and its relevant answer, how to ask good questions is a challenging task, which has many real applications. Inspired by human’s paraphrasing capability to ask questions of the same meaning but with diverse expressions, we propose to incorporate paraphrase knowledge into question generation(QG) to generate human-like questions. Specifically, we present a two-hand hybrid model leveraging a self-built paraphrase resource, which is automatically conducted by a simple back-translation method. On the one hand, we conduct multi-task learning with sentence-level paraphrase generation (PG) as an auxiliary task to supplement paraphrase knowledge to the task-share encoder. On the other hand, we adopt a new loss function for diversity training to introduce more question patterns to QG. Extensive experimental results show that our proposed model obtains obvious performance gain over several strong baselines, and further human evaluation validates that our model can ask questions of high quality by leveraging paraphrase knowledge.


Introduction
Question generation (QG) is an essential task for NLP, which focuses on generating grammatical questions for given paragraphs or sentences. It plays a vital role in various realistic scenarios. For educational purposes, QG can create reading comprehension materials for language learners (Heilman and Smith, 2010). For business use, QG can bring benefits to conversation systems and chat-bots for effective communication with humans (Mostafazadeh et al., 2016). Besides, automatically-generated questions can be conversely used for constructing question answering datasets to enhance reading comprehension sys- Table 1: Real examples of generated questions from SQuAD. We highlight the paraphrase transitions between sentences and questions. Human creates good questions by leveraging paraphrase knowledge, while the automatically generated questions just copy the original sentence, resulting in lower evaluation scores.
Although much progress has been made for QG, existing approaches do not explicitly model the "notorious" lexical and syntactic gaps in the generation process. That is, some parts of two texts (e.g. the input sentence and reference question, the reference question and generated question) may convey the same meaning but use different words, phrases or syntactic patterns. In real communica- tion, humans often paraphrase a source sentence to ask questions which are grammatical and coherent. Take SQuAD (Rajpurkar et al., 2016) as an example, which is a popular reading comprehension dataset and has been widely used for QG, there is a large percentage of questions created by paraphrasing (33.3% of the questions contain synonymy variations and 64% of questions contain syntactic variations (Rajpurkar et al., 2016)). Two examples are shown in Table 1. Due to the lack of paraphrase knowledge, the generated questions simply copy certain words from the input sequence, the quality of which is thus not competitive with human-created questions.
To address this issue, we introduce paraphrase knowledge in the QG process to generate humanlike questions. The sketch of our design is illustrated in Figure 1. To make our model easy to implement and train the model in an end-to-end fashion, we do not use any extra paraphrase generation (PG) dataset but just use a simple backtranslation method to automatically create paraphrases for both the input sentences and reference questions. Based on the high-quality expanded data, we propose a two-hand hybrid model. On the left hand, using the expanded sentence paraphrase as the target of PG, we perform multi-task learning with PG and QG, to optimize the task-share encoder with the paraphrase knowledge. On the right hand, with the gold reference question and question paraphrase as QG's multi-targets, we adopt a new min-loss function, to enable the QG module to learn more diverse question patterns.
We conduct extensive experiments on SQuAD and MARCO (Nguyen et al., 2016). Results show that both separate modules, the PG auxiliary task and the min-loss function, obviously improve the performances of QG task, and combing them achieves further improvements. Furthermore, human evaluation results show that our hybrid model can ask better and more human-like questions by incorporating paraphrase knowledge.

Related Work
For current mainstream neural network-based methods on QG, most approaches utilize the Seq2Seq model with attention mechanism (Du et al., 2017;Zhao et al., 2018b;Zhou et al., 2019a). To obtain better representations of the input sequence and answer, the answer position and token lexical features are treated as supplements for the neural encoder Song et al., 2018;Kim et al., 2018). Similar to other text generation tasks, many works on QG also employ copy or pointer mechanism to overcome the OOV problem (Du and Cardie, 2018;Sun et al., 2018;Zhang and Bansal, 2019). Recently, Zhou et al. (2019a) employ language modeling (LM) as an auxiliary task to enrich the encoder representations. In this paper, we adopt this work as one of the baseline models, since their universal model is easy to implement and achieves promising results for QG.
In order to make use of the context information of paragraphs, Zhao et al. (2018b) propose a gated self-attention network to encode context passage. Based on this, Zhang and Bansal (2019) apply reinforcement learning to deal with semantic drift in QG; Nema et al. (2019) use a passage-answer fusion mechanism to obtain answer-focused context representations; Li et al. (2019a) utilize gated attention to fuse answer-relevant relation with context sentence. Besides, Chen et al. (2019) design different passage graphs to capture structure information of passage through graph neural networks. Dong et al. (2019) propose a unified language model pre-training method to obtain better context representations for QG. All these works adopt a whole paragraph as input to generate questions. Different from this, our work only takes a sentence as input and leaves paragraph-level QG for future research.
Paraphrase generation is also a challenging task for NLP. Recent works usually obtain paraphrases by reordering or modifying the syntax or lexicon based on some paraphrase databases and rules (Fader et al., 2013;Chen et al., 2016), or by employing some neural generation methods (Prakash et al., 2016;Li et al., 2019b). In this paper, we employ a simple and effective paraphrasing method to expand both input sentences and reference questions. Our method also can be replaced with more sophisticated paraphrasing methods.
Paraphrase knowledge has been used to improve many NLP tasks, such as machine translation, ques-tion answering, and text simplification. Callison-Burch et al. (2006) use paraphrase techniques to deal with unknown phrases to improve statistical machine translation. Fader et al. (2013) and Dong et al. (2017) employ paraphrase knowledge to enhance question answering models. Kriz et al. (2018) utilize paraphrase and context-based lexical substitution knowledge to improve simplification task. Similarly, Zhao et al. (2018a) combine paraphrase rules of PPDB (Ganitkevitch et al., 2013) with Transformer (Vaswani et al., 2017) to perform sentence simplification task. Guo et al. (2018a) propose a multi-task learning framework with PG and simplification. In addition, Yu et al. (2018) and Xie et al. (2019) use paraphrase as data argumentation for their primary tasks. Different from these works, we leverage paraphrase knowledge for question generation, by automatically constructing a built-in paraphrase corpus without using any external paraphrase knowledge bases.

Model Description
In this section, we first describe two baseline models we used: feature-enriched pointer-generator and language modeling enhanced QG. Then we explain how to obtain paraphrase resources and show the quality statistics. Furthermore, we describe in detail two modules of utilizing paraphrase knowledge: the PG auxiliary task and the min loss function, as well as their combination. The overall structure of our hybrid model is shown in Figure 2.  . They adopt a bidirectional LSTM as the encoder, which takes the featureenriched embedding e i as input: where w i , a i , n i , p i , u i respectively represents embeddings of word, answer position, name entity, POS and word case. Same as the decoder used by See et al. (2017), another unidirectional LSTM with attention mechanism is used to obtain the decoder hidden state s t and context vector c t . Based on these, the pointergenerator model will simultaneously calculate the probabilities of generating a word from vocabulary and copying a word from the source text. The final probability distribution is the combination of these two modes with a generation probability p g : The training objective is to minimize the negative log likelihood of the target sequence q:

Language Modeling Enhanced QG
Zhou et al. (2019a) enhance QG with language modeling under a hierarchical structure of multitask learning. The language modeling aims at predicting the next and previous words in the input sequence with forward and backward LSTMs, respectively, which serves as a low-level task to provide semantic information for the high-level QG task.
In general, the input sequence will firstly be fed into the language modeling module to get the semantic hidden states, then these states will be concatenated with the input sequence to obtain the input of the feature-rich encoder: where h lm i is the semantic hidden state of LM module. The loss function of language modeling is defined as: where P lm (w t+1 |w <t+1 ) and P lm (w t−1 |w >t−1 ) represent the generation probabilities of the next word and the previous word, respectively.
As a result, the total loss of language modeling enhanced QG is formulated as: where β is a hyper-parameter to control the relative importance between language modeling and QG. Follow the work of Zhou et al. (2019a), we set β to 0.6. We re-implement this unified model to base our method on a strong baseline.

Paraphrase Expansion
The paraphrasing strategy is independent of the neural-based QG model, and we can use any advanced methods to generate paraphrases. In our work, we employ a simple back-translation method to automatically create paraphrases of both sentences and questions. Specially, we use a mature translation tool Google Translate, which is a free and accessible online service. We translate an original text into German and then back to English to get its paraphrase. As a result, we obtain s which is the paraphrase of the input sentence s, and q which is the paraphrase of the golden reference question q. In the following section, we will illustrate the way to use (s, s ) as a training pair of the auxiliary PG task, and adopt (q, q ) as multireferences to conduct the diversity training module. The way we expand paraphrases does not need extra PG datasets. Besides, it guarantees the PG and QG tasks share the same input s, so we can optimize their sharing encoder simultaneously and train the model end-to-end.
Synonym Syntactic Fluency sentence-paraphrase 74% 7% 67% question-paraphrase 58% 44% 67% To assess the quality of expanded paraphrases, we randomly select 100 paraphrases respectively from sentences and questions, and ask two annotators to judge the Synonym conversions and Syntactic transitions, as well as the paraphrase F luency. As shown in Table 2, 74% sentence paraphrases and 58% question paraphrases have synonym conversions with source sequences, 7% and 44% of them have sentence pattern transitions. Besides, 67% of paraphrases have no grammar errors. Two real expansion examples are shown in Table 3. It indicates that our expansion method introduces rich and high quality paraphrasing knowledge into the original data.

Auxiliary PG Task
The multi-task learning mechanism with PG aims at introducing paraphrase knowledge into QG. In general, we employ a parallel architecture to combine PG and QG, where QG is the main task and PG serves as an auxiliary task. To make our model Input Sentence: the current basilica of the sacred heart is located on the spot of fr. Sentence Paraphrase: the present basilica of the sacred heart is located in the place of fr.
Input Question: what structure is found on the location of the original church of father sorin at notre dame?
Question Paraphrase: what structure can be found at the location of the original church of father sorin at notre dame? easy to implement and can be trained end-to-end, we conduct the multi-task learning in a simultaneous mode. In detail, feature-riched embeddings will first be encoded by the task-share encoder and then be fed into PG and QG decoders respectively. The PG and QG decoders both have two layers and they are identical in the structure but different in parameters. In the auxiliary PG task, the input is the original sentence s, and the training objective is to minimize the cross-entropy loss: where y pg t is the generated word of PG at time step t and s t is the t th word in the expanded sentence paraphrase s .

Soft Sharing Strategy
To enhance the impact of auxiliary PG task so that the paraphrase knowledge can be absorbed by the question generation process more deeply, we employ a soft sharing strategy between the first layer of PG and QG decoders. The soft sharing strategy loosely couples parameters and encourages them close to each other in representation space. Following the work of Guo et al. (2018b), we minimize the l 2 distance between the shared layer of QG and PG decoders as a regularization. The soft sharing loss is defined as: where D is the set of shared decoder parameters, θ and φ respectively represent the parameters of the main QG task and the auxiliary PG task.

Diversity Training with Min-loss Function
For the QG task, a general training goal is to fit the decoded results with the reference questions.
To provide more generation patterns, we adjust the training target from one golden reference question to several reference questions by using expanded paraphrase resources. We adopt a min-loss function among several references, and the loss function defined by Equation 3 can be rewritten as: where Q is the set of gold reference question and expanded question paraphrase {q, q }. Each generated question will separately calculate the negative log-likelihood of its multiple references, and the final loss is the minimum of them. Under this training process, our model can learn multiple question expressions which are not in the original training dataset, so that the generation can be more diverse. Besides, inspired by the work of Kovaleva et al. (2018), we have tried several loss strategies, such as minimum loss, maximum loss, and weighted loss to guide the diversity training. Among them, the minimum is the best performing strategy. By employing minimum strategy, the QG decoder fits the generated question with the most similar sequence among gold reference question and question para-phrase. In this way, more question patterns are introduced into QG process.

Hybrid Model
Combining the above modules, we get our hybrid model. During training, the feature-enriched inputs are first encoded by the task-share encoder. Then the semantic hidden states are fed into PG decoder and QG decoder, respectively. For PG decoder, it has one fitting target (expanded sentence paraphrase). For QG decoder, it calculates the cross-entropy loss with both the gold reference question and the question paraphrase and regards the minimum loss of them as the QG loss. The auxiliary PG task and diversity training strategy simultaneously optimize the question generation process. The combined training loss function can be defined as: where α and λ are both hyper-parameters. We will describe the chosen of these hyper-parameters later.

Datasets
Our experiments are based on two reading comprehension datasets: SQuAD (2016)   Du Split (Du et al., 2017), we use the same settings with Li et al. (2019a) and there are 74689, 10427 and 11609 sentence-question-answer triples for training, validation and test respectively. For Zhou Split , we use the data shared by  and there are 86,635, 8,965 and 8,964 triples correspondingly. On MARCO, there are 74,097, 4,539 and 4,539 sentence-answerquestion triples for train, development and test sets, respectively (Sun et al., 2018).
We expand the datasets using the paraphrase expansion approach described in Section 3.2. After that, one sample of the expanded dataset is in the form of ((sentence, sentence paraphrase), (question, question paraphrase), answer).

Baselines and Metrics
For fair comparison, we report the following recent works on sentence-level Du and Zhou Splits: s2s (Du et al., 2017): an attention-based seq2seq model.
M2S+cp (Song et al., 2018): uses different matching strategies to explicitly model the information between answer and context.
A-P-Hybrid (Sun et al., 2018): generates an accurate interrogative word and focuses on important context words.
s2s-a-ct-mp-gsa (Zhao et al., 2018b): employs a gated attention encoder and a maxout pointer decoder to deal with long text inputs.
ASs2s (Kim et al., 2018): proposes an answerseparated Seq2Seq model by replacing the answer in the input sequence with some specific words.
LM enhanced QG (Zhou et al., 2019a): treats language modeling as a low-level task to provide semantic representations for the high-level QG.
Q-type (Zhou et al., 2019b): multi-task learning framework with question word prediction and QG.
Sent-Relation (Li et al., 2019a): extracts answer-relevant relations in sentence and encodes both sentence and relations to capture answerfocused representations.
We evaluate the performance of our models using BLEU (Papineni et al., 2002) and METEOR (Denkowski and Lavie, 2014), which are widely used in previous works for QG.

Implementation Details
We set the vocabulary as the most frequent 20,000 words. We use 300-dimensional GloVe word vectors as initialization of the word embeddings. Answer position and token lexical features are randomly initialized to 32-dimensional vectors through truncated normal distribution. The maximum lengths of input sequence and output sequence are 100 and 40, respectively. The hidden size of the encoder, decoder, and language modeling LSTMs are all 512. We use Adagrad optimization with learning rate 0.15 for training. The batch size is 32 and the beam search decoding size is 12. To alleviate the volatility of the training procedure, we get the average model of the 5 checkpoints closest to the best-trained model on development set.

Main Results
The experimental results on two splits of SQuAD are shown in Table 4. In terms of BLEU-4 that is often regarded as the main evaluation metric for text generation, our hybrid model-2 yields the best results on both splits, with 16.93 on Zhou Split and 17.21 on Du Split. We achieve state-of-the-art results on Du Split for sentence-level QG.
Especially for baseline-1, the performance gains of our model are more obvious. Our hybrid model-1 outperforms baseline-1 by 1.52 points on Zhou Split and 1.34 points on Du Split, which are large margins for this challenging task. Even based on this weak baseline, our method also achieves the state-of-the-art, 16.55 BLEU-4 score on Du Split for sentence-level QG.
The previous work of CGC-QG (Liu et al., 2019) obtains a 17.55 BLEU-4 score on Zhou Split. But their model relies on many heuristic rules and ad-hoc strategies. In their full model with clue prediction, they do graph convolutional network (GCN) operations on dependency trees, while our model does not use any hand-crafted rules and is lightweight without graphs and trees.
We also conduct experiments on MARCO, and the results are shown in Table 5. Our hybrid models obtain obvious improvements over two baselines, achieving a state-of-the-art BLEU-4 score of 21.61.
Specifically, SQuAD and MARCO are built in different ways. The questions in SQuAD are generated by crowd-workers, while questions in MARCO are sampled from real user queries. The experimental results on two datasets validate the generalization and robustness of our models.

Effect of Multi-task Learning with PG Task
As shown in Table 4, the auxiliary PG task brings consistent improvements over both baseline models. On Zhou Split, it increases baseline-1 by 1.38 points and baseline-2 by 0.61 respectively. On Du Split, it increases baseline-1 by 1.16 points and baseline-2 by 0.51 points respectively. The Previous Works BLEU-4 s2s (Du et al., 2017) 10.46 s2sa-at-mp-gsa (Zhao et al., 2018b) 16.02 A-P-Hybrid (Sun et al., 2018) 19.45 LM enhanced QG (Zhou et al., 2019a) 20.88 Q-type (Zhou et al., 2019b) 21.59 Our Models baseline-1 20.13 hybrid model-1 21.15 baseline-2 20.79 hybrid model-2 21.61 reason is that the PG task provides abundant paraphrase knowledge into the model and allows the task-share encoder to learn more paraphrasing representations.
Effect of Diversity Training with Min-loss Function From the results in Table 4, we can see the min-loss strategy improves performances over both baseline models. On Zhou Split, we get a 0.77 improvement over baseline-1 and 0.48 improvement over baseline-2, respectively. On Du Split, we get similar improvements.
Effect of Data Augmentation A straightforward way to leverage paraphrase knowledge is data augmentation. To test whether it works by simply adding paraphrase data as external training data, we also conduct an experiment based on the question paraphrase resource. We add the (s, q ) pairs into the training dataset, where s represents the input sentence and q denotes the paraphrase of the golden reference. Under this setting, we double the training samples. Unfortunately, as shown in Table  4, the baseline-1 model yields much lower BLEU-4 scores on both Zhou Split (13.28) and Du Split (13.36) with such data augmentation. The main reason is that for the same input sentence, there are two different training targets (q and q ), making the training process cannot easily converge.

Diversity Test
To investigate whether the paraphrase knowledge introduces more diverse expressions, we conduct evaluations on the distinct metric (Li et al., 2016), which is calculated as the number of distinct unigrams (distinct-1) and bigrams (distinct-2) divided by the total number of the generated words. The experimental results are shown in Table 6. It shows that our hybrid models obtain obvious gains over baseline models on both distinct-1 and distinct-2 metrics, validating that our models really generate more diverse questions with the help of paraphrase knowledge.  Table 6: Results of the distinct metric on zhou split.

Ablation Study of Soft Sharing
We also verify the effectiveness of the soft sharing mechanism by removing it from the full hybrid models. The results are displayed in

Parameters Selection
The soft sharing coefficient hyper-parameter λ is 1 × 10 −6 , intuitively chosen by balancing the crossentropy and regularization losses according to Guo et al. (2018b). The other hyper-parameter α which is to control the balance of QG and PG is tuned by grid search. We set α to different values to explore the best proportion of two tasks. The experimental results of different α are shown in Figure 3. Consequently, we set α to 0.3 for our hybrid model.

Human Evaluation
To further assess the quality of generated questions, we perform human evaluation to compare our hybrid model-2 with the strong baseline of language modeling enhanced QG. We randomly select 100 samples from SQuAD (Zhou Split) and ask three annotators to score these generated questions according to three aspects: Fluency: which measures whether a question is grammatical and fluent; Relevancy: which measures whether the question is relevant to the input context; Answerability: which indicates whether the question can be answered by the given answer.
The rating score is set to [0, 2]. The evaluation results are shown in Table 8. The Spearman correlation coefficients between annotators are high, which guarantees the validity of human evaluation. Our hybrid model receives higher scores on all three metrics, indicating that our generated questions have higher quality in different aspects.

Case Study
We list two examples of generated questions in Table 9. By introducing paraphrase knowledge into generation, the generated questions well capture the paraphrase transitions between contexts and references. Obviously, the questions generated by our hybrid model are more grammatical and coherent.

Different Paraphrasing Methods
To further test the generalization of our proposed methods, we use other paraphrasing methods to construct the paraphrase dataset. PPDB: for each non-stop word and phrase, looking it up in PPDB (2013) and replacing it with its synonyms.
NMT: another back-translation method using a pre-trained Transformer (2017) model.
Mixed: expanding input sentences with Google Trans and expanding reference questions with PPDB.
The results are shown in where is newcastle 's horse racing course located?  can observe that the Mixed paraphrase method even obtain better results than the mature Google Translate. It proves that our proposed architecture is effective across different paraphrasing methods and has potential for improvement.

Conclusion and Future Work
In this paper, we propose a two-hand hybrid model leveraging paraphrase knowledge for QG. The experimental results of independent modules and hybrid models prove that our models are effective and transferable. Besides, human evaluation results demonstrate that the paraphrase knowledge benefits our model to ask more human-like questions of high quality. In the future, we will explore more diverse and advanced paraphrase expanding methods for both sentence and paragraph level QG. Moreover, we will apply our methods to other similar tasks, such as sentence simplification.