Question Generation for Question Answering

This paper presents how to generate questions from given passages using neural networks, where large scale QA pairs are automatically crawled and processed from Community-QA website, and used as training data. The contribution of the paper is 2-fold: First, two types of question generation approaches are proposed, one is a retrieval-based method using convolution neural network (CNN), the other is a generation-based method using recurrent neural network (RNN); Second, we show how to leverage the generated questions to improve existing question answering systems. We evaluate our question generation method for the answer sentence selection task on three benchmark datasets, including SQuAD, MS MARCO, and WikiQA. Experimental results show that, by using generated questions as an extra signal, significant QA improvement can be achieved.

Motivated by this, we explore how to generate questions from given passages using neural networks, with three expected goals: (1) the training data should need few or no human efforts and reflect commonly-asked question intentions; (2) the questions are generated based on natural language passages, and should have good quality; (3) the generated questions should be helpful to QA tasks.
To achieve the 1 st goal, we propose to acquire large scale high-quality training data from Community-QA (CQA) website. The motivation of using CQA website for training data collection is that, such websites (e.g., YahooAnswers, Quora, etc.) contain large scale QA pairs generated by real users, and these questions reflect the most common user intentions, and therefore are useful to search, QA, and chatbot scenarios.
To achieve the 2 nd goal, we explore two ways to generate questions for a given passage, one is a retrieval-based method using convolution neural network (CNN), the other is a generation-based method using recurrent neural network (RNN). We evaluate the generation quality by BLEU score (Papineni et al., 2002) and human annotations, and discuss their pros and cons in Section 9.
To achieve the 3 rd goal, we integrate our question generation approach into an end-to-end QA task, i.e., answer sentence selection, and evaluate its impact on three popular benchmark datasets, SQuAD, MS MARCO, and WikiQA. Experimental results show that, the generated questions can improve the QA quality on all these three datasets.

Question Generation
Formally, given a passage S, question generation (QG) engine generates a set of questions {Q i }, where each generated Q i can be answered by S. There are four components in our QG engine: 1. Question Pattern Mining, which extracts the frequently-asked question patterns from large scale CQA questions, without any human annotation effort; 2. Question Pattern Prediction, which predicts top-N question patterns Q 1 p , ..., Q N p given S, by a retrieval-based method or a generationbased method. Therefore, "Prediction" has two different meanings here: in retrievalbased method, it means to rank existing question patterns and select the highest ranked ones, while in generation-based method, it means to generate question patterns based on S in a sequence-to-sequence manner, each of which could be a totally new question pattern beyond the existing question pattern set; 3. Question Topic Selection, which selects a phrase Q t from S as the question topic, based on a predicted question pattern Q p . Q t will be filled into Q p to form a complete question; 4. Question Ranking, which ranks all generated questions by a set of features. Here, multiple questions with different intentions could be generated, as S could contain multiple facts.

Question Pattern Mining
A question pattern (or QP) is defined as a word sequence Q p = {w 1 , w 2 , ..., w L }. Each QP should contain one and only one special word "#" as the placeholder, and all the other L − 1 words are terminal words. For example, "who founded # ?" is a question pattern, and "#" denotes the placeholder, which could be an organization name. We generate questions only using frequently-asked question patterns, where "frequently-asked" denotes that each question pattern should be extracted from a large scale question set more than T times, where T denotes a pre-defined threshold. In this paper, a question cluster (or QC) based approach is proposed to mine frequently-asked question patterns from large scale CQA questions.
First, a set of question clusters is collected from CQA webpages, and each question cluster consists of questions that are grouped as related questions 1 by the CQA website. For example, when the query "what is the population of nyc" is issued to YahooAnswers 2 , the returned page contains a list of related questions including "population of nyc", "nyc population", "nyc census", and etc.
Second, for each question cluster QC = {Q 1 , ..., Q K }, we enumerate all valid continuous n-grams, each of which should contain at least one content word and its order should be equal or less then 7, as question topic candidates {Q 1 t , ..., Q M t }. We then assign an importance score Impts(·) to each question topic candidate Q m t : Q m t denotes the m th question topic candidate, δ(Q m t , Q k ) equals to 1 when Q m t occurs in Q k , and 0 otherwise, |Q m t | denotes Q m t 's word count, which boosts longer question topic candidates.
For each QC, we select Q t with the highest importance score as the question topic, and remove it from each question to form a question pattern. We call each removed question topic as a historical question topic of its corresponding question pattern. If Q t doesn't exist in a question, ignore it.
Mining question patterns based on question clusters is motivated by the observation that, all questions within a QC tend to ask questions about an identical "question topic" (e.g., nyc in the above example) from different aspects, or the same aspect but using different expressions. Thus, we can leverage the consensus information among questions to detect the boundary of the question topic: the more times an n-gram occurs in different questions within a question cluster, the more likely it is the question topic of the current question cluster.
Although the question pattern mining approach described above is simple, it works surprisingly well. Table 1 shows statistics of question patterns mined from YahooAnswers, and Figure 1 gives examples of frequently-asked question patterns with their corresponding historical question topics. We have two interesting observations: 1. Most frequently-asked question patterns (frequency >=10,000) are with high quality, and reflect the most common user intentions; 2. Most historical question topics extracted are entities. This is achieved without using any prior semantic knowledge base or dictionary.

Question Pattern Prediction
Given a passage S, question pattern prediction predicts S's most related question patterns, and then use them to generate questions. For example, given S as "Tesla Motors is an American automaker and energy storage company co-founded by Elon Musk, Martin Eberhard, Marc Tarpenning, JB Straubel and Ian Wright, and is based in Palo Alto.", two question patterns can be derived from S, including: (1) who founded # ?, which can be inferred by the context around "co-founded by", and (2) where is # located ?, which can be inferred by the context around "is based in". Based on these two question patterns, we can generate two questions, "who founded Tesla Motors ?" and "where is Tesla Motors located ?" respectively.

Training Data Construction
We collect QA pairs from YahooAnswers. For each QA pair < Q, A >, if (1) Q can be matched by a frequently-asked question pattern Q p , and (2) the matched question topic Q t of Q based on Q p exists in A, then we create a training instance as < A, Q p , Q t >. Q t ∈ A makes sure that the question topic occurs in both Q and A. If the matched question topic only exists in Q, we just discard it. By doing so, we collect a total of 1,984,401 training instances as training data for QP prediction. Two neural network-based question pattern pre-diction approaches are explored in this paper: • Retrieval-based QP Prediction, which considers QP prediction as a ranking task; • Generation-based QP Prediction, which considers QP prediction as a generation task.

Retrieval-based QP Prediction
The retrieval-based QP prediction is done based on an attention-based convolution neural network. It takes a passage and a question pattern as input, and outputs their corresponding vector representations. We denote each input pair as S, Q p , where S is a passage, and Q p is a question pattern.
In the input layer, given an input pair S, Q p , an attention matrix Att ∈ |S|×|Qp| is generated by pre-trained word embeddings of S and Q p , where each element Att i,j ∈ Att is computed as: where v S i (or v Qp j ) denotes the embedding vector of the i th (or j th ) word in S (or Q p ).
Then, column-wise and row-wise max-pooling are applied to Att to generate two attention vectors V S ∈ |S| and V Qp ∈ |Qp| , where the k th elements of V S and V Qp are computed as: can be interpreted as the attention score of the k th word in S (or Q p ) with regard to all words in Q p (or S).
Next, two attention distributions D S ∈ |S| and D Qp ∈ |Qp| are generated for S and Q p based on V S and V Qp respectively, where the k th elements of D S and D Qp are computed as: can be interpreted as the normalized attention score of the k th word in S (or Q p ) with regard to all words in Q p (or S).
Last, we update each pre-trained word embed- . The underlying intuition of updating pre-trained word embeddings is to re-weight the importance of each word in S (or Q p ) based on Q p (or S), instead of treating them in an equal manner.
In the convolution layer, we first derive an in- , centralized in the t th word in S. Then, the convolution layer performs sliding window-based feature extraction to project each vector representation l t ∈ Z S to a contextual feature vector h S t : 1+e −2x is the activation function. The same operation is performed to Q p as well.
In the pooling layer, we aggregate local features extracted by the convolution layer from S, and form a sentence-level global feature vector with a fixed size independent of the length of the input sentence. Here, max-pooling is used to force the network to retain the most useful local features by The same operation are performed to Q p as well.
In the output layer, one more non-linear transformation is applied to l S p : W s is the semantic projection matrix, y(S) is the final sentence embedding of S. The same operation is performed to Q p to obtain y(Q p ).
We train model parameters W c and W s by minimizing the following ranking loss function: where M is a constant, Q − p is a negative instance. 3 In this paper, m is set to 3.

Generation-based QP Prediction
The generation-based QP prediction is done based on an sequence-to-sequence BiGRU (Bahdanau et al., 2015) that is commonly used in the neural machine translation field. The encoder reads a word sequence of an input passage S = (x 1 , ..., x |S| ), and the decoder predicts a word sequence of an output question pattern Q p = (y 1 , ..., y |Qt| ). The probability of generating a question pattern Q p is computed as: where each conditional probability is defined as: denotes a nonlinear function that outputs the probability of generating y i . s i denotes the hidden state of time t in decoder, which is computed as: is the sigmoid function, • represents elementwise multiplication, E w ∈ m×1 denotes the word embedding of a word w, W , W z , W r ∈ n×m , U , U z , U r ∈ n×n , and C, C z , C r ∈ n×2n are weights. c i denotes the context vector, which is computed as: v a ∈ n , W a ∈ n ×n , and U a ∈ n ×2n are weights. h j denotes the j th hidden state from encoder, which is the concatenation of the forward hidden state − → h j and the back forward state ← − h j .
For training, stochastic gradient descent (SGD) algorithm is used, and Adadelta (Zeiler, 2012) is used to adapt the learning rate of each parameter. Given a batch of D = {< S, Q p >} M i=1 pairs with size M instances, the objective function is to minimize the negative log-likelihood: For prediction, beam search is used to output the top-N question pattern predictions.
Retrieval-based approach can only find existing question patterns for each passage, but it makes sure that each question pattern comes from real questions and is in a good grammatical form; Generation-based approach, on the other hand, can generate totally new question patterns beyond existing question pattern set. We will compare both of them in the experimental part (Section 8).

Question Topic Selection
Given a passage S and a predicted question pattern Q p , question topic selection selects an n-gram (or a phrase) Q t from S, which can be then filled into Q p to form a complete question. Since we have two question pattern prediction methods, we have two ways to select the question topic Q t as well.
For Q p from retrieval-based method, two types of prior knowledge are used to extract question topic candidates from S, including: • Entities as question topic candidates, which are detected based on Freebase 4 entities; • Noun phrases as question topic candidates, which are detected based on the Stanford parser (Klein and Manning, 2003).
Once a question topic candidate Q t is extracted from S, we then measure how Q t can fit Q p by: s(Q t , Q p ) denotes the confidence that Q t can be filled into Q p to generate a reasonable question. Q t k p denotes the k th historical question topic of Q p . #(Q t k p ) denotes the number of times that Q t k p is extracted from different question clusters to generate Q p . v p denotes the question topic embedding of p, which is computed as the average of 4 https://developers.google.com/freebase/ word embeddings in p. dist(·) denotes the cosine distance between two question topic embeddings. N = k #(Q t k p ) denotes the total number of historical question topics of Q p . The basic principle of the above equation is that, the historical question topics of a given question pattern can help on measuring how possible a question topic candidate can be filled into this question pattern to form a reasonable question. For example, as most historical question topics of "who founded # ?" are organization names, then it is very unlikely a date or a film name is suitable for this question pattern to generate a reasonable question.
For Q p from generation-based method, suppose the placeholder # is the i th word in Q p , then we select the j th word w j ∈ S as the question topic, which satisfies the following constraint: This question topic selection strategy leverages the attention scores between S and Q p , and can be considered as a COPY mechanism.

Question Ranking
Given a predicted question pattern Q p and a selected question topic Q t of an input passage S, a complete question Q can be simply generated by replacing # in Q p with Q t . We use a set of features to rank generated question candidates: • question pattern prediction score, which is the prediction score by either retrieval-based approach or generation-based approach; • question topic selection score, for retrievalbased approach, this score is computed as s(Q t , Q p ), while for generation-based approach, this score is the attention score; • QA matching score, which measures relevance between generated question Q and S.
• word overlap between Q and S, which counts number of words that co-occur in Q and S; • question pattern frequency, which equals to the extraction frequency of Q p , if Q is generated from or matched by Q p , and 0 otherwise.
All features are combined by a linear model as: where h i (Q, S, Q p , Q t ) is one of the features described above, and λ i is the corresponding weight.

Question Generation for QA
This section describes how question generation can improve existing QA systems. There are several types of QA systems, i.e. knowledge-based QA, community-based QA, text-based QA, etc, and in this paper, we focus on text-based QA task (a.k.a. answer sentence selection), which aims to select one or multiple answer sentences from a text given an input question. We select this task as it can be considered as a dual task of QG. A typical answer sentence selection method, such as (Yin et al., 2016;Santos et al., 2016;Miller et al., 2016;Tymoshenko et al., 2016), computes the relevance score between input question Q and each answer candidate A, and selects the one with the highest relevance score as the final answer: Motivated by Dual Learning (He et al., 2016), we integrate question generation into answer ranking procedure, by changing the above formula to: λ is hyper-parameter, and in order to compute QQ(Q, Q gen max ), we generate top-10 questions {Q gen 1 , ..., Q gen 10 } for current answer candidate A, and then compute the question-to-generated question matching score QQ(Q, Q gen max ), by computing the similarity between input question Q and generated questions {Q gen 1 , ..., Q gen 10 } as: QQ(Q, Q gen max ) = arg max i=1,...,10 sim(Q, Q gen i ) · p(Q gen i ) sim(Q, Q gen i ) is the similarity between the input question Q and the i th generated question Q gen i , and computed as the cosine distance between averaged word embedding of Q and averaged word embedding of Q gen i , p(Q gen i ) denotes the posterior probability that is computed based on the generation score of each generated question: is output by the question generation model described in Section 6. The underlying motivation is that, the questions generated from correct answers are more likely to be similar to labeled questions than questions generated from wrong answers. Yao et al. (2012) proposed a semantic-based question generation approach, which first parses the input sentence into its corresponding Minimal Recursion Semantics (MRS) representation, and then generates a question guided by the English Resource Grammar that includes a large scale handcrafted lexicon and grammar rules. Labutov et al. (2015) proposed an 'ontologycrowd-relevance' method for question generation. First, Freebase types and Wikipedia session names are used as semantic tags to understand texts. Question are then generated based on question templates that are aligned with types/session names and labeled by crowdsourcing. All generated questions are ranked by a relevance model. Chali and Hasan (2015) proposed a topic-toquestion method, which uses about 350 generalpurpose rules to transform the semantic-role labeled sentences into corresponding questions. Serban et al. (2016) used the encoder-decoder framework to generate 30M QA pairs, but their inputs are knowledge triples, instead of passages. Song and Zhao (2016) proposed a question generation method using question template seeds and used search engine to do question expansion. Du et al. (2017) proposed a neural question generation method using a vanilla sequence-tosequence RNN model, which is most-related to our work. But this method is still based on labeled dataset, and tried RNN only.

Related Work
Comparing to all these related work mentioned above, our question generation approach has two uniqueness: (1) all question patterns, that are used as training data for question generation, are automatically extracted from a large scale CQA question set without any crowdsourcing effort. Such question patterns reflect the most common user intentions, and therefore are useful to search, QA, and chatbot engines; (2) it is also the first time question generation is integrated and evaluated in an end-to-end QA task directly, and shows significant improvements.
9 Experiment 9.1 Dataset As described in Section 4.1, we collect 1, 984, 401 < A, Q p , Q t > pairs from YahooAnswers, and use them as the training set of the question pattern prediction model. We re-use the dev sets and test sets of SQuAD, MS MARCO, and WikiQA, to evaluate the quality of generated questions. The dataset statistics are in Table 2  Besides, an answer sentence selection model (Yin et al., 2016) is trained based on the 1,984,401 QA pairs from the training set as well, and used to compute the QA matching score for question ranking, as we described in Section 6. Feature weights for question ranking are optimized on dev set.

Evaluation on Question Generation
We first perform a vanilla sequence-to-sequence method (Du et al., 2017) using the original training sets of these three datasets, and show QG results in Table 3, where BLEU 4 score is used as the metric.
We then evaluate the quality of the generated questions based on auto-extracted training set. For each passage in the test set, we generate two top-1 questions based on retrieval-based method and generation-based method respectively, and then compare them with labeled questions using BLEU 4 as the metric. Results are listed in Table 4.
From Table 3 and 4 we can see two findings: (1) Comparing to QG results based on original labeled training sets, G-QG achieves comparable or better results. We think this is due to two facts: first, the size of the automatically constructed training set is much larger than the labeled training sets, and second, as the QA pairs from CQA websites are generated by real users, the quality is good. (2) Generation-based QG performs better than Retrieval-based QG. By analyzing outputs we find that, for question pattern prediction, both retrieval-based and generation-based methods perform similarly. However, Generation-based QG performs better than Retrieval-based QG on question topic selection. This could be caused by the fact that, in Generation-based QG, question topic selection is based on the attention mechanism, which is optimized together with question pattern prediction in an end-to-end way; while in Retrieval-based QG, question topic selection is a separate task, and based on the similarity between each question topic candidate and historical question topics of a given question pattern. The embedding of each question topic is pre-trained, which is not directly related to the question generation task. So such method cannot handle unseen question topics very well. Another disadvantage of Retrieval-based QG is that, each time, we have to compute the similarity between the input passage and each question pattern. When question pattern size is large, the computation is very expensive.
In order to better understand the question generation quality, we manually check a set of sampled outputs, and list the main errors in Figure 2: • Multi-Fact Error (40%). Most input passages include more than one fact. For such a question, it is reasonable to generate different questions from different aspects, all of which can be answered by the input passage. For each passage in QAGen, we only label one question as ground truth. In the future, we will extend QAGen to be a more comprehensive dataset, by labeling multiple questions to each passage for more reasonable evaluation; • Paraphrase Error (30%). The same question can be expressed by different ways. Labeling more paraphrased questions for a passage can alleviate this issue as well; • Question Topic Selection Error (15%). This error is caused by selecting either a total- ly wrong question topic, or a partially right question topic. In the future, we plan to develop an independent question topic selection model for the question generation task.

Evaluation on QA
As described in Section 7, we combine question generation into QA system for answer sentence selection task, and do evaluation on SQuAD, M-S MARCO, and WikiQA. Evaluation results are shown in Table 5, 6, and 7, where QA denotes the result of our in-house implementation of a retrieval-based answer selection approach (Doc-Chat) proposed by , QA+QG denotes result by combining question-to-generated question matching score with DocChat score.
SQuAD MAP MRR ACC@1 QA 0.8843 0.8915 0.8160 QA+QG 0.8887 0.8963 0.8232   for CQA websites; while questions from the other datasets are labeled by crowd-sourcing.
In order to explain these improvements, two datasets, WikiQG+ and WikiQG-, are built from WikiQA test set: given each document and its labeled question, we pair the question with its COR-RECT answer sentence as a QA pair and add it to WikiQG+; we also pair the same question with a randomly selected WRONG answer sentence as a QA pair and add it to WikiQG-. Then, we generate questions for passages in WikiQG+ and WikiQGrespectively, and compare them with labeled questions. The BLEU 4 score is 0.2031 on Wik-iQG+, and 0.1301 on WikiQG-, which indicates that the questions generated from correct answers are more likely to be similar to labeled questions than questions generated from wrong answers.

Conslusion
This paper presents a neural question generation method that is based on training data collected from CQA questions. We integrate the QA pair generation task into an end-to-end QA task, and show significant improvements, which indicates that, QA task and QG are dual tasks that can boost each other. In the future, we will explore more ways to leverage QG for QA task.