Harvesting Paragraph-level Question-Answer Pairs from Wikipedia

We study the task of generating from Wikipedia articles question-answer pairs that cover content beyond a single sentence. We propose a neural network approach that incorporates coreference knowledge via a novel gating mechanism. As compared to models that only take into account sentence-level information (Heilman and Smith, 2010; Du et al., 2017; Zhou et al., 2017), we find that the linguistic knowledge introduced by the coreference representation aids question generation significantly, producing models that outperform the current state-of-the-art. We apply our system (composed of an answer span extraction system and the passage-level QG system) to the 10,000 top ranking Wikipedia articles and create a corpus of over one million question-answer pairs. We provide qualitative analysis for the this large-scale generated corpus from Wikipedia.


Introduction
Recently, there has been a resurgence of work in NLP on reading comprehension (Hermann et al., 2015;Rajpurkar et al., 2016; with the goal of developing systems that can answer questions about the content of a given passage or document. Large-scale QA datasets are indispensable for training expressive statistical models for this task and play a critical role in advancing the field. And there have been a number of efforts in this direction. Miller et al. (2016), for example, develop a dataset for open-domain question answering; Rajpurkar et al. (2016) and  do so for reading comprehension (RC); and Hill et al. (2015) and Hermann   (Rajpurkar et al., 2016) dataset. We show in italics the set of mentions that refer to Nikola Tesla -Tesla, him, his, he, etc. et al. (2015), for the related task of answering cloze questions (Winograd, 1972;Levesque et al., 2011). To create these datasets, either crowdsourcing or (semi-)synthetic approaches are used. The (semi-)synthetic datasets (e.g., Hermann et al. (2015)) are large in size and cheap to obtain; however, they do not share the same characteristics as explicit QA/RC questions (Rajpurkar et al., 2016). In comparison, high-quality crowdsourced datasets are much smaller in size, and the annotation process is quite expensive because the labeled examples require expertise and careful design (Chen et al., 2016).
Thus, there is a need for methods that can automatically generate high-quality question-answer pairs. Serban et al. (2016) propose the use of recurrent neural networks to generate QA pairs from structured knowledge resources such as Freebase. Their work relies on the existence of automatically acquired KBs, which are known to have errors and suffer from incompleteness. They are also nontrivial to obtain. In addition, the questions in the resulting dataset are limited to queries regarding a single fact (i.e., tuple) in the KB.
Motivated by the need for large scale QA pairs and the limitations of recent work, we investigate methods that can automatically "harvest" (generate) question-answer pairs from raw text/unstructured documents, such as Wikipediatype articles.
Recent work along these lines ) (see Section 2) has proposed the use of attention-based recurrent neural models trained on the crowdsourced SQuAD dataset (Rajpurkar et al., 2016) for question generation. While successful, the resulting QA pairs are based on information from a single sentence. As described in , however, nearly 30% of the questions in the human-generated questions of SQuAD rely on information beyond a single sentence. For example, in Figure 1, the second and third questions require coreference information (i.e., recognizing that "His" in sentence 2 and "He" in sentence 3 both corefer with "Tesla" in sentence 1) to answer them.
Thus, our research studies methods for incorporating coreference information into the training of a question generation system. In particular, we propose gated Coreference knowledge for Neural Question Generation (CorefNQG), a neural sequence model with a novel gating mechanism that leverages continuous representations of coreference clusters -the set of mentions used to refer to each entity -to better encode linguistic knowledge introduced by coreference, for paragraph-level question generation.
In an evaluation using the SQuAD dataset, we find that CorefNQG enables better question generation. It outperforms significantly the baseline neural sequence models that encode information from a single sentence, and a model that encodes all preceding context and the input sentence itself. When evaluated on only the portion of SQuAD that requires coreference resolution, the gap be-tween our system and the baseline systems is even larger.
By applying our approach to the 10,000 topranking Wikipedia articles, we obtain a question answering/reading comprehension dataset with over one million QA pairs; we provide a qualitative analysis in Section 6. The dataset and the source code for the system are available at https://github.com/xinyadu/ HarvestingQA.

Question Generation
Since the work by Rus et al. (2010), question generation (QG) has attracted interest from both the NLP and NLG communities. Most early work in QG employed rule-based approaches to transform input text into questions, usually requiring the application of a sequence of well-designed general rules or templates (Mitkov and Ha, 2003;Labutov et al., 2015). Heilman and Smith (2010) introduced an overgenerate-and-rank approach: their system generates a set of questions and then ranks them to select the top candidates. Apart from generating questions from raw text, there has also been research on question generation from symbolic representations (Yao et al., 2012;Olney et al., 2012).
With the recent development of deep representation learning and large QA datasets, there has been research on recurrent neural network based approaches for question generation. Serban et al. (2016) used the encoder-decoder framework to generate QA pairs from knowledge base triples; Reddy et al. (2017) generated questions from a knowledge graph;  studied how to generate questions from sentences using an attention-based sequence-to-sequence model and investigated the effect of exploiting sentencevs. paragraph-level information.  proposed a hierarchical neural sentencelevel sequence tagging model for identifying question-worthy sentences in a text passage. Finally, Duan et al. (2017) investigated how to use question generation to help improve question answering systems on the sentence selection subtask.
In comparison to the related methods from above that generate questions from raw text, our method is different in its ability to take into account contextual information beyond the sentencelevel by introducing coreference knowledge.

Question Answering Datasets and Creation
Recently there has been an increasing interest in question answering with the creation of many datasets. Most are built using crowdsourcing; they are generally comprised of fewer than 100,000 QA pairs and are time-consuming to create. We-bQuestions (Berant et al., 2013), for example, contains 5,810 questions crawled via the Google Suggest API and is designed for knowledge base QA with answers restricted to Freebase entities. To tackle the size issues associated with WebQuestions,  introduce SimpleQuestions, a dataset of 108,442 questions authored by English speakers. SQuAD (Rajpurkar et al., 2016) is a dataset for machine comprehension; it is created by showing a Wikipedia paragraph to human annotators and asking them to write questions based on the paragraph. TriviaQA  includes 95k question-answer authored by trivia enthusiasts and corresponding evidence documents.
(Semi-)synthetic generated datasets are easier to build to large-scale (Hill et al., 2015;Hermann et al., 2015). They usually come in the form of cloze-style questions.  Chen et al. (2016) showed that this dataset is quite noisy due to the method of data creation and concluded that performance of QA systems on the dataset is almost saturated.
Closest to our work is that of Serban et al. (2016). They train a neural triple-to-sequence model on SimpleQuestions, and apply their system to Freebase to produce a large collection of human-like question-answer pairs.

Task Definition
Our goal is to harvest high quality questionanswer pairs from the paragraphs of an article of interest. In our task formulation, this consists of two steps: candidate answer extraction and answer-specific question generation. Given an input paragraph, we first identify a set of question-worthy candidate answers ans = (ans 1 , ans 2 , ..., ans l ), each a span of text as denoted in color in Figure 1. For each candidate answer ans i , we then aim to generate a question Q -a sequence of tokens y 1 , ..., y N -based on the sentence S that contains candidate ans i such that: • Q asks about an aspect of ans i that is of potential interest to a human; • Q might rely on information from sentences that precede S in the paragraph.
Mathematically then, where P (Q|S, C) = N n=1 P (y n |y <n , S, C) where C is the set of sentences that precede S in the paragraph.

Methodology
In this section, we introduce our framework for harvesting the question-answer pairs. As described above, it consists of the question generator CorefNQG ( Figure 2) and a candidate answer extraction module. During test/generation time, we (1) run the answer extraction module on the input text to obtain answers, and then (2) run the question generation module to obtain the corresponding questions.

Question Generation
As shown in Figure 2, our generator prepares the feature-rich input embedding -a concatenation of (a) a refined coreference position feature embedding, (b) an answer feature embedding, and (c) a word embedding, each of which is described below. It then encodes the textual input using an LSTM unit (Hochreiter and Schmidhuber, 1997). Finally, an attention-copy equipped decoder is used to decode the question.
More specifically, given the input sentence S (containing an answer span) and the preceding context C, we first run a coreference resolution system to get the coref-clusters for S and C and use them to create a coreference transformed input sentence: for each pronoun, we append its most representative non-pronominal coreferent mention. Specifically, we apply the simple feedforward network based mention-ranking model of Clark and Manning (2016) to the concatenation of C and S to get the coref-clusters for all entities in C and S. The C&M model produces a score/representation s for each mention pair (m 1 , m 2 ), where W m is a 1 × d weight matrix and b is the bias. h m (m 1 , m 2 ) is representation of the last hidden layer of the three layer feedforward neural network.

Decoder LSTMs
For each pronoun in S, we then heuristically identify the most "representative" antecedent from its coref-cluster. (Proper nouns are preferred.) We append the new mention after the pronoun. For example, in Table 1, "the panthers" is the most representative mention in the coref-cluster for "they". The new sentence with the appended coreferent mention is our coreference transformed input sentence S (see Figure 2).

Coreference Position Feature Embedding
For each token in S , we also maintain one position feature f c = (c 1 , ..., c n ), to denote pronouns (e.g., "they") and antecedents (e.g., "the panthers"). We use the BIO tagging scheme to label the associated spans in S . "B_ANT" denotes the start of an antecedent span, tag "I_ANT" continues the antecedent span and tag "O" marks tokens that do not form part of a mention span. Similarly, tags "B_PRO" and "I_PRO" denote the pronoun span. (See Table 1, "coref. feature".)

Refined Coref. Position Feature Embedding
Inspired by the success of gating mecha-nisms for controlling information flow in neural networks (Hochreiter and Schmidhuber, 1997;Dauphin et al., 2017), we propose to use a gating network here to obtain a refined representation of the coreference position feature vectors f c = (c 1 , ..., c n ). The main idea is to utilize the mention-pair score (see Equation 2) to help the neural network learn the importance of the coreferent phrases. We compute the refined (gated) coreference position feature vector f d = (d 1 , ..., d n ) as follows, where denotes an element-wise product between two vectors and ReLU is the rectified linear activation function. score i denotes the mentionpair score for each antecedent token (e.g., "the" and "panthers") with the pronoun (e.g., "they"); score i is obtained from the trained model (Equation 2) of the C&M. If token i is not added later as an antecedent token, score i is set to zero. W a , W b are weight matrices and b is the bias vector.
Answer Feature Embedding We also include an answer position feature embedding to generate answer-specific questions; we denote the answer span with the usual BIO tagging scheme (see, e.g., "the arizona cardinals" in Table 1). During training and testing, the answer span feature (i.e., "B_ANS", "I_ANS" or "O") is mapped to its feature embedding space: f a = (a 1 , ..., a n ).

Word Embedding
To obtain the word embedding for the tokens themselves, we just map the tokens to the word embedding space: x = (x 1 , ..., x n ).
Final Encoder Input As noted above, the final input to the LSTM-based encoder is a concatenation of (1) the refined coreference position feature embedding (light blue units in Figure 2), (2) the answer position feature embedding (red units), and (3) the word embedding for the token (green units), Encoder As for the encoder itself, we use bidirectional LSTMs to read the input e = (e 1 , ..., e n ) in both the forward and backward directions. After encoding, we obtain two sequences of hidden vectors, namely, Question Decoder with Attention & Copy On top of the feature-rich encoder, we use LSTMs with attention (Bahdanau et al., 2015) as the decoder for generating the question y 1 , ..., y m one token at a time. To deal with rare/unknown words, the decoder also allows directly copying words from the source sentence via pointing (Vinyals et al., 2015). At each time step t, the decoder LSTM reads the previous word embedding w t−1 and previous hidden state s t−1 to compute the new hidden state, Then we calculate the attention distribution α t as in Bahdanau et al. (2015), where W c is a weight matrix and attention distribution α t is a probability distribution over the source sentence words. With α t , we can obtain the context vector h * t , Then, using the context vector h * t and hidden state s t , the probability distribution over the target (question) side vocabulary is calculated as, Instead of directly using P vocab for training/generating with the fixed target side vocabulary, we also consider copying from the source sentence. The copy probability is based on the context vector h * t and hidden state s t , and the probability distribution over the source sentence words is the sum of the attention scores of the corresponding words, Finally, we obtain the probability distribution over the dynamic vocabulary (i.e., union of original target side and source sentence vocabulary) by summing over P copy and P vocab , where σ is the sigmoid function, and W d , W e , W f are weight matrices.

Answer Span Identification
We frame the problem of identifying candidate answer spans from a paragraph as a sequence labeling task and base our model on the BiLSTM-CRF approach for named entity recognition (Huang et al., 2015). Given a paragraph of n tokens, instead of directly feeding the sequence of word vectors x = (x 1 , ..., x n ) to the LSTM units, we first construct the feature-rich embedding x for each token, which is the concatenation of the word embedding, an NER feature embedding, and a character-level representation of the word (Lample et al., 2016). We use the concatenated vector as the "final" embedding x for the token, where CharRep i is the concatenation of the last hidden states of a character-based biLSTM. The intuition behind the use of NER features is that SQuAD answer spans contain a large number of named entities, numeric phrases, etc. Then a multi-layer Bi-directional LSTM is applied to (x 1 , ..., x n ) and we obtain the output state z t for time step t by concatenation of the hidden states (forward and backward) at time step t from the last layer of the BiLSTM. We apply the softmax to (z 1 , ..., z n ) to get the normalized score representation for each token, which is of size n × k, where k is the number of tags.
Instead of using a softmax training objective that minimizes the cross-entropy loss for each individual word, the model is trained with a CRF (Lafferty et al., 2001) objective, which minimizes the negative log-likelihood for the entire correct sequence: − log(p y ), where q(x , y) = n t=1 P t,yt + n−1 t=0 A yt,y t+1 , P t,yt is the score of assigning tag y t to the t th token, and A yt,y t+1 is the transition score from tag y t to y t+1 , the scoring matrix A is to be learned. Y represents all the possible tagging sequences.

Dataset
We use the SQuAD dataset (Rajpurkar et al., 2016) to train our models. It is one of the largest general purpose QA datasets derived from Wikipedia with over 100k questions posed by crowdworkers on a set of Wikipedia articles. The answer to each question is a segment of text from the corresponding Wiki passage. The crowdworkers were users of Amazon's Mechanical Turk located in the US or Canada. To obtain high-quality articles, the authors sampled 500 articles from the top 10,000 articles obtained by Nayuki's Wikipedia's internal PageRanks. The question-answer pairs were generated by annotators from a paragraph; and although the dataset is typically used to evaluate reading comprehension, it has also been used in an open domain QA setting Wang et al., 2018). For training/testing answer extraction systems, we pair each paragraph in the dataset with the gold answer spans that it contains. For the question generation system, we pair each sentence that contains an answer span with the corresponding gold question as in .
To quantify the effect of using predicted (rather than gold standard) answer spans on question generation (e.g., predicted answer span boundaries can be inaccurate), we also train the models on an augmented "Training set w/ noisy examples" (see Table 2). This training set contains all of the original training examples plus new examples for predicted answer spans (from the top-performing answer extraction model, bottom row of Table 3) that overlap with a gold answer span. We pair the new training sentence (w/ predicted answer span) with the gold question. The added examples comprise 42.21% of the noisy example training set.
For generation of our one million QA pair corpus, we apply our systems to the 10,000 topranking articles of Wikipedia.

Evaluation Metrics
For question generation evaluation, we use BLEU (Papineni et al., 2002) and ME-TEOR (Denkowski and Lavie, 2014). 1 BLEU measures average n-gram precision vs. a set of reference questions and penalizes for overly short sentences. METEOR is a recall-oriented metric that takes into account synonyms, stemming, and paraphrases.
For answer candidate extraction evaluation, we use precision, recall and F-measure vs. the gold standard SQuAD answers. Since answer boundaries are sometimes ambiguous, we compute Binary Overlap and Proportional Overlap metrics in addition to Exact Match. Binary Overlap counts every predicted answer that overlaps with a gold answer span as correct, and Proportional Overlap give partial credit proportional to the amount of overlap (Johansson and Moschitti, 2010;Irsoy and Cardie, 2014).

Baselines and Ablation Tests
For question generation, we compare to the stateof-the-art baselines and conduct ablation tests as follows: 's model is an attention-based RNN sequence-to-sequence neural network (without using the answer location information feature). Seq2seq + copy w/ answer is the attention-based sequence-to-sequence model augmented with a copy mechanism, with answer features concatenated with the word embeddings during encoding. Seq2seq + copy w/ full context + answer is the same model as the previous one, but we allow access to the full context (i.e., all the preceding sentences and the input sentence itself). We denote it as ContextNQG henceforth for simplicity. CorefNQG is the coreference-based model proposed in this paper. CorefNQG-gating is an   ablation test, the gating network is removed and the coreference position embedding is not refined. CorefNQG-mention-pair score is also an ablation test where all mention-pair score i are set to zero.
For answer span extraction, we conduct experiments to compare the performance of an off-theshelf NER system and BiLSTM based systems.
For training and implementation details, please see the Supplementary Material.
6 Results and Analysis 6.1 Automatic Evaluation Table 2 shows the BLEU-{3, 4} and METEOR scores of different models. Our CorefNQG outperforms the seq2seq baseline of  by a large margin. This shows that the copy mechanism, answer features and coreference resolution all aid question generation. In addition, CorefNQG outperforms both Seq2seq+Copy models significantly, whether or not they have access to the full context. This demonstrates that the coreference knowledge encoded with the gating network explicitly helps with the training and generation: it is more difficult for the neural sequence model to learn the coreference knowledge in a latent way. (See input 1 in Figure A.1 for an example.) Building end-to-end models that take into account coreference knowledge in a latent way is an interesting direction to explore. In the ablation tests, the performance drop of CorefNQG-gating  shows that the gating network is playing an important role for getting refined coreference position feature embedding, which helps the model learn the importance of an antecedent. The performance drop of CorefNQG-mention-pair score shows the mention-pair score introduced from the external system (Clark and Manning, 2016) helps the neural network better encode coreference knowledge.
To better understand the effect of coreference resolution, we also evaluate our model and the baseline models on just that portion of the test set that requires pronoun resolution (36.42% of the examples) and show the results in Table 4. The gaps of performance between our model and the baseline models are still significant. Besides, we see that all three systems' performance drop on this partial test set, which demonstrates the hardness of generating questions for the cases that require pronoun resolution (passage context).
We also show in Table 2 the results of the QG models trained on the training set augmented with noisy examples with predicted answer spans.
There is a consistent but acceptable drop for each model on this new training set, given the inaccuracy of predicted answer spans. We see that CorefNQG still outperforms the baseline models across all metrics. Figure A.1 provides sample output for input sentences that require contextual coreference knowledge. We see that ContextNQG fails in all cases; our model misses only the third example due to an error introduced by coreference resolution -the "city" and "it" are considered coreferent. We can also see that human-generated questions are more natural and varied in form with better paraphrasing.
In Table 3, we show the evaluation results for different answer extraction models. First we see that all variants of BiLSTM models outperform the off-the-shelf NER system (that proposes all NEs as answer spans), though the NER system has a higher recall. The BiLSTM-CRF that encodes the character-level and NER features for each token performs best in terms of F-measure.

Human Study
We hired four native speakers of English to rate the systems' outputs. Detailed guidelines for the raters are listed in the supplementary materials.  Table 5: Human evaluation results for question generation. "Grammaticality", "Making Sense" and "Answerability" are rated on a 1-5 scale (5 for the best, see the supplementary materials for a detailed rating scheme), "Average rank" is rated on a 1-3 scale (1 for the most preferred, ties are allowed.) Two-tailed t-test results are shown for our method compared to ContextNQG (stat. significance is indicated with * (p < 0.05), * * (p < 0.01).) The evaluation can also be seen as a measure of the quality of the generated dataset (Section 6.3). We randomly sampled 11 passages/paragraphs from the test set; there are in total around 70 questionanswer pairs for evaluation. We consider three metrics -"grammaticality", "making sense" and "answerability". The evaluators are asked to first rate the grammatical correctness of the generated question (before being shown the associated input sentence or any other textual context). Next, we ask them to rate the degree to which the question "makes sense" given the input sentence (i.e., without considering the correctness of the answer span). Finally, evaluators rate the "answerability" of the question given the full context. Table 5 shows the results of the human evaluation. Bold indicates top scores. We see that the original human questions are preferred over the two NQG systems' outputs, which is understandable given the examples in Figure A.1. The human-generated questions make more sense and correspond better with the provided answers, particularly when they require information in the preceding context. How exactly to capture the preceding context so as to ask better and more diverse questions is an interesting future direction for research. In terms of grammaticality, however, the neural models do quite well, achieving very close to human performance. In addition, we see that our method (CorefNQG) performs statistically significantly better across all metrics in comparison to the baseline model (ContextNQG), which has access to the entire preceding context in the passage.

The Generated Corpus
Our system generates in total 1,259,691 questionanswer pairs, nearly 126 questions per article.   Figure 4: Example question-answer pairs from our generated corpus.
ure 5 shows the distribution of different types of questions in our dataset vs. the SQuAD training set. We see that the distribution for "In what", "When", "How long", "Who", "Where", "What does" and "What do" questions in the two datasets is similar. Our system generates more "What is", "What was" and "What percentage" questions, while the proportions of "What did", "Why" and "Which" questions in SQuAD are larger than ours. One possible reason is that the "Why", "What did" questions are more complicated to ask (sometimes involving world knowledge) and the answer spans are longer phrases of various types that are harder to identify. "What is" and "What was" questions, on the other hand, are often safer for the neural networks systems to ask.
In Figure 4, we show some examples of the generated question-answer pairs. The answer extractor identifies the answer span boundary well and all three questions correspond to their answers. Q2 is valid but not entirely accurate. For more examples, please refer to our supplementary materials. Table 6 shows the performance of a topperforming system for the SQuAD dataset (Document Reader ) when applied to the development and test set portions of our generated dataset. The system was trained on the training set portion of our dataset. We use the SQuAD evaluation scripts, which calculate exact  match (EM) and F-1 scores. 2 Performance of the neural machine reading model is reasonable. We also train the DocReader on our training set and test the models' performance on the original dev set of SQuAD; for this, the performance is around 45.2% on EM and 56.7% on F-1 metric. DocReader trained on the original SQuAD training set achieves 69.5% EM, 78.8% F-1 indicating that our dataset is more difficult and/or less natural than the crowd-sourced QA pairs of SQuAD.

Conclusion
We propose a new neural network model for better encoding coreference knowledge for paragraphlevel question generation. Evaluations with different metrics on the SQuAD machine reading dataset show that our model outperforms state-ofthe-art baselines. The ablation study shows the effectiveness of different components in our model. Finally, we apply our question generation framework to produce a corpus of 1.26 million questionanswer pairs, which we hope will benefit the QA research community. It would also be interesting to apply our approach to incorporating coreference knowledge to other text generation tasks.