Neural Models for Key Phrase Detection and Question Generation

We propose a two-stage neural model to tackle question generation from documents. Our model first estimates the probability that word sequences in a document compose"interesting"answers using a neural model trained on a question-answering corpus. We thus take a data-driven approach to interestingness. Predicted key phrases then act as target answers that condition a sequence-to-sequence question generation model with a copy mechanism. Empirically, our neural key phrase detection model significantly outperforms an entity-tagging baseline system and existing rule-based approaches. We demonstrate that the question generator formulates good quality natural language questions from extracted key phrases, and a human study indicates that our system's generated question-answer pairs are competitive with those of an earlier approach. We foresee our system being used in an educational setting to assess reading comprehension and also as a data augmentation technique for semi-supervised learning.


Introduction
Many educational applications can benefit from automatic question generation, including vocabulary assessment Brown, Frishkoff, and Eskenazi (2005), writing support Liu, Calvo, and Rus (2012), and assessment of reading comprehension Mitkov and Ha (2003); Kunichika et al. (2004). Formulating questions that test for certain skills at certain levels requires significant human effort that is difficult to scale, e.g., to massive open online courses (MOOCs). Despite their applications, the majority of existing models for automatic question generation rely on rule-based methods that likewise do not scale well across different domains and/or writing styles. To address this limitation, we propose and compare several neural models for automatic question generation.
We focus specifically on the assessment of reading comprehension. In this domain, question generation typically involves two inter-related components: first, a system to identify interesting entities or events (key phrases) within a passage or document Becker, Basu, and Vanderwende (2012); second, a question generator that constructs questions in natural language that ask specifically about the given key phrases. Key phrases thus act as the "correct" answers for generated questions. This procedure ensures that we can assess a student's performance against a ground-truth target.
We formulate key phrase detection as modeling the probability of potential answers conditioned on a given document, i.e., P (a|d). Inspired by successful work in question answering, we propose a sequence-to-sequence model that generates a set of key-phrase boundaries. This model can flexibly select an arbitrary number of key phrases from a document. To teach it to assign high probability to interesting answers, we train it on human-selected answers from large-scale, crowd-sourced question-answering datasets. We thus take a purely data-driven approach to the concept of interestingness, working from the premise that crowdworkers tend to select entities or events that interest them when they formulate their own comprehension questions. If this premise is correct, then the growing collection of crowd-sourced question-answering datasets Rajpurkar et al. (2016); Trischler et al. (2016) can be harnessed to learn models for key phrases of interest to human readers.
Given a set of extracted key phrases, we approach question generation by modeling the conditional probability of a question given a document-answer pair, i.e., P (q|a, d). For this we use a sequenceto-sequence model with attention Bahdanau, Cho, and Bengio (2014) and the pointer-softmax mechanism Gulcehre et al. (2016). This component is also trained on a QA dataset by maximizing the likelihood of questions in the dataset.
Empirically, our proposed model for key phrase detection outperforms two baseline systems by a significant margin. We support these quantitative findings with qualitative examples of generated question-answer pairs given documents.

Question Generation
Automatic question generation systems are often used to alleviate (or even eliminate) the burden of human generation of questions to assess reading comprehension Mitkov and Ha (2003); Kunichika et al. (2004). Various NLP techniques have been adopted in these systems to improve generation quality, including parsing Heilman and Smith (2010a); Mitkov and Ha (2003), semantic role labeling Lindberg et al. (2013), and the use of lexicographic resources like WordNet Miller (1995); Mitkov and Ha (2003). However, the majority of proposed methods resort to simple rule-based techniques such as slot-filling with templates Lindberg et al. (2013); Chali and Golestanirad (2016) ;Labutov, Basu, and Vanderwende (2015) or syntactic transformation heuristics Agarwal and Mannem (2011);Ali, Chali, and Hasan (2010) (e.g., subject-auxiliary inversion Heilman and Smith (2010a)). These techniques can be inadequate to capture the diversity of natural language questions.
To address this limitation, end-to-end-trainable neural models have recently been proposed for question generation in both vision Mostafazadeh et al. (2016) and language. For the latter, Du, Shao, and Cardie used a sequence-to-sequence model with an attention mechanism derived from the encoder states. Yuan et al. proposed a similar architecture but in addition improved model performance through policy gradient techniques. Wang, Yuan, and Trischler proposed a generative model that learns jointly to generate questions and answers based on documents.

Key Phrase Detection
Meanwhile, a highly relevant aspect of question generation is to identify which parts of a given document are important or interesting for asking questions. Existing studies formulate key phrase extraction as a two-step process. In the first step, lexical features (e.g., part-of-speech tags) are used to extract a key phrase candidate list of certain types Liu et al. (2011);Wang, Zhao, and Huang (2016); Le, Nguyen, and Shimazu (2016); Yang et al. (2017). In the second step, ranking models are often used to select a key phrase. Medelyan, Frank, and Witten; Lopez and Romary used bagged decision trees, while Lopez and Romary used a Multi-Layer Perceptron (MLP) and Support Vector Machine (SVM) to label the candidates in a binary fashion. Mihalcea and Tarau; Wan and Xiao; Le, Nguyen, and Shimazu scored key phrases using PageRank. Heilman and Smith asked crowdworkers to rate the acceptability of computer-generated natural language questions as quiz questions, and Becker, Basu, and Vanderwende solicited quality ratings of text chunks as potential gaps for Clozestyle questions.
These studies are closely related to our proposed work by the common goal of modeling the distribution of key phrases given a document. The major difference is that previous studies begin with a prescribed list of candidates, which might significantly bias the distribution estimate. In contrast, we adopt a dataset that was originally designed for question answering, where crowdworkers presumably tend to pick entities or events that interest them most. We postulate that the resulting distribu-tion, learned directly from data, is more likely to reflect the true importance and appropriateness of answers.
Recently, Meng et al. proposed a generative model for key phrase prediction with an encoderdecoder framework, which is able to both generate words from vocabulary and point to words from document. Their model achieved state-of-the-art results on multiple scientific publication keyword extraction datasets. This model shares similar ideas to our key phrase extractor, i.e., using a single neural model to learn the probabilities that words are key phrases. 1 Yang et al. used rule-based method to extract potential answers from unlabeled text, and then generated questions given documents and extracted answers using a pre-trained question generation model, and combined model-generated questions with human-generated questions for training question answering models. Experiments showed that question answering models can benefit from the augmented data provided by their approach.

Key Phrase Detection
In this section, we describe a simple baseline as well as the two proposed neural models for extracting key phrases (answers) from documents.

Entity Tagging Baseline
Our baseline model (ENT) predicts all entities tagged by spaCy 2 as key phrases. This is motivated by the fact that over 50% of the answers in SQuAD are entities ( Table 2 in Rajpurkar et al. (2016)). These include dates (September 1967), numeric entities (3, five), people (William Smith), locations (the British Isles) and other entities (Buddhism).

Neural Entity Selection
The baseline model above naïvely selects all entities as candidate answers. One pitfall is that it exhibits high recall at the expense of precision (Table 1). We first attempt to address this with a neural entity selection model (NES) that selects a subset of entities from a list of candidates. Our neural entity selection model takes a document (i.e., a sequence of words) D = (w d 1 , . . . , w d n d ) and a list of n e entities as a sequence of (start, end) locations within the document E = ((e start 1 , e end 1 ), . . . , (e start ne , e end ne )). The model is then trained on the binary classification task of predicting whether an entity overlaps with any of the gold answers.
Specifically, we maximize ne i log(P (e i |D)). We parameterize P (e i |D) using a neural model that first embeds each word w d i in the document as a distributed vector using a word embedding lookup table. We then encode the document using a bidirectional Long Short-Term Memory (LSTM) network into annotation vectors (h d 1 , . . . , h d n d ). We then compute P (e i |D) using a three-layer multilayer perceptron (MLP) that takes as input the concatenation of three vectors h d During inference, we select the top k entities with highest likelihood as given by our model. We use k = 6 in our experiments as determined by hyper-parameter search.

Pointer Networks
While a significant fraction of answers in SQuAD are entities, extracting interesting aspects of a document requires looking beyond entities. Many documents of interest may be entity-less, or sometimes an entity tagger may fail to recognize some of the entities. To this end, we build a neural model that is trained from scratch to extract all of the answer key phrases in a particular document. We parameterize this model as a pointer network Vinyals, Fortunato, and Jaitly (2015) that is trained to point sequentially to the start and end locations of all key phrase answers. As in our entity selection model, we first encode the document into a sequence of annotation vectors (h d 1 , . . . , h d n d ). A decoder LSTM is then trained to point to all of the start and end locations of answers in the document from left to right, conditioned on the annotation vectors, via an attention mechanism. We add a special termination token to the document, which the decoder is trained to attend on when it has generated all key phrases. This provides the flexibility to learn the number of key phrases the model should extract from a particular document. This is in contrast to the work of Meng et al. (2017), where a fixed number of key phrases was generated per document.
A pointer network is an extension of sequence-to-sequence models Sutskever, Vinyals, and Le (2014), where the target sequence consists of positions in the source sequence. We encode the source sequence document into a sequence of annotation vectors h d = (h d 1 , . . . h d n d ) using an embedding lookup table followed by a bidirectional LSTM. The decoder also consists of an embedding lookup that is shared with the encoder followed by an unidirectional LSTM with an attention mechanism. We denote the decoder's annotation vectors as where n a is the number of answer key phrases, h p 1 and h p 2 correspond to the start and end annotation vectors for the first answer key phrase and so on. We parameterize P (w using the dot product attention mechanism Luong, Pham, and Manning (2015) between the decoder and encoder annotation vectors, where W 1 is an affine transformation matrix. The inputs at each step of the decoder are words from the document that correspond to the start and end locations pointed to by the decoder.
During inference, we employ a decoding strategy that greedily picks the best location from the softmax vector at every step, then post process results to remove duplicate key phrases. Since the output sequence is relatively small, we observed similar performances when using greedy decoding and beam search.
An input word w {d,a} i is first embedded by concatenating its word and character-level embeddings e i = e w i ; e ch i . Character-level information e ch i is captured with the final states of a BiLSTM on the character sequences of w i . The concatenated embeddings are subsequently encoded with another BiLSTM into annotation vectors h d i . To take better advantage of the extractive nature of answers in documents, we encode the answer by extracting the document encodings at the answer word positions. Specifically, we encode the hidden states of the document that correspond to the answer phrase with another condition aggregation BiLSTM. We use the final state h a of this as an encoding of the answer.
The RNN decoder employs the pointer-softmax module Gulcehre et al. (2016). At each step of the question generation process, the decoder decides adaptively whether to (a) generate from a decoder vocabulary or (b) point to a word in the source sequence (and copy over). The pointer-softmax decoder thus has two components -a pointer attention mechanism and a generative decoder.
In the pointing decoder, recurrence is implemented with two cascading LSTM cells c 1 and c 2 : (1) where s 1 and s 2 are the recurrent states, y (t−1) is the embedding of decoder output from the previous time step, and v (t) is the context vector (to be defined shortly in Equation (3)).  At each time step t, the pointing decoder computes a distribution α (t) over the document word positions (i.e., a document attention, Bahdanau, Cho, and Bengio (2014)). Each element is defined as: where f is a two-layer MLP with tanh and softmax activation, respectively. The context vector v (t) used in Equation (2) is the sum of the document encoding weighted by the document attention: The generative decoder, on the other hand, defines a distribution over a prescribed decoder vocabulary with a two-layer MLP g: , h a ). The switch scalar s (t) at each time step is computed by a three-layer MLP h: The first two layers of h use tanh activation and the final layer uses sigmoid. Highway connections are present between the first and the second layer. 3 Finally, the resulting switch is used to interpolate the pointing and the generative probabilities for predicting the next word:

Dataset
We conduct our experiments on the SQuAD Rajpurkar et al. (2016) and NewsQA Trischler et al. (2016) corpora. Both of these are machine comprehension datasets consisting of over 100k crowdsourced question-answer pairs. SQuAD contains 536 paragraphs from Wikipedia while NewsQA was created on 12,744 news articles. Simple preprocessing is performed, including lower-casing and word tokenization using NLTK. The test split of SQuAD is hidden from the public, we therefore take 5,158 question-answer pairs (self-contained in 23 Wikipedia articles) from the training set as a validation set, and use the official development data to report test results. We use NewsQA only to evaluate our key phrase detection models in a transfer setting.

Implementation Details
All models were trained using stochastic gradient descent with a minibatch size of 32 using the ADAM optimization algorithm.

Key Phrase Detection
Key phrase detection models used pretrained word embeddings of 300 dimensions, generated using a word2vec extension Ling et al. (2015) trained on the English Gigaword 5 corpus. We used bidirectional LSTMs of 256 dimensions (128 forward and backward) to encode the document and an LSTM of 256 dimensions as our decoder in the pointer network model. A dropout of 0.5 was used at the outputs of every layer in the network.

Question Generation
In question generation, the decoder vocabulary uses the top 2000 words sorted by their frequency in the gold questions in the training data. The word embedding matrix is initialized with the 300dimensional GloVe vectors Pennington, Socher, and Manning (2014). The dimensionality of the character representations is 32. The number of hidden units is 384 for both of the encoder/decoder RNN cells. Dropout is applied at a rate of 0.3 to all embedding layers as well as between the hidden states in the encoder/decoder RNNs across time steps.

Quantitative Evaluation of Key Phrase Extraction
Since each key phrase is itself a multi-word unit, we believe that a naive word-level F1 that considers an entire key phrase as a single unit is not well suited to evaluate these models. We thus propose an extension of the SQuAD F1 evaluation metric (for a single answer span) to multiple spans within a document called multi-span F1 score.
The metric is calculated as follows. Given the predicted phraseê i and a gold phrase e j , we first construct a pairwise, token-level F 1 score matrix of elements f i,j between the two phrasesê i and e j . Max-pooling along the gold-label axis essentially assesses the precision of each prediction, with partial matches accounted for by the pair-wise F1 (identical to evaluation of a single answer in SQuAD) in the cells: p i = max j (f i,j ). Analogously, recall for label e j can be defined by maxpooling along the prediction axis: r j = max i (f i,j ). The multi-span F1 score is defined from the mean precisionp = avg(p i ) and recallr = avg(r j ): Existing evaluations (e.g., that of Meng et al.) can be seen as the above computation performed on the matrix of exact match scores between predicted and gold key phrases. By using token-level F1 scores between phrase pairs, we allow fuzzy matches.

Human Evaluation of QA pairs
While key phrase extraction has a fairly well defined quantitative evaluation metric, evaluating generated text as in question generation is a harder problem. Instead of using automatic evaluation metrics such as BLEU, ROUGE, METEOR or CIDEr, we performed a human evaluation of our generated questions in conjunction with the answer key phrases.
We used two different evaluation approaches: an ambitious one that compares our generated question-answer pairs to human generated ones from SQuAD and another that compares our model with Heilman and Smith (2010a) (henceforth refered to as H&S).
Comparison to human generated questions -We presented annotators with documents from the SQuAD official development set and two sets of question-answer pairs, one from our model (machine generated) and the other from SQuAD (human generated). Annotators are tasked with identifying which of the question-answer pairs is machine generated. The order in which the questionanswer pairs appear in each example is randomized. The annotators are free to use any criterion of their choice to make a distinction such as poor grammar, the answer phrase not correctly answering the generated question, uninteresting answer phrases, etc.
Implict comparison to H&S -To compare our system to existing methods (H&S) 4 , we use human generated SQuAD question-answer pairs to setup an implict comparison. Human annotators are presented with a document and two question-answer pairs -one that comes from the SQUAD official development set and another from either our system or H&S (at random). Annotators are not made aware of the fact that there are two different models generating QA pairs. The annotators are once again tasked with identifying which QA pair is "human" generated. We evaluate the accuracy with which annotators can distinguish human and machine when using both of these models.
Comparison to H&S -In a more direct evaluation strategy, we present annotators with documents from the SQuAD official development set but instead of a human generated question-answer pair, we use one generated by the H&S model and one from ours. We then ask annotators which one they prefer.

Results and Discussion
Our evaluation of the key phrase extraction systems is presented in Table 1. We compare answer phrases extracted by H&S, our baseline entity tagger, the neural entity selection module and the pointer network. As expected, the entity tagging baseline achieved the best recall, likely by overgenerating candidate answers. The NES model, on the other hand, exhibits a much larger advantage in precision and consequently outperforms the entity tagging baseline by notable margins in F1. This trend persists in the comparison between the NES model and the pointer-network model. The H&S model exhibits high recall and lacks precision similar to the baseline entity tagger. This is not very surprising since the model hasn't been exposed the SQuAD answer phrase distribution.
Qualitatively, we observe that the entity-based models have a strong bias towards numeric types, which often fail to capture interesting information in an article.
In addition, we also notice that the entity-based systems tend to select the central topical entity as the answer, which can contradict the distribution of interesting answers selected by humans. For example, given a Wikipedia article on Kenya and the fact agriculture is the second largest contributor to kenya 's gross domestic product ( gdp ), the entity-based systems propose kenya as a key phrase and ask what country is nigeria 's second largest contributor to ? 5 Given the same information, the pointer model picked agriculture as the answer and asked what is the second largest contributor to kenya 's gross domestic product ?
Qualitative results with the question generation and key phrase extraction modules are presented in Table 2 and contrast H&S, our system, and human generated QA pairs from SQuAD.
H&S -Key phrases selected by this model appear to be different from the PtrNet and human generated ones; for example, they may start with prepositions such as "of", "by" and "to" or be very large noun-phrases such as that student motivation and attitudes towards school are closely linked to student-teacher relationships. In addition, their key phrases as seen in Figure 1 (document 1) do not seem "interesting" and appear to contain somewhat arbitrary phrases such as "this theory", "some studies", "a person", etc. Their question generation module appears to produce a few ungrammatical sentences, eg: the first time -what was the yuan dynasty that non-native chinese people ruled all of china ?
Our system -Since our key phrase extraction module was trained on SQuAD, the selected key phrases more closely resemble gold SQuAD answers. However, some of these answers don't answer the questions generated about them, eg: eicosanoids and cytokines -what are bacteria produced by ? (first document in Table 2). Our model is sometimes able to effectively parse coreferent entities. eg: to generate the mongol empire -the yuan dynasty is considered to be the continuation of what ? the model had to resolve the pronoun it to yuan dynasty in "it is generally considered to be the continuation of the mongol empire" (third document in Table 2).
Comparison to human generated questions -We presented 14 annotators with a total of 740 documents, each containing 2 question-answer pairs. We observed that annotators were able to identify the machine generated question-answer pairs 77.8% of the time with a standard deviation of 8.34%.
Implict comparison to H&S -We presented 2 annotators with the same 100 documents, 45 of which come from our model and 55 from H&S, all examples are paired with SQuAD gold questions and answers. The first annotator labeled 30 (66.7%) correctly between ours and gold, labeled 45 (81.8%) correctly between H&S and gold; while the second annotator labeled 9 (20%) correctly between ours and gold, labeled 13 (23.6%) correctly between H&S and gold. Neither of the annotators has substantial prior knowledge of SQuAD dataset. The experiment shows both annotator have harder time distinguishing between our generated question-answer pairs with gold than H&S with gold.
Comparison to H&S -We presented 2 annotators with the same 200 examples, each of them contains the document with question-answer pairs generated from both our model and H&S's model. The first annotator chose 107 (53.5%) question-answer pairs generated from our model as preferred choice, while the second annotator chose 90 (45%) from our model. This experiment shows that, without given the ground truth question-answer pairs, humans consider both models' outputs to be equally good.

Conclusion
We proposed a two-stage framework to tackle the problem of question generation from documents. First, we use a question answering corpus to train a neural model to estimate the distribution of key phrases that are interesting to question-asking humans. We proposed two neural models, one that ranks entities proposed by an entity tagging system, and another that points to key-phrase start and end boundaries with a pointer network. When compared to an entity tagging baseline, the proposed models exhibit significantly better results.
We adopt a sequence-to-sequence model to generate questions conditioned on the key phrases selected in the framework's first stage. Our question generator is inspired by an attention-based translation model, and uses the pointer-softmax mechanism to dynamically switch between copying a word from the document and generating a word from a vocabulary. Qualitative examples show that the generated questions exhibit both syntactic fluency and semantic relevance to the conditioning documents and answers, and appear useful for assessing reading comprehension in educational settings.
In future work we will investigate fine-tuning the complete framework end to end. Another interesting direction is to explore abstractive key phrase detection. Table 2: Qualitative examples of detected key phrases and generated questions.
Doc. inflammation is one of the first responses of the immune system to infection . the symptoms of inflammation are redness , swelling , heat , and pain , which are caused by increased blood flow into tissue . inflammation is produced by eicosanoids and cytokines , which are released by injured or infected cells . eicosanoids include prostaglandins that produce fever and the dilation of blood vessels associated with inflammation , and leukotrienes that attract certain white blood cells ( leukocytes ) . . .

Q-A
H&S by eicosanoids and cytokines -who is inflammation produced by ?
of the first responses of the immune system to infection -what is inflammation one of ?

Q-A
PtrNet leukotrienes -what can attract certain white blood cells ?
eicosanoids and cytokines -what are bacteria produced by ?

Gold SQuAD
inflamation -what is one of the first responses the immune system has to infection ?
eicosanoids and cytokines -what compounds are released by injured or infected cells , triggering inflammation ?
Doc. research shows that student motivation and attitudes towards school are closely linked to student-teacher relationships . enthusiastic teachers are particularly good at creating beneficial relations with their students . their ability to create effective learning environments that foster student achievement depends on the kind of relationship they build with their students . useful teacher-to-student interactions are crucial in linking academic success with personal achievement . here , personal success is a student 's internal goal of improving himself , whereas academic success includes the goals he receives from his superior . a teacher must guide his student in aligning his personal goals with his academic goals . students who receive this positive influence show stronger self-confidenche and greater personal and academic success than those without these teacher interactions .

H&S
research -what shows that student motivation and attitudes towards school are closely linked to studentteacher relationships ?
useful teacher-to-student interactions -what are crucial in linking academic success with personal achievement ? to student-teacher relationships -what does research show that student motivation and attitudes towards school are closely linked to ? that student motivation and attitudes towards school are closely linked to student-teacher relationshipswhat does research show to ?

Q-A
PtrNet student-teacher relationships -what are the student motivation and attitudes towards school closely linked to ?
enthusiastic teachers -who are particularly good at creating beneficial relations with their students ?
teacher-to-student interactions -what is crucial in linking academic success with personal achievement ? a teacher -who must guide his student in aligning his personal goals ?

student-teacher relationships -'what is student motivation about school linked to ?
beneficial -what type of relationships do enthusiastic teachers cause ? aligning his personal goals with his academic goals .
-what should a teacher guide a student in ? student motivation and attitudes towards schoolwhat is strongly linked to good student-teacher relationships ?
Doc. the yuan dynasty was the first time that non-native chinese people ruled all of china . in the historiography of mongolia , it is generally considered to be the continuation of the mongol empire . mongols are widely known to worship the eternal heaven . . . Doc. on july 31 , 1995 , the walt disney company announced an agreement to merge with capital cities/abc for $ 19 billion . . . . . in 1998 , abc premiered the aaron sorkin-created sitcom sports night , centering on the travails of the staff of a sportscenter-style sports news program ; despite earning critical praise and multiple emmy awards , the series was cancelled in 2000 after two seasons .

Q-A
H&S an agreement to merge with capital cities/abc for $19 billion -what did the walt disney company announce on july 31 , 1995 ? the walt disney company -what announced an agreement to merge with capital cities/abc for $19 billion on july 31 , 1995 ?

PtrNet
2000 -in what year was the aaron sorkin-created sitcom sports night cancelled ?
walt disney company -who announced an agreement to merge with capital cities/abc for $ 19 billion ?

Gold SQuAD
july 31 , 1995 -when was the disney and abc merger first announced ?
sports night -what aaron sorkin created show did abc debut in 1998 ?