Identifying Where to Focus in Reading Comprehension for Neural Question Generation

A first step in the task of automatically generating questions for testing reading comprehension is to identify question-worthy sentences, i.e. sentences in a text passage that humans find it worthwhile to ask questions about. We propose a hierarchical neural sentence-level sequence tagging model for this task, which existing approaches to question generation have ignored. The approach is fully data-driven — with no sophisticated NLP pipelines or any hand-crafted rules/features — and compares favorably to a number of baselines when evaluated on the SQuAD data set. When incorporated into an existing neural question generation system, the resulting end-to-end system achieves state-of-the-art performance for paragraph-level question generation for reading comprehension.


Introduction and Related Work
Automatically generating questions for testing reading comprehension is a challenging task (Mannem et al., 2010;Rus et al., 2010). First and foremost, the question generation system must determine which concepts in the associated text passage are important, i.e. are worth asking a question about.
The little previous work that exists in this area currently circumvents this critical step in passagelevel question generation by assuming that such sentences have already been identified. In particular, prior work focuses almost exclusively on sentence-level question generation: given a text passage, assume that all sentences contain a question-worthy concept and generate one or more questions for each (Heilman and Smith, 2010;Du et al., 2017;Zhou et al., 2017).
In contrast, we study the task of passage-level question generation (QG). Inspired by the large body of research in text summarization on identifying sentences that contain "summary-worthy" content (e.g. Mihalcea (2005), Berg-Kirkpatrick et al. (2011), ), we develop a method to identify the question-worthy sentences in each paragraph of a reading comprehension passage. Inspired further by the success of neural sequence models for many natural language processing tasks (e.g. named entity recognition (Collobert et al., 2011), sentiment classification (Socher et al., 2013), machine translation (Sutskever et al., 2014), dependency parsing (Chen and Manning, 2014)), including very recently document-level text summarization (Cheng and Lapata, 2016), we propose a hierarchical neural sentence-level sequence tagging model for question-worthy sentence identification.
We employ the SQuAD reading comprehension data set (Rajpurkar et al., 2016) for evaluation and show that our sentence selection approach compares favorably to a number of baselines including the feature-rich sentence selection model of Cheng and Lapata (2016) proposed in the context of extract-based summarization, and the convolutional neural network model of Kim (2014) that achieves state-of-the-art results on a variety of sentence classification tasks.
We also incorporate our sentence selection component into the neural question generation system of Du et al. (2017) and show, again using SQuAD, that our resulting end-to-end system achieves state-of-the-art performance for the challenging task of paragraph-level question generation for reading comprehension.

Problem Formulation
In this section, we define the tasks of important (i.e. question-worthy) sentence selection and sentence-level question generation (QG). Our full paragraph-level QG system includes both of these components. For the sentence selection task, given a paragraph D consisting of a sequence of sentences {s 1 , ..., s m }, we aim to select a subset of k question-worthy sentences (k < m). The goal is defined as finding y = {y 1 , ..., y m }, such that, where log P (y|D) is the conditional loglikelihood of the label sequence y; and y i = 1 means sentence i is question-worthy (contains at least one answer), otherwise y i = 0.
For sentence-level QG, the goal is to find the best word sequence z (a question of arbitrary length) that maximizes the conditional likelihood given the input sentence x and satisfies: where P 2 (z|x) is modeled with a global attention mechanism (Section 3).

Model
Important Sentence Selection Our general idea for the hierarchical neural network architecture is illustrated in Figure 1. First, we perform the encoding using sum operation or convolu-tion+maximum pooling operation (Kim, 2014;dos Santos and Zadrozny, 2014) over the word vectors comprising each sentence in the input paragraph. For simplicity and consistency, we denote the sentence encoding process as ENC. Given the t th sentence x = {x 1 , ..., x n } in the paragraph, we have its encoding: Then we use a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) to encode the paragraph, Sentence Encoder Figure 1: Hierarchical neural network architecture for sentence-level sequence labeling. The input is a paragraph consisting of sentences, whose encoded representation is fed into each hidden unit.
We use the concatenation of the two, namely, , as the hidden state h t at time stamp t, and feed it to the upper layers to get the probability distribution of y t (∈ {0, 1}), where MLP is multi-layer neural network and tanh is the activation function.
Question Generation Similar to Du et al. (2017), we implement the sentence-level question generator with an attention-based sequence-to-sequence learning framework (Sutskever et al., 2014;Bahdanau et al., 2015), to map a sentence in the reading comprehension article to natural questions. It consists of an LSTM encoder and decoder. The encoder is a bi-directional LSTM network; it encodes the input sentence x into a sequence of hidden states q 1 , q 2 , ..., q |x| .  (Kim, 2014) 68 The decoder is another LSTM that uses global attention over the encoder hidden states. The entire encoder-decoder structure learns the probability of generating a question given a sentence, as indicated by equation 2. To be more specific, where W s , W t are parameter matrices; h t is the hidden state of the decoder LSTM; and c t is the context vector created dynamically by the encoder LSTM -the weighted sum of the hidden states computed for the source sentence: The attention weights a i,t are calculated via a bilinear scoring function and softmax normalization: Apart from the bilinear score, alternative options for computing the attention can also be used (e.g. dot product). Readers can refer to Luong et al. (2015) for more details.
During inference, beam search is used to predict the question. The decoded UNK token at time step t, is replaced with the token in the input sentence with the highest attention score, the index of which is arg max i a i,t .
Henceforth, we will refer to our sentence-level Neural Question Generation system as NQG.
Note that generating answer-specific questions would be easy for this architecture -we can append answer location features to the vectors of tokens in the sentence. To better mimic the real life case (where questions are generated with no prior knowledge of the desired answers), we do not use such location features in our experiments.

Dataset and Implementation Details
We use the SQuAD dataset (Rajpurkar et al., 2016) for training and evaluation for both important sentence selection and sentence-level NQG. The dataset contains 536 curated Wikipedia articles with over 100k questions posed about the articles. The authors employ Amazon Mechanical Turk crowd-workers to generate questions based on the article paragraphs and to annotate the corresponding answer spans in the text. Later, to make the evaluation of the dataset more robust, other crowd-workers are employed to provide additional answers to the questions.
We split the public portion of the dataset into training (∼80%), validation (∼10%) and test (∼10%) sets at the paragraph level. For the sentence selection task, we treat sentences that contain at least one answer span (question-worthy sentences) as positive examples (y = 1); all remaining sentences are considered negative (y = 0). Not surprisingly, the training set is unbalanced: 52332 (∼60%) sentences contain answers, while 29693 sentences do not. Because of the variabil-   Elkan and Noto, 2008) to augment the set of known summaryworthy sentences. In contrast, we adopt a conservative approach rather than predict too many sentences as being question-worthy: we pair up source sentences with their corresponding questions, and use just these sentence-question pairs to training the encoder-decoder model. We use the glove.840B.300d pre-trained embeddings (Pennington et al., 2014) for initialization of the embedding layer for our sentence selection model and the full NQG model. glove.6B.100d embeddings are used for calculating sentence similarity feature of the baseline linear model (LREG). Tokens outside the vocabulary list are replaced by the UNK symbol. Hyperparameters for all models are tuned on the validation set and results are reported on the test set.

Sentence Selection Results
We compare to a number of baselines. The Random baseline assigns a random label to each sentence. The Majority baseline assumes that all sentences are question-worthy. The convolutional neural networks (CNN) sentence classification model (Kim, 2014) has similar structure to our CNN sentence encoder, but the classification is done only at the sentence-level rather than jointly at paragraph-level. LREG w/ BOW is the logistic regression model with bag-of-words features. LREG w/ para.-level is the feature-rich LREG model designed by Cheng and Lapata (2016); the features include: sentence length, position of sentence, number of named entities in the sentence, number of sentences in the paragraph, sentence-tosentence cohesion, and sentence-to-paragraph relevance. Sentence-to-sentence cohesion is obtained a a a a a a Table 3: For a source sentence in SQuAD, given the prediction from the sentence selection system and the corresponding NQG output, we provide conservative and liberal evaluations.
by calculating the embedding space similarity between it and every other sentence in the paragraph (similar for sentence-to-paragraph relevance). In document summarization, graph-based extractive summarization models (e.g. TGRAPH Parveen et al. (2015) and URANK Wan (2010)) focus on global optimization and extract sentences contributing to topical coherent summaries. Because this does not really fit our task -a summaryworthy sentence might not necessarily contain enough information for generating a good question -we do not include these as comparisons.
Results are displayed in Table 1. Our models with sum or CNN as the sentence encoder significantly outperform the feature-rich LREG as well as the other baselines in terms of F-measure.

Evaluation of the full QG system
To evaluate the full systems for paragraph-level QG, we introduce in Table 3 the "conservative" and "liberal" evaluation strategies. Given an input source sentence, there will be in total four possibilities: if both the gold standard data and prediction include the sentence, then we use its n-gram matching score (by BLEU (Papineni et al., 2002) and METEOR (Denkowski and Lavie, 2014)); if neither the gold data nor prediction include the sentence, then the sentence is discarded from the evaluation; if the gold data includes the sentence while the prediction does not, we assign a score of 0 for it; and if gold data does not include the sentence while prediction does, the generated question gets a 0 for conservative, while it gets full Wikipedia paragraph: arnold schwarzenegger has been involved with the special olympics for many years after they were founded by his ex-mother-in-law , eunice kennedy shriver . after they were founded by his ex-mother-in-law , eunice kennedy shriver . in 2007 , schwarzenegger was the official spokesperson for the special olympics which were held in shanghai , china .
Our questions: Q1: who founded the special olympics ? Q2: who was the official adviser for the special olympics ? Q3: when was the inner city games foundation founded ? Q4: how many schools does icg have ? Gold questions: Q1: schwarzenegger was the spokesperson for the special olympic games held in what city in china ? Q2: what nonprofit did schwarzenegger found in 1995 ? Q3: about how many schools across the country is icg active in ? Figure 2: Sample output from our full NQG system, the four questions correspond to the four highlighted sentences in the paragraph in the same order. Darkness indicates sentence importance, the score for deciding the darkness is obtained from the softmax results. Wave-lined sentences bear label y = 1, and 0 otherwise. The three gold questions also correspond to the wave-lined sentences in the same order. Please refer to the appendix for sample output on more Wikipedia articles.
score for liberal evaluation. Table 2 shows that the QG system incorporating our best performing sentence extractor outperforms its LREG counterpart across metrics. Note that to calculate the score for the matching case, similar to our earlier work (Du et al., 2017), we adapt the image captioning evaluation scripts of Chen et al. (2015) since there can be several gold standard questions for a single input sentence.
In Figure 2, we provide questions generated by the full NQG system (Q1-4) and according to the gold standard (Q1-3) for the selected Wikipedia paragraph. The sentences they were drawn from are shown with wavy lines (gold standard) and via highlighting (our system). Darkness of the highlighting is proportional to the softmax score provided by the sentence extractor.

Conclusion
In this work we introduced the task of identifying important sentences -good sentences to ask a question about -in the reading comprehension setting. We proposed a hierarchical neural sentence labeling model and investigated encoding sentences with sum and convolution operations. The question generation system that uses our sentence selection model consistently outperforms previous approaches and achieves state-ofthe-art paragraph-level question generation performance on the SQUAD data set.
In future work, we would like to investigate approaches to identify question-worth concepts rather than question-worthy sentences. It would also be interesting to see if the generated questions can be used to help improve question answering systems.