Improving Question Generation With to the Point Context

Question generation (QG) is the task of generating a question from a reference sentence and a specified answer within the sentence. A major challenge in QG is to identify answer-relevant context words to finish the declarative-to-interrogative sentence transformation. Existing sequence-to-sequence neural models achieve this goal by proximity-based answer position encoding under the intuition that neighboring words of answers are of high possibility to be answer-relevant. However, such intuition may not apply to all cases especially for sentences with complex answer-relevant relations. Consequently, the performance of these models drops sharply when the relative distance between the answer fragment and other non-stop sentence words that also appear in the ground truth question increases. To address this issue, we propose a method to jointly model the unstructured sentence and the structured answer-relevant relation (extracted from the sentence in advance) for question generation. Specifically, the structured answer-relevant relation acts as the to the point context and it thus naturally helps keep the generated question to the point, while the unstructured sentence provides the full information. Extensive experiments show that to the point context helps our question generation model achieve significant improvements on several automatic evaluation metrics. Furthermore, our model is capable of generating diverse questions for a sentence which conveys multiple relations of its answer fragment.


Introduction
Question Generation (QG) is the task of automatically creating questions from a range of inputs, such as natural language text (Heilman and Smith, 2010), knowledge base (Serban et al., 2016) and * These two authors contributed equally.
Sentence: The daily mean temperature in January, the area's coldest month, is 32.6 • F (0.3 • C); however, temperatures usually drop to 10 • F (-12 • C) several times per winter and reach 50 • F (10 • C) several days each winter month. Reference Question: What is New York City 's daily January mean temperature in degrees celsius ? Baseline Prediction: What is the coldest temperature in Celsius ? Structured Answer-relevant Relation: (The daily mean temperature in January; is; 32.6 • F (0.3 • C)) image (Mostafazadeh et al., 2016). QG is an increasingly important area in NLP with various application scenarios such as intelligence tutor systems, open-domain chatbots and question answering dataset construction. In this paper, we focus on question generation from reading comprehension materials like SQuAD (Rajpurkar et al., 2016). As shown in Figure 1, given a sentence in the reading comprehension paragraph and the text fragment (i.e., the answer) that we want to ask about, we aim to generate a question that is asked about the specified answer.
Question generation for reading comprehension is firstly formalized as a declarative-tointerrogative sentence transformation problem with predefined rules or templates (Mitkov and Ha, 2003;Heilman and Smith, 2010). With the rise of neural models, Du et al. (2017) propose to model this task under the sequence-to-sequence (Seq2Seq) learning framework  with attention mechanism (Luong et al., 2015). However, question generation is a oneto-many sequence generation problem, i.e., several aspects can be asked given a sentence. Zhou et al. (2017) propose the answer-aware question generation setting which assumes the answer, a contiguous span inside the input sentence, is al-  Table 1: Performance for the average relative distance between the answer fragment and other non-stop sentence words that also appear in the ground truth question. (Bn: BLEU-n, MET: METEOR, R-L: ROUGE-L) ready known before question generation. To capture answer-relevant words in the sentence, they adopt a BIO tagging scheme to incorporate the answer position embedding in Seq2Seq learning. Furthermore, Sun et al. (2018) propose that tokens close to the answer fragments are more likely to be answer-relevant. Therefore, they explicitly encode the relative distance between sentence words and the answer via position embedding and positionaware attention.
Although existing proximity-based answeraware approaches achieve reasonable performance, we argue that such intuition may not apply to all cases especially for sentences with complex structure. For example, Figure 1 shows such an example where those approaches fail. This sentence contains a few facts and due to the parenthesis (i.e. "the area's coldest month"), some facts intertwine: "The daily mean temperature in January is 0.3 • C" and "January is the area's coldest month". From the question generated by a proximity-based answer-aware baseline, we find that it wrongly uses the word "coldest" but misses the correct word "mean" because "coldest" has a shorter distance to the answer "0.3 • C".
In summary, their intuition that "the neighboring words of the answer are more likely to be answer-relevant and have a higher chance to be used in the question" is not reliable. To quantitatively show this drawback of these models, we implement the approach proposed by Sun et al. (2018) and analyze its performance under different relative distances between the answer and other non-stop sentence words that also appear in the ground truth question. The results are shown in Table 1. We find that the performance drops at most 36% when the relative distance increases from "0 ∼ 10" to "> 10". In other words, when the useful context is located far away from the answer, current proximity-based answer-aware approaches will become less effective, since they overly emphasize neighboring words of the answer.
To address this issue, we extract the structured answer-relevant relations from sentences and pro-pose a method to jointly model such structured relation and the unstructured sentence for question generation. The structured answer-relevant relation is likely to be to the point context and thus can help keep the generated question to the point. For example, Figure 1 shows our framework can extract the right answer-relevant relation ("The daily mean temperature in January", "is", "32.6 • F (0.3 • C)") among multiple facts. With the help of such structured information, our model is less likely to be confused by sentences with a complex structure. Specifically, we firstly extract multiple relations with an off-the-shelf Open Information Extraction (OpenIE) toolbox (Saha and Mausam, 2018), then we select the relation that is most relevant to the answer with carefully designed heuristic rules.
Nevertheless, it is challenging to train a model to effectively utilize both the unstructured sentence and the structured answer-relevant relation because both of them could be noisy: the unstructured sentence may contain multiple facts which are irrelevant to the target question, while the limitation of the OpenIE tool may produce less accurate extracted relations. To explore their advantages simultaneously and avoid the drawbacks, we design a gated attention mechanism and a dual copy mechanism based on the encoder-decoder framework, where the former learns to control the information flow between the unstructured and structured inputs, while the latter learns to copy words from two sources to maintain the informativeness and faithfulness of generated questions.
In the evaluations on the SQuAD dataset, our system achieves significant and consistent improvement as compared to all baseline methods. In particular, we demonstrate that the improvement is more significant with a larger relative distance between the answer and other non-stop sentence words that also appear in the ground truth question. Furthermore, our model is capable of generating diverse questions for a single sentenceanswer pair where the sentence conveys multiple relations of its answer fragment.

Framework Description
In this section, we first introduce the task definition and our protocol to extract structured answerrelevant relations. Then we formalize the task under the encoder-decoder framework with gated attention and dual copy mechanism.

Problem Definition
We formalize our task as an answer-aware Question Generation (QG) problem (Zhao et al., 2018), which assumes answer phrases are given before generating questions. Moreover, answer phrases are shown as text fragments in passages. Formally, given the sentence S, the answer A, and the answer-relevant relation M , the task of QG aims to find the best question Q such that, where A is a contiguous span inside S.

Answer-relevant Relation Extraction
We utilize an off-the-shelf toolbox of OpenIE 1 to the derive structured answer-relevant relations from sentences as to the point contexts. Relations extracted by OpenIE can be represented either in a triple format or in an n-ary format with several secondary arguments, and we employ the latter to  Overlapped words are those non-stop tokens co-occurring in the source (sentence/relation) and the target question. Copy ratio means the proportion of source tokens that are used in the question.
keep the extractions as informative as possible and avoid extracting too many similar relations in different granularities from one sentence. We join all arguments in the extracted n-ary relation into a sequence as our to the point context. Figure 2 shows n-ary relations extracted from OpenIE. As we can see, OpenIE extracts multiple relations for complex sentences. Here we select the most informative relation according to three criteria in the order of descending importance: (1) having the maximal number of overlapped tokens between the answer and the relation; (2) being assigned the highest confidence score by OpenIE; (3) containing maximum non-stop words. As shown in Figure 2, our criteria can select answer-relevant relations (waved in Figure 2), which is especially useful for sentences with extraneous information. In rare cases, OpenIE cannot extract any relation, we treat the sentence itself as the to the point context. Table 2 shows some statistics to verify the intuition that the extracted relations can serve as more to the point context. We find that the tokens in relations are 61% more likely to be used in the target question than the tokens in sentences, and thus they are more to the point. On the other hand, on average the sentences contain one more question token than the relations (1.86 v.s. 2.87). Therefore, it is still necessary to take the original sentence into account to generate a more accurate question.

Our Proposed Model
Overview. As shown in Figure 3, our framework consists offour components (1) Sentence Encoder and Relation Encoder, (2) Decoder, (3) Gated Attention Mechanism and (4) Dual Copy Mechanism. The sentence encoder and relation encoder encode the unstructured sentence and the structured answer-relevant relation, respectively. To select and combine the source information from the two encoders, a gated attention mechanism is Sentence Encoder employed to jointly attend both contextualized information sources, and a dual copy mechanism copies words from either the sentence or the relation.
Answer-aware Encoder. We employ two encoders to integrate information from the unstructured sentence S and the answer-relevant relation M separately. Sentence encoder takes in featureenriched embeddings including word embeddings w, linguistic embeddings l and answer position embeddings a. We follow (Zhou et al., 2017) to transform POS and NER tags into continuous representation (l p and l n ) and adopt a BIO labelling scheme to derive the answer position embedding (B: the first token of the answer, I: tokens within the answer fragment except the first one, O: tokens outside of the answer fragment). For each word w i in the sentence S, we simply concatenate all features as input: Here [a; b] denotes the concatenation of vectors a and b.
We use bidirectional LSTMs to encode the sentence (x s 1 , x s 2 , ..., x s n ) to get a contextualized rep-resentation for each token: where − → h s i and ← − h s i are the hidden states at the i-th time step of the forward and the backward LSTMs. The output state of the sentence encoder is the concatenation of forward and backward hidden states: The contextualized representation of the sentence is (h s 1 , h s 2 , ..., h s n ). For the relation encoder, we firstly join all items in the n-ary relation M into a sequence. Then we only take answer position embedding as an extra feature for the sequence: Similarly, we take another bidirectional LSTMs to encode the relation sequence and derive the corresponding contextualized representation (h m 1 , h m 2 , ..., h m n ). Decoder. We use an LSTM as the decoder to generate the question. The decoder predicts the word probability distribution at each decoding timestep to generate the question. At the t-th timestep, it reads the word embedding w t and the hidden state u t−1 of the previous timestep to gen-erate the current hidden state: (2) Gated Attention Mechanism. We design a gated attention mechanism to jointly attend the sentence representation and the relation representation.
For sentence representation (h s 1 , h s 2 , ..., h s n ), we employ the Luong et al. (2015)'s attention mechanism to obtain the sentence context vector c s t , where W a is a trainable weight. Similarly, we obtain the vector c m t from the relation representation (h m 1 , h m 2 , ..., h m n ). To jointly model the sentence and the relation, a gating mechanism is designed to control the information flow from two sources: where represents element-wise dot production and W g , W h are trainable weights. Finally, the predicted probability distribution over the vocabulary V is computed as: where W V and b V are parameters.
Dual Copy Mechanism. To deal with the rare and unknown words, the decoder applies the pointing method (See et al., 2017;Gu et al., 2016; to allow copying a token from the input sentence at the t-th decoding step. We reuse the attention score α s t and α m t to derive the copy probability over two source inputs: Different from the standard pointing method, we design a dual copy mechanism to copy from two sources with two gates. The first gate is designed for determining copy tokens from two sources of inputs or generate next word from P V , which is computed as g v t = sigmoid(w v gh t + b v g ). The second gate takes charge of selecting the source (sentence or relation) to copy from, which is computed as g c t = sigmoid(w c g [c s t ; c m t ] + b c g ). Finally, we  combine all probabilities P V , P S and P M through two soft gates g v t and g c t . The probability of predicting w as the t-th token of the question is: Training and Inference. Given the answer A, sentence S and relation M , the training objective is to minimize the negative log-likelihood with regard to all parameters: where {Q} is the set of all training instances, θ denotes model parameters and logP (Q|A, S, M ; θ) is the conditional log-likelihood of Q.
In testing, our model targets to generate a question Q by maximizing: 3 Experimental Setting

Dataset & Metrics
We conduct experiments on the SQuAD dataset (Rajpurkar et al., 2016). It contains 536 Wikipedia articles and 100k crowd-sourced question-answer pairs. The questions are written by crowd-workers and the answers are spans of tokens in the articles. We employ two different data splits by following (Zhou et al., 2017) 2 and (Du et al., 2017) 3 . In (Zhou et al., 2017), the original SQuAD development set is evenly divided into dev and test sets, while (Du et al., 2017) treats SQuAD development set as its test set and splits SQuAD training set into a training set (90%) and a development set (10%). We also filter out questions which do not have any overlapped non-stop words with the corresponding sentences and perform some preprocessing steps, such as tokenization and sentence splitting. The data statistics are given in Table 3.   We evaluate with all commonly-used metrics in question generation (Du et al., 2017): BLEU-1 (B1), BLEU-2 (B2), BLEU-3 (B3), BLEU-4 (B4) (Papineni et al., 2002), METEOR (MET) (Denkowski and Lavie, 2014) and ROUGE-L (R-L) (Lin, 2004). We use the evaluation script released by Chen et al. (2015).

Baseline Models
We compare with the following models.
• s2s (Du et al., 2017) proposes an attention-based sequence-to-sequence neural network for question generation. • NQG++ (Zhou et al., 2017) takes the answer position feature and linguistic features into consideration and equips the Seq2Seq model with copy mechanism. • M2S+cp (Song et al., 2018) conducts multiperspective matching between the answer and the sentence to derive an answer-aware sentence representation for question generation. • s2s+MP+GSA (Zhao et al., 2018) introduces a gated self-attention into the encoder and a maxout pointer mechanism into the decoder. We report their sentence-level results for a fair comparison. • Hybrid (Sun et al., 2018) is a hybrid model which considers the answer embedding for the question word generation and the position of context words for modeling the relative distance between the context words and the answer. • ASs2s (Kim et al., 2019) replaces the answer in the sentence with a special token to avoid its appearance in the generated questions.

Implementation Details
We take the most frequent 20k words as our vocabulary and use the GloVe word embeddings (Pennington et al., 2014) for initialization. The embed-ding dimensions for POS, NER, answer position are set to 20. We use two-layer LSTMs in both encoder and decoder, and the LSTMs hidden unit size is set to 600. We use dropout (Srivastava et al., 2014) with the probability p = 0.3. All trainable parameters, except word embeddings, are randomly initialized with the Xavier uniform in (−0.1, 0.1) (Glorot and Bengio, 2010). For optimization in the training, we use SGD as the optimizer with a minibatch size of 64 and an initial learning rate of 1.0. We train the model for 15 epochs and start halving the learning rate after the 8th epoch. We set the gradient norm upper bound to 3 during the training.
We adopt the teacher-forcing for the training. In the testing, we select the model with the lowest perplexity and beam search with size 3 is employed for generating questions. All hyperparameters and models are selected on the validation dataset. Table 4 shows automatic evaluation results for our model and baselines (copied from their papers). Our proposed model which combines structured answer-relevant relations and unstructured sentences achieves significant improvements over proximity-based answer-aware models (Zhou et al., 2017;Sun et al., 2018) on both dataset splits. Presumably, our structured answer-relevant relation is a generalization of the context explored by the proximity-based methods because they can only capture short dependencies around answer fragments while our extractions can capture both short and long dependencies given the answer fragments. Moreover, our proposed framework is a general one to jointly leverage structured  Table 5: Performance for the average relative distance between the answer fragment and other non-stop sentence words that also appear in the ground truth question (BLEU is the average over BLEU-1 to BLEU-4). Values in parenthesis are the improvement percentage of Our Model over Hybrid. (a) is based on all sentences while (b) only considers long sentences with more than 20 words.

Main Results
relations and unstructured sentences. All compared baseline models which only consider unstructured sentences can be further enhanced under our framework.
Recall that existing proximity-based answeraware models perform poorly when the distance between the answer fragment and other non-stop sentence words that also appear in the ground truth question is large (Table 1). Here we investigate whether our proposed model using the structured answer-relevant relations can alleviate this issue or not, by conducting experiments for our model under the same setting as in Table 1. The brokendown performances by different relative distances are shown in Table 5a. We find that our proposed model outperforms Hybrid (our re-implemented version for this experiment) on all ranges of relative distances, which shows that the structured answer-relevant relations can capture both short and long term answer-relevant dependencies of the answer in sentences. Furthermore, comparing the performance difference between Hybrid and our model, we find the improvements become more significant when the distance increases from "0 ∼ 10" to "> 10". One reason is that our model can extract relations with distant dependencies to the answer, which greatly helps our model ignore the extraneous information. Proximity-based answeraware models may overly emphasize the neighboring words of answers and become less effective as the useful context becomes further away from the answer in the complex sentences. In fact, the breakdown intervals in Table 5a naturally bound its sentence length, say for "> 10", the sentences in this group must be longer than 10. Thus, the length variances in these two intervals could be Sentence: Beyoncé received critical acclaim and commercial success, selling one million digital copies worldwide in six days; The New York Times noted the album 's unconventional, unexpected release as significant. Reference Question: How many digital copies of her fifth album did Beyonc sell in six days? Baseline Prediction: How many digital copies did the New York Times sell in six days ? Structured Answer-relevant Relation: (Beyoncé; received commercial success selling; one million digital copies worldwide; in six days) Our Model Prediction: How many digital copies did Beyoncé sell in six days ? Sentence: The daily mean temperature in January, the area's coldest month, is 32.6 • F (0.3 • C); however, temperatures usually drop to 10 • F (-12 • C) several times per winter and reach 50 • F (10 • C) several days each winter month. Reference Question: What is New York City 's daily January mean temperature in degrees celsius ? Baseline Prediction: What is the coldest temperature in Celsius ? Structured Answer-relevant Relation: (The daily mean temperature in January; is; 32.6 • F (0.3 • C)) Our Model Prediction: In degrees Celsius , what is the average temperature in January ? Figure 4: Example questions (with answers highlighted) generated by crowd-workers (ground truth questions), the baseline model and our model. significant. To further validate whether our model can extract long term dependency words. We rerun the analysis of Table 5b only for long sentences (length > 20) of each interval. The improvement percentages over Hybrid are shown in Table 5b, which become more significant when the distance increases from "0 ∼ 10" to "> 10".

Case Study
Figure 4 provides example questions generated by crowd-workers (ground truth questions), the base-line Hybrid (Sun et al., 2018), and our model. In the first case, there are two subsequences in the input and the answer has no relation with the second subsequence 4 . However, we see that the baseline model prediction copies irrelevant words "The New York Times" while our model can avoid using the extraneous subsequence "The New York Times noted ..." with the help of the structured answer-relevant relation. Compared with the ground truth question, our model cannot capture the cross-sentence information like "her fifth album", where the techniques in paragraph-level QG models (Zhao et al., 2018) may help. In the second case, as discussed in Section 1, this sentence contains a few facts and some facts intertwine. We find that our model can capture distant answer-relevant dependencies such as "mean temperature" while the proximity-based baseline model wrongly takes neighboring words of the answer like "coldest" in the generated question.

Diverse Question Generation
Another interesting observation is that for the same answer-sentence pair, our model can generate diverse questions by taking different answerrelevant relations as input. Such capability improves the interpretability of our model because the model is given not only what to be asked (i.e., the answer) but also the related fact (i.e., the answer-relevant relation) to be covered in the question. In contrast, proximity-based answeraware models can only generate one question given the sentence-answer pair regardless of how many answer-relevant relations in the sentence. We think such capability can also validate our motivation: questions should be generated according to the answer-aware relations instead of neighboring words of answer fragments. Figure 5 show two examples of diverse question generation. In the first case, the answer fragment 'Hugh L. Dryden' is the appositive to 'NASA Deputy Administrator' but the subject to the following tokens 'announced the Apollo program ...'. Our framework can extract these two answer-relevant relations, and by feeding them to our model separately, we can receive two questions asking different relations with regard to the answer.

Related Work
The topic of question generation, initially motivated for educational purposes, is tackled by designing many complex rules for specific question types (Mitkov and Ha, 2003;Rus et al., 2010). Heilman and Smith (2010) improve rule-based question generation by introducing a statistical ranking model. First, they remove extraneous information in the sentence to transform it into a simpler one, which can be transformed easily into a succinct question with predefined sets of general rules. Then they adopt an overgenerate-and-rank approach to select the best candidate considering several features.
With the rise of dominant neural sequence-tosequence learning models , Du et al. (2017) frame question generation as a sequence-to-sequence learning problem. Compared with rule-based approaches, neural models (Yuan et al., 2017) can generate more fluent and grammatical questions. However, question generation is a one-to-many sequence generation problem, i.e., several aspects can be asked given a sentence, which confuses the model during train and prevents concrete automatic evaluation. To tackle this issue, Zhou et al. (2017) propose the answer-aware question generation setting which assumes the answer is already known and acts as a contiguous span inside the input sentence. They adopt a BIO tagging scheme to incorporate the answer position information as learned embedding features in Seq2Seq learning. Song et al. (2018) explicitly model the information between answer and sentence with a multiperspective matching model. Kim et al. (2019) also focus on the answer information and proposed an answer-separated Seq2Seq model by masking the answer with special tokens. All answer-aware neural models treat question generation as a oneto-one mapping problem, but existing models perform poorly for sentences with a complex structure (as shown in Table 1).
Our work is inspired by the process of extraneous information removing in (Heilman and Smith, 2010;Cao et al., 2018). Different from Heilman and Smith (2010) which directly use the simplified sentence for generation and Cao et al. (2018) which only consider aggregate two sources of information via gated attention in summarization, we propose to combine the structured answerrelevant relation and the original sentence. Factoid question generation from structured text is initially investigated by Serban et al. (2016), but our focus here is leveraging structured inputs to help question generation over unstructured sentences. Our proposed model can take advantage of unstructured sentences and structured answerrelevant relations to maintain informativeness and faithfulness of generated questions. The proposed model can also be generalized in other conditional sequence generation tasks which require multiple sources of inputs, e.g., distractor generation for multiple choice questions (Gao et al., 2019b).

Conclusions and Future Work
In this paper, we propose a question generation system which combines unstructured sentences and structured answer-relevant relations for generation. The unstructured sentences maintain the informativeness of generated questions while structured answer-relevant relations keep the faithfulness of questions. Extensive experiments demonstrate that our proposed model achieves state-ofthe-art performance across several metrics. Furthermore, our model can generate diverse questions with different structured answer-relevant relations. For future work, there are some interesting dimensions to explore, such as difficulty levels (Gao et al., 2019a), paragraph-level informa-tion (Zhao et al., 2018) and conversational question generation (Gao et al., 2019c).