Leveraging Context Information for Natural Question Generation

The task of natural question generation is to generate a corresponding question given the input passage (fact) and answer. It is useful for enlarging the training set of QA systems. Previous work has adopted sequence-to-sequence models that take a passage with an additional bit to indicate answer position as input. However, they do not explicitly model the information between answer and other context within the passage. We propose a model that matches the answer with the passage before generating the question. Experiments show that our model outperforms the existing state of the art using rich features.


Introduction
The task of natural question generation (NQG) is to generate a fluent and relevant question given a passage and a target answer. Recently NQG has received increasing attention from both the industrial and academic communities because of its values for improving QA systems by automatically increasing the training data. It can also be used for educational purposes such as language learning (Heilman and Smith, 2010).
One example is shown in Table 1, where a question "when was nikola tesla born ?" is generated given a passage and a fact "1856". Existing work for NQG uses a sequence-to-sequence model (Sutskever et al., 2014), which takes a passage as input for generating a question. They either entirely ignore the target answer (Du et al., 2017), or directly hard-code answer positions Tang et al., 2017;Wang et al., 2017a;. These methods can neglect rich potential * Work done during an internship at IBM.
Passage: nikola tesla ( serbian cyrillic : Nikola Tesla ; 10 july 1856 -7 january 1943 ) was a serbian american inventor , electrical engineer , mechanical engineer , physicist , and futurist best known for his contributions to the design of the modern alternating current ( ac ) electricity supply system . Question: when was nikola tesla born ? interactions between the passage and the target answer. In addition, they fail when the target answer does not occur in the passage verbatim. In Table  1 the answer "1856" is the year when nikola tesla was born. This can be easily determined by leveraging the contextual information of "10 july 1856 -7 january 1943", while it is relatively hard when only the answer position information is adopted.
We investigate explicit interaction between the target answer and the passage, so that contextual information can be better considered by the encoder. In particular, matching is used between the target answer and the passage for collecting relevant contextual information. We adopt the multiperspective context matching (MPCM) algorithm (Wang et al., 2017b), which takes two texts as input before producing a vector of numbers, representing similarity under different perspectives.
Results on SQuAD (Rajpurkar et al., 2016) show that our model gives better BLEU scores than the state of the art. Furthermore, the questions generated by our model help to improve a strong extractive QA system. Our code is available at https://github.com/freesunshine0316/MPQG.
2 Baseline: sequence-to-sequence Our baseline is a sequence-to-sequence model (Bahdanau et al., 2015) with the copy mechanism (Gulcehre et al., 2016;Gu et al., 2016). It uses an LSTM encoder to encode a passage and an LSTM decoder to synthesize a question.

Encoder
The encoder is a bi-directional LSTM (Hochreiter and Schmidhuber, 1997), whose input x j at step j is [e j ; b j ], the concatenation of the current word embedding e j with additional bit b j indicating whether it belongs to the answer.

Decoder with the copy mechanism
The decoder is an attentional LSTM model, with the attention memory H being the concatenation of all encoder states. Each encoder state h j is the concatenation of two bi-directional LSTM states: where N is the number of encoder states. At each step t, the decoder state s t and context vector c t are generated from the previous decoder state s t−1 , context vector c t−1 and output x t−1 in the same way as Bahdanau et al. (2015). The output distribution over a vocabulary is calculated via: where V 1 and b 1 are model parameters, and the number of rows in V 1 is the size of the vocabulary. Since many passage words also appear in the question, we adopt the copy mechanism (Gulcehre et al., 2016;Gu et al., 2016), which integrates the attention over input words into the final vocabulary distribution. The probability distribution is defined as the interpolation: where g t is the switch for controlling generating a word from the vocabulary or directly copying it from the passage. P vocab is the vocabulary probability distribution as defined above, and P attn is calculated based on the current attention distribution by merging probabilities of duplicated words. Finally, g t is defined as: , where the vectors w c , w s , w x and scalar b 2 are model parameters.

Method
Our model follows the baseline encoder-decoder framework. The encoder reads a passage P = (p 1 , . . . , p N ) and an answer A = (a 1 , . . . , q M ); the decoder generates a question Q = (q 1 , . . . , q L ) word by word.

Multi-perspective encoder
Different from the baseline, we first encode both the passage and the answer by using two separate bi-directional LSTMs: We use the multi-perspective context matching algorithm (Wang et al., 2017b) on top of the BiL-STM outputs, matching each hidden state h p j of the passage against all hidden states h a 1 . . . h a M of the answer. The goal is to detect whether each passage word belongs to the relevant context of the answer. Shown in Figure 1, we adopt three strategies to match the passage with the answer, each investigating different sources of information.
Full-matching considers the last hidden state of the answer, which encodes all words and the word order. Attentive-matching synthesizes a vector by computing a weighted sum of all answer states against the passage state, then compares the vector with the passage state. It also considers all words in the answer but without word order. Finally, max-attentive-matching only considers the most relevant answer state to the passage state.
Multi-perspective matching These strategies require a function f m to match two vectors v 1 and v 2 . It is defined as: where W is a tunable weight matrix. Each row W k ∈ W represents the weights associated with one perspective, and the similarity according to that perspective is defined as: where is the element-wise multiplication operation. So f m (v 1 , v 2 ; W) represents the matching results between v 1 and v 2 from all perspectives. Intuitively, each perspective calculates the cosine similarity between two reweighted input vectors, associated with a weight vector trained to highlight different dimensions of the input vectors. This can be regarded as considering a different part of the semantics captured in the vector.
The final matching vector m j for the j-th word in the passage is the concatenation of the matching results of all three strategies. We employ another BiLSTM layer on top of the matching layer: Comparison with the baseline The encoder states (h j ) of the baseline only contains the answer position information in addition to the passage content. The matching states (h m j ) of our model includes the matching information of all passage words, and potentially contains the answer position information. The rich matching information can guide the decoder to generate more accurate questions.

Decoder with the copy mechanism
The decoder is identical to the one described in Section 2.2, except that matching information is added to the attention memory: The attention memory contains not only the passage content, but also the matching information, which helps generate more accurate questions.

Experiments
Following existing work (Du et al., 2017;, experiments are conducted on the publically accessible part of SQuAD (Rajpurkar et al., 2016). The dataset contains 536 articles and over 100k questions, and around 10% is held by the organizer for fair evaluation.

Settings
We evaluate our model for question quality against gold questions, as well as their effectiveness in improving an extractive QA system. Since Du et al. (2017) and  conducted their experiments using different training/dev/test split, we conduct experiments on both splits, and compare with their reported performance. For improving an extractive QA system, we use the data split of Du et al. (2017), and conduct experiments on low-resource settings, where only (10%, 20%, or 50%) of the human-labeled questions in the training data are available. We choose  as the extractive QA system.
Both the baseline and our model are trained with cross-entropy loss. Greedy search is adopted for generating questions.

Development experiments
Matching strategies In Table 2, we analyze the importance of each matching strategy by performing an ablation experiment on the devset according to the data split of Du et al. (2017). We can see that there is a performance decrease when removing each of the three matching strategies, which means that all three strategies are complementary. In addition, w/o max-attentive-matching shows the least performance decrease. One likely reason is that max-attentive-matching considers only the most similar hidden state of the answer, while the other two consider all hidden states. Finally, w/o full-matching shows more performance decrease than w/o attentive-matching. A reason  may be that full-matching captures word order information, while attentive-matching does not.
Number of perspectives Figure 2 shows the performance changes with different numbers of perspectives. There is a large performance improvement when increasing the number from 1 to 3, which becomes small when further increasing the number from 3 to 5. This shows that our multiperspective matching algorithm is effective, as we do not need a large number of perspectives for reaching our reported performance.

Results
In Table 3, we compare our model with the previous state of the art: S2S-ans (Du et al., 2017) and S2S+cp+f . Both methods use the sequence-to-sequence model. S2S-ans encodes only the passage, yet does not use answer position information. S2S+cp+f uses both answer position and rich features (NE and POS tags) by concatenating their embeddings with the word embedding on the encoder side (Peng et al., 2016), adopting the copy mechanism for their decoder. S2S+cp is our sequence-to-sequence baseline with the copy mechanism, and M2S+cp is our model, which further uses multi-perspective encoder.
M2S+cp outperforms S2S+cp on both data splits, showing that modeling contextual information is helpful for generating better questions. In addition, only taking word embedding features, M2S+cp shows better performance than S2S+cp+f. Both multi-perspective matching and rich features play a similar role of leveraging more information than the answer position information. However, M2S+cp can be applied to low-resource languages and domains, where there is not sufficient labeled data for training the taggers for generating rich features. M2S+cp is also free from feature engineering, which is necessary for S2S+cp+f on new domains.
Finally, unlike S2S-ans, S2S+cp+f and S2S+cp, M2S+cp can be useful when the answer is not explicitly contained in the passage, as it matches the target answer against the passage rather than using Passage: nikola tesla ( serbian cyrillic : Nikola Tesla ; 10 july 1856 -7 january 1943 ) was a serbian american inventor , electrical engineer , mechanical engineer , physicist , and futurist best known for his contributions to the design of the modern alternating current ( ac ) electricity supply system . Reference: when was nikola tesla born ? S2S+cp: when was nikola tesla 's inventor ? M2S+cp: when was nikola tesla born ? Passage: zhèng ( chinese : 正 ) meaning " right " , " just " , or " true " , would have received the mongolian adjectival modifiers , creating " jenggis " , which in medieval romanization would be written " genghis "  the answer position information. Table 4 shows example outputs of M2S+cp and S2S+cp. For the first case, M2S+cp recognizes that "1856" is the year when "nikola tesla" is born, while S2S+cp fails to. The matching algorithm of M2S+cp gives high matching numbers for the phrase "10 july 1856 -7 january 1943" with "1856" having the highest matching number, while S2S+cp only highlights "1856". Simply highlighting "1856" can be ambiguous, while recognizing a pattern "day month year -day month year" with the first year being the answer is more definite. It is similar in the second case, where M2S+cp recognize "zhèng meaning right", which fits into the pattern "A meaning B" with B being the answer.

Example Output
The third example is a case where the answer is not explicitly contained in the passage. 1 M2S+cp generates a precise question, even though the answer "in illinois" does not appear in the passage. On the other hand, S2S+cp fails in this case, as the answer position information can not be obtained from the input.  extractive QA model, while S2S+cp and M2S+cp use all training data, adopting the model-generated questions if the gold question is not available. For evaluation metrics, F1 score treats the prediction and ground-truth answer as bags of tokens, and compute their F1 score; Exact Match measures the percentage of predictions that match the ground truth answer exactly (Rajpurkar et al., 2016).

Question generation for extractive QA
M2S+cp is consistently better than S2S+cp both under F1 score and Exact Match, showing that contextual information helps to generate more accurate questions. Besides, using 10% gold data, the automatically generated questions from M2S+cp help to reach a better performance than that using only 20% gold data, and it is 11 points better than that using only 10% gold data.

Conclusion
We demonstrated that natural question generation can benefit from contextual information. Leveraging a multi-perspective matching algorithm, our model outperforms the existing state of the art, and our automatically generated questions help to improve a strong extractive QA system.