Let Me Know What to Ask: Interrogative-Word-Aware Question Generation

Question Generation (QG) is a Natural Language Processing (NLP) task that aids advances in Question Answering (QA) and conversational assistants. Existing models focus on generating a question based on a text and possibly the answer to the generated question. They need to determine the type of interrogative word to be generated while having to pay attention to the grammar and vocabulary of the question. In this work, we propose Interrogative-Word-Aware Question Generation (IWAQG), a pipelined system composed of two modules: an interrogative word classifier and a QG model. The first module predicts the interrogative word that is provided to the second module to create the question. Owing to an increased recall of deciding the interrogative words to be used for the generated questions, the proposed model achieves new state-of-the-art results on the task of QG in SQuAD, improving from 46.58 to 47.69 in BLEU-1, 17.55 to 18.53 in BLEU-4, 21.24 to 22.33 in METEOR, and from 44.53 to 46.94 in ROUGE-L.


Introduction
Question Generation (QG) is the task of creating questions about a text in natural language.This is an important task for Question Answering (QA) since it can help create QA datasets.It is also useful for conversational systems like Amazon Alexa.Due to the surge of interests in these systems, QG is also drawing the attention of the research community.One of the reasons for the fast advances in QA capabilities is the creation of large datasets like SQuAD (Rajpurkar et al., 2016) and TriviaQA (Joshi et al., 2017).Since the creation of such datasets is either costly if done manually or prone to error if done automatically, reliable and mean- ingful QG can play a key role in the advances of QA (Lewis et al., 2019).
QG is a difficult task due to the need of understanding the text to ask about and generating a natural question that is adequate according to the given text.We consider that this task have two aspects: what to ask and how to ask.The first one refers to the information about the entity that we want to ask; this includes the interrogative word to use and the topic of the question.On the other hand, how to ask refers to the creation of a natural language question that is grammatically correct and semantically precise.Most of the current approaches utilize sequence-to-sequence models, composed of an encoder model that first transforms a passage into a vector and a decoder model that given this vector, generates a question about the passage (Liu et al., 2019;Sun et al., 2018;Zhao et al., 2018;Pan et al., 2019).
There are different settings for QG.Subramanian et al. (2018) assumes that only a passage is given and attempts to find candidate key phrases that represent the core of the questions to be created.Zhao et al. (2018) follows an answer-aware setting, where the input is a passage and the answer to the question to create.We assume this setting and consider that the answer is a span of the passage, as in SQuAD.Following this approach, the decoder of the sequence-to-sequence model has to learn to generate both the interrogative word (i.e., wh-word) and the rest of the question simultaneously.
The main claim of our work is that separating the two tasks (i.e., interrogative-word classification and question generation) can lead to a better performance.We posit that the interrogative word must be predicted by a well-trained classifier.We consider that selecting the right interrogative word is the key to generate high-quality questions.For example, a question with a wrong interrogative word for the answer "the owner" is: "what produces a list of requirements for a project?".However, with the right interrogative word, who, the question would be: "who produces a list of requirements for a project?", which is clear that is more adequate regarding the answer than the first one.According to our claim, the independent classification model can improve the recall of interrogative words of a QG model because 1) the interrogative word classification task is easier to solve than generating the interrogative word along with the full question in the QG model and 2) the QG model would be able to generate the interrogative word easily by using the copy mechanism, which can copy parts of the input of the encoder.With these hypotheses, we propose Interrogative-Word-Aware Question Generation (IWAQG), a pipelined system composed of two modules: an interrogative-word classifier that predicts the interrogative word and a QG model that generates a question conditioned on the predicted interrogative word.Figure 1 shows a highlevel overview of our approach.
The proposed model achieves new state-of-theart results on the task of QG in SQuAD, improving from 46.58 to 47.69 in 21.24 to 22.33

Related Work
Question Generation (QG) problem has been approached in two ways.One is based on heuristics, templates and syntactic rules (Heilman and Smith, 2010;Mazidi and Nielsen, 2014;Labutov et al., 2015).This type of approach requires a heavy human effort, so they do not scale well.The other approach is based on neural networks and it is becoming popular due to the recent progress of deep learning in NLP (Pan et al., 2019).Du et al. (2017) is the first one to propose an sequence-to-sequence model to tackle the QG problem and outperformed the previous state-of-the-art model using human and automatic evaluations.Sun et al. (2018) proposed a similar approach to us, an answer-aware sequence-to-sequence model with a special decoding mode in charge of only the interrogative word.However, we propose to predict the interrogative word before the encoding stage, so that the decoder can focus more on the rest of the question rather than on the interrogative word.Besides, they cannot train the interrogativeword classifier using golden labels because it is learned implicitly inside the decoder.Duan et al. (2017) proposed, in a similar way to us, a pipeline approach.First, the authors create a long list of question templates like "who is author of", and "who is wife of".Then, when generating the question, they select first the question template and next, they fill it in.To select the question template, they proposed two approaches.One is a retrievalbased question pattern prediction, and the second one is a generation-based question pattern prediction.The first one has the problem that is computationally expensive when the question pattern size is large, and the second one, although it yields to better results, it is a generative approach and we argue that just modeling the interrogative word prediction as a classification task is easier and can lead to better results.As far as we know, we are the first one to propose an explicit interrogativeword classifier that provides the interrogative word to the question generator.

Problem Statement
Given a passage P , and an answer A, we want to find a question Q, whose answer is A. More formally: We assume that P is a paragraph composed of a list of words: , and the answer is a subspan of P .
We model this problem with a pipelined approach.First, given P and A, we predict the interrogative word I w , and then, we input into QG module P , A, and I w .The overall architecture of our model is shown in Figure 2.

Interrogative-Word Classifier
As discussed in section 5.2, any model can be used to predict interrogative words if its accuracy is high enough.Our interrogative-word classifier is based on BERT, a state-of-the-art model in many NLP tasks that can successfully utilize the context to grasp the semantics of the words inside a sentence (Devlin et al., 2018).We input a passage that contains the answer of the question we want to build and add the special token [ANS] to let BERT knows that the answer span has a special meaning and must be used differently to the rest of the passage.As required by BERT, the first token of the input is the special token [CLS], and the last is [SEP].This [CLS] token embedding originally was designed for classification tasks.In our case, to classify interrogative words, it learns how to represent the context and the answer information.
On top of BERT, we build a feed-forward network that receives as input the [CLS] token embedding concatenated with a learnable embedding of the entity type of the answer, as shown on the left side of Figure 2. We propose to utilize the entity type of the answer because there is a clear correlation between the answer type of the question and the entity type of the answer.For example, if the interrogative word is who, the answer is very likely to have an entity type person.Since we are using [CLS] token embedding as a representation of the context and the answer, we consider that using an explicit entity type embedding of the answer could help the system.

Question Generator
For the QG module, we employ one of the current state-of-the-art QG models (Zhao et al., 2018).This model is a sequence-to-sequence neural network that uses a gated self-attention in the encoder and an attention mechanism with maxout pointer in the decoder.
One way to connect the interrogative-word classifier to the QG model is to use the predicted interrogative word as the first output token of the decoder by default.However, we cannot expect a perfect interrogative-word classifier and also, the first word of the questions is not necessarily an interrogative word.Therefore, in this work, we add the predicted interrogative word to the input of the QG model to let the model decide whether to use it or not.In this way, we can condition the generated question on the predicted interrogative word effectively.

Encoder
The encoder is composed of a Recurrent Neural Network (RNN), a self-attention network, and a feature fusion gate (Gong and Bowman, 2018).The goal of this fusion gate is to combine two intermediate learnable features into the final encoded passage-answer representation.The input of this model is the passage P .It includes the answer and the predicted interrogative word I w , which is located just before the answer span.The RNN receives the word embedding of the tokens of this text concatenated with a learnable metaembedding that tags if the token is the interrogative word, the answer of the question to generate or the context of the answer.

Decoder
The decoder is composed of an RNN with an attention layer and a copy mechanism (Gu et al., 2016).The RNN of the decoder at time step t receives its hidden state at the previous time step t − 1 and the previously generated output y t−1 .At t = 0, it receives the last hidden state of the encoder.This model combines the probability of generating a word and the probability of copying that word from the input as shown on the right side of Figure 2. To compute the generative scores, it uses the outputs of the decoder, and the context of the encoder, which is based on the raw attention scores.To compute the copy scores, it uses the outputs of the RNN and the raw attention scores of the encoder.Zhao et al. (2018) observed that the repetition of words in the input sequence tends to create repetitions in the output sequence too.Thus, they proposed a maxout pointer mechanism instead of the regular pointer mechanism (Vinyals et al., 2015).This new pointer mechanism limits the magnitude of the scores of the repeated words to their maximum value.To do that, first, the attention scores are computed over the input sequence and then, the score of a word at time step t is calculated as the maximum of all scores pointing to the same word in the input sequence.The final probability distribution is calculated by applying the softmax function on the concatenation of copy scores and generative scores and summing up the probabilities pointing to the same words.

Experiments
In our experiments, we study our proposed system on SQuAD dataset v1.1.(Rajpurkar et al., 2016), prove the validity of our hypothesis and compare it with the current state of the art.

Dataset
In order to train our interrogative-word classifier, we use the training set of SQuAD v1.1 (Rajpurkar et al., 2016).This dataset is composed of 87599 instances, however, the number of interrogative words is not balanced as seen in 1.For a fair comparison with previous models, we train the QG model on the training set of SQuAD and split by half the dev set into dev and test randomly as Zhou et al. (2017).

Implementation
The interrogative-word classifier is made using the PyTorch implementation of BERT-base-uncased made by HuggingFace1 .It was trained for three epochs using cross entropy loss as the objective function.The entity types are obtained using spaCy2 .If spaCy cannot return an entity for a given answer, we label it as None.The dimension of the entity type embedding is 5.The input dimension of the classifier is 773 (768 from BERT base hidden size and 5 from the entity type embedding size) and the output dimension is 8 since we predict the interrogative words: what, which, where, when, who, why, how, and others.The feed-forward network consists of a single layer.For optimization, we used Adam optimizer with weight decay and learning rate of 5e-5.The QG model is based on the model proposed by (Zhao et al., 2018) with small modifications using Py-Torch.The encoder uses a BiLSTM and the decoder uses an LSTM.During training, the QG model uses the golden interrogative words to enforce the decoder to always copy the interrogative word.On the other hand, during inference, it uses the interrogative word predictions from the classifier.

Evaluation
We perform an automatic evaluation using the metrics: BLUE-1, BLUE-2, BLUE-3, BLUE-4 (Papineni et al., 2002), METEOR (Lavie and Denkowski, 2009) and ROUGE-L (Lin, 2004).In addition, we perform a qualitative analysis where we compare the generated questions of the baseline (Zhao et al., 2018), our proposed model, the upper bound performance of our model, and the golden question.

Comparison with Previous Models
Our interrogative-word classifier achieves an accuracy of 73.8% on the test set of SQuAD.Using this model for the pipelined system, we compare the performance of the QG model with respect to the previous state-of-the-art models.Table 2 shows the evaluation results of our model and the current state-of-the-art models, which are briefly described below.
• Zhou et al. ( 2017) is one of the first authors who proposed a sequence-to-sequence model with attention and copy mechanism.They also proposed the use of POS and NER tags as lexical features for the encoder.
• Zhao et al. (2018) proposed the model in which we based our QG module.
• Kim et al. (2019) proposed QG architecture that treats the passage and the target answer 2019) proposed a sequence-tosequence model with a clue word predictor using a Graph Convolutional Networks to identify if each word in the input passage is a potential clue that should be copied into the generated question.
Our model outperforms all other models in all the metrics.This improvement is consistent, around 2%.This is due to the improvement in the recall of the interrogative words.All these measures are based on the overlap between the golden question and the generated question, so using the right interrogative word, we can improve these scores.In addition, generating the right interrogative word also helps to create better questions since the output of the RNN of the decoder at time step t also depends on the previously generated word.

Upper Bound Performance of IWAQG
We analyze the upper bound improvement that our QG model can have according to different levels of accuracy of the interrogative-word classifier.In order to do that, instead of using our interrogativeword classifier, we use the golden labels of the test set and generated noise to simulate a classifier with different accuracy levels.Table 3 and Figure 3 show a linear relationship between the accuracy of the classifier and the IWAQG.This demonstrates the effectiveness of our pipelined approach regardless of the interrogative-word classifier model.In addition, we analyze the recall of the interrogative words generated by our pipelined system.As shown in the Table 4, the total recall of using only the QG module is 68.29%, while the recall of our proposed system, IWAQG, is 74.10%, an improvement of almost 6%.Furthermore, if we assume a perfect interrogative-word classifier, the recall would be 99.72%, a dramatic improvement which proves the validity of our hypothesis.

Effectiveness of the input of interrogative words into the QG model
In this section, we show the effectiveness of inserting explicitly the predicted interrogative word into the passage.We argue that this simple way of connecting the two models exploits the characteristics of the copy mechanism successfully.Table 3: Performance of the QG model with respect to the accuracy of the interrogative-word classifier."*" is our implementation of the QG module without our interrogative-word classifier (Zhao et al., 2018).
see in Figure 4, the attention score of the generated interrogative word, who, is relatively high for the predicted interrogative word and lower for the other words.This means that it is very likely that the interrogative word inserted into the passage is copied as intended.
Figure 4: Attention matrix between the generated question (Y-axis) and the given passage (X-axis).

Qualitative Analysis
In this section, we present a sample of the generated questions of our model, the upper bound model (interrogative-word classifier accuracy is 100%), the baseline (Zhao et al., 2018), and the golden questions to show how our model improves the recall of the interrogative words with respect to the baseline.In general, our model has a better recall of interrogative words than the baseline which leads us to a better quality of questions.However, since we are still far from a perfect interrogativeword classifier, we also show that questions that our current model cannot generate correctly could be generated well if we had a better classifier.
As we can see in Table 5, in the first three examples the interrogative words generated by the baseline are wrong, while our model is right.In addition, due to the wrong selection of interrogative words, in the second example, the topic of the question generated by the baseline is also wrong.On the other hand, since our model selects the right interrogative word, it can create the right question.Each generated word depends on the previously generated word because of the generative LSTM model, so it is very important to select correctly the first word, i.e. the interrogative word.However, the performance of our proposed interrogative-word classifier is not perfect, if it had a 100% accuracy, then, we could improve the quality of the generated questions like in the last two examples.

Ablation Study
We tried to combine different features shown in Table 6 for the interrogative-word classifier.In this section, we analyze their impact on the performance of the model.
is the passage where the answer appears but, the model does not know where the answer is.The second model is the previous one with the entity type of the answer as an additional feature.The performance of this model is a bit better than the first one but it is not enough to be utilized effectively for our pipeline.In the third model, the input is the passage.This model uses the average of the answer token embeddings generated by BERT along with the [CLS] token embedding.As we can see, the performance noticeably increased, which indicates that answer information is the key to predict the interrogative word needed.In the fourth model, we added the special token [ANS] at the beginning and at the end of the answer span to let BERT knows where the answer is in the passage.So the input to the feedforward network is only the [CLS] token embedding.This model clearly outperforms the previous one, which shows that BERT can exploit the answer information better if it is tagged with the [ANS] token.The fifth model is the same as the previous one but with the addition of the entitytype embedding of the answer.The combination of the three features (answer, answer entity type, and passage) yields to the best performance.In addition, we provide the recall and precision per class for our final interrogative-word classifier (CLS + AT + NER in Table 7).As we can see, the overall recall is high, and it is also higher than just using the QG module (Table 4), which proves our hypothesis that modeling the interrogative-word prediction task as an independent classification problem yields to a higher recall than generating them with the full question.However, the recall of which is very low.This is due to the intrinsic difficulty of predicting this interrogative words.Questions like "what country" and "which country" can be correct depending on the context, but the meaning is very similar.Our model has also problem with why due to the lack of training instances for this class.Lastly, the recall of 'when is also low because many questions of this type can be formulated with other interrogative words, e.g.: instead of "When did WWII start?", we can ask "In which year did WWII start?".

Conclusion and Future Work
In this work, we proposed an Interrogative-Word-Aware Question Generation (IWAQG), a pipelined model composed of an interrogative-word classifier and a question generator to tackle the question generation task.First, we predict the interrogative word.Then, the Question Generation (QG) model generates the question using the predicted interrogative word.Thanks to this independent interrogative-word classifier and the copy mechanism of the question generation model, we are able to improve the recall of the interrogative words in the generated questions.This improvement also leads to a better quality of the generated questions.We prove our hypotheses through quantitative and qualitative experiments, showing that our pipelined system outperforms the previous state-of-the-art models.Lastly, we also prove that which country is the most dependent on arab oil?Japan Table 5: Qualitative Analysis.Comparison between the baseline, our proposed model, the upper bound of our model, the golden question and the answer of the question."*" is our implementation of the QG module without our interrogative-word classifier (Zhao et al., 2018).
our methodology is remarkably effective, showing a theoretical upper bound of the potential improvement using a more accurate interrogativeword classifier.
In the future, we would like to improve the interrogative-word classifier, since it would clearly improve the performance of the whole system as we showed.We also expect that the use of the Transformer architecture (Vaswani et al., 2017) could improve the QG model.In addition, we plan to test our approach on other datasets to prove its generalization capability.Finally, an interesting application of this work could be to utilize QG to improve Question Answering systems.

Figure 1 :
Figure 1: High-level overview of the proposed model.

Figure 3 :
Figure 3: Performance of the QG model with respect to the accuracy of the interrogative-word classifier.
in METEOR, and from 44.53 to 46.94 in ROUGE-L.

Table 1 :
To train the interrogative-word classifier, we downsample the training set to have a balanced dataset.SQuAD training set statistics.Full training set and downsampled training set.

Table 2 :
Comparison of our model with the baselines."*" is our QG module.
that represents the input passage.In this model, the input

Table 4 :
Recall of interrogative words of the QG model."*" is our implementation of the QG module without our interrogative-word classifier

Table 6 :
Ablation Study of our interrogative-word classifier.

Table 7 :
Recall and precision of interrogative words of our interrogative-word classifier.