Let’s Ask Again: Refine Network for Automatic Question Generation

In this work, we focus on the task of Automatic Question Generation (AQG) where given a passage and an answer the task is to generate the corresponding question. It is desired that the generated question should be (i) grammatically correct (ii) answerable from the passage and (iii) specific to the given answer. An analysis of existing AQG models shows that they produce questions which do not adhere to one or more of the above-mentioned qualities. In particular, the generated questions look like an incomplete draft of the desired question with a clear scope for refinement. To alleviate this shortcoming, we propose a method which tries to mimic the human process of generating questions by first creating an initial draft and then refining it. More specifically, we propose Refine Network (RefNet) which contains two decoders. The second decoder uses a dual attention network which pays attention to both (i) the original passage and (ii) the question (initial draft) generated by the first decoder. In effect, it refines the question generated by the first decoder, thereby making it more correct and complete. We evaluate RefNet on three datasets, viz., SQuAD, HOTPOT-QA, and DROP, and show that it outperforms existing state-of-the-art methods by 7-16% on all of these datasets. Lastly, we show that we can improve the quality of the second decoder on specific metrics, such as, fluency and answerability by explicitly rewarding revisions that improve on the corresponding metric during training. The code has been made publicly available .


Introduction
Over the past few years, there has been a growing interest in Automatic Question Generation (AQG) * * The first two authors have contributed equally to this work. 1 https://github.com/PrekshaNema25/RefNet-QG from text -the task of generating a question from a passage and optionally an answer. AQG is used in curating Question Answering datasets, enhancing user experience in conversational AI systems (Shum et al., 2018) and for creating educational materials (Heilman and Smith, 2010). For the above applications, it is essential that the questions are (i) grammatically correct (ii) answerable from the passage and (iii) specific to the answer. Existing approaches focus on encoding the passage, the answer and the relationship between them using complex functions and then generate the question in one single pass. However, by carefully analysing the generated questions, we observe that these approaches tend to miss one or more of the important aspects of the question. For instance, in Table 1, the question generated by the single-pass baseline model for the first passage is grammatically correct but is not specific to the answer. In the second example, the generated question is both syntactically incorrect and incomplete.
The above examples indicate that there is clear scope of improving the general quality of the questions. Additionally, the quality can be specifically improved in terms of aspects like: fluency (Example 2) and answerability (Example 1). One way to approach this is by re-visiting the passage and answer with the aim to refine the initial draft by generating a better question in the second pass and then improving it with respect to a certain aspect. We can draw a comparison between this process and how humans tend to write a rough initial draft first and then refine it over multiple passes, where the later revisions focus on improving the draft aiming at certain aspects like fluency or completeness. With this motivation, we propose Refine Network (RefNet), which examines the initially generated question and performs a second pass to generate a revised question. Furthermore, we propose Reward-RefNet which uses explicit reward signals to achieve refinement focused on specific properties of the question such as fluency and answerability.
Our RefNet is a seq2seq based model that comprises of two decoders: Preliminary and Refinement Decoder. The Refinement Decoder takes the initial draft of the question generated by the Preliminary decoder as an input along with passage and answer, and generates the refined question by attending onto both the passage and the initial draft using a Dual Attention Network. The proposed dual attention aids RefNet to generate the final question by revisiting the appropriate parts of the input passage and initial draft. From Table 1, we can infer that our RefNet model is able to generate better questions in the second pass by fixing the errors in the initial draft. Our Reward-RefNet model uses REINFORCE with a baseline algorithm to explicitly reward the Refinement Decoder for generating a better question as compared to the Preliminary Decoder based on certain desired parameters like fluency and answerability. This leads to more answerable (see Reward-RefNet example for passage 1 in Table 1) and fluent (see Reward-RefNet example for passage 2 in Table 1) questions as compared to vanilla RefNet model.
Our experiments show that the proposed RefNet model outperforms existing state-of-the-art models on the SQuAD dataset by 12.3% and 3.7% (on BLEU) given the relevant sentence and passage respectively. We also achieve state-of-the-art results on HOTPOT-QA and DROP datasets with an im-provement of 7.57% and 15.25% respectively over the single-decoder baseline (on BLEU). Our human evaluations further validate these results. We further analyze and explain the impact of including the Refinement Decoder by examining the interaction between both the decoders. Interestingly, we observe that the inclusion of the Refinement Decoder boosts the quality of the questions generated by the initial decoder also. Lastly, our human evaluation of the questions generated by Reward-RefNet corroborate empirical results, i.e., it improves the question w.r.t. to fluency and answerability as compared to RefNet questions.

Refine Networks (RefNet) Model
In this section, we discuss various components of our proposed model as shown in Figure 1. For a given passage P = {w p 1 , . . . , w p m } of length m and answer A = {w a 1 , . . . , w a n } of length n, we first obtain answer-aware latent representation, U = {h p 1 , . . . ,h p m }, for every word of the passage and an answer representation h a (as described in Section 2.1). We then generate an initial draftQ = {q 1 , . . . ,q T } by computingq t as q t = arg max q l t=1p (q t |q t−1 , . . . ,q 1 , U, h a ) Herep(.) is a probability distribution modeled using the Preliminary Decoder. We then refine the initial draftQ using the Refinement Decoder to obtain the refined draft Q = {q 1 , . . . q T }: We then use explicit rewards to enforce refinement on a desired metric, such as, fluency or answerability through our Reward-RefNet model. In the following sub-sections, we describe the passage encoder, preliminary and refinement decoders and our reward mechanism.

Passage and Answer Encoder
We use a 3 layered encoder consisting of: (i) Embedding, (ii) Contextual and (iii) Passage-Answer Fusion layers as described below. To capture interaction between passage and answer, we ensure that the passage and answer representations are fused together at every layer.  Embedding Layer: In this layer, we compute a d-dimensional embedding for every word in the passage and the answer. This embedding is obtained by concatenating the word's Glove embedding (Pennington et al., 2014) with its character based embedding as discussed in (Seo et al., 2016). Additionally, for passage words, we also compute a positional embedding based on the relative position of the word w.r.t. the answer span as described in (Zhao et al., 2018). For every passage word, this positional embedding is also concatenated to the word and character based embeddings. We discuss the impact of character embeddings and answer tagging in Appendix A. In the subsequent sections, we will refer to embedding of the i-th passage word w p i as e(w p i ) and the j-th answer word w a j as e(w a j ). Contextual Layer: In this layer, we compute a contextualized representation for every word in the passage by passing the word embeddings (as computed above) through a bidirectional-LSTM (Hochreiter and Schmidhuber, 1997): where − → h p t is the hidden state of the forward LSTM at time t. We then concatenate the forward and backward hidden states as The answer could correspond to a span in the passage. Let j + 1 and j + n be the start and end indices of the answer span in the passage respectively. We can thus refer to {h p j+1 , . . . , h p j+n } as the representation of the answer words in the context of the passage. We then obtain contextualized representations for the n answer words by passing them through LSTM as follows: The final state h a = [ − → h a n ; ← − h a n ] of this Bi-LSTM is used as the answer representation in the subsequent stages. When the answer is not present in the passage, only e(w a t ) is passed to the LSTM. Passage-Answer Fusion Layer: In this layer, we refine the representations of the passage words based on the answer representation as follows: Here W u ∈ R l×3l . l is the hidden size of LSTM. This is similar to how (Seo et al., 2016) capture interactions between passage and question for QA. We use U = {h p 1 , . . . ,h p m } as the fused passageanswer representation which is then used by our decoder(s) to generate the question Q.

Preliminary and Refinement Decoders
As discussed earlier, RefNet has two decoders, viz., Preliminary Decoder and Refinement Decoder, as described below: Preliminary Decoder: This decoder generates an initial draft of the question, one word at a time, using an LSTM as follows: Hereh d t is the hidden state at time t, h a is the answer representation as computed above,c t−1 is an attention weighted sum of the contextualized passage word representations, α i t are parameterized and normalized attention weights (Bahdanau et al., 2014). Let's call this attention network as A 1 . e w (q t ) is the embedding of the wordq t . We obtainq t as: where W c is a R l×2l matrix and W o is the output matrix which projects the final representation to R V where V is the vocabulary size. Refinement Decoder: Once the preliminary decoder generates the entire question, the refinement decoder uses it to generate an updated version of the question using a Dual Attention Network. It first computes an attention weighted sum of the embeddings of the words generated by the first decoder as: where β t i are parameterized and normalized attention weights computed by attention network A 3 . Since the initial draft could be erroneous or incomplete, we obtain additional information from the passage instead of only relying on the output of the first decoder. We do so by computing a context vector c t as where γ t i are parameterized and normalized attention weights computed by attention network A 3 . The hidden state of the refinement decoder at time t is computed as follows: Finally, q t is predicted using where W c is a weight matrix and W o is the output matrix which is shared with the Preliminary decoder (Equation 2). Note that RefNet generates two variants of the question : initial draftQ and final draft Q. We compare these two versions of the generated questions in Section 4.

Reward-RefNet
Next, we address the following question: Can the refinement decoder be explicitly rewarded for generating a question which is better than that generated by the preliminary decoder on certain desired parameters? For example, (Nema and Khapra, 2018) define fluency and answerability as desired qualities in the generated question. They evaluate fluency using BLEU score and answerability using a score which captures whether the question contains the required {named entities, important words, function words, question types} (and is thus answerable). We use these fluency and answerability scores proposed by (Nema and Khapra, 2018) as reward signals. We first compute the reward r(Q) and r(Q) for the question generated by the preliminary and refinement decoder respectively. We then use "REINFORCE with a baseline" algorithm (Williams, 1992) to reward Refinement Decoder using the Preliminary Decoder's reward r(Q) as the baseline. More specifically, given the Preliminary Decoder's generated word sequenceQ = {q 1 ,q 2 , . . . ,q T } and the Refinement Decoder's generated word sequence Q = {q 1 , q 2 , . . . , q T } obtained from the distribution p(q t |q t−1 , . . . , q 1 ,Q, U, h a ), the training loss is defined as follows where r(Q) and r(Q) are the rewards obtained by comparing with the reference question Q * . As mentioned, this reward r(.) can be the fluency score or answerability score as defined by (Nema and Khapra, 2018).

Copy Module
Along with the above-mentioned three modules, we adopt the pointer-network and coverage mechanism from (See et al., 2017). We use it to (i) handle Out-of-Vocabulary words and (ii) avoid repeating phrases in the generated questions.

Experimental Details
In this section, we discuss (i) the datasets for which we tested our proposed model, (ii) implementation details and (iii) evaluation metrics used to compare our model with the baseline and existing works.

Datasets
SQuAD (Rajpurkar et al., 2016): It contains 100K (question, answer) pairs obtained from 536 Wikipedia articles, where the answers are a span in the passage. For SQuAD, AQG has been tried from both sentences and passages. In the former case, only the sentence which contains the answer span is used as input, whereas in the latter case the entire passage is used. We use the same trainvalidation-test splits as used in (Zhao et al., 2018). Hotpot QA  : Hotpot-QA is a multi-document and multi-hop QA dataset. Along with the triplet (P, A, Q), the authors also provide supporting facts that potentially lead to the answer. The answers here are either yes/no or answer span in P. We concatenate these supporting facts to form the passage. We use 10% of the training data for validation and use the original dev set as test set. DROP (Dua et al., 2019): The DROP dataset is a reading comprehension benchmark which requires discrete reasoning over passage. It contains 96K questions which require discrete operations such as addition, counting, or sorting to obtain the answer. We use 10% of the original training data for validation and use the original dev set as test set.

Implementation Details
We use 300 dimensional pre-trained Glove word embeddings, which are fixed during training. For character-level embeddings, we initially use a 20 dimensional embedding for the characters which is then projected to 100 dimensions. For answertagging, we use embedding size of 3. The hidden size for all the LSTMs is fixed to 512. We use 2layer, 1-layer and 2-layer stacked BiLSTM for the passage encoder, answer encoder and the decoders (both) respectively. We take the top 30, 000 frequent words as the vocabulary. We use Adam optimizer with a learning rate of 0.0004 and train our models for 10 epochs using cross entropy loss. For the Reward-RefNet model, we fine-tune the pretrained model with the loss function mentioned in Section 2.3 for 3 epochs. The best model is chosen based on the BLEU (Papineni et al., 2002) score on the validation split. For all the results we use beam search decoding with a beam size of 5.

Results and Discussions
In this section, we present the results and analysis of our proposed model RefNet. Throughout this section, we refer to our models as follows: Encode-Attend-Decode (EAD) model is our single decoder model containing the encoder and the Preliminary Decoder described earlier. Note that the performance of this model is comparable to our implementation of the model proposed in (Zhao et al., 2018).
Refine Network (RefNet) model includes the encoder, the Preliminary Decoder and the Refinement Decoder. We will (i) compare RefNet's performance with EAD and existing models across all the mentioned datasets (ii) report human evaluations to compare RefNet and EAD (iii) analyze Refinement and Preliminary Decoders iv) present the performance of Reward RefNet with two different reward signal (fluency and answerability).

RefNet's performance across datasets
In Table 2, we compare the performance of RefNet with existing single decoder architectures across different datasets. On BLEU-4 metric, RefNet beats the existing state-of-the-art model by 12.30%, 9.74%, 17.48%, and 3.71% respectively on SQuAD (sentence), HOTPOT-QA, DROP and SQuAD (passage) dataset. Also it outperforms EAD by 7.83%, 7.57%, 15.25% and 3.85% respectively on SQuAD (sentence), HOTPOT-QA, DROP and SQuAD (passage). In general, RefNet is consistently better than existing models across all n-gram scores (BLEU, ROUGE-L and ME-TEOR). Along with n-gram scores, we also observe improvements on Q-BLEU4 as well, which as described earlier, gives a measure of both answerability and fluency.

Human Evaluations
We conducted human evaluations to analyze the quality of the questions produced by EAD and RefNet. We randomly sampled 500 questions generated from the SQuAD (sentence level) dataset and asked the annotators to compare the quality of the generated questions. The annotators were shown a pair of questions, one generated by EAD and one by RefNet from the same sentence, and were asked to decide which one was better in terms of Fluency, Completeness, and Answerability. They were allowed to skip the question   pairs where they could not make a clear choice.
Three annotators rated each question and the final label was calculated based on majority voting. We observed that the RefNet model outperforms the EAD model across all three metrics. Over 68.6%, 66.7% and 64.2% of the generated questions from RefNet were respectively more fluent, complete and answerable when compared to the EAD model. However, there are some cases where EAD does better than RefNet. For example, in Table 3, we show that while trying to generate a more elaborate question, RefNet introduces an additional phrase "in the united" which is not required. Due to such instances, annotators preferred the EAD model in around 30% of the instances.

Analysis of Refinement Decoder and Preliminary Decoder
The two decoders impact each other through two paths: (i) indirect path, where they share the encoder and the output projection to the vocabulary V , (ii) direct path, via the dual attention network, where the initial draft of the question is attended by the Refinement Decoder. When RefNet has only indirect path, we can infer from row 1 of Table 4 that the performance of Preliminary Decoder improves when compared to the EAD model (16.84 v/s 17.59 BLEU). This suggests that generating two variants of the question improves the performance of the first decoder pass as well. This is perhaps due to the additional feedback that the Sample: Sentence: For instance , the language { xx -x is any binary string } can be solved in linear time on a multi-tape Turing machine , but necessarily requires quadratic time in the model of single-tape Turing machines .   shared encoder and output layer get from the Refinement Decoder. When we add the direct path (attention network) between the two decoders, the performance of the Refinement Decoder improves as compared to the Preliminary Decoder as shown in rows 3 and 4 of the Table 4 Comparison on Answerability: We also evaluate both the initial and refined draft using QBLEU4. As discussed earlier, Q-Metric measures Answerability using four components, viz., Named Entities, Important Words, Function Words, and Question Type. We observe that the increase in Q-Metric for refined questions is because the RefNet model can correct/add the relevant Named Entities in the question. In particular, we observe that the Named Entity component score in Q-Metric increases from 32.42 for the first draft to 37.81 for the refined draft.
Qualitative Analysis: Figure 2 shows that the RefNet model indeed generates more elaborate questions when compared to the Preliminary Decoder. As shown in Table 5, the quality of the refined question is better than the initial draft of the questions. Here RefNet adds the phrase "multitape Turing Machine," (row 2) which removes any   ambiguity in the question.

Analysis of Reward-RefNet
In this section, we analyze the impact of employing different reward signals in Reward-RefNet. As discussed earlier in section 2.3, we use fluency and answerability scores as reward signals. As shown in  We observe that current state-of-the-art models perform very well in terms of BLEU/QBLEU scores when the actual question has significant overlap with the passage. For example, consider a passage from the SQuAD dataset in Table 7, where except the question word who, the model sequentially copies everything from the passage and achieves a QBLEU score of 92.4. However, the model performs poorly in situations where the true question is novel and does not contain a large sequence of words from the passage itself. In order to quantify this, we first sort the true questions based on its BLEU-2 overlap with the passage in ascending order. We then select the first N true questions and compute the QBLEU score with the generated questions. The results are shown in red in Figure 3. Towards the left, where there are true questions with low overlap with the passage, the performance is poor, but it gradually improves as the overlap increases.
The task of generating questions with high originality (where the model phrases the question in its own words) is a challenging aspect of AQG since it requires complete understating of the semantics and syntax of the language. In order to improve questions generated on originality, we explicitly reward our model for having low n-gram score with the passage as compared to the initial draft. As a result we observe that with Reward-RefNet(Originality), there is an improvement in the performance where the overlap with the passage was less (as shown in blue in Figure 3).As shown in Table 8, although both questions are answerable given the passage, the question generated from Reward-RefNet(Originality) is better.
Passage: McLetchie was elected on the Lothian regional list and the Conservatives suffered a net loss of five seats , with leader Annabel Goldie claiming that their support had held firm, nevertheless, she too announced she would step down as leader of the party. Questions True: Who announced she would step down as leader of the Conservatives ? RefNet: who claiming that their support had held firm ? Reward-RefNet: who was the leader of the conservatives? Table 8: An example where Reward-RefNet(Originality) is better than RefNet.

Related Work
Early works on Question Generation were essentially rule based systems (Heilman and Smith, 2010;Mostow and Chen, 2009;Lindberg et al., 2013;Labutov et al., 2015). Current models for AQG are based on the encode-attend-decode paradigm and they either generate questions from the passage alone Yao et al., 2018) or from the passage and a given answer (in which case the generated question must result in the given answer). Over the past couple of years, several variants of the encodeattend-decode model have been proposed. For example, (Zhou et al., 2018) proposed a sequential copying mechanism to explicitly select a sub-span from the passage. Similarly, (Zhao et al., 2018) mainly focuses on efficiently incorporating paragraph level content by using Gated Self Attention and Maxout pointer networks. Some works (Yuan et al., 2017) even use Question Answering as a metric to evaluate the generated questions. There has also been some work on generating questions from images (Jain et al., 2017;Li et al., 2017) and from knowledge bases (Serban et al., 2016;Reddy et al., 2017). The idea of multi pass decoding which is central to our work has been used by (Xia et al., 2017) for machine translation and text summarization albeit with a different objective. Some works have also augmented seq2seq models (Rennie et al., 2017;Paulus et al., 2018;Song et al., 2017) with external reward signals using REIN-FORCE with baseline algorithm (Williams, 1992). The typical rewards used in these works are BLEU and ROUGE scores. Our REINFORCE loss is different from the previous ones as it uses the first decoder's reward as the baseline instead of reward of the greedy policy.

Conclusion and Future Work
In this work, we proposed Refine Networks (RefNet) for Question Generation to focus on refining and improving the initial version of the generated question. Our proposed RefNet model consisting of a Preliminary Decoder and a Refinement Decoder with Dual Attention Network outperforms the existing state-of-the-art models on the SQuAD, HOTPOT-QA and DROP datasets. Along with automated evaluations, we also conducted human evaluations to validate our findings. We further showed that using Reward-RefNet improves the initial draft on specific aspects like fluency, answerability and originality. As a future work, we would like to extend RefNet to have the ability to decide whether a refinement is needed on the generated initial draft.