Learning to Ask Unanswerable Questions for Machine Reading Comprehension

Machine reading comprehension with unanswerable questions is a challenging task. In this work, we propose a data augmentation technique by automatically generating relevant unanswerable questions according to an answerable question paired with its corresponding paragraph that contains the answer. We introduce a pair-to-sequence model for unanswerable question generation, which effectively captures the interactions between the question and the paragraph. We also present a way to construct training data for our question generation models by leveraging the existing reading comprehension dataset. Experimental results show that the pair-to-sequence model performs consistently better compared with the sequence-to-sequence baseline. We further use the automatically generated unanswerable questions as a means of data augmentation on the SQuAD 2.0 dataset, yielding 1.9 absolute F1 improvement with BERT-base model and 1.7 absolute F1 improvement with BERT-large model.


Introduction
Extractive reading comprehension (Hermann et al., 2015;Rajpurkar et al., 2016) obtains great attentions from both research and industry in recent years.End-to-end neural models (Seo et al., 2017;Wang et al., 2017;Yu et al., 2018) have achieved remarkable performance on the task if answers are assumed to be in the given paragraph.Nonetheless, the current systems are still not good at deciding whether no answer is presented in the context (Rajpurkar et al., 2018).For unanswerable questions, the systems are supposed to abstain from answering rather than making unreliable guesses, which is an embodiment of language understanding ability.Figure 1: An example taken from the SQuAD 2.0 dataset.The annotated (plausible) answer span in the paragraph is used as a pivot to align the pair of answerable and unanswerable questions.
We attack the problem by automatically generating unanswerable questions for data augmentation to improve question answering models.The generated unanswerable questions should not be too easy for the question answering model so that data augmentation can better help the model.For example, a simple baseline method is randomly choosing a question asked for another paragraph, and using it as an unanswerable question.However, it would be trivial to determine whether the retrieved question is answerable by using wordoverlap heuristics, because the question is irrelevant to the context (Yih et al., 2013).In this work, we propose to generate unanswerable questions by editing an answerable question and conditioning on the corresponding paragraph that contains the answer.So the generated unanswerable questions are more lexically similar and relevant to the context.Moreover, by using the answerable question as a prototype and its answer span as a plausible answer, the generated examples can provide more discriminative training signal to the question answering model.
To create training data for unanswerable question generation, we use (plausible) answer spans in paragraphs as pivots to align pairs of answerable questions and unanswerable questions.As shown in Figure 1, the answerable and unanswerable questions of a paragraph are aligned through the text span "Victoria Department of Education" for being both the answer and plausible answer.These two questions are lexically similar and both asked with the same answer type in mind.In this way, we obtain the data with which the models can learn to ask unanswerable questions by editing answerable ones with word exchanges, negations, etc.Consequently, we can generate a mass of unanswerable questions with existing large-scale machine reading comprehension datasets.
Inspired by the neural reading comprehension models (Xiong et al., 2017;Huang et al., 2018), we introduce a pair-to-sequence model to better capture the interactions between questions and paragraphs.The proposed model first encodes input question and paragraph separately, and then conducts attention-based matching to make them aware of each other.Finally, the context-aware representations are used to generate outputs.To facilitate the use of context words during the generation process, we also incorporate the copy mechanism (Gu et al., 2016;See et al., 2017).
Experimental results on the unanswerable question generation task shows that the pair-tosequence model generates consistently better results over the sequence-to-sequence baseline and performs better with long paragraphs than with short answer sentences.Further experimental results show that the generated unanswerable questions can improve multiple machine reading comprehension models.Even using BERT fine-tuning as a strong reading comprehension model, we can still obtain a 1.9% absolute improvement of F1 score with BERT-base model and 1.7% absolute F1 improvement with BERT-large model.

Related Work
Machine Reading Comprehension (MRC) Various large-scale datasets (Hermann et al., 2015;Rajpurkar et al., 2016;Nguyen et al., 2016;Joshi et al., 2017;Rajpurkar et al., 2018;Kocisky et al., 2018) have spurred rapid progress on machine reading comprehension in recent years.SQuAD (Rajpurkar et al., 2016) is an extractive benchmark whose questions and answers spans are annotated by humans.Neural reading comprehension systems (Wang and Jiang, 2017;Seo et al., 2017;Wang et al., 2017;Hu et al., 2018;Huang et al., 2018;Liu et al., 2018;Yu et al., 2018;Wang et al., 2018) have outperformed humans on this task in terms of automatic metrics.The SQuAD 2.0 dataset (Rajpurkar et al., 2018) extends SQuAD with more than 50, 000 crowdsourced unanswerable questions.So far, neural reading comprehension models still fall behind humans on SQuAD 2.0.Abstaining from answering when no answer can be inferred from the given document does require more understanding than barely extracting an answer.
Question Generation for MRC In recent years, there has been an increasing interest in generating questions for reading comprehension.Du et al. (2017) show that neural models based on the encoder-decoder framework can generate significantly better questions than rule-based systems (Heilman and Smith, 2010).To generate answer-focused questions, one can simply indicate the answer positions in the context with extra features (Yuan et al., 2017;Zhou et al., 2018;Du and Cardie, 2018;Sun et al., 2018;Dong et al., 2019).Song et al. (2018) and Kim et al. (2019) separate answer representations for further matching.Yao et al. (2018) introduce a latent variable for capturing variability and an observed variable for controlling question types.In summary, the above mentioned systems aim to generate answerable questions with certain context.On the contrary, our goal is to generate unanswerable questions.
Adversarial Examples for MRC To evaluate the language understanding ability of pre-trained systems, Jia and Liang (2017) construct adversarial examples by adding distractor sentences that do not contradict question answering for humans to the paragraph.Clark and Gardner (2018) and Tan et al. (2018) use questions to retrieve paragraphs that do not contain the answer as adversarial examples.Rajpurkar et al. (2018) create unanswerable questions through rigid rules, which swap entities, numbers and antonyms of answerable questions.rize these work according to the type of the augmentation data: external data source, paragraphs or questions.Devlin et al. (2019) fine-tune BERT on the SQuAD dataset jointly with another dataset TriviaQA (Joshi et al., 2017).Yu et al. (2018) paraphrase paragraphs with backtranslation.Another line of work adheres to generate answerable questions.Yang et al. (2017) propose to generate questions based on the unlabeled text for semisupervised question answering.Sun et al. (2019) propose a rule-based system to generate multiplechoice questions with candidate options upon the paragraphs.We aim at generating unanswerable questions as a means of data augmentation.

Problem Formulation
Given an answerable question q and its corresponding paragraph p that contains the answer a, we aim to generate unanswerable questions q that fulfills certain requirements.First, it cannot be answered by paragraph p.Second, it must be relevant to both answerable question q and paragraph p, which refrains from producing irrelevant questions.Third, it should ask for something of the same type as answer a.
As shown in Figure 2, we investigate two simple neural models built upon encoder-decoder architecture (Cho et al., 2014;Bahdanau et al., 2015) to generate unanswerable questions.A sequence-tosequence model takes the concatenated paragraph and question as input, and encodes the input in a sequential manner.A pair-to-sequence model is further introduced to capture the interactions be-tween inputs.The decoder of two models generates unanswerable questions sequentially.We factorize the probability of generating the unanswerable question P (q|q, p, a) as: where q<t = q1 . . .qt−1 .

Sequence-to-Sequence Model
In the sequence-to-sequence model, paragraph and question pairs are packed into an ordered sequence x with a special separator in between.To indicate answers in paragraphs, we introduce token type embeddings which can also be used to distinguish questions from paragraphs in sequenceto-sequence model.As we can see in Figure 2, the token type can be answer (A), paragraph (P), or question (Q).For a given token, we construct the input representation e i by summing the corresponding word embeddings, character embeddings and token type embeddings.Here characters are embedded by an embedding matrix followed by a max pooling layer.
We apply a single-layer bi-directional recurrent neural networks with long short-term memory units (LSTM; Hochreiter and Schmidhuber, 1997) to produce encoder hidden states h i = f BiLSTM (h i−1 , e i ).On each decoding step t, the hidden states of decoder (a single-layer unidirectional LSTM network) are computed by , where y t−1 is the word embedding of previously predicted token and c t−1 is the encoder context vector of previous step.Besides, we use an attention mechanism to summarize the encoder-side information into c t for current step.The attention distribution γ t over source words is computed as in Luong et al. (2015): where Next, s t is concatenated with c t to produce the vocabulary distribution P v : where W v and b v are learnable parameters.Copy mechanism (See et al., 2017) is incorporated to directly copy words from inputs, because words in paragraphs or source questions are of great value for unanswerable question generation.Specifically, we use s t and c t to produce a gating probability g t : where W g and b g are learnable parameters.The gate g t determines whether generating a word from the vocabulary or copying a word from inputs.Finally, we obtain the probability of generating qt by: where ζ qt denotes all the occurrence of qt in inputs, and the copying score γt is computed in the same way as attention scores γ t (see Equation ( 3)) while using different parameters.

Pair-to-Sequence Model
Paragraph and question interactions play a vitally important role in machine reading comprehension.The interactions make the paragraph and question aware of each other and help to predict the answer more precisely.Therefore we propose a pairto-sequence model, conducting attention based interactions in encoder and subsequently decoding with two series of representations.
In pair-to-sequence model, the paragraph and question are embedded as in sequenceto-sequence model, but encoded separately by weight-shared bi-directional LSTM networks, yielding h p i = f BiLSTM (h p i−1 , e p i−1 ) as paragraph encodings and h q i = f BiLSTM (h q i−1 , e q i−1 ) as question encodings.The same attention mechanism as in sequence-to-sequence model is used in the following interaction layer to produce questionaware paragraph representations hp i : where ,W p and b p are learnable parameters.Similarly, the paragraph-aware question representations hq i are produced by: where Z j = |p| k=1 exp(score(h p k , h q j )), W q and b q are learnable parameters.
Accordingly, the decoder now takes paragraph context c p t−1 and question context c q t−1 as encoder context, computed as c t (see Equation ( 4)) in sequence-to-sequence model, to update decoder hidden states ) and predict tokens.Copy mechanism is also adopted as described before, and copying words from both the paragraph and question is viable.

Training and Inference
The training objective is to minimize the negative likelihood of the aligned unanswerable question q given the answerable question q and its corresponding paragraph p that contains the answer a: where D is the training corpus and θ denotes all the parameters.Sequence-to-sequence and pairto-sequence models are trained with the same objective.
During inference, the unanswerable question for question answering pair (q, p, a) is obtained via argmax q P (q |q, p, a), where q represents candidate outputs.Beam search is used to avoid iterating over all possible outputs.

Experiments
We conduct experiments on the SQuAD 2.0 dataset (Rajpurkar et al., 2018).The extractive machine reading benchmark contains about 100, 000 answerable questions and over 50, 000 crowdsourced unanswerable questions towards Wikipedia paragraphs.Crowdworkers are requested to craft unanswerable questions that are relevant to the given paragraph.Moreover, for each unanswerable question, a plausible answer span is annotated, which indicates the incorrect answer obtained by only relying on type-matching heuristics.Both answers and plausible answers are text spans in the paragraphs.

Training Data Construction
We use (plausible) answer spans in paragraphs as pivots to align pairs of answerable questions and unanswerable questions.An aligned pair is shown in Figure 1.As to the spans that correspond to multiple answerable and unanswerable questions, we sort the pairs by Levenshtein distance (Levenshtein, 1966) and keep the pair with the minimum distance, and make sure that each question is only paired once.
We obtain 20, 240 aligned pairs from the SQuAD 2.0 dataset in total.The Levenshtein distance between the answerable and unanswerable questions in pairs is 3.5 on average.Specifically, the 17, 475 pairs extracted from the SQuAD 2.0 training set are used to train generation models.Since the SQuAD 2.0 test set is hidden, we randomly sample 46 articles from the SQuAD 2.0 training set with 1, 805 (∼10%) pairs as holdout set and evaluate generation models with 2, 765 pairs extracted the SQuAD 2.0 development set.

Settings
We implement generation models upon Open-NMT (Klein et al., 2017).We preprocess the corpus with the spaCy toolkit for tokenization and sentence segmentation.We lowercase tokens and build the vocabulary on SQuAD 2.0 training set with word frequency threshold of 9 to remove most noisy tokens introduced in data collection and tokenization.We set word, character and token type embeddings dimension to 300.We use the glove.840B.300dpre-trained embeddings (Pennington et al., 2014) to initialize word embeddings, and do further updates during training.Both encoder and decoder share the same vocabulary and word embeddings.The hidden state size of LSTM network is 150.Dropout probability is set to 0.2.The data are shuffled and split into mini-batches of size 32 for training.The model is optimized with Adagrad (Duchi et al., 2011) with an initial learning rate of 0.15.During inference, the beam size is 5.We prohibit producing unknown words by setting the score of <unk> token to -inf.We filter the beam outputs that make no differences to the input question.

Evaluation Metrics
The generation quality is evaluated using three automatic evaluation metrics: BLEU (Papineni et al., 2002), ROUGE (Lin, 2004) and GLEU (Napoles et al., 2015).BLEU1 is a commonly used metric in machine translation that computes ngram precisions over references.Recall-oriented ROUGE2 metric is widely adopted in summarization, and ROUGE-L measures longest common subsequence between system outputs and references.GLEU3 is a variant of BLEU with the modification that penalizes system output n-grams that present in input but absent from the reference.This makes GLEU a preferable metric for tasks with subtle but critical differences in a monolingual setting as in our unanswerable question generation task.
We also conduct human evaluation on 100 samples in three criteria: (1) unanswerability, which indicates whether the question is unanswerable or not; (2) relatedness, which measures semantic relatedness between the generated question and input question answering pair; (3) readability, which indicates the grammaticality and fluency.We ask three raters to score the generated questions in terms of relatedness and readability on a 1-3 scale (3 for the best) and determine the answerability in binary (1 for unanswerable).The raters are not aware of the question generation methods in advance.

Results
Results of the automatic evaluation are shown in Table 1.We find that the proposed pair-tosequence model that captures interactions between paragraph and question performs consistently better than sequence-to-sequence model.Moreover, replacing the input paragraph with the answer sentence hurts model performance, which indicates that using the whole paragraph as context provides more helpful information to unanswerable question generation.We also try to generate unanswerable questions by only relying on answerable questions (see "-Paragraph"), or the paragraph (see "-Question").Unsurprisingly, both ablation models obtain worse performance compared with the full model.These two ablation results also demonstrate that the input answerable question helps more to improve performance compared with the input paragraph.We argue that the original answerable question provides more direct information due to the fact that the average edit distance between the example pairs is 3.5.we remove the copy mechanism that restrains prediction tokens to the vocabulary.The results indicate the necessity of copying tokens from answerable questions and paragraphs to outputs, which relieves the out-of-vocabulary problem.
Table 3 shows the human evaluation results of generated unanswerable questions.We compare with the baseline method TFIDF, which uses the input answerable question to retrieve similar questions towards other articles as outputs.The retrieved questions are mostly unanswerable and readable, but they are not quite relevant to the question answering pair.Notice that being relevant is demonstrated to be important for data augmentation in further experiments on machine reading comprehension.Here pair-to-sequence model still outperforms sequence-to-sequence model in terms of all three metrics.But the differences in human evaluation are not as notable as in the automatic metrics.
As shown in Table 4, we further randomly sample 100 system outputs to analyze the types of generated unanswerable questions.We borrow the types defined in Rajpurkar et jpurkar et al., 2018) for detail definition of each type."S2S" represents the sequence-to-sequence baseline and "P2S" is our proposed pair-to-sequence model.
for SQuAD 2.0.We categorize the outputs with grammatical errors that make them hard to understand into Other.Samples that fall into Impossible Condition are mainly produced by non-entity substitution.We can see that models tend to generate unanswerable questions by inserting negation and swapping entities.These two types are also most commonly used when crowdworkers pose unanswerable questions according to answerable ones.We also find that the current models still have difficulties in utilizing antonyms and exclusion conditions, which could  be improved by incorporating external resources.
In Figure 3, we present a sample paragraph and its corresponding answerable questions and generated unanswerable questions.In the first example, two models generate unanswerable questions by swapping the location entity "Victoria" with "texas" and inserting negation word "never", respectively.In the second example, sequenceto-sequence model omits the condition "in Victoria" and yields an answerable question.Pair-tosequence model inserts the negation "no longer" properly, which is not mentioned in the paragraph.In the third example, grammatical errors are found in the output of SEQ2SEQ.The last example shows that inserting negation words in different positions ("n't public" versus "not in victoria") can express different meanings.Such cases are critical for generated questions' answerability, which is hard to handle in a rule-based system.

Question Answering Models
We apply our automatically generated unanswerable questions as augmentation data to the follow-ing reading comprehension models: BiDAF-No-Answer (BNA) BiDAF (Seo et al., 2017) is a benchmark model on extractive machine reading comprehension.Based on BiDAF, Levy et al. (2017) propose the BiDAF-No-Answer model to predict the distribution of answer candidates and the probability of a question being unanswerable at the same time.
DocQA Clark and Gardner (2018) propose the DocQA model to address document-level reading comprehension.The no-answer probability is also predicted jointly.
BERT Fine-Tuning It is the state-of-the-art model on unanswerable machine reading comprehension.We adopt the uncased version of BERT (Devlin et al., 2019) for fine-tuning.The batch sizes of BERT-base and BERT-large are set to 12 and 24 respectively.The rest hyperparameters are kept untouched as in the official instructions of fine-tuning BERT-Large on SQuAD 2.0.

Data Augmentation Setup
We first generate unanswerable questions using the trained generation model.the answerable questions in the SQuAD 2.0 training set, besides ones aligned before, to generate unanswerable questions.Then we use the paragraph and answers of answerable questions along with the generated questions to construct training examples.At last, we have an augmentation data containing 69, 090 unanswerable examples.
We train question answering models with augmentation data in two separate phases.In the first phase, we train the models by combining the augmentation data and all 86, 821 SQuAD 2.0 answerable examples.Subsequently, we use the original SQuAD 2.0 training data alone to further fine-tune model parameters.

Results
Exact Match (EM) and F1 are two metrics used to evaluate model performance.EM measures the percentage of predictions that match ground truth answers exactly.F1 measures the word overlap between the prediction and ground truth answers.We use pair-to-sequence model with answerable questions and paragraphs for data augmentation by default.
Table 2 shows the exact match and F1 scores of multiple reading comprehension models with and without data augmentation.We can see that the generated unanswerable questions can improve both specifically designed reading comprehension models and strong BERT fine-tuning models, yielding 1.9 absolute F1 improvement with BERTbase model and 1.7 absolute F1 improvement with BERT-large model.Our submitted model obtains an EM score of 80.75 and an F1 score of 83.85 on the hidden test set.
As shown in Table 5, pair-to-sequence model proves to be a better option for generating augmentation data than other three methods.Besides the sequence-to-sequence model, we use answerable questions to retrieve questions from other ar-EM / F1 ticles with TFIDF.The retrieved questions are of little help to improve the model, because they are less relevant to the paragraph as shown in Table 3.We refer to the rule-based method (Jia and Liang, 2017) that swaps entities and replaces words with antonyms as RULE.In comparison to the above methods, pair-to-sequence model can yield the largest improvement.
Results in Table 6 show that enlarging the size of augmentation data can further improve model performance, especially with the BERTbase model.We conduct experiments using two and three times the size of the base augmentation data (i.e., 69, 090 unanswerable questions).We generate multiple unanswerable questions for each answerable question by using beam search.Because we only generate unanswerable questions, the data imbalance problem could mitigate the improvement of incorporating more augmentation data.

Conclusions
In this paper, we propose to generate unanswerable questions as a means of data augmentation for machine reading comprehension.We produce relevant unanswerable questions by editing answerable questions and conditioning on the corresponding paragraph.A pair-to-sequence model is introduced in order to capture the interactions between question and paragraph.We also present a way to construct training data for unanswerable question generation models.Both automatic and human evaluations show that the proposed model consistently outperforms the sequence-tosequence baseline.The results on the SQuAD 2.0 dataset show that our generated unanswer-able questions can help to improve multiple reading comprehension models.As for future work, we would like to enhance the ability to utilize antonyms for unanswerable question generation by leveraging external resources.

Figure 2 :
Figure 2: Diagram of the proposed pair-to-sequence model and sequence-to-sequence model.The input embeddings is the sum of the word embeddings, the character embeddings and the token type embeddings.The input questions are all answerable.

Figure 3 :
Figure 3: Sample output generated by human, sequence-to-sequence model, and pair-to-sequence model.The (plausible) answer span of questions are marked in colors and main difference of model outputs are underlined.

Table 1 :
Automatic evaluation results.Higher score is better and the best performance for each evaluation metric is highlighted in boldface."-Paragraph (+AS)" represents replacing paragraphs with answer sentences.

Table 2 :
Experimental results of applying data augmentation to reading comprehension models on the SQuAD 2.0 dataset." " indicates absolute improvement.

Table 4 :
Types of unanswerable questions generated by models and humans, we refer the reader to (Ra- Victoria (Australia) Paragraph: Victorian schools are either publicly or privately funded.Public schools, also known as state or government schools, are funded and run directly by the Victoria Department of Education .Students do not pay tuition fees, but some extra costs are levied.Private fee-paying schools include parish schools run by the Roman Catholic Church and independent schools similar to British public schools.Independent schools are usually affiliated with Protestant churches.Victoria also has several private Jewish and Islamic primary and secondary schools.Private schools also receive some ... Since students do not pay tuition, what do they have to pay for schooling in Victoria?Human: What is covered by the state in addition to tuition?SEQ2SEQ: since students do not pay to pay schooling in victoria ?PAIR2SEQ: since students do n't pay tuition , what do they have to pay for schooling in victoria ?(Plausible) Answer: some extra costs Title:

Table 5 :
Results using different generation methods for data augmentation." " indicates absolute improvement.

Table 6 :
Ablation over the size of data augmentation."× N" means the original size is enhanced N times." " indicates absolute improvement.