Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space

In this paper, we propose a novel data augmentation method, referred to as Controllable Rewriting based Question Data Augmentation (CRQDA), for machine reading comprehension (MRC), question generation, and question-answering natural language inference tasks. We treat the question data augmentation task as a constrained question rewriting problem to generate context-relevant, high-quality, and diverse question data samples. CRQDA utilizes a Transformer autoencoder to map the original discrete question into a continuous embedding space. It then uses a pre-trained MRC model to revise the question representation iteratively with gradient-based optimization. Finally, the revised question representations are mapped back into the discrete space, which serve as additional question data. Comprehensive experiments on SQuAD 2.0, SQuAD 1.1 question generation, and QNLI tasks demonstrate the effectiveness of CRQDA


Introduction
Data augmentation (DA) is commonly used to improve the generalization ability and robustness of models by generating more training examples. Compared with the DA used in the fields of computer vision (Krizhevsky et al., 2012;Szegedy et al., 2015;Cubuk et al., 2019) and speech processing (Ko et al., 2015), how to design effective DA tailored to natural language processing (NLP) tasks remains a challenging problem. Unlike the general image DA techniques such as rotation and cropping, it is more difficult to synthesize new high-quality and diverse text.
Recently, some textual DA techniques have been proposed for NLP, which mainly focus on text classification and machine translation tasks. One way is directly modifying the text data locally with word deleting, word order changing, and word replacement (Fadaee et al., 2017;Kobayashi, 2018;Wei and Zou, 2019;. Another popular way is to utilize the generative model to generate new text data, such as back-translation (Sennrich et al., 2016;Yu et al., 2018), data noising technique (Xie et al., 2017), and utilizing pre-trained language generation model (Kumar et al., 2020;Anaby-Tavor et al., 2020).
Machine reading comprehension (MRC) (Rajpurkar et al., 2018), question generation (QG) (Du et al., 2017; and, question-answering natural language inference (QNLI) (Demszky et al., 2018;Wang et al., 2018) are receiving attention in NLP community. MRC requires the model to find the answer given a paragraph 2 and a question, while QG aims to generate the question for a given paragraph with or without a given answer. Given a question and a sentence in the relevant paragraph, QNLI requires the model to infer whether the sentence contains the answer to the question. Because the above tasks require the model to reason about the question-paragraph pair, existing textual DA methods that directly augment question or paragraph data alone may result in irrelevant question-paragraph pairs, which cannot improve the downstream model performance.
Question data augmentation (QDA) aims to automatically generate context-relevant questions to further improve the model performance for the above tasks . Existing QDA methods mainly employ the round-trip consistency (Alberti et al., 2019; to synthesize answerable questions. However, the round-trip consistency method is not able to generate context-relevant unanswerable questions, where MRC with unanswerable questions is a challenging task (Rajpurkar et al., 2018;Kwiatkowski et al., 2019). Zhu et al. (2019) firstly study unanswerable question DA, which relies on annotated plausible answer to constructs a small pseudo parallel corpus of answerable-to-unanswerable questions for unanswerable question generation. Unfortunately, most question answering (QA) and MRC datasets do not provide such annotated plausible answers.
Inspired by the recent progress in controllable text revision and text attribute transfer Liu et al., 2020), we propose a new QDA method called Controllable Rewriting based Question Data Augmentation (CRQDA), which can generate both new context-relevant answerable questions and unanswerable questions. The main idea of CRQDA is to treat the QDA task as a constrained question rewriting problem. Instead of revising discrete question directly, CRQDA aims to revise the original questions in a continuous embedding space under the guidance of a pretrained MRC model. There are two components of CRQDA: (i) A Transformer-based autoencoder whose encoder maps the question into a latent representation. Then its decoder reconstructs the question from the latent representation. (ii) A MRC model, which is pre-trained on the original dataset. This MRC model is used to tell us how to revise the question representation so that the reconstructed new question is a context-relevant unanswerable or answerable question. The original question is first mapped into a continuous embedding space. Next, the pre-trained MRC model provides the guidance to revise the question representation iteratively with gradient-based optimization. Finally, the revised question representations are mapped back into the discrete space, which act as the additional question data for training.
In summary, our contributions are as follows: (1) We propose a novel controllable rewriting based QDA method, which can generate additional highquality, context-relevant, and diverse answerable and unanswerable questions. (2) We compare the proposed CRQDA with state-of-the-art textual DA methods on SQuAD 2.0 dataset, and CRQDA outperforms all those strong baselines consistently.
(3) In addition to MRC tasks, we further apply CRQDA to question generation and QNLI tasks, and comprehensive experiments demonstrate its effectiveness.

Related Works
Recently, textual data augmentation has attracted a lot of attention. One popular class of textual DA methods is confined to locally modifying text in the discrete space to synthesize new data. Wei and Zou (2019) propose a universal DA technique for NLP called easy data augmentation (EDA), which performs synonym replacement, random insertion, random swap, or random deletion operation to modify the original text. Jungiewicz and Smywinski-Pohl (2019) propose a word synonym replacement method with WordNet. Kobayashi (2018) relies on word paradigmatic relations. More recently, CBERT  retrofits BERT (Devlin et al., 2018) to conditional BERT to predict the masked tokens for word replacement. These DA methods are mainly designed for the text classification tasks.
Unlike modifying a few local words, another commonly used textual DA way is to use a generative model to generate the entire new textual samples, including using variational autoencodes (VAEs) (Kingma and Welling, 2013;Rezende et al., 2014), generative adversarial networks (GANs) (Tanaka and Aranha, 2019), and pretrained language generation models (Radford et al., 2019;Kumar et al., 2020;Anaby-Tavor et al., 2020). Back-translation (Sennrich et al., 2016;Yu et al., 2018) is also a major way for textual DA, which uses machine translation model to translate English sentences into another language (e.g., French), and back into English. Besides, data noising techniques (Xie et al., 2017;Marivate and Sefara, 2019) and paraphrasing (Kumar et al., 2019) are proposed to generate new textual samples. All the methods mentioned above usually generate individual sentences separately. For QDA of MRC, QG, and QNLI tasks, these DA approaches cannot guarantee the generating question are relevant to the given paragraph. In order to generate contextrelevant answerable and unanswerable questions, our CRQDA method utilizes a pre-trained MRC as guidance to revise the question in continuous embedding space, which can be seen as a special constrained paraphrasing method for QDA.
Question generation (Heilman and Smith, 2010;Du et al., 2017;Zhang and Bansal, 2019) is attracting attention in the field of natural language generation (NLG). However, most previous works are not designed for QDA. That is, they do not aim to generate context-relevant questions for improving downstream model performance. Compared to QG, QDA is relatively unexplored. Recently, some works (Alberti et al., 2019; utilize round-trip consistency technique to synthesize answerable questions. They first use a generative model to generate the question with the paragraph and answer as model input, and then use a pre-trained MRC model to filter the synthetic question data. However, they are unable to generate context-relevant unanswerable questions. It should be noted that our method and round-trip consistency are orthogonal. CRQDA can also rewrite the synthetic question data by other methods to obtain new answerable and unanswerable question data. Unanswerable QDA is firstly explored in Zhu et al. (2019), which constructs a small pseudo parallel corpus of paired answerable and unanswerable questions and then generates relevant unanswerable questions in a supervised manner. This method relies on annotated plausible answers for the unanswerable questions, which does not exist in most QA and MRC datasets. Instead, our method rewrites the original answerable question to a relevant unanswerable question in an unsupervised paradigm, which can also rewrite the original answerable question to another new relevant answerable question.
Our method is inspired by the recent progress on controllable text revision and text attribute transfer Liu et al., 2020). However, our approach differs in several ways. First, those methods are used to transfer the attribute of the single sentence alone, but our method considers the given paragraph to rewrite the context-relevant question. Second, existing methods jointly train an attribute classifier to revise the sentence representation, while our method unitizes a pre-trained MRC model that shares the embedding space with autoencoder as the guidance to revise the question representation. Finally, the generated questions by our method serve as augmented data can benefit the downstream tasks.

Problem Formulation
We consider an extractive MRC dataset D, such as SQuAD 2.0 (Rajpurkar et al., 2018), which has |D| 5-tuple data: (q, d, s, e, t), where |D| is the data size, q = {q 1 , ..., q n } is a tokenized question with length n, d = {d 1 , ..., d m } is a tokenized paragraph with length m, s, e ∈ {0, 1, ..., m − 1} are inclusive indices pointing to the start and end of the answer span, and t ∈ {0, 1} represents whether the question q is answerable or unanswerable with d. Given a data tuple (q, d, s, e, t), we aim to rewrite q to a new answerable or unanswerable question q and obtain a new data tuple (q , d, s, e, t ) that fulfills certain requirements: (i) The generated answerable question can be answered with the answer span (s, e) with d, while the generated unanswerable question cannot be answered with d. (ii) The generated question should be relevant to the original question q and paragraph d. (iii) The augmented dataset D should be able to further improve the performance of the MRC models. Figure 1 shows the overall architecture of CRQDA. The proposed model consists of two components: a pre-trained language model based MRC model as described in § 3.3, and a Transformer-based autoencoder as introduced in § 3.4. Given a question q from the original dataset D, we first map the question q into a continuous embedding space. Then we revise the question embeddings by gradient-based optimization with the guidance of the MRC model ( § 3.5). Finally, the revised question embeddings are inputted to the Transformer-based autoencoder to generate a new question data.

Pre-trained Language Model based MRC Model
In this paper, we adopt the pre-trained language model (e.g., BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019b)) based MRC models as our MRC baseline model. Without loss of generality, we take the BERT-based MRC model as an example to introduce our method, which is shown in the left part of Figure 1. Following Devlin et al. (2018), given a data tuple (q, d, s, e, t), we concatenate a "[CLS]" token, the tokenized question q with length n, a "[SEP]" token, the tokenized paragraph d with length m, and a final "[SEP]" token. We feed the resulting sequence into the BERT model. The question q and paragraph d are first mapped into two sequence of embeddings: where BertEmbedding(·) denotes the BERT embedding layer which sums the corresponding token, segment, and position embeddings, E q ∈ R (n+2)×h and E d ∈ R m×h represent the question embedding and the paragraph embedding. E q and E d are further fed into BERT layers which consist of multiple Transformer layers (Vaswani et al., 2017) to obtain the final hidden representations Figure 1. The representation vector C ∈ R h corresponding to the first input token ([CLS]) are fed into a binary classification layer to output the probability of whether the question is answerable: where W c ∈ R 2×h and b c ∈ R 2 are trainable parameters. The final hidden representations of paragraph {T d 1 , ..., T d m } ∈ R m×h are inputted into two classifier layer to output the probability of the start position and the end position of the answer span: where W s ∈ R 1×h , W e ∈ R 1×h , b s ∈ R 1 , and b e ∈ R 1 are trainable parameters. For the data tuple (q, d, s, e, t), the total loss of MRC model can be written as where λ is a hyper-parameter.

Transformer-based Autoencoder
As shown in the right part of Figure 1, the original question q is firstly mapped into question embedding E q with the BERT embedding layer. It should be noted that the Transformer encoder and the pretrained MRC model share 3 the parameters of the embedding layer, which makes the question embedding of the two models in the same continuous embedding space.
We obtain the encoder hidden states H enc ∈ R (n+2)×h from the Transformer encoder. The objective of the Transformer autoencoder is to reconstruct the input question itself, which is optimized with cross-entropy (Dai and Le, 2015). A trivial solution of the autoencoder would be to simply copy tokens in the decoder side. To avoid this, we do not directly feed the whole H enc to the decoder, but use an RNN-GRU (Cho et al., 2014) layer with sum pooling to obtain a latent vector z ∈ R h . Then we feed z to the decoder to reconstruct the question, which follows .
We can train the autoencoder on the question data of D or pre-train it on other large-scale corpora, such as BookCorpus (Zhu et al., 2015) and English Wikipedia. Algorithm 1 Question Rewriting with Gradientbased Optimization.

Rewriting Question with Gradient-based Optimization
As mentioned above, the question embedding of the Transformer encoder and pre-trained MRC are in the same continuous embedding space, where we can revise the question embedding with the gradient guidance by MRC model. The revised question embedding E q is fed into Transformer autoencoder to generate a new question dataq . Figure 2 illustrates the process of question rewriting. Specifically, we take the process of rewriting an answerable question to a relevant unanswerable question as an example to present the process. Given an answerable question q, the goals of the rewriting are: (I) the revised question embedding should make the pre-trained MRC model predict the question from answerable to unanswerable with the paragraph d; (II) The modification size of E q should be adaptive to prevent the revision of E q from falling into local optimum; (III) The revised questionq should be similar to the original q, which helps to improve the robustness of the model.
For goal-(I), we take the label t = 0, which denotes the label of question is unanswerable, to calculate the loss L a (t ) and the gradient of E q by the pre-trained MRC model (see the red line in Figure 2). We iteratively revise E q with gradients from the pre-trained MRC model until the MRC model predicts the question is unanswerable with the revised E q as its input, which means the P a (t |E q ) is large than a threshold β t . Note that here we use the gradient to only modify E q , and all the model parameters during rewriting process are fixed. The process of each iteration can be written as: where η is the step size. Similarly, we can revise the E q of a data tuple (q, d, s, e, t) to generate a new answerable question whose answer is still the original answer span (s, e) as follows: Rewriting the answerable question into another answerable question can be seen as a special constrained paraphrasing, which requires that the question after the paraphrasing is context-relevant answerable and its answer remains unchanged. For goal-(II), we follow  to employ the dynamic-weight-initialization method to allocate a set of step-sizes S η = {η i } as initial step-sizes. For each initial step-size, we perform a pre-defined max-step revision with the step size value decay (corresponds to Algorithm 1 line 2-11) to find the target question embedding. For goal-(III), we select theq whose unigram word overlap rate with the original question q is within a threshold range [β a , β b ]. The unigram word overlap is computed by: here w q is the word in q and wq is the word inq . The whole question rewriting procedure is summarized in Algorithm 1.
reported in § 4.1. The ablation study and further analysis are provided in § 4.2. Then we evaluate our method on additional two tasks including question generation on SQuAD 1.1 dataset (Rajpurkar et al., 2016)

SQuAD
The extractive MRC benchmark SQuAD 2.0 dataset contains about 100,000 answerable questions and over 50,000 unanswerable questions. Each question is paired with a Wikipedia paragraph.
Implementation Based on RobertaForQuestio-nAnswering 4 model of Huggingface (Wolf et al., 2019), we train a RoBERTa large model on SQuAD 2.0 as the pre-trained MRC model for CRQDA. The hyper-parameters are the same as the original paper (Liu et al., 2019b). For training the autoencoder, we copy the word embedding parameters of the pre-trained MRC model to autoencoder and fix them during training. Both of its encoder and decoder consist of 6-layer Transformers, where the inner dimension of feed-forward networks (FFN), hidden state size, and the number of attention head are set to 4096, 1024, and 16. The autoencoder trains on BookCorpus (Zhu et al., 2015) and English Wikipedia (Devlin et al., 2018). The sequence length, batch size, learning rate, and training steps are set to 64, 256, 5e-5 and 100,000. For each original answerable data, we use CRQDA to generate new unanswerable question data, resulting in about 220K data samples (including the original data samples). The hyperparameter of β s , β t , β a , β b , and max-step are set to 0.9, 0.5, 0.5, 0.99, and 5, respectively.
Baselines We compare our CRQDA against the following baselines: (1) EDA (Wei and Zou, 2019): it augments question data by performing synonym replacement, random insertion, random swap, or random deletion operation. We implement EDA with their source code 5 to synthesize a new question data for each question of SQuAD 2.0; (2) Back-Translation (Yu et al., 2018;Prabhumoye et al., 2018): it uses machine translation model to translate questions into French and back into English. We implement Back-Translation based on the source code 6 to generate a new question data for each original question; (3) Text-VAE (Bowman et al., 2016;Liu et al., 2019a): it uses RNNbased VAE to generate a new question data for each question of SQuAD 2.0. The implementation is based on the source code 7 ; (4) AE with Noise: it uses the same autoencoder of CRQDA for question data rewriting. The only difference is that the autoencoder cannot utilize the MRC gradient but only uses a noise (sampled from Gaussian distribution) to revise the question embedding. This experiment is designed to show necessity of the pre-trained MRC. (5) 3M synth (Alberti et al., 2019): it employs round-trip consistency technique to synthesize 3M questions on SQuAD 2.0; (6) UNANSQ (Zhu et al., 2019): it employs a pair-tosequence model to generate 69,090 unanswerable questions. Following previous methods (Zhu et al., 2019;Alberti et al., 2019), we use each augmented dataset to fine-tune BERT large model, where the implementation is also based on Huggingface.
Results For SQuAD 2.0, Exact Match (EM) and F1 score are used as evaluation metrics. The results on SQuAD 2.0 development set are shown in Table 1. The popular textual DA methods (including EDA, Back-Translation, Text-VAE, and AE with Noised), do not improve the performance of the MRC model. One possible reason might be that they introduce detrimental noise to the training process as they augment question data without considering the paragraphs and the associated answers. In sharp contrast, the QDA methods (including 3M synth, UNANSQ, and CRQDA) improve the model performance. Besides, our CRQDA outperforms all the strong baselines, which brings about 1.9 absolute EM score and 1.5 F1 score improvement based on BERT large . We provide some augmented data samples of each baseline in Appendix A.

Ablation and Analysis
Our ablation study and further analysis are designed for answering the following questions: Q1: How useful is the augmented data synthesized by our method if trained by other MRC models? Q2: How does the choice of the corpora for autoencoder training influence the performance? Q3: How do different CRQDA augmentation strategies influence the model performance?  To answer the first question (Q1), we use the augmented SQuAD 2.0 dataset in § 4.1 to train different MRC models (BERT base , BERT large , RoBERTa base , and RoBERTa large ). The hyperparameters and implementation are based on Huggingface (Wolf et al., 2019). The results are presented in Table 2. We can see that CRQDA can improve the performance of each MRC model, yielding 2.4 absolute F1 improvement with BERT base model and 1.5 absolute F1 improvement with RoBERTa base . Besides, although we use a RoBERTa large model to guide the rewriting of question data, the augmented dataset can further improve its performance.  Table 3: Results of training autoencoder on different corpora. R-L is short for ROUGE-L, and B4 is short for BLEU-4.
We firstly measure the reconstruction performance of the autoencoders on the question data of SQuAD 2.0 development set. We use BLEU-4 (Papineni et al., 2002) and ROUGE-L (Lin, 2004) metrics for evaluation. Then we use these autoencoders for the CRQDA question rewriting with the same settings in § 4.1. These augmented SQuAD 2.0 datasets are used to fine-tune BERT base model. We report the performance of fine-tuned BERT base model in Table 3. It can be observed that with more training data, the reconstruction performance of autoencoder is better. Also, the performance of finetuned BERT base model is better. When trained with Wiki and Wiki+Mask, the autoencoders can reconstruct almost all questions well. The reconstruction performance of model trained with Wiki+Mask performs the best. However, the fine-tuned BERT base model with autoencoder trained on Wiki performs better than that trained on Wiki+Mask. The reason might be that the autoencoder trained with denoising task will be insensitive to the word embedding revision of CRQDA. In other words, some revisions guided by the MRC gradients might be filtered out as noises by the autoencoder, which is trained with a denoising task.

Methods
EM F1 RoBERTalarge (Liu et al., 2019b)   For the last question Q3, we use CRQDA for question data augmentation with different settings. For each answerable original question data sample from the training set of SQuAD 2.0, we use CRQDA to generate both answerable and unanswerable question examples. Then the augmented unanswerable question data (unans), the augmented answerable question data (ans), and all of them (ans + unans) are used to fine-tune RoBERTa large model. To further analyze the effect of β a (a larger β a value means that the generated questions are closer to the original question in the discrete space), we use different β a = 0.3, 0.5, 0.7 for question rewriting. The results are reported in Table 4. It can be observed that the MRC achieves the best performance when β a = 0.5. Moreover, all of unans, ans, and ans + unans augmented datasets can further improve the performance. However, we find that the RoBERTa large model fine-tuned on ans + unans performs worse than fine-tuned on unans only. The result is mixed in that using more augmented data is not always beneficial.

Question Generation
Answer-aware question generation task (Zhou et al., 2017) aims to generate a question for the given answer span with a paragraph. We apply our CRQDA method to SQuAD 1.1 (Rajpurkar et al., 2016)

Methods
Accuracy BERTlarge (Devlin et al., 2018) 92.3 BERTlarge + CRQDA 93.0 and Lavie, 2005), and ROUGE-L metrics for evaluation, and we split the SQuAD 1.1 dataset into training, development and test set. We also report the results on the another data split setting as in Yan et al. (2020), which reverses the development set and test set. The results are shown in Table 5. We can see that CRQDA improves ProphetNet on all three metrics and achieves a new state-of-the-art on this task.

QNLI
Given a question and a context sentence, questionanswering NLI asks the model to infer whether the context sentence contains the answer to the question. QNLI dataset (Wang et al., 2018) contains 105K data samples. We apply CRQDA to QNLI dataset to generate new entailment and nonentailment data samples. Note that this task does not include the MRC model, but uses a text entailment classification model. Similarly, we train a BERT large model based on the code of BertForSe-quenceClassification in Huggingface to replace the "pre-trained MRC model" of CRQDA to guide the question data rewriting. Following the settings in § 4.1, we use CRQDA to synthesize about 42K new data samples as augmented data. Note that we only rewrite the question but keep the paired context sentence unchanged. Then the augmented data and original dataset are used to fine-tine BERT large model. Table 6 shows the results. CRQDA increases the accuracy of the BERT large model by 0.7%, which also demonstrates the effectiveness of CRQDA.

Conclusion
In this work, we present a novel question data augmentation method, called CRQDA, for contextrelevant answerable and unanswerable question generation. CRQDA treats the question data augmentation task as a constrained question rewriting problem. Under the guidance of a pre-trained MRC model, the original question is revised in a continuous embedding space with gradient-based optimization and then decoded back to the discrete space as a new question data sample. The experimental results demonstrate that CRQDA outperforms other strong baselines on SQuAD 2.0. The CRQDA augmented datasets can improve multiple reading comprehension models. Furthermore, CRQDA can be used to improve the model performance on question generation and question-answering language inference tasks, which achieves a new state-of-theart on the SQuAD 1.1 question generation task. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19-27. Figure 3 and Figure 4 provide some augmented data samples of each baseline on SQuAD 2.0. We can see that the baseline EDA tends to introduce noise which destroys the original sentence structure. The baselines of Text VAE, BackTranslation and AE+Noised often change some important words of the original question. This can cause the augmented question to miss the original key information and not to able to infer the original answer. In contrast, it can be observed that the generated answerable questions of CRQDA still maintain the key information for the original answer inference.

A Augmented Dataset
Its generated unanswerable questions tend to introduce some context-relevant words to convert an original answerable question into an unanswerable one.