Generating Diverse and Consistent QA pairs from Contexts with Information-Maximizing Hierarchical Conditional VAEs

One of the most crucial challenges in question answering (QA) is the scarcity of labeled data, since it is costly to obtain question-answer (QA) pairs for a target text domain with human annotation. An alternative approach to tackle the problem is to use automatically generated QA pairs from either the problem context or from large amount of unstructured texts (e.g. Wikipedia). In this work, we propose a hierarchical conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts, while maximizing the mutual information between generated QA pairs to ensure their consistency. We validate our Information Maximizing Hierarchical Conditional Variational AutoEncoder (Info-HCVAE) on several benchmark datasets by evaluating the performance of the QA model (BERT-base) using only the generated QA pairs (QA-based evaluation) or by using both the generated and human-labeled pairs (semi-supervised learning) for training, against state-of-the-art baseline models. The results show that our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.


Introduction
Extractive Question Answering (QA) is one of the most fundamental and important tasks for natural language understanding. Thanks to the increased complexity of deep neural networks and use of knowledge transfer from the language models pretrained on large-scale corpora (Peters et al., 2018;Devlin et al., 2019;Dong et al., 2019), the stateof-the-art QA models have achieved human-level performance on several benchmark datasets (Rajpurkar et al., 2016(Rajpurkar et al., , 2018. However, what is also * Equal contribution 1 The generated QA pairs and the code can be found at https://github.com/seanie12/Info-HCVAE Paragraph (Input) Philadelphia has more murals than any other u.s. city, thanks in part to the 1984 creation of the department of recreation's mural arts program, . . . The program has funded more than 2,800 murals Q1 which city has more murals than any other city? A1 philadelphia Q2 why philadelphia has more murals? A2 the 1984 creation of the department of recreation's mural arts program Q3 when did the department of recreation' s mural arts program start ? A3 1984 Q4 how many murals funded the graffiti arts program by the department of recreation? A4 more than 2,800 crucial to the success of the recent data-driven models, is the availability of large-scale QA datasets. To deploy the state-of-the-art QA models to real-world applications, we need to construct high-quality datasets with large volumes of QA pairs to train them; however, this will be costly, requiring a massive amount of human efforts and time.
Question generation (QG), or Question-Answer pair generation (QAG), is a popular approach to overcome this data scarcity challenge. Some of the recent works resort to semi-supervised learning, by leveraging large amount of unlabeled text (e.g. Wikipedia) to generate synthetic QA pairs with the help of QG systems (Tang et al., 2017;Yang et al., 2017;Tang et al., 2018;Sachan and Xing, 2018). However, existing QG systems have overlooked an important point that generating QA pairs from a context consisting of unstructured texts, is essentially a one-to-many problem. Sequence-tosequence models are known to generate generic sequences (Zhao et al., 2017a) without much variety, as they are trained with maximum likelihood estimation. This is highly suboptimal for QAG since the contexts given to the model often contain richer information that could be exploited to generate multiple QA pairs. To tackle the above issue, we propose a novel probabilistic deep generative model for QA pair generation. Specifically, our model is a hierarchical conditional variational autoencoder (HCVAE) with two separate latent spaces for question and answer conditioned on the context, where the answer latent space is additionally conditioned on the question latent space. During generation, this hierarchical conditional VAE first generates an answer given a context, and then generates a question given both the answer and the context, by sampling from both latent spaces. This probabilistic approach allows the model to generate diverse QA pairs focusing on different parts of a context at each time.
Another crucial challenge of the QG task is to ensure the consistency between a question and its corresponding answer, since they should be semantically dependent on each other such that the question is answerable from the given answer and the context. In this paper, we tackle this consistency issue by maximizing the mutual information (Belghazi et al., 2018;Hjelm et al., 2019;Yeh and Chen, 2019) between the generated QA pairs. We empirically validate that the proposed mutual information maximization significantly improves the QA-pair consistency. Combining both the hierarchical CVAE and the InfoMax regularizer together, we propose a novel probabilistic generative QAG model which we refer to as Information Maximizing Hierarchical Conditional Variational AutoEncoder (Info-HCVAE). Our Info-HCVAE generates diverse and consistent QA pairs even from a very short context (see Table 1).
But how should we quantitatively measure the quality of the generated QA pairs? Popular evaluation metrics (e.g. BLEU (Papineni et al., 2002), ROUGE (Lin and Hovy, 2002), METEOR (Banerjee and Lavie, 2005)) for text generation only tell how similar the generated QA pairs are to Ground-Truth (GT) QA pairs, and are not directly correlated with their actual quality (Nema and Khapra, 2018;Zhang and Bansal, 2019). Therefore, we use the QA-based Evaluation (QAE) metric proposed by Zhang and Bansal (2019), which measures how well the generated QA pairs match the distribution of GT QA pairs. Yet, in a semi-supervised learning setting where we already have GT labels, we need novel QA pairs that are different from GT QA pairs for the additional QA pairs to be truly effective. Thus, we propose a novel metric, Reverse QAE (R-QAE), which is low if the generated QA pairs are novel and diverse.
We experimentally validate our QAG model on SQuAD v1.1 (Rajpurkar et al., 2016), Natural Questions (Kwiatkowski et al., 2019), and Trivi-aQA (Joshi et al., 2017 datasets, with both QAE and R-QAE using BERT-base (Devlin et al., 2019) as the QA model. Our QAG model obtains high QAE and low R-QAE, largely outperforming stateof-the-art baselines using a significantly smaller number of contexts. Further experimental results for semi-supervised QA on the three datasets using the SQuAD as the labeled dataset show that our model achieves significant improvements over the state-of-the-art baseline (+2.12 on SQuAD, +5.67 on NQ, and +1.18 on Trivia QA in EM).
Our contribution is threefold: • We propose a novel hierarchical variational framework for generating diverse QA pairs from a single context, which is, to our knowledge, the first probabilistic generative model for questionanswer pair generation (QAG). • We propose an InfoMax regularizer which effectively enforces the consistency between the generated QA pairs, by maximizing their mutual information. This is a novel approach in resolving consistency between QA pairs for QAG. • We evaluate our framework on several benchmark datasets by either training a new model entirely using generated QA pairs (QA-based evaluation), or use both ground-truth and generated QA pairs (semi-supervised QA). Our model achieves impressive performances on both tasks, largely outperforming existing QAG baselines.

Related Work
Question and Question-Answer Pair Generation Early works on Question Generation (QG) mostly resort to rule-based approaches (Heilman and Smith, 2010;Lindberg et al., 2013;Labutov et al., 2015). However, recently, encoder-decoder based neural architectures (Du et al., 2017;Zhou et al., 2017) have gained popularity as they outperform rule-based methods. Some of them use paragraph-level information (Du and Cardie, 2018;Song et al., 2018;Liu et al., 2019;Zhao et al., 2018;Kim et al., 2019;Sun et al., 2018) as additional information. Reinforcement learning is a popular approach to train the neural QG models, where the reward is defined as the evaluation metrics (Song et al., 2017;Kumar et al., 2018), or the QA accuracy/likelihood (Yuan et al., 2017;Hosking and Riedel, 2019;Zhang and Bansal, 2019). State-ofthe-art QG models (Alberti et al., 2019;Dong et al., 2019;Chan and Fan, 2019) use pre-trained language models. Question-Answer Pair Generation (QAG) from contexts, which is our main target, is a relatively less explored topic tackled by only a few recent works (Du and Cardie, 2018;Alberti et al., 2019;Dong et al., 2019). To the best of our knowledge, we are the first to propose a probabilistic generative model for end-to-end QAG; Yao et al.  Su et al., 2018;Deng et al., 2018). In this work, we propose a novel hierarchical conditional VAE framework with an InfoMax regularization for generating a pair of samples with high consistency.

Method
Our goal is to generate diverse and consistent QA pairs to tackle the data scarcity challenge in the extractive QA task. Formally, given a context c which contains M tokens, c = (c 1 , . . . , c M ), we want to generate QA pairs (x, y) where x = (x 1 , . . . , x N ) is the question containing N tokens and y = (y 1 , . . . , y L ) is its corresponding answer containing L tokens. We aim to tackle the QAG task by learning the conditional joint distribution of the question and answer given the context, p(x, y|c), from which we can sample the QA pairs: We estimate p(x, y|c) with a probabilistic deep generative model, which we describe next.

Hierarchical Conditional VAE
We propose to approximate the unknown conditional joint distribution p(x, y|c), with a variational autoencoder (VAE) framework (Kingma and Welling, 2014). However, instead of directly learning a common latent space for both question and answer, we model p(x, y|c) in a hierarchical conditional VAE framework with a separate latent space for question and answer as follows: where z x and z y are latent variables for question and answer respectively, and the p ψ (z x |c) and p ψ (z y |z x , c) are their conditional priors following an isotropic Gaussian distribution and a categorical distribution ( Figure 1-(a)). We decompose the latent space of question and answer, since the answer is always a finite span of context c, which can be modeled well by a categorical distribution, while a continuous latent space is a more appropriate choice for question since there could be unlimited valid questions from a single context. Moreover, we design the bi-directional dependency flow of joint distribution for QA. By leveraging hierarchical structure, we enforce the answer latent variables  to be dependent on the question latent variables in p ψ (z y |z x , c) and achieve the reverse dependency by sampling question x ∼ p θ (x|z x , y, c). We then use a variational posterior q φ (·) to maximize the Evidence Lower Bound (ELBO) as follows (The complete derivation is provided in Appendix A): where θ, φ, and ψ are the parameters of the generation, posterior, and prior network, respectively. We refer to this model as a Hierarchical Conditional Variational Autoencoder (HCVAE) framework. Figure 2 shows the directed graphical model of our HCVAE. The generative process is as follows: 2. Sample answer L.V.: z y ∼ p ψ (z y | z x , c) 3. Generate an answer: y ∼ p θ (y | z y , c) 4. Generate a question: x ∼ p θ (x | z x , y, c) Embedding We use the pre-trained word embedding network from BERT (Devlin et al., 2019) for posterior and prior networks, whereas the whole BERT is used as a contextualized word embedding model for the generative networks. For the answer encoding, we use a binary token type id of BERT. Specifically, we encode all context tokens as 0s, except for the tokens which are part of answer span (highlighted words of context in Figure 1-(a) or -(c)), which we encode as 1s. We then feed the sequence of the word token ids, token type ids, and position ids into the embedding layer to encode the answer-aware context. We fix all the embedding layers in HCVAE during training.

Prior Networks
We use two different conditional prior networks p ψ (z x |c), p ψ (z y |z x , c) to model context-dependent priors (the dashed lines in Figure 1-(a)). To obtain the parameters of isotropic Gaussian N (µ, σ 2 I) for p ψ (z x |c), we use a bidirectional LSTM (Bi-LSTM) to encode the word embeddings of the context into the hidden representations, and then feed them into a Multi-Layer Perceptron (MLP). We model p ψ (z y |z x , c) following a categorical distribution Cat(π), by computing the parameter π from z x and the hidden representation of the context using another MLP.
Posterior Networks We use two conditional posterior networks q φ (z x |x, c), q φ (z y |z x , y, c) to approximate true posterior distributions of latent variables for both question x and answer y. We use two Bi-LSTM encoders to output the hidden representations of question and context given their word embeddings. Then, we feed the two hidden representations into MLP to obtain the parameters of Gaussian distribution, µ and σ (upper right corner in Figure 1-(a)). We use the reparameterization trick (Kingma and Welling, 2014) to train the model with backpropagation since the stochastic sampling process z x ∼ q φ (z x |x, c) is nondifferentiable. We use another Bi-LSTM to encode the word embedding of answer-aware context into the hidden representation. Then, we feed the hidden representation and z x into MLP to compute the parameters π of categorical distribution (lower right corner in Figure 1-(a)). We use the categorical reparameterization trick with gumbel-softmax (Maddison et al., 2017;Jang et al., 2017) to enable backpropagation through sampled discrete latent variables. Answer Generation Networks Since we consider extractive QA, we can factorize p θ (y|z y , c) into p θ (y s |z y , c) and p θ (y e |z y , c), where y s and y e are the start and the end position of an answer span (highlighted words in Figure 1-(b)), respectively. To obtain MLE estimators for both, we first encode the context c into the contextualized word embedding of E c = {e c 1 , . . . , e c M } with the pre-trained BERT. We compute the final hidden representation of context and the latent variable z y with a heuristic matching layer (Mou et al., 2016) and a Bi-LSTM: where z y is linearly transformed, and H ∈ R dy×M is the final hidden representation. Then, we feed H into two separate linear layers to predict y s and y e . Question Generation Networks We design the encoder-decoder architecture for our QG network by mainly adopting from our baselines (Zhao et al., 2018;Zhang and Bansal, 2019). For encoding, we use pre-trained BERT to encode the answer-specific context into the contextualized word embedding, and then use a two-layer Bi-LSTM to encode it into the hidden representation (in Figure 1-(c)). We apply a gated self-attention mechanism (Wang et al., 2017) to the hidden representation to better capture long-term dependencies within the context, to obtain a new hidden representationĤ ∈ R dx×M . The decoder is a two-layered LSTM which receives the latent variable z x as an initial state. It uses an attention mechanism (Luong et al., 2015) to dynamically aggregateĤ at each decoding step into a context vector of s j , using the j-th decoder hidden representation d j ∈ R dx (in Figure 1-(c)). Then, we feed d j and s j into MLP with maxout activation (Goodfellow et al., 2013) to compute the final hidden representationd j as follows: where z x is linearly transformed, and e x j is the j-th question word embedding. The probability vector over the vocabulary is computed as . We initialize the weight matrix W e as the pretrained word embedding matrix and fix it during training. Further, we use the copy mechanism (Zhao et al., 2018), so that the model can directly copy tokens from the context. We also greedily decode questions to ensure that all stochasticity comes from the sampling of the latent variables.

Consistent QA Pair Generation with Mutual Information Maximization
One of the most important challenges of the QAG task is enforcing consistency between the generated question and its corresponding answer. They should be semantically consistent, such that it is possible to predict the answer given the question and the context. However, neural QG or QAG models often generate questions irrelevant to the context and the answer (Zhang and Bansal, 2019) due to the lack of the mechanism enforcing this consistency. We tackle this issue by maximizing the mutual information (MI) of a generated QA pair, assuming that an answerable QA pair will have high MI. Since an exact computation of MI is intractable, we use a neural approximation. While there exist many different approximations (Belghazi et al., 2018;Hjelm et al., 2019), we use the estimation proposed by Yeh and Chen (2019) based on Jensen-Shannon Divergence: where E P and E N denote expectation over positive and negative examples. We generate negative examples by shuffling the QA pairs in the minibatch, such that a question is randomly associated with an answer. Intuitively, the function g(·) acts like a binary classifier that discriminates whether QA pair is from joint distribution or not. We empirically find that the following g(·) effectively achieves our goal of consistent QAG: N id i and y = 1 L jĥ j are summarized representations of question and answer, respectively. Combined with the ELBO, the final objective of our Info-HCVAE is as follows: where Θ includes all the parameters of φ, ψ, θ and W, and λ controls the effect of MI maximization. In all experiments, we always set the λ as 1.

Dataset
Stanford Question Answering Dataset v1.1 (SQuAD) (Rajpurkar et al., 2016). This is a reading comprehension dataset consisting of questions obtained from crowdsourcing on a set of Wikipedia articles, where the answer to every question is a segment of text or a span from the corresponding reading passage. We use the same split used in Zhang and Bansal (2019) for the fair comparison. Natural Questions (NQ) (Kwiatkowski et al., 2019). This dataset contains realistic questions from actual user queries to a search engine, using Wikipedia articles as context. We adapt the dataset provided from MRQA shared task (Fisch et al., 2019) and convert it into the extractive QA format. We split the original validation set in half, to use as validation and test for our experiments.
TriviaQA (Joshi et al., 2017). This is a reading comprehension dataset containing question-answerevidence triples. The QA pairs and the evidence (contexts) documents are authored and uploaded by Trivia enthusiasts. Again, we only choose QA pairs of which answers are span of contexts.
HarvestingQA 2 This dataset contains top-ranking 10K Wikipedia articles and 1M synthetic QA pairs generated from them, by the answer span extraction and QG system proposed in (Du and Cardie, 2018). We use this dataset for semi-supervised learning.

Experimental Setups
Implementation Details In all experiments, we use BERT-base (d = 768) (Devlin et al., 2019) as the QA model, setting most of the hyperparameters as described in the original paper. For both HCVAE and Info-HCVAE, we set the hidden dimensionality of the Bi-LSTM to 300 for posterior, prior, and answer generation networks, and use the dimensionality of 450 and 900 for the encoder and the decoder of the question generation network. We set the dimensionality of z x as 50, and define z y to be set of 10-way categorical variables z y = {z 1 , . . . , z 20 }. For training the QA model, we fine-tune the model for 2 epochs. We train both the QA model and Info-HCVAE with Adam optimizer (Kingma and Ba, 2015) with the batch size of 32 and the initial learning rate of 5 · 10 −5 and 10 −3 respectively. For semi-supervised learning, we first pre-train BERT on the synthetic data for 2 epochs and fine-tune it on the GT dataset for 2 epochs. To prevent posterior collapse, we multiply 0.1 to the KL divergence terms of question and answer (Higgins et al., 2017). For more details of the datasets and experimental setup, please see Appendix C. Baselines We experiment two variants of our model against several baselines: For the baselines, we use the same answer spans extracted by the answer extraction system (Du and Cardie, 2018).

Quantitative Analysis
QAE and R-QAE One of crucial challenges with generative models is a lack of a good quantitative evaluation metric. We adopt QA-based Evaluation (QAE) metric proposed by Zhang and Bansal (2019)    data covers larger distribution than the human annotated training data, R-QAE will be lower. However, note that having a low R-QAE is only meaningful when the QAE is high enough since trivially invalid questions may also yield low R-QAE.
Results We compare HCVAE and Info-HCVAE with the baseline models on SQuAD, NQ, and Triv-iaQA. We use 10% of Wikipedia paragraphs from HarvestingQA (Du and Cardie, 2018) for evaluation. Table 2 shows that both HCVAE and Info-HCVAE significantly outperforms all baselines by large margin in QAE on all three datasets, while obtaining significantly lower R-QAE, which shows that our model generated both high-quality and diverse QA pairs from the given context. Moreover, Info-HCVAE largely outperforms HCVAE, which demonstrates the effectiveness of our InfoMax regularizer for enforcing QA-pair consistency. Figure 3 shows the accuracy as a function of number of QA pairs. Our Info-HCVAE outperform all baselines by large margins using orders of magnitude smaller number of QA pairs. For example, Info-HCVAE achieves 61.38 points using 12K QA pairs, outperforming Semantic-QG that use 10 times larger number of QA pairs. We also report   the score of x T Wy as an approximate estimation of mutual information (MI) between QA pairs generated by each method in Table 3; our Info-HCVAE yields the largest value of MI estimation. Ablation Study We further perform an ablation study to see the effect of each model component.
We start with the model without any latent variables, which is essentially a deterministic Seq2Seq model (denoted as Baseline in Table 4). Then, we add in the question latent variable (+Q-latent) and then the answer latent variable (+A-latent), to see the effect of probabilistic latent variable modeling and hierarchical modeling respectively. The results in Table 4 shows that both are essential for improving both the quality (QAE) and diversity (R-QAE) of the generated QA pairs. Finally, adding in the In-foMax regularization (+InfoMax) further improves the performance by enhancing the consistency of the generated QA pairs.

Qualitative Analysis
Human Evaluation As a qualitative analysis, we first conduct a pairwise human evaluation of the QA pairs generated by our Info-HCVAE and Maxout-QG on 100 randomly selected paragraphs. Specifically, 20 human judges performed blind quality assessment of two sets of QA pairs that are presented in a random order, each of which contained two to five QA pairs. Each set of QA pairs is evalu-   ated in terms of the overall quality, diversity, and consistency between the generated QA pairs and the context. The results in Table 5 show that the QA pairs generated by our Info-HCVAE is evaluated to be more diverse and consistent, compared to ones generated by the baseline models.
One-to-Many QG To show that our Info-HCVAE can effectively tackle one-to-many mapping problem for question generation, we qualitatively analyze the generated questions for given a context and an answer from the SQuAD validation set. Specifically, we sample the question latent variables multiple times using the question prior network p ψ (z x | c), and then feed them to question generation networks p θ (x | z x , y, c) with the answer. The example in Table 6 shows that our Info-HCVAE generates diverse and semantically consistent questions given an answer. We provide more qualitative examples in Appendix D.
Latent Space Interpolation To examine if Info-HCVAE learns meaningful latent space of QA pairs, we qualitatively analyze the QA pairs generated by interpolating between two latent codes of it on SQuAD training set. We first encode z x from two QA pairs using posterior networks of q φ (z x |x, c), and then sample z y from interpolated values of z x using prior networks p ψ (z y |z x , c) to generate corresponding QA pairs. Table 7 shows that the semantic of the QA pairs generated smoothly transit from one latent to another with high diversity and consistency. We provide more qualitative examples  in Appendix D.

Semi-supervised QA
We now validate our model in a semi-supervised setting, where the model uses both the ground truth labels and the generated labels to solve the QA task, to see whether the generated QA pairs help improve the performance of a QA model in a conventional setting. Since such synthetic datasets consisting of generated QA pairs may inevitably contain some noise (Zhang and Bansal, 2019; Dong et al., 2019; Alberti et al., 2019), we further refine the QA pairs by using the heuristic suggested by Dong et al. (2019), to replace the generated answers whose F1 score to the prediction of the QA model trained on the human annotated data is lower than a set threshold. We select the threshold of 40.0 for the QA pair refinement model via cross-validation on the SQuAD dataset, and used it for the experiments. Please see Appendix C for more details. SQuAD We first perform semi-supervised QA experiments on SQuAD using the synthetic QA pairs generated by our model. For the contexts, we use both the paragraphs in the original SQuAD (S) dataset, and the new paragraphs in the Harvest-ingQA dataset (H). Using Info-HCVAE, we generate 10 different QA pairs by sampling from the latent spaces (denoted as S×10). For the baseline, we use Semantic-QG (Zhang and Bansal, 2019) with the beam search size of 10 to obtain the same number of QA pairs. We also generate new QA pairs  using different portions of paragraphs provided in HarvestingQA (denoted as H×10%-H×100%), by sampling one latent variable per context. Table 8 shows that our framework improves the accuracy of the BERT-base model by 2.12 (EM) and 1.59 (F1) points, significantly outperforming Semantic-QG. NQ and TriviaQA Our model is most useful when we do not have any labeled data for a target dataset. To show how well our QAG model performs in such a setting, we train the QA model using only the QA pairs generated by our model trained on SQuAD and test it on the target datasets (NQ and TriviaQA). We generate multiple QA pairs from each context of the target dataset, sampling from the latent space one to ten times (denoted by N×1-10 or T×1-10 in Table 9). Then, we fine-tune the QA model pretrained on the SQuAD dataset with the generated QA pairs from the two datasets. Table 9 shows that as we augment training data with larger number of synthetic QA pairs, the performance of the QA model significantly increases, significantly outperforming the QA model trained on SQuAD only. Yet, models trained with our QAG still largely underperform models trained with human labels, due to the distributional discrepancy between the source and the target dataset.

Conclusion
We proposed a novel probabilistic generative framework for generating diverse and consistent questionanswer (QA) pairs from given texts. Specifically, our model learns the joint distribution of question and answer given context with a hierarchically conditional variational autoencoder, while enforcing consistency between generated QA pairs by maximizing their mutual information with a novel In-

B Datatset
The statistics and the data resource are summarized in Table 10.   beam size of 10 for decoding. We also evaluate the Maxout-QG model on our SQuAD validation set with BLEU4 (Papineni et al., 2002), and get 15.68 points.
Selection of Threshold for Replacement As mentioned in our paper, we use the threshold of 40.0 selected via cross-validation of the QA model performance, using both the full SQuAD and Harvest-ingQA dataset for QAG. The detailed selection processes are as follows: 1) train QA model on only human annotated data, 2) compute F1 score of generated QA pairs, and 3) if the F1 score is lower than the threshold, replace the generated answer with the prediction of QA model. We investigate the optimal value of threshold among [20.0, 40.0, 60.0, 80.0] using our validation set of SQuAD. Table 11 shows the results of cross-validation on the validation set. The optimal value of 40.0 is used for semisupervised experiments on Natural Questions and TriviaQA. For fully unlabeled semi-supervised experiments on Natural Questions and TriviaQA, the QA model is only trained on SQuAD and used to replace the synthetic QA pairs (denoted in our paper as N×1-10, T×1-10).
Semi-supervised learning For the semisupervised learning experiment on SQuAD, we follow Zhang and Bansal (2019)'s split for a fair comparison. Specifically, we receive the unique IDs for QA pairs from the authors and use exactly the same validation and test set as theirs. For the Natural Questions and TriviaQA experiments, we use our own split as mentioned in the above. We generate QA pairs from the paragraphs of Wikipedia extracted by Du and Cardie (2018) and train BERT-base QA model with the synthetic data for two epochs. Then we further train the model with human-annotated training data for two more epochs. The catastrophic forgetting reported in Zhang and Bansal (2019) does not occur in our cases. We use Adam optimizer (Kingma and Ba, 2015) with batch size 32 and follow the learning rate scheduling as described in (Devlin et al., 2019) with initial learning rate 2 · 10 −5 and 3 · 10 −5 for synthetic and human annotated data, respectively.

D Qualitative Examples
The qualitative examples in Table 12, 13, 14 are shown in the next page.