End-to-End Synthetic Data Generation for Domain Adaptation of Question Answering Systems

We propose an end-to-end approach for synthetic QA data generation. Our model comprises a single transformer-based encoder-decoder network that is trained end-to-end to generate both answers and questions. In a nutshell, we feed a passage to the encoder and ask the decoder to generate a question and an answer token-by-token. The likelihood produced in the generation process is used as a filtering score, which avoids the need for a separate filtering model. Our generator is trained by fine-tuning a pretrained LM using maximum likelihood estimation. The experimental results indicate significant improvements in the domain adaptation of QA models outperforming current state-of-the-art methods.


Introduction
Improving question answering (QA) systems through automatically generated synthetic data is a long standing research goal (Mitkov and Ha, 2003;Rus et al., 2010). Although many past works have proposed different strategies for question generation, they have limited or no success in improving the downstream QA task (Du et al., 2017;Sun et al., 2018;Song et al., 2018;Klein and Nabi, 2019;Wang et al., 2020;Ma et al., 2020;Chen et al., 2020;Tuan et al., 2019). Some recent approaches for synthetic QA data generation based on large pretrained language models (LM) have started to demonstrate success in improving the downstream Reading Comprehension (RC) task with automatically generated data (Alberti et al., 2019;Puri et al., 2020). However, these approaches typically consist of multi-stage systems that use three modules: span/answer detector, question generator and question filtering.
Given an input passage, the span detector is responsible for extracting spans that will serve as answers for which questions will be generated. This module normally combines a pretrained QA model with handcrafted heuristics. The question generator is a large LM fine-tuned for the task of conditional generation of questions given passage and answer. The question filtering comprises another RC model that is used to score and filter the generated QA pairs. Each module of this synthetic data generation pipeline is trained/tuned separately and errors from one stage can propagate to the next stages. Additionally, each module is expensive to be computed because all use large transformer networks (Vaswani et al., 2017).
In this work, we propose an end-to-end approach for synthetic QA data generation. Our model comprises a single transformer-based encoder-decoder network that is trained end-to-end to generate both the answer and the question. In a nutshell, we feed a passage to the encoder and ask the decoder to generate the question and the answer token-bytoken. The likelihood produced in the generation process is used as a filtering score, which avoids the need of a separate filtering model. Our generator is trained by fine-tuning a pretrained LM using maximum likelihood estimation (MLE). We use BART (Lewis et al., 2019) as the pretrained LM in our experiments.
We perform experiments with three different variations of our synthetic QA data generator: (1) AQGen, which generates first the answer then the question; (2) QAGen, which generates first the question then the answer; (3) QAGen Two-step (2S), which generates first the question, concatenates it to the passage, then generates the answer in a second pass through the same encoder-decoder.
We focus our empirical evaluation on the task of data augmentation for domain adaptation of reading comprehension (RC) models trained on SQuAD 1.1 dataset. We assess the effectiveness of our QA data generators for domain adaptation of four different target domain datasets: Natural Questions (NQ), BioASQ, NewsQA and DuoRC. We compare our results with recent work on domain adaptation for QA as well as with a three-stage synthetic data generator. QAGen performs better than AQGen and the baselines for all datasets, while QAGen2S provides the best results overall because it allows bidirectional attention between passage and question. For NQ dataset, QAGen2S improves the SQuAD baseline by more than 8 points in EM and more than 7 points in F1. For NewsQA and BioASQ the gains in EM are also above 4 points. Additionally, we also demonstrate that synthetically generated data by QAGen2S can improve the in-domain performance of both small and large RC models, leading to F1/EM improvements of 1/0.5 and 3.1/2.2 on RoBERTa-large and bert-base-uncased trained RC models on SQuAD dev.
The main contributions of this work can be summarized as follows: (1) we propose the first effective end-to-end approach for synthetic QA data generation; (2) our approach solves an important issue in previous methods for QA data generation: the detection of good spans. We show that span detection can be effectively solved as a generation task, just like question generation; (3) as it uses a single end-to-end model, our data generation pipeline is simpler, faster and more efficient; (4) we perform comprehensive experiments that demonstrate the effectiveness of our proposed approach for domain adaptation of QA systems.

End-to-End Model for Question and Answer Generation and Filtering
We model the problem of synthetic QA data generation as a conditional language modeling task. More specifically, we use an encoder-decoder (enc-dec) conditional LM as described in what follows.

Enc-Dec Conditional Language Models
Language modeling consists of learning the probability distribution p(x) over variable-length token sequences x = (x 1 , x 2 , ..., x |x| ), where the tokens come from a fixed size vocabulary V . The training of LMs typically involves solving the task of predicting the next token based on past tokens. The distribution p(x) can be represented by the conditional probability of the next token given the previ-ous ones (Bengio et al., 2003): In the case of conditional LMs, the generation is conditioned on an additional context c: Transformer-based encoder-decoder conditional LMs (Lewis et al., 2019;Raffel et al., 2019) use bidirectional self-attention in the encoding step to create vector representations of the tokens in the context c. The decoding step generates the tokens of the sequence x in an auto-regressive manner, while performing self-attention on previously generated tokens of x and all the representations output by the encoder for c.

Question-Answer Generation
In the case of end-to-end synthetic data generation for QA, we need to model the joint conditional distribution p(a, q|c), where the input context c is a passage, q is a question and a is the correct answer, which is a span in c. Our approach to model p(a, q|c) involves fine-tuning a pretrained Enc-Dec conditional LM using a training set D = {(c 1 , q 1 , a 1 ), (c 2 , q 2 , a 2 ), ..., (c |D| , q |D| , a |D| )}. We train the Enc-Dec with parameters θ through maximum likelihood estimation (MLE) by minimizing the negative log-likelihood over D: We can have different variations of the generator depending on how we place the items in the output sequence: answer-question or question-answer. This difference in the ordering is crucial because it defines which part is conditioned on the other. Based on this observation, we propose three variations of our generative model: AQGen: this model generates answer and question jointly given the input context: (q, a) ∼ p(a, q|c). During sampling, the answer tokens are generated, which are followed by question tokens. This makes the generation of the question conditioned on both input context (through attention on the encoder) and answer (through self-attention in the decoder). Fig. 1   QAGen: this model generates question and answer jointly given the input passage: (q, a) ∼ p(a, q|c). During sampling, the question tokens are generated, which are followed by answer tokens. This makes the generation of the answer conditioned on both input context (through attention on the encoder) and question (through self-attention in the decoder). Fig. 2   QAGen Two-Step (2S): this model performs question generation and answer generation in two separate passes over the Enc-Dec LM. First, the question is generated given the input context q ∼ p(q|c), (Step 1). Next, the question is concatenated with the input context and the resulting sequence is given as input to the Enc-Dec, which finally generates the answer a ∼ p(a|q, c), (Step 2). QAGen 2S sampling approach is illustrates in Fig. 3. This model uses a single Enc-Dec LM that is trained with samples of both p(q|c) and p(a|q, c). We use control codes <q> and <a> to inform the decoder whether to generate a question or an answer, respectively.

Decoding
A natural choice for decoding with conditional neural LMs is beam search. However, our preliminary experiments with beam search showed a lack of diversity and a high repetition of generated questionanswer pairs. Generating diverse question-answer pairs is crucial to the performance of downstream RC models. Particularly, diversity of answer spans ensures that various parts of the passage are used, and different question types are generated. We use a variant of nucleus sampling (Holtzman et al., 2019), where we pick top k tokens, and within top Encoder <s> p1 p2 p3 p4 <s> p1 p2 p3 p4 <q> q1 q2 q3 <\q> Decoder Encoder a1 a2 <\a> Decoder <a> a1 a2 <q> q1 q2 q3 Figure 3: QAGen Two-Step: given an input passage the model first generates a question (Step 1). Next, the question is concatenated with the passage and both are given to the encoder-decoder that generates the answer (Step 2). k, we pick tokens that comprise top 95% probability mass. We set k to 20 in our experiments. We refer to this setting as Topk+Nucleus. This decoding was used in QAGen, AQGen, and question sampling step in QAGen2S. The answer generation of QAGen2S was performed by greedy decoding. We discard generated (q, a) pairs whose answers do not occur in the input passage, as non-extractive QA is outside the scope of this work. We observed between 10% to 15% of samples being dropped because of this issue.

Filtering
Recent work have used the round-trip filtering method (Alberti et al., 2019;Puri et al., 2020) to prune the synthetic QA set and improve data quality. This method consists of two steps: (1) using an RC model to provide answers to the generated questions; (2) dropping the QA pairs for which the answer of the RC model does not match the span detected answer. While round-trip filtering has shown to be effective, it is not the most efficient approach because it involves the application of an RC system over the whole set of generated data. Additionally, there might exist cases that are difficult for the filtering model, but in fact are of high quality.
We propose using the likelihood of the generated question-answers as a measure to perform filtering and address the efficiency issue, as it avoids the use of an RC model for filtering. We argue that such a likelihood score, albeit noisy, is an indicator of whether a generated question-answer is high quality for training a downstream RC model. We refer to this approach as LM filtering. Essentially, given an input passage, we sample n different QA pairs, rank them according to decreasing order of PubMed Lymph node status has major prognostic importance in colorectal cancer and greater precision in the diagnosis of lymph node metastases should provide better prognostic and therapeutic guidance. Keratin 20 (K20) gene expression has been used as a marker of lymph node metastases, but the evidence for this remains circumstantial. This study has therefore sought to determine K20 specificity and to correlate K20 expression with mutant K-RAS expression, in order to provide direct evidence that K20 expression in lymph nodes of colorectal cancer patients genuinely reflects metastatic disease. Specificity of K20 expression was established against a range of tissue types and 289 lymph nodes from 41 non-cancer control patients. K20 expression was restricted to gastrointestinal epithelia and was only present in one of the 289 control lymph nodes, giving a calculated specificity of 97.6 % (95% confidence limits: 87.1-99.9%)... Q: What is K20 expression found to be restricted to?
A: gastrointestinal epithelia Q: What was the 95% confidence range of the mutation analysis?
A: 87.1-99.9% Q: What is the name of the gene that can be used as a marker of metastatic disease?
A: Keratin 20 CNNDM By. Emily Allen. PUBLISHED:. 06:27 EST, 12 June 2012. |. UPDATED:. 09:35 EST, 12 June 2012. Teachers have apologised to parents after a group of primary school children were forced to stay in the canteen until they had finished all the food on their plates. Parents of children attending Kaizen Primary School in Plaistow, East London, were left fuming after a group of pupils, some as young as five, were told they had to clear their plates before being allowed out into the playground. Even though years ago parents would not have batted an eyelid and would have welcomed schools encouraging their children to eat, dozens of parents complained, saying that children should 'not be forced to eat' by teachers. Upset: Parents of children at Kaizen Primary School in Plaistow, East London, said pupils were told they had to clear their plates (file picture) Candeece Kenlock said her five-year-old son Kehyan was 'so scared' of being forced to eat everything on his plate he didn't want to go to school anymore.... Q: what is the name of a five year old boy whose parents said A: Kehyan he was 'so scared' he didn't want to go to school? Q: What type of school were children forced to stay A: primary school in the canteen to finish their meals? Q: How old were the children who were forced to stay in A: five the canteen until they had finished their food?
IMDB Clark Russell, a prominent writer, concludes that he will visit the south in the capacity of a farm hand and thus secure atmosphere for a new story. He learns that laborers are needed on a certain farm and as he journeys into the country he rescues a young woman whose horse is running away. When Clark applies for work he is treated lightly by Bud, the foreman, until the owner of the farm arrives with his daughter, Anna, who recognizes her hero of the afternoon. A few days later at the dinner table Clark defends Polly, a maid, when she is annoyed by Bud and after the hands departed for the fields the two men settle their score in a fight, the bully receiving a severe lesson. Polly overhears Bud declaring that he will be revenged but she is unable to warn Clark. Later in the day the bully tries to force Clark into the hopper of the threshing machine but Anna sees the struggle from a distance and stops the engine... Q: What is the name of the foreman at the farm?
A: Bud Q: Who saves Anna?
A: Clark Russell Q: Who tries to force Clark into a hopper of the threshing machine?
A: 1992 Q: Who is the head coach of the Tampa Bay Lightning?
A: Jon Cooper Q: Who is the Tampa Bay Lightning general manager?
A: Steve Yzerman LM score and pick the top m samples. This is similar to the sample-and-rerank approach suggested by Holtzman et al. (2019) and Adiwardana et al. (2020). Formally, for QAGen and QAGen2S, we use the score: And for AQGen : Where N q and N a indicate the lengths of generated question and answer, respectively. We use answer-only scores for QAGen and QAGen2S because question quality would have a dominant effect on LM scores since questions are usually longer than answers. Additionally, using answeronly scores when conditioned on the generated question is more suitable for the RC tasks because it better mimics the score of a downstream RC model, which is answer centric. With AQGen, we use both answer and question LM scores, as answer generation is not conditioned on the question. We use likelihood summation instead of averaging because experiments showed that the former works slightly better. Further details included in Appendix B.3. We speculate this is due to average pooling encouraging longer question-answers, which could be of lower quality than shorter question-answer pairs.
Question generation (QG) has been extensively studied from the early heuristic-based methods (Mitkov and Ha, 2003;Rus et al., 2010) to the recent neural-base approaches. However, most work (

Datasets
We used SQuAD 1.1 dataset (Rajpurkar et al., 2016) to train the generative models as well as in-domain supervised data for the downstream RC task in this work. We used the default train and dev splits, which contain 87,599 and 10,570 (q, a) pairs, respectively. Similar to (Nishida et al., 2019), we selected the following four datasets as target domains: Natural Questions (Kwiatkowski et al., 2019), which consist of Google search questions and the annotated answers from Wikipedia. We used MRQA Shared Task (Fisch et al., 2019) preprocessed training and dev sets, which consist of 104,071 and 12,836 (q, a) pairs, respectively. The training set passages were used as the unlabeled target domain corpus, while the evaluations were performed on the dev set.
NewsQA (Hermann et al., 2015), which consists of question and answer pairs from CNN news articles. We used the dev set from the MRQA Shared Task, which removes unanswerable questions and those without annotator agreement. We prefer this version as we focus only on the generation of answerable questions. The dev set consists of 4,212 (q, a) pairs. Passages from CNN/Daily Mail corpus of Hermann et al. (2015) are used as unlabeled target domain corpus. BioASQ (Tsatsaronis et al., 2015): we employed MRQA shared task version of BioASQ, which consists of a dev set with 1,504 samples. We collected PubMed abstracts to use as target domain unlabeled passages.
DuoRC (Saha et al., 2018) contains questionanswer pairs from movie plots which are extracted from both Wikipedia and IMDB. ParaphraseRC task of DuoRC dataset was used in our evaluations, consisting of 13,111 pairs. We crawled IMDB movie plots to use as the unlabeled target domain corpus.

Experimental Setup
We used Pytorch (Paszke et al., 2019) and Transformers (Wolf et al., 2019) to develop the models and perform experiments. Generative models are trained on SQuAD 1.1 for 5 epochs, and the best model is selected based on the cross entropy loss on the SQuAD dev set. AdamW (Loshchilov and Hutter, 2017) optimizer with learning rate of 3 × 10 −5 is employed.
For RC model training, we use bert-base-uncased model (Devlin et al., 2018). AdamW optimizer is used with learning rate of 3 × 10 −5 and batch size 24 for 2 epochs without linear warmup. We set maximum sequence length 384 with document stride 128. SQuAD 1.1 dev set is used to select the best model during training. As a baseline for QA data generation, we implemented a three-stage pipeline similar to the state-of-the-art approach of Puri et al. (2020). We call this baseline QGen, which generates a question given a passage and extracted span, q ∼ p(q|a, c). The span detection module consists of bert-base-uncased fine-tuned on SQuAD 1.1 passage and spans, where the start and end classification heads are trained to perform span detection. For QGen, we experimented with sampling top 5 spans and generating two questions per each, as suggested by (Puri et al., 2020), as well as sampling top 10 spans and generating one question per each. Our results showed the latter outperforming the former. Henceforth, we used this configuration in our evaluations.
We trained QGen models on both BART-Large and GPT2-Medium (Radford et al., 2019), which have an equivalent number of parameters, 406M (BART) vs 350M (GPT2), and evaluated BLEU score of the generated question w.r.t. the ground truth question on the SQuAD dev set. BART and GPT2 achieved 21.29 and 18.31 BLEU, respectively. We believe the bi-directional encoding in BART is superior to uni-directional encoding in GPT2. Hence, we used BART for the rest of the experiments.

Synthetic Data Generation
For each of the unlabeled target domain corpora, we randomly selected 100,000 passages to perform synthetic data generation. Passages shorter than 100 tokens were discarded. Selected ones were truncated to maximum length of 550 tokens. We removed the passages that existed in the dev sets.
Question-answer generation with AQGen, QAGen, and QAGen2S is performed using Topk+Nucleus, as discussed in Sec. 2.3. For each passage, 10 samples are generated. Unless otherwise mentioned, LM filtering is applied by sorting the 10 samples of each passage according to LM scores as detailed in Sec. 2.4, and the top 5 samples are selected. The number of synthetically generated pairs is between 860k to 890k without filtering and 480k to 500k after LM filtering. Tab. 1 shows generated question-answer pairs from four target domain (see Appendix for more examples). We can observe that the generative model is able to generate question answer pairs even from raw HTML input that corresponds to a table. The rendered table can be seen in Tab. 12 (Appendix C.3). Considering the fact that the training data of the generative model does not include any HTML input, this further demonstrates the robustness and efficacy of our proposed approach.

Domain Adaptation Results
Tab. 2 shows the results of domain adaptation experiments. Each experiment was performed by training the RC model on the synthetic data generated on the target domain corpus. We refer to the dataset to which the downstream model is being adapted as the target domain. Source domain indicates the supervised training dataset (SQuAD).
We also performed experiments by using both Synthetic + SQuAD1.1 data. Our QAGen and QA-Gen2S models outperform by wide margins the baseline models trained on SQuAD 1.1 only, as well as unsupersived domain adaptation approaches Comparing LM and round-trip filtering when applied to the best performing model, QAGen2S, we can observe that the LM filtering approach (Sec. 2.4) is more effective than round-trip filtering in BioASQ and DuoRC target domains. It barely underperforms (∼ 1 point) in F1 and EM in the other two domains. This demonstrates the efficacy of the suggested filtering approach, which also simplifies the question-answer generation pipeline.
The highest (EM/F1) domain adaptation gains seen with BioASQ (4/2.2) and DuoRC (1.2/1.1) are smaller than those with Natural Questions (8.5/7.5) and NewsQA (5.5/4.5). We postulate this is due to two reasons: Firstly, both BioASQ and DuoRC domains are more dissimilar to the source domain, SQuAD, compared to NewsQA and Natural Questions; Secondly, BioASQ and DuoRC are more difficult datasets. Comparing our results with supervised target domain training of DuoRC, we observe that with using only synthetic data outperforms the DuoRC training set, which consists of 39144 pairs. While our domain adaptation methods show substantial gains with NewsQA and Natural Questions domain, there is still room for improvements to match the performance of supervised target domain training (last row in Tab. 2).
While results in Tab. 2 suggest that generating synthetic QA data from target domain text leads to significant gains on the target domain dev set, one can argue whether it is essential to generate synthetic data from the corpus matching the target dev set's domain to achieve good performance.   on every target domain dev set of RC models finetuned on synthetic data of different target domain corpora. We can see that diagonal elements, which have same domain of dev set and target corpus, show either the best performance (underlined results) or are within a narrow margin of top EM/F1 scores. Therefore, the most effective strategy is achieved when the passages used in the generation of synthetic samples are from the same domain as the target, which is expected in a domain adaptation method. Additionally, we trained an RC model with the synthetic data from all the four domains (last two rows in Tab. 3). This produced our best F1 results for all datasets, indicating that mixing synthetic data from different domains is beneficial for the QA task. Tab. 3 also shows EM/F1 scores of the cross-domain RC models on SQuAD 1.1 dev set. We can see that using synthetic data from any of the four domains significantly improved the performance for SQuAD. In particular, when training the RC model with data from all domains + SQuAD training data (last row), there is a large gain in both EM (3.8) and F1 (2.7).

Comparison of AQGen, QAGen and QAGen2S models
Comparing our proposed LM filtering-based models in Tab. 2, we propose the following explanations: (1) QAGen2S and QAGen outperform AQ-Gen because generating answers conditioned on the question results in better spans, which is crucial in the training of the downstream RC task. Generated answer spans not conditioned on questions could include spurious tokens, or be a partial span.
(2) QAGen2S outperforms QAGen because including the generated question in the bidirectional encoder allows cross attention between the passage and generated question, which results in even more accurate answer generation. Comparing the performance when only synthetic question-answer pairs are employed versus adding SQuAD training pairs, we can observe that the addition of labeled data results in marginal gains. This becomes even more evident for the best performing data generators. In fact, in some cases, adding SQuAD data degrades EM, such as QAGen2S + LM filtering with Natural  Questions and NewsQA.

Ablation Studies Sampling Design Choices
Tab. 4 shows a comparison between beam search and Topk+Nucleus sampling with different number of samples (5, 10, and 20). The results indicate that beam search underperforms Topk+Nucleus. We attribute this to the lack of diversity in the generated samples using beam search. We observed that beam search tends to select fewer distinct spans, compared to Topk+Nucleus, and generates minor variations of the same question. Appendix C.1 examines this issue.
When training the RC model we only used the top 5 samples based on LM score per each passage. We can observe that sampling 10 pairs per document leads to the best EM/F1 on the target domain. By sampling many QA pairs per passage, we increase the chance of generating good samples. However, if we sample too many qa pairs the top ranked ones might be too similar. Therefore, we used sample size of 10 in this work since a higher sample size incurs higher computation cost while not showing improvements.

LM Filtering
We argue that using LM filtering, as discussed in section 2.4, results in improvements in the target domain downstream RC models by enhancing the quality of the generated (q, a) pairs. Results in Tab. 5 indicate that in the majority of the experiments using LM filtering leads to improved F1/EM metrics. AQGen benefits the most from LM filtering as it generates data with lower quality than the other two models. Tables 10 and 12 in the Appendix show examples of QA pairs and their LM scores. Fig. 4 shows experimental results when varying the number of (q, a) pairs selected from the 10 pairs sampled per each passage. We chose the value of 5 as this configuration outperforms other values overall. A high value is more likely to allow undesired pairs, while a low value might discard plenty of high quality samples.

Correlation between LM and F1 Scores
In this work, we proposed using the LM score of the generated samples as a surrogate to round-trip filtering. We postulate that the LM score correlates with the F1 score used in round-trip filtering. To more thoroughly examine this, we devised an experiment where we sorted the generated samples by their answer LM scores, divided them into contiguous buckets each with 200 samples, and calculated the average F1 score of the samples in each bucket. Fig. 5 shows the results of this experiment. As we can see, there exists a strong correlation between the two scores.
While the correlation looks promising, a challenge with using the LM score is that it is relatively noisy. For example, to use the LM score to get only samples whose F1 scores are 1, a very high threshold needs to be set, forcing the vast majority of samples to be dropped. Future work can explore how to reduce this noise.

Impact of Synthetic Dataset Size
In Fig. 6, we present plots that correlate synthetic dataset size (in # of passages) and RC model performance (EM/F1). We can see that with increasing the number of generated (q, a) pairs (5 pairs per passage), RC model performance improves. Such correlation is more evident when not using the SQuAD training data. This is expected as with added supervised training samples, there would be less need for a large number of synthetic samples.

Conclusions
We presented a novel end-to-end approach to generate question-answer pairs by using a single transformer-based model. Our experiments showed that by proper decoding, significant improvements in domain adaptation of RC models can be achieved. We concluded that using LM filtering improves the quality of synthetic question-answer pairs; however, there is still a gap with round-trip filtering with some of the target domains. Improving LM-score-based filtering is a future direction of our work. While we were able to generate diverse, high quality and challenging synthetic samples on the target domains, the types of the questions produced still were limited to those of SQuAD, since the generative models were trained on SQuAD. It would be interesting to explore how one can adapt the generative models to the type of target domain questions.   answer pairs, which are more likely to consist of incorrect samples. By using summation, shorter question-answer pairs would be more likely to be selected during LM filtering.

C Examples of Generated Samples
C.1 Illustration of Answer LM Score Tab. 10 presents unfiltered question-answer pairs and associated answer LM scores generated from a randomly selected Natural Questions corpus using the QAGen2S model. As can be seen from Topk+Nucleus decoded samples, the last two generated samples are incorrect and would be filtered out using the LM filtering approach that is used in this work. The last sample, which consists of an answer that is entirely irrelevant to its question, has a considerably lower answer LM score than the rest of the samples.
With beam search, due to the high number of repetitions, the scores are close. While beam search generates samples with high likelihood, due to the lack of diversity, as evident here, the performance of the trained RC models on such synthetic samples underperforms those of Topk+Nucleus.

C.2 Comparison of Generated Samples by AQGen, QAGen and QAGen2S
Tab. 11 presents unfiltered question-answers pairs generated using each of our proposed models on a randomly selected passage from CNN/Daily Mail corpus. We can observe that generated samples using AQGen have lower quality than the other two models. Also, the selected spans are repetitive. Only 3 out of the 6 properly generated samples are correct question-answer pairs. Comparing QA-Gen and QAGen2S samples, we can observe that QAGen2S generates more diverse and longer answer spans. In this example, we can see that more repeated spans are generated by QAGen than QA-Gen2S. While the Topk+Nucleus sampling approach improves the diversity of generated question-answer pairs, we can still see repetitions and incorrect pairs. We believe using the LM score filtering, the vast majority of incorrect pairs are discarded. However, this also means there is room for improving the generative models.

C.3 Question Answers from Table
The Natural Questions dataset includes HTML formatted passages. We noticed that some of them  Table 9: Comparison of using average versus summation of LM scores when doing LM filtering. Bold values indicate the best performance on each target domain for each model (per rows separated by solid lines).
Passage: <P> The United States is estimated to have a population of 327,589,916 as of April 23 , 2018 , making it the third most populous country in the world . It is very urbanized , with 81 % residing in cities and suburbs as of 2014 ( the worldwide urban rate is 54 % ) . California and Texas are the most populous states , as the mean center of U.S. population has consistently shifted westward and southward . New York City is the most populous city in the United States . </P> Topk+Nucleus Q: As of April 23, 2018, what is the estimated population of the US?
A: 327,589,916 LM score: -0.00577 Q: How many people lived in the US in April of 2018?
A: 327,589,916 LM score: -0.00707 Q: What is the population of the United States?
A: 327,589,916 LM score: -0.01358 Q: What is the most populous city in the United States?
A: New York City LM score: -0.04131 Q: Where do 81 percent of Americans live?
A: cities and suburbs LM score: -0.05360 Q: Where does the United States rank among most populous countries on the planet?
A: third LM score: -0.07449 Q: How much of the US's population is concentrated in the metropolitan areas of the country?
A: 81 % LM score: -0.09509 Q: How much of the US population is urbanized?
A: 81 % LM score: -0.1375 Q: What two cities have the highest populations in America?
A: California and Texas LM score: -0.18128 Q: What country is considered the most populous?
A: third LM score: -  are web tables. Tab. 12 illustrates one such example. The content under Passage is the input string, as seen by the generative models, and Rendered Passage indicates how the table appears in a browser. We experimented with using QGen model on this passage, and noticed that the span detection model was not capable of distinguishing between textual content and HTML tags properly, resulting in selecting spans that included HTML tags. However, the samples generated by the joint span and question generation model, QAGen2S in this example, show surprisingly high-quality spans and questions. Only one sample is not correct (What team is Tampa Bay's home arena?). We believe this is because when the span generation is conditioned on the generated question, the likelihood of spans that include spurious tokens, HMTL tags in this example, diminishes sharply. This opens the door to the possibility of using our proposed models in structured corpora without any extra effort.

D Training and Platform Details
All of the experiments in this work were performed on Amazon EC2 instances. We employed p3.8xlarge, p3.16xlarge, and p3dn.24xlarge GPU instances. In the training of the generative models, warmup was set to 10% of total training steps. We used a batch size of 24. Each epoch took 2 to 3 hours on 3 GPUs. We observed that usually, the best model is achieved within the first two epochs. The RC models with Synthetic+SQuAD samples were trained by combining synthetic samples and SQuAD training set and randomly shuffling them. Each epoch of training took 2 to 12 hours, depending on the average length of target domain passages on 1 GPU.
All of the hyperparameters of both generative and RC downstream models were fixed. We only performed hyperparameter tuning on those mentioned in the paper.