Synthetic QA Corpora Generation with Roundtrip Consistency

We introduce a novel method of generating synthetic question answering corpora by combining models of question generation and answer extraction, and by filtering the results to ensure roundtrip consistency. By pretraining on the resulting corpora we obtain significant improvements on SQuAD2 and NQ, establishing a new state-of-the-art on the latter. Our synthetic data generation models, for both question generation and answer extraction, can be fully reproduced by finetuning a publicly available BERT model on the extractive subsets of SQuAD2 and NQ. We also describe a more powerful variant that does full sequence-to-sequence pretraining for question generation, obtaining exact match and F1 at less than 0.1% and 0.4% from human performance on SQuAD2.


Introduction
Significant advances in Question Answering (QA) have recently been achieved by pretraining deep transformer language models on large amounts of unlabeled text data, and finetuning the pretrained models on hand labeled QA datasets, e.g. with BERT (Devlin et al., 2018).
Language modeling is however just one example of how an auxiliary prediction task can be constructed from widely available natural text, namely by masking some words from each passage and training the model to predict them. It seems plausible that other auxiliary tasks might exist that are better suited for QA, but can still be constructed from widely available natural text. It also seems intuitive that such auxiliary tasks will be more helpful the closer they are to the particular QA task we are attempting to solve.
Based on this intuition we construct auxiliary tasks for QA, generating millions of syn-  matches the original answer the question was generated from, so the example is kept.
thetic question-answer-context triples from unlabeled passages of text, pretraining a model on these examples, and finally finetuning on a particular labeled dataset. Our auxiliary tasks are illustrated in Table 1. For a given passage C, we sample an extractive short answer A (Step (1) in Table 1). In Step (2), we generate a question Q conditioned on A and C, then (Step (3)) predict the extractive answer A conditioned on Q and C. If A and A match we finally emit (C, Q, A) as a new synthetic training example (Step (4)). We train a separate model on labeled QA data for each of the first three steps, and then apply the models in sequence on a large number of unlabeled text passages. We show that pretraining on synthetic data generated through this procedure provides us with significant improvements on two challenging datasets, SQuAD2 (Rajpurkar et al., 2018) and NQ (Kwiatkowski et al., 2019), achieving a new state of the art on the latter. question-answer pairs to improve a QA system, showing large improvements in low-resource settings with few gold labeled examples. Validating and improving the accuracy of these generated QA pairs, however, is relatively unexplored.
In machine translation, modeling consistency with dual learning (He et al., 2016) or backtranslation (Sennrich et al., 2016) across both translation directions improves the quality of translation models. Back-translation, which adds synthetically generated parallel data as training examples, was an inspiration for this work, and has led to state-of-the-art results in both the supervised (Edunov et al., 2018) and the unsupervised settings (Lample et al., 2018). Lewis and Fan (2019) model the joint distribution of questions and answers given a context and use this model directly, whereas our work uses generative models to generate synthetic data to be used for pretraining. Combining these two approaches could be an area of fruitful future work.
We use BERT (Devlin et al., 2018) * to model each of these distributions. Inputs to each of these models are fixed length sequences of wordpieces, listing the tokenized question (if one was available) followed by the context c. The answer extraction model is detailed in §3.1 and two variants of question generation models in §3.2 and §3.3. The question answering model follows .

Question (Un)Conditional Extractive QA
We define a question-unconditional extractive answer model p(a|c; θ A ) and a question-conditional extractive answer model p(a|q, c; θ A ) as follows: a e f I (a ,c,q;θ A ) * Some experiments use a variant of BERT that masks out whole words at training time, similar to Sun et al. (2019).
See https://github.com/ google-research/bert for both the original and whole word masked versions of BERT.
where a, a are defined to be token spans over c. For p(a|c; θ A ), a and a are constrained to be of length up to L A , set to 32 word piece tokens. The key difference between the two expressions is that f I scores the start and the end of each span independently, while f J scores them jointly.
Specifically we define f J : R h → R and f I : R h → R to be transformations of the final token representations computed by a BERT model: Here h is the hidden representation dimension, (s, e) = a is the answer span, BERT(t)[i] is the BERT representation of the i'th token in token sequence t. MLP J is a multi-layer perceptron with a single hidden layer, and AFF I is an affine transformation.
We found it was critical to model span start and end points jointly in p(a|c; θ A ) because, when the question is not given, there are usually multiple acceptable answers for a given context, so that the start point of an answer span cannot be determined separately from the end point.

Question Generation: Fine-tuning Only
Text generation allows for a variety of choices in model architecture and training data. In this section we opt for a simple adaptation of the public BERT model for text generation. This adaptation does not require any additional pretraining and no extra parameters need to be trained from scratch at finetuning time. This question generation system can be reproduced by simply finetuning a publicly available pretrained BERT model on the extractive subsets of datasets like SQuAD2 and NQ.
Fine-tuning We define the p(q|c, a; θ Q ) model as a left-to-right language model where q = (q 1 , . . . , q L Q ) is the sequence of question tokens and L Q is a predetermined maximum question length, but, unlike the more usual encoder-decoder approach, we compute f Q using the single encoder stack from the BERT model: where W BERT is the word piece embedding matrix in BERT. All parameters of BERT including W BERT are finetuned. In the context of question generation, the input answer is encoded by introducing a new token type id for the tokens in the extractive answer span, e.g. the question tokens being generated have type 0 and the context tokens have type 1, except for the ones in the answer span that have type 2. We always pad or truncate the question being input to BERT to a constant length L Q to avoid giving the model information about the length of the question we want it to generate.
This model can be trained efficiently by using an attention mask that forces to zero all the attention weights from c to q and from q i to q i+1 . . . q L Q for all i.
Question Generation At inference time we generate questions through iterative greedy decoding, by computing argmax q i f Q (q 1 , . . . , q i , a, c) for i = 1, . . . , L Q . Question-answer pairs are kept only if they satisfy roundtrip consistency.

Question Generation: Full Pretraining
The prior section addressed a restricted setting in which a BERT model was fine-tuned, without any further changes. In this section, we describe an alternative approach for question generation that fully pretrains and fine-tunes a sequence-tosequence generation model.
Pretraining Section 3.2 used only an encoder for question generation. In this section, we use a full sequence-to-sequence Transformer (both encoder and decoder). The encoder is trained identically (BERT pretraining, Wikipedia data), while the decoder is trained to output the next sentence.
Fine-tuning Fine-tuning is done identically as in Section 3.2, where the input is (C, A) and the output is Q from tuples from a supervised question-answering dataset (e.g., SQuAD).

Question Generation
To get examples of synthetic (C, Q, A) triples, we sample from the decoder with both beam search and Monte Carlo search. As before, we use roundtrip consistency to keep only the high precision triples.

Why Does Roundtrip Consistency Work?
A key question for future work is to develop a more formal understanding of why the roundtrip method improves accuracy on question answering tasks (similar questions arise for the backtranslation methods of Edunov et al. (2018) and Sennrich et al. (2016); a similar theory may apply to these methods). In the supplementary material we sketch a possible approach, inspired by the method of Balcan and Blum (2005) for learning with labeled and unlabeled data. This section is intentionally rather speculative but is intended to develop intuition about the methods, and to propose possible directions for future work on developing a formal grounding.
In brief, the approach discussed in the supplementary material suggests optimizing the loglikelihood of the labeled training examples, under a constraint that some measure of roundtrip consistency β(θ A ) on unlabeled data is greater than some value γ. The value for γ can be estimated using performance on development data. The auxiliary function β(θ A ) is chosen such that: (1) the constraint β(θ A ) ≥ γ eliminates a substantial part of the parameter space, and hence reduces sample complexity; (2) the constraint β(θ A ) ≥ γ nevertheless includes 'good' parameter values that fit the training data well. The final step in the argument is to make the case that the algorithms described in the current paper may effectively be optimizing a criterion of this kind. Specifically, the auxiliary function β(θ A ) is defined as the log-likelihood of noisy (c, q, a) triples generated from unlabeled data using the C → A and C, A → Q models; constraining the parameters θ A to achieve a relatively high value on β(θ A ) is achieved by pre-training the model on these examples. Future work should consider this connection in more detail.

Experimental Setup
We considered two datasets in this work: SQuAD2 (Rajpurkar et al., 2018) and the Natural Questions (NQ) (Kwiatkowski et al., 2019). SQuAD2 is a dataset of QA examples of questions with answers formulated and answered by human annotators about Wikipedia passages. NQ is a dataset of Google queries with answers from Wikipedia pages provided by human annotators. We used the full text from the training set of NQ (1B words) as  a source of unlabeled data. In our fine-tuning only experiments (Section 3.2) we trained two triples of models (θ A , θ Q , θ A ) on the extractive subsets of SQuAD2 and NQ. We extracted 8M unlabeled windows of 512 tokens from the NQ training set. For each unlabeled window we generated one example from the SQuAD2-trained models and one example from the NQ-trained models. For A we picked an answer uniformly from the top 10 extractive answers according to p(a|c; θ A ). For A we picked the best extractive answer according to p(a|c, q; θ A ). Filtering for roundtrip consistency gave us 2.4M and 3.2M synthetic positive instances from SQuAD2and NQ-trained models respectively. We then added synthetic unanswerable instances by taking the question generated from a window and associating it with a non-overlapping window from the same Wikipedia page. We then sampled negatives to obtain a total of 3M and 4M synthetic training instances for SQuAD2 and NQ respectively. We trained models analogous to  initializing from the public BERT model, with a batch size of 128 examples for one epoch on each of the two sets of synthetic examples and on the union of the two, with a learning rate of 2 · 10 −5 and no learning rate decay. We then fine-tuned the the resulting models on SQuAD2 and NQ.
In our full pretraining experiments (Section 3.3) we only trained (θ A , θ Q , θ A ) on SQuAD2. How- † https://github.com/google-research/ bert  Figure 1: Learning curves for pretraining using synthetic question-answering data (fine-tuning only setting). "no-RT" refers to omitting the roundtrip consistency check. Best exact match is reported after finetuning on SQuAD2. Performance improves with the amount of synthetic data. For a fixed amount of synthetic data, having a more diverse source (NQ+SQuAD vs. just SQuAD) yields higher accuracies. Roundtrip filtering gives further improvements. ever, we pretrained our question generation model on all of the BERT pretraining data, generating the next sentence left-to-right. We created a synthetic, roundtrip filtered corpus with 50M examples. We then fine-tuned the model on SQuAD2 as previously described. We experimented with both the single model setting and an ensemble of 6 models.

Results
The final results are shown in Tables 2 and 3. We found that pretraining on SQuAD2 and NQ synthetic data increases the performance of the finetuned model by a significant margin. On the NQ short answer task, the relative reduction in headroom is 50% to the single human performance and 10% to human ensemble performance. We additionally found that pretraining on the union of synthetic SQuAD2 and NQ data is very beneficial on the SQuAD2 task, but does not improve NQ results.
The full pretraining approach with ensembling obtains the highest EM and F1 listed in Table 2. This result is only 0.1 − 0.4% from human performance and is the third best model on the SQuAD2 leaderboard as of this writing (5/31/19).
Roundtrip Filtering Roundtrip filtering appears to be consistently beneficial. As shown in Figure 1, models pretrained on roundtrip consistent data outperform their counterparts pretrained without filtering. From manual inspection, of 46 (C, Q, A) triples that were roundtrip consistent   39% were correct, while of 44 triples that were discarded only 16% were correct.
Data Source Generated question-answer pairs are illustrative of the differences in the style of questions between SQuAD2 and NQ. We show a few examples in Table 4, where the same passage is used to create a SQuAD2-style and an NQ-style question-answer pair. The SQuAD2 models seem better at creating questions that directly query a specific property of an entity expressed in the text. The NQ models seem instead to attempt to create questions around popular themes, like famous works of art or TV shows, and then extract the answer by combining information from the entire passage.

Conclusion
We presented a novel method to generate synthetic QA instances and demonstrated improvements from this data on SQuAD2 and on NQ. We additionally proposed a possible direction for formal grounding of this method, which we hope to develop more thoroughly in future work.