Template-Based Question Generation from Retrieved Sentences for Improved Unsupervised Question Answering

Question Answering (QA) is in increasing demand as the amount of information available online and the desire for quick access to this content grows. A common approach to QA has been to fine-tune a pretrained language model on a task-specific labeled dataset. This paradigm, however, relies on scarce, and costly to obtain, large-scale human-labeled data. We propose an unsupervised approach to training QA models with generated pseudo-training data. We show that generating questions for QA training by applying a simple template on a related, retrieved sentence rather than the original context sentence improves downstream QA performance by allowing the model to learn more complex context-question relationships. Training a QA model on this data gives a relative improvement over a previous unsupervised model in F1 score on the SQuAD dataset by about 14%, and 20% when the answer is a named entity, achieving state-of-the-art performance on SQuAD for unsupervised QA.


Introduction
Question Answering aims to answer a question based on a given knowledge source. Recent advances have driven the performance of QA systems to above or near-human performance on QA datasets such as SQuAD (Rajpurkar et al., 2016) and Natural Questions (Kwiatkowski et al., 2019) thanks to pretrained language models such as BERT , XLNet (Yang et al., 2019) and RoBERTa (Liu et al., 2019). Fine-tuning these language models, however, requires largescale data for fine-tuning. Creating a dataset for every new domain is extremely costly and practically infeasible. The ability to apply QA models on outof-domain data in an efficient manner is thus very 1 Equal contribution 2 Work done during internship at the AWS AI Labs Figure 1: Question Generation Pipeline: the original context sentence containing a given answer is used as a query to retrieve a related sentence containing matching entities, which is input into our question-style converter to create QA training data.
desirable. This problem may be approached with domain adaptation or transfer learning techniques (Chung et al., 2018) as well as data augmentation (Yang et al., 2017;Dhingra et al., 2018;Wang et al., 2018;. However, here we expand upon the recently introduced task of unsupervised question answering  to examine the extent to which synthetic training data alone can be used to train a QA model. In particular, we focus on the machine reading comprehension setting in which the context is a given paragraph, and the QA model can only access this paragraph to answer a question. Furthermore, we work on extractive QA, where the answer is assumed to be a contiguous sub-string of the context. A training instance for supervised reading comprehension consists of three components: a question, a context, and an answer. For a given dataset domain, a collection of documents can usually be easily obtained, providing context in the form of paragraphs or sets of sentences. Answers can be gathered from keywords and phrases from the context. We focus mainly on factoid QA; the question concerns a concise fact. In particular, we emphasize questions whose answers are named entities, the majority type of factoid questions. Entities can be extracted from text using named entity recognition (NER) techniques as the training instance's answer. Thus, the main challenge, and the focus of this paper, is creating a relevant question from a (context, answer) pair in an unsupervised manner.
Recent work of  uses style transfer for generating questions for (context, answer) pairs but shows little improvement over applying a much simpler question generator which drops, permutates and masks words. We improve upon this paper by proposing a simple, intuitive, retrieval and template-based question generation approach, illustrated in Figure 1. The idea is to retrieve a sentence from the corpus similar to the current context, and then generate a question based on that sentence. Having created a question for all (context, answer) pairs, we then fine-tune a pretrained BERT model on this data and evaluate on the SQuAD v1.1 dataset (Rajpurkar et al., 2016).
Our contributions are as follows: we introduce a retrieval, template-based framework which achieves state-of-the-art results on SQuAD for unsupervised models, particularly when the answer is a named entity. We perform ablation studies to determine the effect of components in template question generation. We are releasing our synthetic training data and code. 1

Unsupervised QA Approach
We focus on creating high-quality, non-trivial questions which will allow the model to learn to extract the proper answer from a context-question pair.
Sentence Retrieval: A standard cloze question can be obtained by taking the original sentence in which the answer appears from the context and masking the answer with a chosen token. However, a model trained on this data will only learn text matching and how to fill-in-the-blank, with little generalizability. For this reason, we chose to use a retrieval-based approach to obtain a sentence similar to that which contains the answer, upon which to create a given question. For our experiments, we focused on answers which are named entities, which has proven to be a useful prior assumption for downstream QA performance  confirmed by our initial experiments. First, we indexed all of the sentences from a Wikipedia dump using the ElasticSearch search engine. We also extract named entities for each sentence in both the Wikipedia corpus and the sentences used as queries. We assume access to a named-entity recognition system, and in this work Figure 2: Example of synthetically generated questions using generic cloze-style questions as well as a template-based approach. make use of the spaCy 2 NER pipeline. Then, for a given context-answer pair, we query the index, using the original context sentence as a query, to return a sentence which (1) contains the answer, (2) does not come from the context, and (3) has a lower than 95% F1 score with the query sentence to discard highly similar or plagiarized sentences. Besides ensuring that the retrieved sentence and query sentence share the answer entity, we require that at least one additional matching entity appears in both the query sentence and in the entire context, and we perform ablation studies on the effect of this matching below. These retrieved sentences are then fed into our question-generation module.
Template-based Question Generation: We consider several question styles (1) generic clozestyle questions where the answer is replaced by the token "[MASK]", (2) templated question "Wh+B+A+?" as well as variations on the ordering of this template, as shown in Figure  2. Given the retrieved sentence in the form of [Fragment A] [Answer] [Fragment B], the templated question "Wh+B+A+?" replaces the answer with a Wh-component (e.g., what, who, where), which depends on the entity type of the answer and places the Wh-component at the beginning of the question, followed by sentence Fragment B and Fragment A. For the choice of wh-component, we sample a bi-gram based on prior probabilities of that bi-gram being associated with the named-entity type of the answer. This prior probability is calculated based on named-entity and question bi-gram starters from the SQuAD dataset. This information does not make use of the full context-question-answer and can be viewed as prior information, not disturbing the integrity of our unsupervised approach. Additionally, the choice of wh component does not significantly affect results. For template-based approaches, we also experimented with clause-based templates but did not find significant differences in performance.

Experiments
Settings: For all downstream question answering models, we fine-tune a pretrained BERT model using the Transformers repository (Wolf et al., 2019) and report ablation study numbers using the baseuncased version of BERT, consistent with . All models are trained and validated on generated pairs of questions and answers along with their contexts tested on the SQuAD development set. The training set differs for each ablation study and will be described below, while the validation dataset is a random set of 1,000 template-based generated data points, which is consistent across all ablation studies. We train all QA models for 2 epochs, checkpointing the models every 500 steps and choosing the checkpoint with the highest F1 score on the validation set as the best model. All ablation studies are averaged over two training runs with different seeds. Unless otherwise stated, experiments are performed using 50,000 synthetic QA training examples, as initial models performed best with this amount. We will make this generated training data public.

Model Analysis
Effect of retrieved sentences: We test the effect of retrieved vs original sentences as input to question generation when using generic cloze questions. As shown in Table 1, using retrieved sentences improves over using the original sentence, reinforcing our motivation that a retrieved sentence, which may not match trivially the current context, forces the QA model to learn more complex relationships than just simple entity matching. The retrieval process may return sentences which do not match the original context. On a random sample, 15/18 retrieved sentences were judged as entirely relevant to the original sentence. This retrieval is already quite good, as we use a high quality ElasticSearch retrieval and use the original context sentence as the query, not just the answer word. While we do not explicitly ensure that the retrieved sentence has the same meaning, we find that the search results with entity matching gives largely  semantically matching sentences. Additionally, we believe the sentences which have loosely related meaning may act as a regularization factor which prevent the downstream QA model from learning only string matching patterns. Along these lines,  found that a simple noise function of dropping, masking and permuting words was a strong question generation baseline. We believe that loosely related context sentences can act as a more intuitive noise function, and investigating the role of the semantic match of the retrieved sentences is an important direction for future work. For the sections which follow, we only show results of retrieved sentences, as the trend of improved performance held across all experiments.
Effect of template components: We evaluate the effect of individual template components on downstream QA performance. Results are shown in Table 2. Wh template methods improve largely over the simple cloze templates. "Wh + B + A + ?" performs best among the template-based methods, as having the Wh word at the beginning most resembles the target SQuAD domain and switching the order of Fragment B and Fragment A may force the model to learn more complex relationships from the question. We additionally test the effect of the wh-component and the question mark added at the end of the sentence. Using the same data as "Wh + B + A + ?" but removing the wh-component results in a large decrease in performance. We believe that this is because the wh-component signals the type of possible answer entities, which helps narrow down the space of possible answers. Removing the question mark at the end of the template also results in decreased performance, but not as large as removing the wh-component. This may be a result of BERT pretraining which expects certain punctuation based on sentence structure. We note that these questions may not be grammatical, which may have an impact on performance. Improving the question quality makes a difference in performance as seen from the jump from cloze-style questions to template questions. The ablation studies suggest that a combination of question relevance, though  matching entities, and question formulation, as described above, determine downstream performance. Balancing those two components is an interesting problem and we leave improving grammaticality and fluency through means such as language model generation for future experiments.
In the last two rows of Table 2, we show the effect of using the wh bi-gram prior on downstream QA training. Using the most-common wh word by grouping named entities into 5 categories according to  performs very close to the best-performing wh n-gram prior method, while using a single wh-word (what) results in a significant decrease in performance. These results suggest that information about named entity type signaled by the wh-word does provide important information to the model but further information beyond wh-simple does not improve results significantly.
Effect of filtering by entity matching: Besides ensuring that the retrieved sentence and query sentence share the answer entity, we require that at least one additional matching entity appears in both query sentence and entire context. Results are shown in Table 3. Auxillary matching leads to improvements over no matching when using template-based data, with best results using matching with both query and context. Matching may filter some sentences whose topic are too far from the original context. We leave further investigation of the effect of retrieved sentence relevance to future work.
Effect of synthetic training dataset size: Notably, ) make use of approximately 4 million synthetic data points in order to train their model. However, we are able to train a model with better performance in much fewer examples, and show that such a large subset is unnecessary for their released synthetic training data   as well. Figure 3 shows the performance from training over random subsets of differing sizes and testing on the SQuAD development data. We sample a random question for each context from the data of . Even with as little as 10k datapoints, training from our synthetically generated template-based data with auxiliary matching outperforms the results from ablation studies in . Using data from our templatebased data consistently outperforms that of . Training on either dataset shows similar trends; performance decreases after increasing the number of synthetic examples past 100,000, likely due to a distributional mismatch with the SQuAD data. We chose to use 50,000 examples for our final experiments with other ablation studies as this number gave good performance in initial experiments.

Comparison of Best-Performing Models:
We compare training on our best template-based data with state-of-the-art in Table 4. SQuAD F1 results reflect results on the hidden SQuAD test set. We report single-model numbers;  report an ensemble method achieving 56.40 F1 and a best single model achieving 54.7 F1. We make use of the whole-word-masking version of  BERT-large, although using the original BERTlarge gives similar performance of 62.69 on the SQuAD dev set. We report numbers on the sample of SQuAD questions which are named entities, which we refer to as SQuAD-NER. The subset corresponding to the SQuAD development dataset has 4,338 samples, and may differ slightly from  due to differences in NER preprocessing. We also trained a fully-supervised model on the SQuAD training dataset with varying amounts of data and found our unsupervised performance equals the supervised performance trained on about 3,000 labeled examples.

Conclusion
In this paper we introduce a retrieval-based approach to unsupervised extractive question answering. A simple template-based approach achieves state-of-the-art results for unsupervised methods on the SQuAD dataset of 64.04 F1, and 77.55 F1 when the answer is a named entity. We analyze the effect of several components in our template-based approaches through ablation studies. We aim to experiment with other datasets and other domains, incorporate our synthetic data in a semi-supervised setting and test the feasibility of our framework in a multi-lingual setting.