Self-Supervised Test-Time Learning for Reading Comprehension

Recent work on unsupervised question answering has shown that models can be trained with procedurally generated question-answer pairs and can achieve performance competitive with supervised methods. In this work, we consider the task of unsupervised reading comprehension and present a method that performs “test-time learning” (TTL) on a given context (text passage), without requiring training on large-scale human-authored datasets containing context-question-answer triplets. This method operates directly on a single test context, uses self-supervision to train models on synthetically generated question-answer pairs, and then infers answers to unseen human-authored questions for this context. Our method achieves accuracies competitive with fully supervised methods and significantly outperforms current unsupervised methods. TTL methods with a smaller model are also competitive with the current state-of-the-art in unsupervised reading comprehension.


Introduction
Reading comprehension is the task in which systems attempt to answer questions about a passage of text. Answers are typically found in the passage as text-spans or can be inferred through various forms of reasoning (Rajpurkar et al., 2016). The answer to the following question: "Who is the President of the United States?" depends on the timeframe and context of the passage provided, and will be different for news articles written in 2001 vs. 2021. If the context is the script of the TV series "The West Wing", the answer is "Jed Bartlet", and even in this fictional setting, it will later change to "Matt Santos".
Knowledge sources such as Wikipedia get updated when new events occur (such as the outcome of elections), or new facts about the world are revealed (such as scientific discoveries), with contributors adding new information and removing information that is no longer valid (Almeida et al., 2007). With such context-dependent answers and continual changes in knowledge, it is hard to justify training models over fixed corpora for tasks such as question answering (QA). We would like models to answer questions based on the given context and not to learn biases from datasets or historical news articles.
Moreover, supervised learning has been shown to perform poorly in QA tasks with adversarial examples (Jia and Liang, 2017), domain shift (Jia and Liang, 2017;Yogatama et al., 2019;Kamath et al., 2020), and biased or imbalanced data (Agrawal et al., 2018;McCoy et al., 2019). For example, QA systems trained on Wikipedia fail to generalize to newer domains such as Natural Questions (Rennie et al., 2020) or biomedical data , and suffer a significant drop in accuracy. Even small semantics-preserving changes to input sentences, such as the substitution of words by synonyms, have been shown to degrade performance in NLP tasks (Alzantot et al., 2018;Jia et al., 2019). Continual changes in text corpora are inevitable, thus calling for the development of robust methods that can reliably perform inference without being subject to biases.
Supervised Question Answering faces challenges such as the need for large-scale (usually human-authored) training corpora to train models. Such corpora typically require significant postprocessing and filtering to remove annotation artifacts (Sakaguchi et al., 2020). To address these challenges, some recent methods  approach question answering as an unsupervised learning task. A significant advantage of this approach is that it can be extended to domains and languages for which collecting a large-sized human-authored training corpus is challenging. Methods for unsupervised QA procedurally generate a large corpus of (context, question, answer) triples, and train large neural language models, such as BERT .
In this work, we focus on unsupervised reading comprehension (RC) under evolving contexts and present the "Test-Time Learning" paradigm for this task. RC -the task of answering questions about a passage of text, acts as the perfect setting for robust question-answering systems that do not overfit to training data. While large-scale language models trained on large datasets may contain global information, the answer needs to be extracted from the given context. Thus, our work seeks to learn unsupervised reading comprehension without access to human-authored training data but instead operates independently on each test context. This makes our method 'distribution-blind' where each new context is assumed to be a novel distribution. The test-time learning (TTL) framework enables smaller models to achieve improved performance with small procedurally generated question-answer pairs, and is summarized below: • a single context (text passage) c i is given, from which we procedurally generate QA pairs; • these QA pairs are used to train models to answer questions about c i ; • the inference is performed on previously unseen questions for c i .
This framework has a simple assumption that every context comes from a distinct distribution. Hence, parameters learned for the previous context might not be useful to generalize to other contexts. This assumption holds where the contexts evolve over time, and rote memorization of answers might lead to wrong predictions. As such, the above process is repeated for each new context c i .
For question-answer generation, we use simple methods such as cloze-translation , template-based question-answer generation (Fabbri et al., 2020) and question-answer semantic role labeling (QA-SRL) (He et al., 2015). We use two neural transformer-based language models, BERT-Large  and Dis-tilBert (Sanh et al., 2019), to study the efficacy of our framework with large and small transformer models. We evaluate our method on two reading comprehension datasets, SQuAD (Rajpurkar et al., 2016) and NewsQA (Trischler et al., 2017). We investigate test-time training under multiple learning settings: (1) single-context learning -the "standard" setting, (2) K-neighbor learning -by retrieving top-K multiple related contexts for each test context, (3) curriculum learning -progressively learning on question-types of increasing order of complexity, (4) online learning -sequentially finetuning models on each incoming test sample.
Our experimental findings are summarized below: • Test-time learning methods are effective for the task of reading comprehension and surpass current state-of-the-art on two benchmarks: SQuAD and NewsQA. • Online TTL trained over K-neighboring contexts of the test context is the best version with EM/F1 gains of 7.3%/7.8% on SQuAD 1.1 and 5.3%/6.9% on NewsQA. • DistilBERT -which has less than 1 5 th of the number of model parameters of BERT-Large is competitive with current SOTA methods that use BERT-Large.

Test-Time Reading Comprehension
Consider a reading comprehesion test dataset with context text passages c i , human-authored questions q i and true answers a i . The QA model g(·) is parameterized by θ = (θ f , θ h ) where θ f are parameters for the feature extractor, and θ h for the answering head. The answer is predicted as a text-span, given by the start and stop positions [y start , y stop ]. Contemporary unsupervised RC models (Lewis, 2019; are trained on a large dataset where the QA pairs are synthetically generated from the context. In our setting, we do not use such large training datasets, but instead directly operate on individual test contexts c i ∈ D test . Given c i , M synthetic question-answer pairs {(q j i ,â j i )} M j=1 are procedurally generated as described in Section 3. The QA model parameters θ are trained over the synthetic data to predict the span of the answer [ŷ start ,ŷ stop ] by optimizing the loss ans : where CE is cross-entropy loss. The inference is performed on human-authored questions to predict the answer spans: [y start , y stop ] = g(c, q).
Next, we describe the variants of test-time reading comprehension. Figure 1: Overview of our self-supervised test-time learning framework for reading comprehension. Our method does not require a human-authored training dataset but operates directly on each single test context and synthetically generates question-answer pairs over which model parameters θ are optimized. The inference is performed with trained parameters θ * on unseen human authored questions.
Single-Context Test-Time RC. This is the standard formulation of test-time learning in this paper, with Equation 1 optimizing over θ, i.e. for each context c i , the feature extractor θ f is re-initialized with pre-trained BERT, and the answering head θ h is randomly initialized.
K-neighbor Test-Time RC. In this version, K contexts similar to the test-context c i are grouped together, and Equation 1 is optimized over each set of similar contexts as opposed to single contexts in the standard setting. We index contexts in a Lucenebased information retrieval system (Gormley and Tong, 2015) and retrieve top-K similar contexts given c i , which we call Context Expansion with IR described in Section 3.
Curriculum Test-Time RC. In the curriculum learning version, questions are ordered in increasing order of complexity. We generate different types of questions, such as, semantic role labelling, cloze-completion, template-based and dependency tree-based translation of cloze questions to natural questions. This provides an ordering of complexity and we study the effect of test-time training with such an increasing complexity.
Online Test-Time RC. In the online test-time learning (TTLO), test samples are considered to be encountered in sequence. As such, answering head parameters θ h are updated sequentially without being randomly re-initialized like in the standard single-context setting. For each new test context c i , θ h is initiliazed with the optimal pa-rameteres from the previous test context c i−1 to optimize Equation 1.

Self-Supervised QA Generation
In this section, we detail our framework for procedurally generating QA pairs from a given context. We use named-entity recognition from Spacy (Honnibal and Montani, 2017), dependency parsing from Berkeley Neural Parser (Stern et al., 2017) and semantic role labeling (He et al., 2015) as our core methods to extract plausible answers and generate natural questions. As described in our task formulation, we create a set of M question-answer pairs {(q j i ,â j i )} M j=1 for the given context c i .
Cloze Generation. Statements in which the answer is replaced with a mask or blank token are called cloze questions. We follow the steps provided in  in which answers are replaced with a special token depending on the answer category. For example, in a sentence, "They were descended from Norse raiders and pirates from Denmark" the answer Denmark is replaced by [LOCATION], resulting a cloze question: "They were descended from Norse raiders and pirates from [LOCATION]".
Cloze Translation is utilized to rephrase cloze questions into more natural questions by using rulebased methods from .
Template-based Question Generation utilizes simple template-based rules to generate questions. Given a context of format: a template of the format "Wh+B+A+?" replaces the answer with a Wh-word (e.g., who,what,where) as described in Fabbri et al. (2020).
Dependency Parsing-based Question Generation. In this method, we use dependency reconstruction to translate clozes to natural questions as described in , according to the following steps: 1. Right child nodes of the answer are retained and left children are pruned. 2. For each node of the parse tree, if the child node's subtree contains the answer, the child node is moved to the first child node. 3. An in-order traversal is performed on the reconstructed tree. A rule-based mapping is applied to replace the special mask token of the cloze with an appropriate "Wh-word".
QA-Semantic Role Labeling (QA-SRL) was proposed by He et al. (2015) as a method to annotate NLP data, by using QA pairs to specify textual arguments and their roles. As seen in Figure 1, for the context sentences: "They were descended from Norse raiders and pirates from Denmark.", "The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century and it continued to evolve." the following QA pairs were generated, ("What was someone descended from?", "Norse"), (What evolved?, distinct cultural and ethnic diversty) We can observe the questions are short and use generic descriptors and pronouns such as "something" and "someone" instead of specific references calling for the model to have greater semantic understanding of the given context.
Context Expansion using IR is used in the Kneighbor version of TTL. For Context Expansion, we index all paragraphs present in a Wikipedia dump in ElasticSearch. During test-time learning, we preprocess the context c i by removing the most frequent stop-words, and use it as a seed query to search and retrieve top-K similar contexts. This provides us with related paragraphs that describe similar topics, and consequently more diverse and slightly larger number of QA pairs to train compared to only c i . We then generate QA pairs using the above described methods. We study the effect of varying the number of most similar contexts (K) on the downstream QA performance.
QA Model. We focus on training two transformer-encoder based models, BERT-Large  trained with wholeword masking and DistilBERT (Sanh et al., 2019). BERT-Large is used by current state-of-the-art methods on unsupervised extractive QA tasks and has 345 million trainable parameters. On the other hand, DistilBERT is a knowledge-distilled transformer-encoder based model and only has 66 million parameters (∼ 5× smaller than BERT-Large), allowing us to study the efficacy of TTL with respect to model-size.
Metrics. We use the standard metrics for extractive QA -macro Exact Match, where the predicted answer span is directly matched with the groundtruth, and macro F1, which measures the overlap between the predicted and the ground-truth spans. For comparisons with existing unsupervised methods, since TTL operates directly on test instances, we report validation set performance only for SQuAD 1.1, as the test set is hidden.
Training Setup. For all test-time learning variants, we limit the maximum number of questions generated per context to 4000 and the maximum number of training steps to 1500. experiments were conducted on two Nvidia RTX-8000 GPUs. We use ten percent of the training data to perform three hyper-parameter trials for each variant. We train models with three random seeds, and report the mean F1 and EM scores.
Baselines. As we generate our own data using QA-SRL, we use the following strong baselines. First, we train BERT-Large with generated data from previous methods described in Section 3 and our method (which contains additional QA-SRL samples). Second, we replicate the baselines using the low parameter-count model DistilBERT (66 million vs 345 million for BERT-Large). Third, for a fair comparison to Single-Context and Kneighbor test-time learning where we train models for each context independently, we propose a baseline where we train on all the test contexts together, referred to as "All test contexts". We also evaluate all TTL variants on two initializations of featureextractor parameters -1. "default" initialization of BERT-Large, i.e. θ f pre-trained on masked language modeling and next-sentence prediction tasks, and θ h randomly initialized for each context and trained from scratch, or 2. θ f and θ h further pre-trained on 100K synthetic QA pairs generated procedurally using our methods described in Section 3 with contexts taken from the Wikipedia corpus.

Unsupervised Question Answering
We compare our results with current state-ofthe-art supervised methods (Table 1) and unsupervised methods (   100 steps. With this setup, we are able to improve the state-of-the-art for the SQuAD benchmark with BERT-Large by 7.8% exact-match accuracy and 7.3% F1 score. With DistilBERT, the best TTL method shows an improvement of 15.5% EM and 20.6% F1 over DistilBERT-based baseline, as shown in Table 2. In NewsQA, TTL improves BERT-Large performance by 5.3% EM and 6.9% F1 score, and with DistilBERT shows an improvement of 7.2% EM and 7.2% F1 score. Training BERT-Large and DistilBERT with "our data" i.e. with a combined synthetic corpus created via all four QA-pair generation methods, marginally improves the F1 score. This shows that our QA generation methods lead to an improvement over existing unsupervised QA generation methods as shown in Table 2. However, the TTL framework leads to even larger gains (∼20% for SQuAD and ∼10% for NewsQA), indicating the benefits of test-time learning. This result also points to the limits of training with a large number of contexts compared to training on individual contexts. This limitation is especially profound in lower parameter models, such as DistilBERT. In Reading Comprehension, since the answer comes from the context, "understanding" the context is much more relevant. It has a higher inductive bias than learning to comprehend a significantly large number of contexts during training.
For instance, there are multiple contexts about Normans in the SQuAD dataset, one of which is shown in Figure 1. But each context may have different historical persons referred to as the leaders or rulers of the Normans. Answers to questions such as "Who was the leader of the Normans" are better learned for each context separately than from all contexts. Pre-training on several contexts is indeed beneficial to obtain better parameter initializations, as observed in Table 2, which can be further independently finetuned for each context during TTL.

Few-Shot Question Answering
We evaluate our best method under the few-shot setting, i.e. when models are trained with a limited number of human-authored QA pairs from the training datasets. Figure 2 shows a comparison with an increasing number of labeled training samples for SQuAD. TTL-Online is consistently better than existing methods and achieves 81.6% F1 score with just 100 labeled samples. This indicates that this learning framework can reduce the number of in-domain human-authored samples required for training. TTL-Online is also consistently better than  which the previous best unsupervised method for SQuAD. All methods (which use BERT-Large as backbone) converge to similar   (He et al., 2015), T (template-based methods), DP (dependency parsing). performance, with an increasing number of additional human-authored samples. This indicates the saturation of the inductive bias that can be incorporated into the architecture using current humanauthored annotations.

Analysis
We study the different variants of test-time learning and effects of hyperparameters, such as the number of training steps and the number of contexts, on the validation split for both datasets.

Single-Context vs K-neighbor Test-Time RC.
In Table 3, we compare all TTL variants. We observe that training with additional contexts has a significant impact on F1 score, compared to training on only the given test context c i . This may be simply explained as more synthetic training samples from similar contexts leading to a better generalization to human-authored samples. Although similar work in image classification (Sun et al., 2020) and super-resolution (Shocher et al., 2018) show a substantial performance improvement in a single sample learning, we observe that context expansion is beneficial for reading comprehension.
In Figure 3, we vary the number of retrieved neighbors contexts, K, and observe that F1 scores continue to increase till a limit (∼ 500). This is consistent in both BERT-Large and DistilBERT, as well as in the two datasets, SQuAD and NewsQA. Our hypothesis is that there exists an optimal number of QA pairs that the model benefits from, and a maximum threshold on the number of similar contexts after which, the model starts to overfit to the synthetic nature of the QA pairs.

Randomly initialized v/s Pre
We study the effect of re-initializing the question answering head and further pre-training using a set of procedurally generated QA-pairs on downstream test-time learning in Figure 4 and Table 3. While F1 scores achieved without pre-training are comparable to prior methods, pre-training leads to improved performance and also faster convergence, as shown in Figure 4. This can be attributed to better initial weights, which are further finetuned during the test-time learning phase. We studied pretraining with 50k, 100k, and 200k QA pairs and observed the best performance with 100k samples.
Curriculum Test-time learning. In Table 4 we study the effect of curriculum TTL, compared to the baseline of the default random-shuffled QA pairs. Interestingly, using a random ordering rather than a defined curriculum begets the best performance. Among the three curriculum ordering that we utilized, [QA-SRL, TEMPLATE-BASED (T), DP (DEPENDENCY-PARSING-BASED)] was effective but slightly lower than the performance with random ordering. However, training with QA-SRL at the end has a distinctly negative effect. We hypothesize that the model starts to overfit to the shorter vague questions from QA-SRL and "forgets" more natural questions. Hence, it loses generalizability to the human-authored questions.
Online-Test-time Learning. In online test-time learning, the model is continuously self-supervised and evaluated on a continuous stream of contexts and QA-pairs. From Table 3 and Figures 3, 4 and 5, we can observe that TTL-Online consistently outperforms the single-context variant. One key observation is that the model achieves its best performance within 100 training steps (batch size of 48), whereas the base version needs around 300 to 500 steps. This fast adaptation enables a faster inference time, compared to θ h being trained from scratch. We studied the effect of different random orderings of the test samples and observed the deviation as ±1.6% in F1 scores, which indicates ordering of test samples has a minor effect.
Effect of Batch Size and Learning Rate. Batchsize and learning rate have strong effects on online test-time learning. We observe that resuming with the learning rate of the last epoch of the pre-training with synthetic QA pairs achieves the best F1 scores. We do not use any weight decay. A persistent optimizer state between contexts is critical. Similarly, we hypothesize that the batch-layer normalization statistics pre-computed in transformer encoder layers get updated in further pre-training with QA pairs, leading to a better estimation during TTL. For the base variant of TTL, a higher, fixed learning rate of 3e-5 with a batch size of 32-48 achieves the best F1 scores.
Effect of number of Training steps and QA pairs is studied in Figures 4 and 5. To limit inference time per test context, we observe TTL variants initialized with pre-trained θ achieve the top performance within 150 training steps, whereas those trained with default initialization need 200−300 steps. In Figure 5, we can observe the variants achieve their best F1 scores around 3k QA pairs. This appears consistent with 100 train steps with a batch size of 24−32. Surprisingly, DistilBERT with pre-trained θ performs equally well compared to BERT-Large with no pre-training on synthetic question-answer pairs.
Effect of TTL on inference time. TTL and its variants all increase the inference time as compared to traditional inference. For the best variant of TTL-Online with BERT-Large, we train for 100 steps with a batch size of 48 samples, which leads to an inference time of ∼5 minutes per context. Each context contains, on average 6−7 questions in SQuaD 1.1 and NewsQA. The best variant of DistilBERT, although has a lower average inference time of 1.6 minutes per context, by employing several engineering tricks, such as saving models on RAM instead of the disk by using tmpfs (Snyder, 1990), and using mixed-precision training (Micikevicius et al., 2018). In comparison, non-TTL methods have inference times in the range ∼ 10K samples/sec with a GPU hardware of Nvidia V100 16GB. TTL inference time is limited by the current computation power of the GPUs but is potentially remediable. However, with an increase in CUDA cores in GPUs and RAM size, we estimate the inference time can be further improved. Moreover, with newer efficient transformer architectures such as Linformer  and Big Bird (Zaheer et al., 2020), it is possible for this inference time to be further reduced. It will be an interesting future work to increase TTL's efficiency further while retaining its strength of generalizing to evolving distributions.
Error Analysis. We analyzed 100 wrongly answered samples from SQuAD validation split and observed the model is biased towards answering named-entities. This is not unexpected as most of our QA-pair generation methods are focused on named-entity answers. For example, for the question "Is it easier or harder to change EU law than stay the same?", the TTL DistilBERT model generates "EU", whereas the ground-truth answer is "harder". Although QA-SRL generates more diverse answers, the corresponding questions are vague and much more synthetic, leaving scope for improving QA pair generation to include a variety of question and answer types in the future. Another source of errors is the alternate plausible answers generated by our models, shown in Table 5.

Related Work
Extractive QA. The goal for extractive question answering (EQA) is to predict a span of text in a context document as the answer to a question. Various benchmarks have been established to evaluate the capability of EQA models on corpuses from different domains such as Wikipedia-based question answering in SQuAD (Rajpurkar et al., 2016), Natural Questions dataset (Kwiatkowski et al., 2019), as well as questions requiring complex reasoning to extract answers in HotPotQA (Yang et al., 2018); questions about news' articles in NewsQA (Trischler et al., 2017); and about triviafacts in TriviaQA (Joshi et al., 2017).
Unsupervised QA. For many of the aforementioned extractive QA benchmarks, "human-like" performance has been reached via supervised methods. Unfortunately, these methods do not transfer well to new domains, and the collection of training data in new domains and new languages may not always be feasible. To address this, unsupervised EQA has been proposed as a challenge , in which aligned (context, question, answer) triplets are not available. Self-supervised data-synthesis methods Rennie et al., 2020;Fabbri et al., 2020; have been used for question answering by procedurally generating QA pairs and training models on these synthetic data.
Self-Supervised Learning. The key idea in selfsupervision is to design auxiliary tasks so as to and extract semantic features from unlabeled samples, for which input-output data samples can be created from unlabeled datasets. Self-supervision has been used to train large transformer-based language models such as BERT  and T5 (Raffel et al., 2020) for the auxiliary task of masked token prediction, and XLNET (Yang et al., 2019) for token prediction given any combination of other tokens in the sequence. ELECTRA (Clark et al., 2019) instead of masking tokens, jointly trains a generator to substitute input tokens with plausible alternatives and a discriminator to predict the presence or absence of substitution. MARGE (Lewis et al., 2020) is trained to retrieve a set of related multi-lingual texts for a target document, and to reconstruct the target document from the retrieved documents. The goal of self-supervised pretext task design is to come up with tasks that are as close to the main task, to learn better representations. In NLP, QA format provides us such an opportunity where we can leverage NER, SRL, Cloze Completion as auxiliary tasks for complex QA.
Learning at test-time. Our work is inspired by image processing methods such as single-image  super-resolution (Glasner et al., 2009;Freedman and Fattal, 2011;Shocher et al., 2018) that do not require access to external training datasets but instead formulate a self-supervised task for upsampling natural image patches recurring at different scales in the image. Test-time training (TTT) (Sun et al., 2020) for image classification makes use of rotation prediction Gidaris et al. (2018) as an auxiliary task to implicitly learn image classification at test-time and shows improved robustness. While we can directly synthesize main-task data (QA pairs) from the context and do not require an auxiliary task, our work is closely related to TTT.
Domain Adaptation. Pre-training for the tasks such as masked language modeling or other synthetic tasks on unlabeled corpora for a new domain has been evaluated for commonsense reasoning (Mitra et al., 2019) and classification tasks (Gururangan et al., 2020). On the other hand, our work can be viewed as task-specific self-supervision with each new context as a new domain.

Conclusion
In this work, we propose test-time learning (TTL) as a new framework for unsupervised extractive question answering (EQA). We present four variants of TTL with a simple but effective context expansion method. We utilize four question-answer pair generation methods for EQA and propose using QA-SRL as an additional source of QA pairs, to supplement prior methods. We show TTL enables "understanding" of contexts at test-time, without human-authored annotations, and significantly improves EQA, including low parameter models. We envision TTL as a framework that can direct work in reading comprehension to be viewed as a problem of ever-evolving datasets instead of a static corpus. Natural language itself undergoes continuous evolution (Gentner and France, 1988;Traugott and Dasher, 2001;Hamilton et al., 2016) via changes in preference for syntactical structures; creation of new words and phrases; and changing usage frequencies and semantics for existing words. TTL can potentially be applied to such scenarios with semantic drift or domain shift. Further improvements w.r.t. selection of similar contexts for K-neighbor TTL could be explored by leveraging hard sample selection, hard negative mining, bootstrapping, and contrastive learning, along with improved currculum strategies.

Ethical Considerations
Our test-time learning method treats every new test instance as a new distribution, and does not rely on a human-authored training dataset. We believe that this is a possible way to avoid learning spurious correlations or linguistic priors, especially when it comes to socio-cultural and historical biases that have been shown to percolate into models for various NLP tasks (Hendricks et al., 2018;Kurita et al., 2019;Sheng et al., 2019). On the other hand, if the test context itself contains biased, false, or propaganda statements, our model will use those statements to extract answers. We would not want models trained on such data to be deployed in the real world. However, because model parameters are randomly initialized for each new context in the standard version of our framework, if contexts are fact-checked by "reliable" sources, then we believe our model will be relatively bias-free, as compared to pre-trained language models for which it is hard to trace why a certain prediction was made. Test-time learning allows us to disentangle biases learned from single contexts, from biases learned by language models from large corpora.