MultiQA: An Empirical Investigation of Generalization and Transfer in Reading Comprehension

A large number of reading comprehension (RC) datasets has been created recently, but little analysis has been done on whether they generalize to one another, and the extent to which existing datasets can be leveraged for improving performance on new ones. In this paper, we conduct such an investigation over ten RC datasets, training on one or more source RC datasets, and evaluating generalization, as well as transfer to a target RC dataset. We analyze the factors that contribute to generalization, and show that training on a source RC dataset and transferring to a target dataset substantially improves performance, even in the presence of powerful contextual representations from BERT (Devlin et al., 2019). We also find that training on multiple source RC datasets leads to robust generalization and transfer, and can reduce the cost of example collection for a new RC dataset. Following our analysis, we propose MultiQA, a BERT-based model, trained on multiple RC datasets, which leads to state-of-the-art performance on five RC datasets. We share our infrastructure for the benefit of the research community.


Introduction
Reading comprehension (RC) is concerned with reading a piece of text and answering questions about it (Richardson et al., 2013;Berant et al., 2014;Hermann et al., 2015;Rajpurkar et al., 2016). Its appeal stems both from the clear application it proposes, but also from the fact that it allows to probe many aspects of language understanding, simply by posing questions on a text document. Indeed, this has led to the creation of a large number of RC datasets in recent years.
While each RC dataset has a different focus, there is still substantial overlap in the abilities required to answer questions across these datasets. Nevertheless, there has been relatively little work (Min et al., 2017;Chung et al., 2018; that explores the relations between the different datasets, including whether a model trained on one dataset generalizes to another. This research gap is highlighted by the increasing interest in developing and evaluating the generalization of language understanding models to new setups (Yogatama et al., 2019;Liu et al., 2019).
In this work, we conduct a thorough empirical analysis of generalization and transfer across 10 RC benchmarks. We train models on one or more source RC datasets, and then evaluate their performance on a target test set, either without any additional target training examples (generalization) or with additional target examples (transfer). We experiment with DOCQA , a standard and popular RC model, as well as a model based on BERT (Devlin et al., 2019), which provides powerful contextual representations.
Our generalization analysis confirms findings that current models over-fit to the particular training set and generalize poorly even to similar datasets. Moreover, BERT representations substantially improve generalization. However, we find that the contribution of BERT is much more pronounced on Wikipedia (which BERT was trained on) and Newswire, but quite moderate when documents are taken from web snippets.
We also analyze the main causes for poor generalization: (a) differences in the language of the text document, (b) differences in the language of the question, and (c) the type of language phenomenon that the dataset explores. We show how generalization is related to these factors ( Figure 1) and that performance drops as more of these factors accumulate.
Our transfer experiments show that pre-training on one or more source RC datasets substantially improves performance when fine-tuning on a tar-get dataset. An interesting question is whether such pre-training improves performance even in the presence of powerful language representations from BERT. We find the answer is a conclusive yes, as we obtain consistent improvements in our BERT-based RC model.
We find that training on multiple source RC datasets is effective for both generalization and transfer. In fact, training on multiple datasets leads to the same performance as training from the target dataset alone, but with roughly three times fewer examples. Moreover, we find that when using the high capacity BERT-large, one can train a single model on multiple RC datasets, and obtain close to or better than state-of-the-art performance on all of them, without fine-tuning to a particular dataset.
Armed with the above insights, we train a large RC model on multiple RC datasets, termed MUL-TIQA. Our model leads to new state-of-the-art results on five datasets, suggesting that in many language understanding tasks the size of the dataset is the main bottleneck, rather than the model itself.
Last, we have developed infrastructure (on top of AllenNLP ), where experimenting with multiple models on multiple RC datasets, mixing datasets, and performing finetuning, are trivial. It is also simple to expand the infrastructure to new datasets and new setups (abstractive RC, multi-choice, etc.). We will open source our infrastructure, which will help researchers evaluate models on a large number of datasets, and gain insight on the strengths and shortcoming of their methods. We hope this will accelerate progress in language understanding.
To conclude, we perform a thorough investigation of generalization and transfer in reading comprehension over 10 RC datasets. Our findings are: • An analysis of generalization on two RC models, illustrating the factors that influence generalization between datasets. • Pre-training on a RC dataset and fine-tuning on a target dataset substantially improves performance even in the presence of contextualized word representations (BERT The uniform format datasets can be downloaded from www.tau-nlp.org/multiqa. The code for the AllenNLP models is available at github.com/alontalmor/multiqa.

Datasets
We describe the 10 datasets used for our investigation. Each dataset provides question-context- for training, and a model maps an unseen question-context pair (q, c) to an answer a. For simplicity, we focus on the single-turn extractive setting, where the answer a is a span in the context c. Thus, we do not evaluate abstractive (Nguyen et al., 2016) or conversational datasets (Choi et al., 2018;Reddy et al., 2018).
We broadly distinguish large datasets that include more than 75K examples, from small datasets that contain less than 75K examples. In §4, we will fix the size of the large datasets to control for size effects, and always train on exactly 75K examples per dataset. We now shortly describe the datasets, and provide a summary of their characteristics in Table 1. The table shows the original size of each dataset, the source for the context, how questions were generated, and whether the dataset was specifically designed to probe multi-hop reasoning.
The large datasets used are: 1. SQUAD (Rajpurkar et al., 2016): Crowdsourcing workers were shown Wikipedia paragraphs and were asked to author questions about their content. Questions mostly require soft matching of the language in the question to a local context in the text. 2. NEWSQA (Trischler et al., 2017): Crowdsourcing workers were shown a CNN article (longer than SQUAD) and were asked to author questions about its content.
3. SEARCHQA (Dunn et al., 2017): Trivia questions were taken from Jeopardy! TV show, and contexts are web snippets retrieved from Google search engine for those questions, with an average of 50 snippets per question. 4. TRIVIAQA (Joshi et al., 2017): Trivia questions were crawled from the web. In one variant of TRIVIAQA (termed TQA-W), Wikipedia pages related to the questions are provided for each question. In another, web snippets and documents from Bing search engine are given. For the latter variant, we use only the web snippets in this work (and term this TQA-U). In addition, we replace Bing web snippets with Google web snippets (and term this TQA-G). 5. HOTPOTQA : Crowdsourcing workers were shown pairs of related Wikipedia paragraphs and asked to author questions that require multi-hop reasoning over the paragraphs. There are two versions of HOT-POTQA: the first where the context includes the two gold paragraphs and eight "distractor" paragraphs, and a second, where 10 paragraphs retrieved by an information retrieval (IR) system are given. Here, we use the latter version.
The small datasets are: 1. CQ (Bao et al., 2016): Questions are real Google web queries crawled from Google Suggest, originally constructed for querying the KB Freebase (Bollacker et al., 2008). However, the dataset was also used as a RC task with retrieved web snippets (Talmor et al., 2017). 2. CWQ (Talmor and Berant, 2018c): Crowdsourcing workers were shown compositional formal queries against Freebase and were asked to re-phrase them in natural language. Thus, questions require multi-hop reasoning. The original work assumed models contain an IR component, but the authors also provided default web snippets, which we use here. The repartitioned version 1.1 was used. (Talmor and Berant, 2018a) 3. WIKIHOP (Welbl et al., 2017) Questions are entity-relation pairs from Freebase, and are not phrased in natural language. Multiple Wikipedia paragraphs are given as context, and the dataset was constructed such that multi-hop reasoning is needed for answering the question. 4. COMQA (Abujabal et al., 2018)

Models
We carry our empirical investigation using two models. The first is DOCQA , and the second is based on BERT (Devlin et al., 2019), which we term BERTQA. We now describe the pre-processing on the datasets, and provide a brief description of the models. We emphasize that in all our experiments we use exactly the same training procedure for all datasets, with minimal hyper-parameter tuning.
Pre-processing Examples in all datasets contain a question, text documents, and an answer. To generate an extractive example we (a) Split: We define a length L and split every paragraph whose length is > L into chunks using a few manual rules. (b) Sort: We sort all chunks (paragraphs whose length is ≤ L or split paragraphs) by cosine similarity to the question in tf-idf space, as proposed by . (c) Merge: We go over the sorted list of chunks and greedily merge them to the largest possible length that is at most L, so that the RC model will be exposed to as much context as possible. The final context is the merged list of chunks c = (c 1 , . . . , c |c| ) (d) We take the gold answer and mark all spans that match the answer.
DOCQA : A widelyused RC model, based on BIDAF (Seo et al., 2016), that encodes the question and document with bidirectional RNNs, performs attention between the question and document, and adds selfattention on the document side. We run DOCQA on each chunk c i , where the input is a sequence of up to L(= 400) tokens represented as GloVE embeddings (Pennington et al., 2014). The output is a distribution over the start and end positions of the predicted span, and we output the span with highest probability across all chunks. At training time, DOCQA uses a sharednorm objective that normalizes the probability dis-tribution over spans from all chunks. We define the gold span to be the first occurrence of the gold answer in the context c.
BERTQA (Devlin et al., 2019): For each chunk, we apply the standard implementation, where the input is a sequence of L = 512 wordpiece tokens composed of the question and chunk separated by special tokens [CLS] <question> [SEP] <chunk> [SEP]. A linear layer with softmax over the top-layer [CLS] outputs a distribution over start and end span positions.
We train over each chunk separately, backpropagating into BERT's parameters. We maximize the log-likelihood of the first occurrence of the gold answer in each chunk that contains the gold answer. At test time, we output the span with the maximal logit across all chunks.

Controlled Experiments
We now present controlled experiments aiming to explore generalization and transfer of models trained on a set of RC datasets to a new target.

Do models generalize to unseen datasets?
We first examine generalization -whether models trained on one dataset generalize to examples from a new distribution. While different datasets differ substantially, there is overlap between them in terms of: (i) the language of the question, (ii) the language of the context, and (iii) the type of linguistic phenomena the dataset aims to probe. Our goal is to answer (a) do models over-fit to a particular dataset? How much does performance drop when generalizing to a new dataset? (b) Which datasets generalize better to which datasets? What properties determine generalization?
We train DOCQA and BERTQA (we use BERTbase) on six large datasets (for TRIVIAQA we use TQA-G and TQA-W), taking 75K examples from each dataset to control for size. We also create MULTI-75K, which contains 15K examples from the five large dataset (Using TQA-G only for TRIVIAQA), resulting in another dataset of 75K examples. We evaluate performance on all datasets that the model was not trained on. Table 2 shows exact match (EM) performance (does the predicted span exactly match the gold span) on the development set. The row SELF corresponds to training and testing on the target itself, and is provided for reference (For DROP, we train on questions where the answer is a span in the context, but evaluate on the entire development set). The top part shows DOCQA, while the bottom BERTQA.
At a high-level we observe three trends. First, models generalize poorly in this zero-shot setup: comparing SELF to the best zero-shot number shows a performance reduction of 31.5% on average. This confirms the finding that models overfit to the particular dataset. Second, BERTQA substantially improves generalization compared to DOCQA owing to the power of large-scale unsupervised learning -performance improves by 21.2% on average. Last, MULTI-75K performs almost as well as the best source dataset, reducing performance by only 3.7% on average. Hence, training on multiple datasets results in robust generalization. We further investigate training on multiple datasets in §4.2 and §5.
Taking a closer look, the pair SEARCHQA and TQA-G exhibits the smallest performance drop, since both use trivia questions and web snippets. SQUAD and NEWSQA also generalize well (especially with BERTQA), probably because they contain questions on a single document, focusing on predicate-argument structure. While HOT-POTQA and WIKIHOP both examine multi-hop reasoning over Wikipedia, performance dramatically drops from HOTPOTQA to WIKIHOP. This is due to the difference in the language of the questions (WIKIHOP questions are synthetic). The best generalization to DROP is from HOTPOTQA, since both require multi-hop reasoning. Performance on DROP is overall low, showing that our models struggle with quantitative reasoning.
For the small datasets, COMQA, CQ, and CWQ, generalization is best with TQA-G, as the contexts in these datasets are web snippets. For CQ, whose training set has 1,300 examples, zeroshot performance is even higher than SELF.
Interestingly, BERTQA improves performance substantially compared to DOCQA on NEWSQA, SQUAD, TQA-W and WIKIHOP, but only moderately on HOTPOTQA, SEARCHQA, and TQA-G. This hints that BERT is efficient when the context is similar to (or even part of ) its training corpus, but degrades over web snippets. This is most evident when comparing TQA-G to TQA-W, as the difference between them is the type of context.
Global structure To view the global structure of the datasets, we visualize them with the forcedirected placement algorithm (Fruchterman and   Reingold, 1991). The input is a set of nodes (datasets), and a set of undirected edges representing springs in a mechanical system pulling nodes towards one another. Edges specify the pulling force, and a physical simulation places the nodes in a final minimal energy state in 2D-space. Let P ij be the performance when training BERTQA on dataset D i and evaluating on D j . Let P i be the performance when training and evaluating on D i . The force between an unordered pair of datasets is F (D 1 , D 2 ) = P 12 P 2 + P 21 P 1 when we train and evaluate in both directions, and F (D 1 , D 2 ) = 2·P 12 P 2 , if we train on D 1 and evaluate on D 2 only. Figure 1 shows this visualization, where we observe that datasets cluster naturally according to shape and color. Focusing on the context, datasets with web snippets are clustered (triangles), while datasets that use Wikipedia are also near one another (circles). Considering the question language, TQA-G, SEARCHQA, and TQA-U are very close (blue triangles), as all contain trivia questions over web snippets. DROP, HOTPOTQA, NEWSQA and SQUAD generate questions with crowd workers, and all are at the top of the figure. WIKI-HOP uses synthetic questions that prevent generalization, and is far from other datasets -however this gap will be closed during transfer learning ( §4.2). DROP is far from all datasets because it requires quantitative reasoning that is missing from other datasets. However, it is relatively close to HOTPOTQA and WIKIHOP, which target multihop reasoning. DROP is also close to SQUAD, as both have similar contexts and question language,  Figure 1: A 2D-visualization of the similarity between different datasets using the force-directed placement algorithm. We mark datasets that use web snippets as context with triangles, Wikipedia with circles, and Newswire with squares. We color multi-hop reasoning datasets in red, trivia datasets in blue, and factoid RC datasets in green. but the linguistic phenomena they target differ.

Does pre-training improve results on small datasets?
We now consider transfer learning, assuming access to a small number of examples (≤15K) from a target dataset. We pre-train a model on a source dataset, and then fine-tune on the target. In all models, pre-training and fine-tuning are identical and performed until no improvement is seen on the development set (early stopping). Our goal is to analyze whether pre-training improves performance compared to training on the target alone. This is particularly interesting with BERTQA, as BERT already contains substantial knowledge that might deem pre-training unnecessary.
How to choose the dataset to pre-train on? Table 4 shows exact match (EM) on the development set of all datasets (rows are the trained datasets and columns the evaluated datasets). Pre-training on a source RC dataset and transferring to the target improves performance by 21% on average for DOCQA (improving on 8 out of 11 datasets), and by 7% on average for BERTQA (improving on 10 out of 11 datasets). Thus, pre-training on a related RC dataset helps even given representations from a model like BERTQA. Second, MULTI-75K obtains good performance in almost all setups. Performance of MULTI-75K is 3% lower than the best source RC dataset on average for DOCQA, and 0.3% lower for BERTQA. Hence, one can pre-train a single model on a mixed dataset, rather than choose the best source dataset for every target.
Third, in 4 datasets (COMQA, DROP, HOT-POTQA, WIKIHOP) the best source dataset uses web snippets in DOCQA, but Wikipedia in BERTQA. This strengthens our finding that BERTQA performs better given Wikipedia text.
Last, we see dramatic improvement in performance comparing to §4.1. This highlights that current models over-fit to the data they are trained on, and small amounts of data from the target distribution can overcome this generalization gap. This is clearest for WIKIHOP, where synthetic questions preclude generalization, but fine-tuning improves performance from 12.6 EM to 50.5 EM. Thus, low performance was not due to a modeling issue, but rather a mismatch in the question language.
An interesting question is whether performance in the generalization setup is predictive of performance in the transfer setup. Average performance across target datasets in Table 4, when choosing the best source dataset from Table 4, is 39.3 (DOCQA) and 43.8 (BERTQA). Average performance across datasets in Table 4, when choosing the best source dataset from Table 2, is 38.9 (DOCQA) and 43.5 (BERTQA). Thus, one can select a dataset to pre-train on based on generalization performance and suffer a minimal hit in accuracy, without fine-tuning on each dataset. However, training on MULTI-75K also yields good results without selecting a source dataset at all.
How much target data is needed? We saw that with 15K training examples from the target dataset, pre-training improves performance. We now ask whether this effect maintains given a larger training set. To examine this, we measure ( Figure 2) the performance on each of the large datasets when pre-training on its nearest dataset (according to F (·, ·)) for both DOCQA (top) and BERTQA (bottom row). The orange curve corresponds to training on the target dataset only, while the blue curve describes pre-training on 75K examples from a source dataset, and then fine-tuning on an increasing number of examples from the target dataset.
In 5 out of 10 curves, pre-training improves performance even given access to all 75K examples from the target dataset. In the other 5, using only the target dataset is better after 30-50K examples. To estimate the savings in annotation costs through pre-training, we measure how many examples are needed, when doing pre-training, to reach 95% of the performance obtained when training on all examples from the target dataset. We find that with pre-training we only need 49% of the examples to reach 95% performance, compared to 86% without pre-training.
To further explore pre-training on multiple datasets, we plot a curve (   tuning). We observe that more data from multiple datasets improves performance in almost all cases. In this case, we reach 95% of the final performance using 30% of the examples only. We will use this observation further in §5 to reach new state-of-the-art performance on several datasets.

Does context augmentation improve performance?
For TRIVIAQA we have for all questions, contexts from three different sources -Wikipedia (TQA-W), Bing web snippets (TQA-U), and Google web snippets (TQA-G). Thus, we can explore whether combining the three datasets improves performance. Moreover, because questions are identical across the datasets, we can see the effect on generalization due to the context language only. Table 5 shows the results. In the first 3 rows we train on 75K examples from each dataset, and in the last we train on the combined 225K examples. First, we observe that context augmentation substantially improves performance (especially for TQA-G and TQA-W). Second, generalization is sensitive to the context type: performance substantially drops when training on one context type and evaluating on another (60.7 → 48.4 for TQA-G, 53.1 → 44.6 for TQA-U, and 50.1 → 43.3 for TQA-W).  We use the official evaluation script for any dataset that provides one, and the SQUAD evaluation script for all other datasets. Table 6 shows results for datasets where the evaluation metric is EM or token F 1 (harmonic mean of the list of tokens in the predicted vs. gold span). Table 7 shows results for datasets where the evaluation metric is average recall/precision/F 1 between the list of predicted answers and the list of gold answers.
We compare MULTIQA to BERT-large, a model that does not train on MULTI-375K, but only fine-tunes BERT-large on the target dataset. We also show the state-of-the-art (SOTA) result for all datasets for reference. 1 MULTIQA improves state-of-the-art performance on fivedatasets, although it does not even 1 State-of-the-are-results were found in (Tay et al., 2018) for NEWSQA, in Lin et al. (2018), for SEARCHQA, in Das et al. (2019) for TQA-U, in (Talmor and Berant, 2018b) for CWQ, in Ding et al. (2019) for HOTPOTQA, in (Abujabal et al., 2018) for COMQA, and in Bao et al. (2016) (Trischler et al., 2017)), improving upon previous state-of-the-art by a large margin.
To conclude, MULTIQA is able to improve state-of-the-art performance on multiple datasets. Our results suggest that in many NLU tasks the size of the dataset is the main bottleneck rather than the model itself.

Related Work
Prior work has shown that RC performance can be improved by training on a large dataset and transferring to a smaller one, but at a small scale (Min et al., 2017;Chung et al., 2018).  has recently shown this in a larger experiment for multi-choice questions, where they first fine-tuned BERT on RACE (Lai et al., 2017) and then finetuned on several smaller datasets.
Interest in learning general-purpose representations for natural language through unsupervised, multi-task and transfer learning has been skyrocketing lately Radford et al., 2018;McCann et al., 2018;Chronopoulou et al., 2019;Phang et al., 2018;. In parallel to our work, studies that focus on generalization have appeared on publication servers, empirically studying generalization to multiple tasks (Yogatama et al., 2019;Liu et al., 2019). Our work is part of this research thread on generalization in   natural langauge understanding, focusing on reading comprehension, which we view as an important and broad language understanding task.

Conclusions
In this work we performed a thorough empirical investigation of generalization and transfer over 10 RC datasets. We characterized the factors affecting generalization and obtained several state-ofthe-art results by training on 375K examples from 5 RC datasets. We open source our infrastructure for easily performing experiments on multiple RC datasets, for the benefit of the community. We highlight several practical take-aways: • Pre-training on multiple source RC datasets consistently improves performance on a target RC dataset , even in the presence of BERT representations. It also leads to substantial reduction in the number of necessary training examples for a fixed performance. • Training the high-capacity BERT-large representations over multiple RC datasets leads to good performance on all of the trained datasets without having to fine-tune on each dataset separately. • BERT representations improve generalization, but their effect is moderate when the source of the context is web snippets compared to Wikipedia and newswire. • Performance over an RC dataset can be improved by retrieving web snippets for all questions and adding them as examples (context augmentation).