MultiCQA: Zero-Shot Transfer of Self-Supervised Text Matching Models on a Massive Scale

We study the zero-shot transfer capabilities of text matching models on a massive scale, by self-supervised training on 140 source domains from community question answering forums in English. We investigate the model performances on nine benchmarks of answer selection and question similarity tasks, and show that all 140 models transfer surprisingly well, where the large majority of models substantially outperforms common IR baselines. We also demonstrate that considering a broad selection of source domains is crucial for obtaining the best zero-shot transfer performances, which contrasts the standard procedure that merely relies on the largest and most similar domains. In addition, we extensively study how to best combine multiple source domains. We propose to incorporate self-supervised with supervised multi-task learning on all available source domains. Our best zero-shot transfer model considerably outperforms in-domain BERT and the previous state of the art on six benchmarks. Fine-tuning of our model with in-domain data results in additional large gains and achieves the new state of the art on all nine benchmarks.


Introduction
Semantic matching of two text sequences is crucial among a wide range of NLP problems, such as question answering (Nakov et al., 2017;Wang and Jiang, 2017) or semantic textual similarity (Cer et al., 2017). Due to the ubiquity of applications, it is crucial to study how to obtain re-usable text matching models that transfer well to unseen domains or tasks.
Zero-shot transfer of text matching models is particularly challenging in setups of non-factoid answer selection (Cohen et al., 2018;Tay et al., 2017;Feng et al., 2015;Verberne et al., 2010) and question similarity (Nakov et al., 2017;Lei et al., 2016). These tasks compare questions and answers, or two potentially related questions in community question answering (cQA) forums, FAQ pages, and general collections of text passages. In contrast to other text matching tasks in NLP, they compare texts of different lengths-e.g., answers can be long explanations or descriptions-and often deal with expert domains. This makes it difficult to transfer models across domains (Shah et al., 2018), and to apply common approaches such as universal sentence embeddings without further domain or task adaptations (Poerner and Schütze, 2019).
Non-factoid answer selection and question similarity are also particularly promising to study zeroshot transfer. Reasons are that (1) there exist a large number of domains, and (2) in-domain training data is often scarce. Previous work proposed domain adaptation techniques (Poerner and Schütze, 2019;Shah et al., 2018), training with unlabeled data (Rücklé et al., 2019b), and shallow architectures (Rücklé et al., 2019a). However, these approaches result in entirely separate models that are specialized to individual target domains. One model that is re-usable and targets zero-shot transfer in similar settings is the question-answer encoder of Yang et al. (2020), which has recently been evaluated in cross-domain settings for efficient answer sentence retrieval (Guo et al., 2020). However, they do not study zero-shot transfer with a large number of source domains, and they do not assess how to best combine them.
In this work, we address these limitations and are-to the best of our knowledge-the first to study the zero-shot transfer capabilities of re-usable text matching models with a large number of source domains in these challenging setups.
In the first part, we investigate the zero-shot transfer capabilities of 140 domain-specific text matching models to nine benchmark datasets. By leveraging self-supervised training signals of ques-tion title-body pairs, we analyze a large number of models specialized on diverse domains. We utilize the training method provided by Rücklé et al. (2019b) and train adapter modules (Rebuffi et al., 2017;Houlsby et al., 2019) within BERT (Devlin et al., 2019) for each of the 140 English Stack-Exchange forums. Adapters considerably reduce storage requirements by training only a small number of additional parameters while keeping the pre-trained BERT weights fixed. In our extensive analysis, we show that our approach for zero-shot transfer is extremely effective-on six benchmarks all 140 models outperform common IR baselines. Most importantly, we revisit and analyze the traditional strategy of leveraging large data sets from intuitively similar domains to train models for zeroshot transfer. We establish that neither training data size nor domain similarity are suitable for predicting the best models, stressing the need for more elaborate strategies to identify suitable training tasks. This also demonstrates that considering a broad selection of source domains is crucial, which contrasts the standard practice of merely relying on the most similar or largest ones.
In the second part of this work, we study how to best combine multiple source domains with multitask learning and AdapterFusion (Pfeiffer et al., 2020a). Our analysis reveals that both approaches are not affected by catastrophic interference across training sets. In particular, our combination of all available source domains-despite the large data imbalances, see Figure 1-is the most effective and outperforms the respective best of 140 singledomain models on six out of nine bechmarks. Finally, we combine unlabeled with labeled data for training in a self-supervised and supervised fashion, which considerably improves the zero-shot transfer performances in 16 out of 18 cases. Our best model substantially outperforms the in-domain BERT and RoBERTa (Liu et al., 2019) models, as well as the previous state of the art on six benchmarks, which demonstrates its versatility across tasks and domains. We also show that our model is an effective initialization for in-domain fine-tuning, which results in large gains and achieves state-of-the-art results on all nine benchmarks.
Our source code and the weights of our best multi-task model is publicly available. 1 Additionally, all 140 source domain adapters are available at AdapterHub.ml (Pfeiffer et al., 2020b).

Related Work
The predominant method for text matching tasks such as non-factoid answer selection and question similarity is to train a neural architecture on a large quantity of labeled in-domain data.  (Zhang et al., 2020;Ma et al., 2020;MacAvaney et al., 2019). A crucial limitation of these approaches is that they result in entirely separate models for each dataset and are thus not re-usable. In this work, we therefore explore the zero-shot transfer capabilities of models, to understand how well they generalize to unseen settings.
Previous work of Yang et al. (2020) investigates this on a smaller scale. They propose USE-QA, a sentence encoder for comparing questions and answers, and achieve promising zero-shot results in retrieval tasks. However, it is unclear how this model compares to the zero-shot performances of models trained on several different source domains and how best to combine the data from multiple domains. Other work addresses the generalization of models over several domains in different settings, e.g., for machine reading comprehension (Talmor and Berant, 2019;Fisch et al., 2019). More related to our work, Guo et al. (2020) propose a new evaluation suite with eight datasets for retrieval-based QA, in which they also study the effectiveness of USE-QA. In contrast to them, our work (1) deals with re-ranking setups and uses cross-encoders, which is different to their bi-encoder scenario for retrieval; (2) we deal with question and answer passages instead of answer sentences; (3) we study a large number of 140 source domains and provide important insights on zero-shot transfer performances in relation to domain similarity and data size, and extensively analyze the training of models on many source domains simultaneously.

Training Data for 140 Domains
StackExchange is a network that consists of 172 cQA forums, 2 referred to as domains in the following, each devoted to a particular topic such as programming, traveling, finance, etc. From those 172 forums, 140 are in English and contain more than 1000 unlabeled questions.
We use data from each of these 140 English forums and train domain-specific models for semantic text matching. This has recently become feasible with self-supervised training methods such as WS-TB (Rücklé et al., 2019b), in which the question title is considered as a query to retrieve the question body (the detailed description of the question). This requires no labeled training instances and thus allows us to scale our experiments to 140 source domains which we can transfer from.
Formally, we train models with positive instances x + and negative training instances x − : in which q n = q m . We randomly sample q m from the entire corpus. For computational reasons, we 2 See https://stackexchange.com/sites. The data from all forums is publicly available https:// archive.org/details/stackexchange

Evaluation Benchmarks
We transfer all models to 9 benchmark datasets from different domains. We categorize them in two broad tasks, non-factoid answer selection and question similarity. See Table 1 for the statistics.
Answer selection (AS). The goal is to re-rank a pool of candidate answers A in regard to a question q. The questions in all datasets are short and do not contain additional descriptions (question bodies). Answers to non-factoid questions are often long texts such as descriptions, explanations, and advice.
• InsuranceQA (Feng et al., 2015) is a benchmark crawled from an FAQ community, 3 in which licensed insurance practitioners answer user questions. The domain is narrow and only contains questions about insurance topics in the US. We use the recent version 2 of the dataset with |A| = 500 candidate answers (retrieved with BM25). Typically, one answer is correct. • WikiPassageQA (Cohen et al., 2018) was crowd-sourced from Wikipedia articles and is not restricted to a particular domain (although many questions are about history topics). Candidate answers are passages from a single doc-ument, on the basis of which the question was formulated. |A| = 58 of which 1.6 passages represent correct answers (on average). • Long Answer Selection (LAS) datasets (Rücklé et al., 2019a) were crawled from apple, cooking, academia, travel, and aviation StackExchange forums. For a user question, its accepted answer is considered as correct, and negative candidates were collected by retrieving the accepted answers to similar questions (using a search engine with BM25). |A| = 100 We measure mean average precision (MAP) on WikiPassageQA, and accuracy (P@1) otherwise.
Question similarity (QS). The goal is to re-rank a pool of potentially related forum questions C in regard to a query question q. All questions contain titles and bodies-which we concatenate-and are thus long multi-sentence texts. On all question similarity benchmarks we measure MAP. , crawled from the AskUbuntu forum. The train split contains noisy community-labeled duplicate annotations, and the (smaller) dev/test splits were manually annotated for relevance. |C| = 20

Models and Training
BERT models. We use a pointwise ranking architecture based on pre-trained language models. We concatenate the two input texts (separated with SEP token), and learn a linear classifier on top of the final CLS representation for scoring. We optimize the binary cross-entropy loss. Similar techniques achieve state-of-the-art results on many related datasets (Garg et al., 2020;Mass et al., 2019;Rücklé et al., 2019b). For our zero-shot transfer experiments from single domains in §4, we use BERT base uncased (Devlin et al., 2019). Later in §5, we additionally investigate BERT large uncased and RoBERTa large (Liu et al., 2019). The hyperparameters for all setups are listed in Appendix A.1.
Training. We train our models with selfsupervision, see §3.1. To obtain in-domain models, we fine-tune BERT with the respective training data of the benchmark datasets of §3.2. We train the models for 20 epochs with early stopping for in-domain BERT, and without early stopping for zero-shot transfer. We report the average result over five runs for the in-domain models in AskUbutu and SemEval (due to small evaluation splits) and over two runs for the remaining benchmarks. Following Mass et al. (2019), we sample a maximum of 10 negative candidate answers for each question in WikiPassageQA (new samples in each epoch).
For the LAS datasets we randomly sample 10 negative candidates from the corpus. For InsuranceQA and AskUbuntu, we randomly sample one negative candidate due to their larger training sizes.
Adapters. To reduce the storage requirements, and to efficiently distribute our models to the community, we train adapters (Houlsby et al., 2019;Rebuffi et al., 2017) instead of full fine-tuning for our 140 single-domain BERT models. Adapters share the parameters of a large pre-trained modelin our case BERT-and introduce a small number of task-specific parameters. With that, adapters transform the intermediate representations in every BERT layer to the training task while keeping the pre-trained model itself unchanged. We use the recent architecture of Pfeiffer et al. (2020a), which makes it possible to investigate their adapter combination technique AdapterFusion in §5. In preliminary experiments, we find that using adapters in contrast to full model fine-tuning does not decrease the model performance while drastically reducing the number of parameters (one model is ∼5 MB).

Zero-Shot Transfer from 140 Domains
In this section, we study the zero-shot transfer performances of all models ( §4.1) and investigate whether domain similarity and training data size are suitable for predicting the best models ( §4.2).

Results
In Figure 2, we show the zero-shot transfer to all nine benchmarks. Except for SemEval17, all results are for the dev split. 5 Diamonds show the performance of IR baselines 6 and in-domain BERT. Zero-shot transfer vs. IR baselines. We observe that the wide range of domain-specific models transfer extremely well to all evaluation datasets. For instance, all models largely outperform IR baselines on six benchmarks. This suggests that learning a general similarity function in BERT for our type of data-i.e., short questions and long answers, or pairs of long questions-is important and indeed learned by the models. The low variances of the model performances, especially for more general domains such as Travel, Cooking, and SemEval17, indicate that the domain-specific factors either have a smaller impact, or were already learned during BERT pre-training. Other work has shown that IR baselines are often hard to beat, e.g., most neural models trained in-domain on WikiPassageQA perform below BM25 (Cohen et al., 2018). In contrast, we show that a large number of BERT models from a variety of 140 domains outperform these baselines without requiring any in-domain supervision.
Zero-shot vs. in-domain models. BERT trained in-domain performs the best in most cases. The difference is larger for expert topics with big training sets (InsuranceQA, AskUbuntu), which shows that our setup provides a challenging test-bed for measuring the generalization capabilities of models. However, for target domains with few training instances (see Table 1), the differences of indomain BERT to the best zero-shot transfer models are much smaller. Importantly, these setups pose crucial and realistic challenges for text matching approaches (Rücklé et al., 2019a,b). For instance, on SemEval17, this results in low performances for in-domain BERT. In contrast, our best zero-shot transfer model achieves a performance of 51.13 MAP-which is 2.13 points better than the best challenge participant in (Nakov et al., 2017). This clearly demonstrates that zero-shot transfer is a suitable alternative for in-domain models, which also contrasts the large performance degradations often observed with traditional models such as LSTMs (Shah et al., 2018). Importantly, we find no substantial differences between question similarity and answer selection tasks, which are both not explicitly learned during training. We thereby take an important step towards overcoming the boundaries between individual tasks and domains.

Analysis
Due to the large number of 140 domain specific models, each trained on datasets of different sizes, we are able to perform unique analyses regarding the zero-shot transfer performances to target tasks.
Ideally, we would like to identify a small number of models that transfer well to a given dataset, without requiring costly evaluations of all models. In the following, we probe the two most commonly used domain selection techniques: (1) domain similarity and (2) training size, in regard to the transfer performances. To simulate an optimal selection, we define an oracle that always identifies the best models. We present our findings in Figure 3. Domain similarity. To measure the domain similarity, we embed the questions of all datasets with Sentence-RoBERTa (Reimers and Gurevych, 2019). For each dataset, we obtain the mean over all embeddings and calculate the domain similarity to other datasets with cosine similarity. Domain similarity is most effective when selecting models for benchmarks of technical domains, e.g., AskUbuntu, LAS-Apple, and LAS-Aviation in Figure 3. However, this does not hold true for benchmarks of non-technical domains such as LAS-Travel or WikiPassageQA. In those cases, only considering the most similar source domains does not improve the average model performance. One reason might be that there do not exist many similar non-technical domains within StackExchange, from which models can transfer domain-specific idiosyncrasies. However, as we have shown in §4.1, such knowledge is not essential, i.e., a large number of models from more distant domains achieve good zero-shot transfer performances.
We provide examples of the best models and the most similar domains for three benchmarks in Table 2 (more are given in Appendix A.4). Many of the best models are from distant domains-e.g., 'Ethereum' for WikiPassageQA or 'SciFi' for LAS-Travel. This shows the importance of considering a broad selection of source domains, including ones that are not intuitively close.
Training size. The average performance of our models after removing the smallest domains improves more consistently (see WikiPassageQA and InsuranceQA in Figure 3). This shows, that the training size is more suitable for identifying models that achieve low performance scores-e.g., models that are trained on very narrow expert domains. However, the training size alone cannot identify the best models for zero-shot transfer. It is thus crucial to not limit the scope to the largest datasets at hand when exploring suitable training tasks. Importantly, this contrasts the common procedure of only including the largest domains for transfer (Shah et al., 2018).
Summary We have established that neither domain similarity nor training data size are suitable for predicting the best models. This shows that elaborate strategies are necessary for automatically identifying the most suitable training sets. Most importantly, we also demonstrate the importance of considering a broad selection of source domains instead of following the standard practice of merely relying on the most similar or largest domains. These insights could also be beneficial for researchers in related areas, e.g., to consider a wider range of domains and source datasets prior to domain adaptation.

Zero-Shot Transfer from Combinations of Multiple Domains
We now investigate how to best combine multiple source domains for zero-shot transfer. We denote our models as MultiCQA.

Setup
Combination methods. We use (1) multi-task learning and share all model layers across the domains. In each minibatch, we sample instances   from a single source domain, which we select with a round-robin schedule. Models trained in this manner are denoted as MT.
In addition, we (2) combine knowledge from our domain adapters ( §4) with AdapterFusion (AF; Pfeiffer et al., 2020a). This learns a weighted combination of multiple (fixed) adapters in each BERT layer and is typically trained on the target task. We adapt this approach to our zero-shot setup and train it with multi-task learning as above. 7 Data. We use the training data of §3.1 and exclude the domains that are used in any of the evaluation datasets 8 . We use three sets of source domains: (1) the set of 18 topically balanced domains, consisting of the top-three domains (according to the number of questions asked) from each of the six broad categories as defined by StackExchange 9 ; (2) the largest 18 domains according to the number of asked questions; (3) all included 134 domains. 7 We use AdapterFusion without value matrix to avoid additional regularization as in (Pfeiffer et al., 2020a). 8 AskUbuntu, aviation, travel, cooking, academia, apple. 9 Technology, culture, life, science, professional, and business. See Appendix A.5 for the list of included domains.
We additionally study the impact of extending our training data with community-labeled instances from the source domains. For a positive instance of question title and body, we add positive instances of (a) question title and accepted answer, and (b) question title and body of a duplicate question. We name this extended data.
Models. If not otherwise noted, we fine-tune BERT base. We also experiment with BERT large and RoBERTa large (all uncased). For Mul-tiCQA models this corresponds to MultiCQA B , MultiCQA B-lg , and MultiCQA RBa-lg . The training procedure, number of runs, and hyperparameters are as in §3.3.
We additionally compare our models to the question/answer encoder USE-QA (Yang et al., 2020), which is a state-of-the-art model for retrieving answers in zero-shot transfer setups. The IR baselines are the same as in §4.1 (TF*IDF for LAS, BM25 for WikiPassageQA and InsuranceQA, and a search engine ranking for SemEval17-the official challenge baseline).

Results
Multiple source domains. In Table 3, we show the results of MultiCQA B with MT and AF for the different sets of source domains, and compare this to the respective best single-domain models of §4. We observe that the balanced set of source domains achieves better results than combining domains with the largest training sets, which shows that diversity is more important than size. Most importantly, MT with data from all source domains outperforms the respective best single-domain model in 6 out of 9 benchmarks. This demonstrates that common problems of MT-catastrophic interference between training sets in particular-do not occur in our setup. This also reveals that combining source domains on a massive scale is possible.
MT and AF are both effective combination methods, with minor differences on most datasets. However, MT performs considerably better on Insur-anceQA, which is a very narrow expert domain. The reason for this is that AF combines fixed domain-specific adapters, which can lead to reduced performances if all adapters are not related to the target domain. AF can also lead to better results, e.g., on LAS-Academia. We include an analysis of AF for these datasets in Appendix A.3, where we also visualize the learned fusion weights. Interestingly, we find that the fusion weights do not differ much between the two datasets. However, when we remove a single adapter, we also observe that AF automatically replaces it with another adapter from a similar source domain, indicating that this approach is robust.
Additional labeled data. In Table 3, we also see that extending the training data of MT models with additional labeled data from question-answer pairs and question duplicates considerably and consistently improves the performances in 16 of 18 cases. This improves the performance of MT all on all nine benchmarks, which shows that our approach is very effective when combining a large number of smaller domains. Due to these consistent improvements, we train all our large MultiCQA models with MT all and the extended data.
Comparison to in-domain models. In Table 4, we compare our large MultiCQA models to the in-domain state of the art. We find that the additional capacity of the models and the better initialization with RoBERTa considerably improves the zero-shot transfer performances (on average). Our best zero-shot MultiCQA RBa-lg model outperforms USE-QA on eight benchmarks, and performs better than the previous in-domain state of the art on all LAS datasets and on SemEval17.
Our MultiCQA models are thus highly effective and re-usable across different domains and tasks. This clearly demonstrates the effectiveness and feasibility of training suitable models for zero-shot transfer that are widely applicable to different realistic settings.  Table 5: A mistake of MultiCQA RBa-lg (zero-shot transfer) on AskUbuntu. The model likely does not understand the intention of the query, which is to change the behavior of the installer (and not merely passing parameters to something).
Further in-domain fine-tuning. Finally, we show that MultiCQA RBa-lg is an effective initialization for in-domain fine-tuning. This leads to large gains and achieves state-of-the-art results on all nine benchmarks.

Analysis
We manually inspect 50 instances of InsuranceQA and AskUbuntu for which our zero-shot transfer model MultiCQA RBa-lg selects a wrong answer or an unrelated question. We find that the texts are always on-topic, i.e., many aspects of the question are included in the selected answers (Insur-anceQA) or in the potentially similar questions (AskUbuntu). This includes keywords, phrases (often paraphrased), names, version numbers, etc. The most common source of error is that an important aspect of the question appears to be ignored or is (likely) not understood by the model. For instance, many aspects of the question might be mentioned in a potentially similar question on AskUbuntu, but in the wrong context. Table 5 shows an example of such a case, and we provide more examples and additional details in Appendix A.6. We find that this type of error affects 25 of 50 instances in AskUbuntu, and 10 of 50 instances InsuranceQA. 10 Future work could thus achieve further improvements by enhancing the overall understanding of question and answer texts. Current models seemingly match similar keywords or phrases of the questions and answers, often without truly understanding them in context. 10 In 8/50 cases in AskUbuntu and 30/50 cases in Insur-anceQA our model actually selects relevant texts, e.g., correct answers or similar questions (which are not labeled as such).

Conclusion
We studied the zero-shot transfer of text matching models on a massive scale, with 140 different source domains and nine benchmark datasets of non-factoid answer selection and question similarity tasks. By investigating such a large number of models, we provided an extensive comparison and fair baselines to combination methods, and were able to extensively analyze a large sample size.
We have shown that (1) BERT models trained in a self-supervised manner on cQA forum data transfer well to all our benchmarks, even across distant domains; (2) training data size and domain similarity are not suitable for predicting the zeroshot transfer performances, revealing that a broad selection of source domains is crucial; (3) our MultiCQA approach that combines self-supervised and supervised training data across a large set of source domains outperforms many in-domain baselines and achieves state-of-the-art zero-shot performances on six benchmarks; (4) fine-tuning MultiCQA RBa-lg in-domain further improves the performances and achieves state-of-the-art results on all nine benchmarks.
We clearly demonstrated the effectiveness and the relevance of zero-shot transfer in many realistic scenarios and believe that our work lays foundations for a wide range of research questions. For instance, combining our approach with additional pre-training objectives such as the Inverse Cloze Task (Chang et al., 2020) could substantially increase the amount of training data for the large quantity of smaller forums. Researchers could also use our 140 domain-specific adapters and investigate further combination techniques to make them even more broadly applicable.

A.1 Hyperparameters
For computational and memory reasons we limit the maximum sequence length to 300 tokens (instead of the maximum of 512 in BERT) for all our models. Similar sequence lengths are commonly used on the benchmarks that we study (e.g., Mass et al., 2019;Tan et al., 2016).
For all experiments, we use a batch size of 32 and a linear warmup schedule over one epoch. We train all models for 20 epochs with early stopping of in-domain models, and without early stopping for zero-shot transfer.
For full model fine-tuning on SemEval17, we use a learning rate of 5 × 10 −5 , due to the very small size of the data set. In all other cases with full model fine-tuning, we use learning rates that we optimized on WikiPassageQA and InsuranceQA. For this, we explored the manual selection of learning rates of 0.001, 0.0001, and 5 × 10 −5 . The development scores on InsuranceQA are 43.25, 40.00, and 39.25 (accuracy), respectively. The development scores on WikiPassageQA are 72. 69,71.93,72.26 (MAP), respectively. We thus chose 0.001 as a learning rate when fine-tuning BERT (and RoBERTa) models.
For the training of adapters and AdapterFusion, we use the learning rates as recommended in (Pfeiffer et al., 2020a), which are 0.0001 and 5 × 10 −5 , respectively.

A.2 Computing Infrastructure
We used a heterogenous cluster with different types of GPUs for our experiments. Our most demanding experiments with RoBERTa-large were performed with one NVIDIA Tesla V100 GPU and 32GB memory (per experiment). To train the models with a batchsize of 32, we used accumulation of gradients over two smaller mini-batches of size 16. One epoch with all source domains trains for on average 97 minutes. The remaining experiments were split across NVIDIA Tesla V100/P100 GPUs (32GB), and NVIDIA Titan RTX (24GB).

A.3 AdapterFusion (AF) on LAS-Academia and InsuranceQA
AdapterFusion learns a weighted combination of adapter outputs in each BERT layer, which is dependent on the layer input. Similar to Pfeiffer et al.
(2020a), we can thus plot the activations of the individual adapters for different benchmarks in order to analyze which source domains are most impactful. Further, this allows us to observe how the activations differ across different benchmarks.
In Figure 4 and in Figure 5 we plot the activatations for AF balanced on LAS-Academia and on In-suranceQA, which were the best and worst transfer datasets of this approach, respectively (compared to MT; see §5.2). We find, that the activations are very similar across the two benchmarks, which indicates that our model learns to focus less on the model input. This shows that some adapters are better suited than others for individual BERT layers, e.g., the adapter for the 'English' domain dominates layers 9 and 10, and 'OpenSource' as well as 'StackExchange' adapters dominate layer 11.
When transferring to the narrow expert domain InsuranceQA, interestingly, the same adapters are activated in BERT layers, with slightly different strengths as compared to LAS-Academia. This means that specific combinations of the same adapters are helpful for a variety of downstream tasks.
To investigate the impact of single most important adapters and how they affect the performance of AF, we remove the adapter of the English domain-which has the strongest activations in AF balanced-and plot the result for LAS-Academia in Figure 6. We observe that AF, now increases the activation of the 'Ell' (English language learners) adapter (see layer 9). This shows that AF has learned to utilize particular types of information encoded in adapters that exploit similar attributes, rather than combining a fixed selection of adapters. If, like in this scenario, the adapter is no longer available, AF extracts the information from other, similar adapters. This validates the effectiveness of AF as well as that different kinds of information are stored within the different layers of adapters.

A.4 Individual Transfer: Best Models and Most Similar Domains
We show the best models and most similar domains for all benchmark in Table 6 (we have provided an excerpt of that in Table 2 of the paper). In particular, we see that the best source domains vary across the different benchmarks. Often, the best models are not from intuitively close domains nor from domains with large training sets.

A.5 List of Domains in Combination Experiments
The list of all domains is available on the web: https://stackexchange.com/sites. We list the domains used for our two subsets in §5 below.

A.6 Examples of Wrong Predictions on InsuranceQA and AskUbuntu
We provide additional examples of mistakes made by MultiCQA RBa-lg (zero-shot transfer) on AskUbuntu and InsuranceQA to complement our brief analysis in Section 6. Table 7 shows an additional example for AskUbuntu. The query question asks for the maximum number of CPUs that can be handled by a kernel. The selected similar question, however, asks for information where the kernel gets its information about the available CPUs-not the maximum possible number of CPUs. Tables 8 and 9 show examples of similar problems in InsuranceQA.
Query question: How many maximum CPUs does Ubuntu support by default? I think this is kernel dependent and probably will change over time depending on the kernel a release uses, correct me if wrong I'd like to know [...] Most similar (MultiCQA RBa-lg ): Creation of /proc/stat. Which function of the kernel creates and writes the information for /proc/stat. In this, would like to know when kernel gets the CPU information (recognises number of CPUs) [...] Ground truth: Ubuntu Linux 14.04 LTS server edition information need. I was wondering what's the maximum RAM, and maximum CPUs does the Ubuntu Linux 14.04 LTS server edition can handle [...] Table 7: AskUbuntu example (shortened). This shows that our model mostly focuses on number of CPUs and kernel information instead of recognizing that the crucial information is the maximum number. We underline important aspects that differ.
Question: Can I buy a car without insurance?
Selected answer (MultiCQA RBa-lg ): You most certainly can get auto insurance without a car. if you needed toborrow, test drive, rent, or lease a vehicle for whatever reason you would purchase what is called a drive other car policy. [...] Ground truth: Depending in the state you live in and also if your are financing the car. if you have a loan on the car the financial institution will require insurance before you even leave the car lot. if you are buying from a private party they may not require this but in most states you can not even get your license plates with out insurance. Table 8: InsuranceQA example 1 (shortened). This shows that the model does not interpret the individual keywords within context, i.e., it does not differentiate between car without insurance and insurance without car. We underline important aspects that differ in the most similar candidate.
Query question: Why is state farm life insurance so expensive?
Selected answer (MultiCQA RBa-lg ): State farm offers life insurance, both term and permanent through their captive agents along with property and casualty insurance. However, unlike the latter types of coverage [...] Ground truth: Every carrier has their own rates -these are based off a long calculation of actuarial values and mortality tables. Some carriers are more aggressive than others and are willing to take on more risk [...] more conservative carriers feature higher rates. So it's hard to say one carrier is just very expensive. Table 9: InsuranceQA example 2 (shortened). The selected answer describes state farm life insurance, whereas the ground truth explains why it can be expensive. We underline important aspects that differ in the most similar candidate.