Intermediate-Task Transfer Learning with Pretrained Models for Natural Language Understanding: When and Why Does It Work?

While pretrained models such as BERT have shown large gains across natural language understanding tasks, their performance can be improved by further training the model on a data-rich intermediate task, before fine-tuning it on a target task. However, it is still poorly understood when and why intermediate-task training is beneficial for a given target task. To investigate this, we perform a large-scale study on the pretrained RoBERTa model with 110 intermediate-target task combinations. We further evaluate all trained models with 25 probing tasks meant to reveal the specific skills that drive transfer. We observe that intermediate tasks requiring high-level inference and reasoning abilities tend to work best. We also observe that target task performance is strongly correlated with higher-level abilities such as coreference resolution. However, we fail to observe more granular correlations between probing and target task performance, highlighting the need for further work on broad-coverage probing benchmarks. We also observe evidence that the forgetting of knowledge learned during pretraining may limit our analysis, highlighting the need for further work on transfer learning methods in these settings.


Introduction
Unsupervised pretraining-e.g., BERT (Devlin et al., 2019) or RoBERTa (Liu et al., 2019b)-has recently pushed the state of the art on many natural language understanding tasks. One method of further improving pretrained models that has been shown to be broadly helpful is to first finetune a pretrained model on an intermediate task, before fine-tuning again on the target task of interest (Phang et al., 2018;Wang et al., 2019a;Clark et al., 2019a;Sap et al., 2019), also referred to as * Equal contribution. STILTs. However, this approach does not always improve target task performance, and it is unclear under what conditions it does. This paper offers a large-scale empirical study aimed at addressing this open question. We perform a broad survey of intermediate and target task pairs, following an experimental pipeline similar to Phang et al. (2018) and Wang et al. (2019a). This differs from previous work in that we use a larger and more diverse set of intermediate and target tasks, introduce additional analysis-oriented probing tasks, and use a better-performing base model RoBERTa (Liu et al., 2019b). We aim to answer the following specific questions: • What kind of tasks tend to make good intermediate tasks across a wide variety of target tasks?
• Which linguistic skills does a model learn from intermediate-task training?
• Which skills learned from intermediate tasks help the model succeed on which target tasks?
The first question is the most straightforward: it can be answered by a sufficiently exhaustive search over possible intermediate-target task pairs. The second and third questions address the why rather than the when, and differ in a crucial detail: A model might learn skills by training on an intermediate task, but those skills might not help it to succeed on a target task. Our search for intermediate tasks focuses on natural language understanding tasks in English. In particular, we run our experiments on 11 intermediate tasks and 10 target tasks, which results in a total of 110 intermediate-target task pairs. We use 25 probing tasks-tasks that each target a narrowly defined model behavior or linguistic phenomenonto shed light on which skills are learned from each intermediate task.
Our findings include the following: (i) Natural language inference tasks as well as QA tasks which involve commonsense reasoning are generally useful as intermediate tasks. (ii) SocialIQA and QQP as intermediate tasks are not helpful as a means to teach the skills captured by our probing tasks, while finetuning first on MNLI and CosmosQA result in an increase in all skills. (iii) While a model's ability to learn skills relating to input-noising correlate with target task performance, low-level skills such as knowledge of a sentence's raw content preservation skills and ability to detect various attributes of input sentences such as tense of main verb and sentence length are less correlated with target task performance. This suggests that a model's ability to do well on the masked language modelling (MLM) task is important for downstream performance. Furthermore, we conjecture that a portion of our analysis is affected by catastrophic forgetting of knowledge learned during pretraining.

Experimental Pipeline
Our experimental pipeline ( Figure 1) consists of two steps, starting with a pretrained model: intermediate-task training, and fine-tuning on a target or probing task.
Intermediate Task Training We fine-tune RoBERTa on each intermediate task. The training procedure follows the standard procedure of fine-tuning a pretrained model on a target task, as described in Devlin et al. (2019). We opt for single intermediate-task training as opposed to multi-task training (cf.  to isolate the effect of skills learned from individual intermediate tasks. Target and Probing Task Fine-Tuning After intermediate-task training, we fine-tune our models on each target and probing task individually. Target tasks are tasks of interest to the general community, spanning various facets of natural language, domains, and sources. Probing tasks, while potentially similar in data source to target tasks such as with CoLA, are designed to isolate the presence of particular linguistic capabilities or skills. For instance, solving the target task BoolQ (Clark et al., 2019a) may require various skills including coreference and commonsense reasoning, while probing tasks like the SentEval probing suite (Conneau et al., 2018) target specific syntactic and metadatalevel phenomena such as subject-verb agreement and sentence length detection. Table 1 presents an overview of the intermediate and target tasks.

Intermediate Tasks
We curate a diverse set of tasks that either represent an especially large annotation effort or that have been shown to yield positive transfer in prior work. The resulting set of tasks cover question answering, commonsense reasoning, and natural language inference.
QAMR The Question-Answer Meaning Representations dataset (Michael et al., 2018) is a crowdsourced QA task consisting of question-answer pairs that correspond to predicate-argument relationships. It is derived from Wikinews and Wikipedia sentences. For example, if the sentence is "Ada Lovelace was a computer scientist.", a potential question is "What is Ada's last name?", with the answer being "Lovelace." CommonsenseQA CommonsenseQA ) is a multiple-choice QA task derived from ConceptNet (Speer et al., 2017) with the help of crowdworkers, that is designed to test a range of commonsense knowledge.
SciTail SciTail (Khot et al., 2018) is a textual entailment task built from multiple-choice science questions from 4th grade and 8th grade exams, as well as crowdsourced questions (Welbl et al., 2017). The task is to determine whether a hypothesis, which is constructed from a science question and its corresponding answer, is entailed or not (neutral) by the premise.
Cosmos QA Cosmos QA is a task for a commonsense-based reading comprehension task  formulated as multiple-choice questions (Huang et al., 2019). The questions concern the causes or effects of events that require reasoning not only based on the exact text spans in the context, but also wide-range abstractive commonsense reasoning. It differs from CommonsenseQA in that it focuses on causal and deductive commensense reasoning and that it requires reading comprehension over an auxiliary passage, rather than simply answering a freestanding question.
SocialIQA SocialIQA (Sap et al., 2019) is a task for multiple choice QA. It tests for reasoning surrounding emotional and social intelligence in everyday situations.
CCG CCGbank (Hockenmaier and Steedman, 2007) is a task that is a translation of the Penn Treebank into a corpus of Combinatory Categorial Grammar (CCG) derivations. We use the CCG supertagging task, which is the task of assigning tags to individual word tokens that jointly determine the parse of the sentence.
HellaSwag HellaSwag (Zellers et al., 2019) is a commonsense reasoning task that tests a model's ability to choose the most plausible continuation of a story. It is built using adversarial filtering (Zellers et al., 2018) with BERT to create challenging negative examples.

QA-SRL
The question-answer driven semantic role labeling dataset (QA-SRL; He et al., 2015) for a QA task that is derived from a semantic role labeling task. Each example, which consists of a set of questions and answers, corresponds to a predicate-argument relationship in the sentence it is derived from. Unlike QAMR, which focuses on all words in the sentence, QA-SRL is specifically focused on verbs.
SST-2 The Stanford sentiment treebank (Socher et al., 2013) is a sentiment classification task based on movie reviews. We use the binary sentence classification version of the task.
QQP The Quora Question Pairs dataset 1 is constructed based on questions posted on the community question-answering website Quora. The task is to determine if two questions are semantically equivalent.
MNLI The Multi-Genre Natural Language Inference dataset (Williams et al., 2018) is a crowdsourced collection of sentence pairs with textual entailment annotations across a variety of genres.

Target Tasks
We use ten target tasks, eight of which are drawn from the SuperGLUE benchmark . The tasks in the SuperGLUE benchmark cover question answering, entailment, word sense disambiguation, and coreference resolution and have been shown to be easy for humans but difficult for models like BERT. Although we offer a brief description of the tasks below, we refer readers to the SuperGLUE paper for a more detailed description of the tasks.
CommitmentBank (CB; de Marneffe et al., 2019) is a three-class entailment task that consists of texts and an embedded clause that appears in each text, in which models must determine whether that embedded clause is entailed by the text. Choice of Plausible Alternatives (COPA; Roemmele et al., 2011) is a classification task that consists of premises and a question that asks for the cause or effect of each premise, in which models must correctly pick between two possible choices. Winograd Schema Challenge (WSC; Levesque et al., 2012) is a sentence-level commonsense reasoning task that consists of texts, a pronoun from each text, and a list of possible noun phrases from each text. The dataset has been designed such that world knowledge is required to determine which of the possible noun phrases is the correct referent to the pronoun. We use the SuperGLUE binary classification cast of the task, where each example consists of a text, a pronoun, and a noun phrase from the text, which models must classify as being coreferent to the pronoun or not. Recognizing Textual Entailment (RTE; Dagan et al., 2005, et seq) is a textual entailment task. Multi-Sentence Reading Comprehension (MultiRC; Khashabi et al., 2018) is a multi-hop QA task that consists of paragraphs, a question on each paragraph, and a list of possible answers, in which models must distinguish which of the possible answers are true and which are false. Word-in-Context (WiC; Pilehvar and Camacho-Collados, 2019) is a binary classification word sense disambiguation task. Examples consist of two text snippets, with a polysemous word that appears in both. Models must determine whether the same sense of the word is used in both contexts. BoolQ (Clark et al., 2019a) is a QA task that consists of passages and a yes/no question associated with each passage. Reading Comprehension with Commonsense Reasoning (ReCoRD; ) is a multiple-choice QA task that consists of news articles. For each article, models are given a question about each article with one entity masked out and a list of possible entities from the article, and the goal is to correctly identify the masked entity out of the list.
Additionally, we use CommonsenseQA and Cosmos QA as target tasks, due to their unique combination of small dataset size and high level of difficulty for high-performing models like BERT from our set of intermediate tasks.

Probing Tasks
We use well-established datasets for our probing tasks, including the edge-probing suite from Tenney et al. Acceptability Judgment Tasks This set of binary classifications tasks was designed to investigate if a model can judge the grammatical acceptability of a sentence. We use the following five datasets: AJ-CoLA is a task that tests for a model's understanding of general grammaticality using the Corpus of Linguistic Acceptability (CoLA) (Warstadt et al., 2019b), which is drawn from 22 theoretical linguistics publications. The other tasks concern the behaviors of specific classes of function words, using the dataset by Kim et al. (2019): AJ-WH is a task that tests a model's ability to detect if a wh-word in a sentence has been swapped with another wh-word, which tests a model's ability to identify the antecedent associated with the wh-word. AJ-Def is a task that tests a model's ability to detect if the definite/indefinite articles in a given sentence have been swapped. AJ-Coord is a task that tests a model's ability to detect if a coordinating conjunction has been swapped, which tests a model's ability to understand how ideas in the various clauses relate to each other. AJ-EOS is a task that tests a model's ability to identify grammatical sentences without indicators such as punctuation marks and capitalization, and consists of grammatical text that are removed of punctuation.
Edge-Probing Tasks The edge probing (EP) tasks are a set of core NLP labeling tasks, collected by Tenney et al. (2019b) and cast into Boolean classification. These tasks focus on the syntactic and semantic relations between spans in a sentence. The first five tasks use the OntoNotes corpus (Hovy et al., 2006): Part-of-Speech tagging (EP-POS) is a task that tests a model's ability to predict the syntactic category (noun, verb, adjective, etc.) for each word in the sentence. Named entity recognition (EP-NER) is task that tests a model's abil-ity to predict the category of an entity in a given span. Semantic Role Labeling (EP-SRL) is a task that tests a model's ability to assign a label to a given span of words that indicates its semantic role (agent, goal, etc.) in the sentence. Coreference (EP-Coref) is a task that tests a model's ability to classify if two spans of tokens refer to the same entity/event.
The other datasets can be broken down into both syntactic and semantic probing tasks. Constituent labeling (EP-Const) is a task that tests a model's ability to classify a non-terminal label for a span of tokens (e.g., noun phrase, verb phrase, etc.). Dependency labeling (EP-UD) is a task that tests a model on the functional relationship of one token relative to another. We use the English Web Treebank portion of Universal Dependencies 2.2 release (Silveira et al., 2014) for this task. Semantic Proto-Role labeling is a task that tests a model's ability to predict the fine-grained non-exclusive semantic attributes of a given span. Edge probing uses two datasets for SPR: SPR1 (EP-SPR1) (Teichert et al., 2017), derived from the Penn Treebank, and SPR2 (EP-SPR2) (Rudinger et al., 2018), derived from the English Web Treebank. Relation classification (EP-Rel) is a task that tests a model's ability to predict the relation between two entities. We use the SemEval 2010 Task 8 dataset (Hendrickx et al., 2009) for this task. For example, the relation between "Yeri" and "Korea" in "Yeri is from Korea" is ENTITY-ORIGIN. The Definite Pronoun Resolution dataset (Rahman and Ng, 2012) (EP-DPR) is a task that tests a model's ability to handle coreference, and differs from OntoNotes in that it focuses on difficult cases of definite pronouns.
SentEval Tasks The SentEval probing tasks (SE) (Conneau et al., 2018) are cast in the form of single-sentence classification. Sentence Length (SE-SentLen) is a task that tests a model's ability to classify the length of a sentence. Word Content (SE-WC) is a task that tests a model's ability to identify which of a set of 1,000 potential words appear in a given sentence. Tree Depth (SE-TreeDepth) is a task that tests a model's ability to estimate the maximum depth of the constituency parse tree of the sentence. Top Constituents (SE-TopConst) is a task that tests a model's ability to identify the high-level syntactic structure of the sentence by choosing among 20 constituent sequences (the 19 most common, plus an other category). Bigram Shift (SE-BShift) is a task that tests a model's ability to classify if two consecutive tokens in the same sentence have been reordered. Coordination Inversion (SE-CoordInv) is a task that tests a model's ability to identify if two coordinating clausal conjoints are swapped (ex: "he knew it, and he deserved no answer."). Past-Present (SE-Tense) is a task that tests a model's ability to classify the tense of the main verb of the sentence. Subject Number (SE-SubjNum) and Object Number (SE-ObjNum) are tasks that test a model's ability to classify whether the subject or direct object of the main clause is singular or plural. Odd-Man-Out (SE-SOMO) is a task that tests the model's ability to predict whether a sentence has had one of its content words randomly replaced with another word of the same part of speech.

Experiments
Training and Optimization We use the largescale pretrained model RoBERTa Large in all experiments. For each intermediate, target, and probing task, we perform a hyperparameter sweep, varying the peak learning rate ∈ {2 × 10 −5 , 1 × 10 −5 , 5 × 10 −6 , 3 × 10 −6 } and the dropout rate ∈ {0.2, 0.1}. After choosing the best learning rate and dropout rate, we apply the best configuration for each task for all runs. For each task, we use the batch size that maximizes GPU usage, and use a maximum sequence length of 256. Aside from these details, we follow the RoBERTa paper for all other training hyperparameters. We use NVIDIA P40 GPUs for our experiments.
A complete pipeline with one intermediate task works as follows: First, we fine-tune RoBERTa on the intermediate task. We then fine-tune copies of the resulting model separately on each of the 10 target tasks and 25 probing tasks and test on their respective validation sets. We run the same pipeline three times for the 11 intermediate tasks, plus a set of baseline runs without intermediate training. This gives us 35×12×3 = 1260 observations. We train our models using the Adam optimizer (Kingma and Ba, 2015) with linear decay and early stopping. We run training for a maximum of 10 epochs when more than 1,500 training examples are available, and 40 epochs otherwise to ensure models are sufficiently trained on small datasets.     Target Task Performance We define good intermediate tasks as ones that lead to positive transfer in target task performance. We observe that tasks that require complex reasoning and inference tend to make good intermediate tasks. These include MNLI and commonsense-oriented tasks such as CommonsenseQA, HellaSWAG, and Cosmos QA (with our poor performance with the similar SocialIQA serving as a suprising exception). So-cialIQA, CCG, and QQP as intermediate tasks lead to negative transfer on all target tasks and the majority of probing tasks.
We investigate the role of dataset size in the intermediate tasks with downstream task performance by additionally running a set of experiments on varying amounts of data on five intermediate tasks, which is shown in the Appendix. We do not find differences in intermediate-task dataset size to have any substantial consistent impact on downstream target task performance.
In addition, we find that smaller target tasks such as RTE, BoolQ, MultiRC, WiC, WSC benefit the most from intermediate-task training. 2 There are no instances of positive transfer to Commitment-Bank, since our baseline model achieves 100% accuracy.
Probing Task Performance Looking at the probing task performance, we find that intermediate-task training affects performance on low-level syntactic probing tasks uniformly across intermediate tasks; we observe little to no improvement for the SentEval probing tasks and higher improvement for acceptability judgment probing tasks, except for AJ-CoLA. This is also consistent with Phang et al. (2018), who find negative transfer with CoLA in their experiments.
Variation across Intermediate Tasks There is variable performance across higher-level syntactic or semantic tasks such as the Edge-Probing and SentEval tasks. SocialIQA and QQP have negative transfer for most of the Edge-Probing tasks, while CosmosQA and QA-SRL see drops in performance only for EP-Rel. While we do see that intermediate-task trained models improve performance on EP-SRL and EP-DPR across the board, there is little to no gain in SentEval probing tasks from any intermediate tasks. Additionally, tasks that increase performance in the most number of probing tasks perform well as intermediate tasks.

Degenerate Runs
We find that the model may not exceed chance performance in some training runs. This mostly affects the baseline (no intermediate training) runs on the acceptability judgment probing tasks, excluding AJ-CoLA, which all have very small training sets. We include these degenerate runs in our analysis to reflect this phenomenon. Consistent with Phang et al. (2018), we find that intermediate-task training reduces the likelihood of degenerate runs, leading to ostensibly positive transfer results on those four acceptability judgment tasks across most intermediate tasks. On the other hand, extremely negative transfer from intermediate-task training can also result in a higher frequency of degenerate runs in downstream tasks, as we observe in the cases of using QQP and So-cialIQA as intermediate tasks. We also observe a number of degenerate runs on the EP-SRL task as well as the EP-Rel task. These degenerate runs decrease positive transfer in probing tasks, such as with SocialIQA and QQP probing performance, and also decrease the average amount of positive transfer we see in target task performance.

Correlation Between Probing and Target
Task Performance Next, we investigate the relationship between target and probing tasks in an attempt to understand why certain intermediate-task models perform better on certain target tasks.
We use probing task performance as an indicator of the acquisition of particular language skills. We compute the Spearman correlation between probing-task and target-task performances across training on different intermediate tasks and multiple restarts, as shown in Figure 3. We test for statistical significance at p = 0.05 and apply Holm-Bonferroni correction for multiple testing. We omit correlations that are not statistically significant. We opt for Spearman and not Pearson correlation because of the wide variety of metrics used for the different tasks. 3 We find that acceptability judgment probing task performance is generally uncorrelated with the target task performance, except for AJ-CoLA. Similarly, many of the SentEval tasks do not correlate with the target tasks, except for Bigram Shift (SE-BShift), Odd-Man-Out (SE-SOMO) and Coordination Inversion (SE-CoordInv). These three tasks are input noising tasks-tasks where a model has to predict if a given input sentence has been randomly modified-which are, by far, the most similar tasks we study to the masked language modeling task that is used for training RoBERTa. This may explain the strong correlation with the performance of the target tasks.
We also find that some of these strong correlations, such as with SE-SOMO and SE-CoordInv, are almost entirely driven by variation in the degree of negative transfer, rather than any positive transfer. Intuitively, fine-tuning RoBERTa on an intermediate task can cause the model to forget some of its ability to perform the MLM task. Target Probing Figure 3: Correlations between probing and target task performances. Each cell contains the Spearman correlation between probing-task and target-task performances across training on different intermediate tasks and random restarts. We test for statistical significance at p = 0.05 with Holm-Bonferroni correction, and omit the correlations that are not statistically significant.
ing is critical to successful intermediate-task transfer.
The remaining SentEval probing tasks have similar delta values (Figure 2), which may indicate that there is insufficient variation among transfer performance to derive significant correlations. Among the edge-probing tasks, the more semantic tasks such as coreference (EP-Coref and EP-DPR), semantic proto-role labeling (EP-SPR1 and EP-SPR2), and dependency labeling (EP-Rel) show the highest correlations with our target tasks. As our set of target tasks is also oriented towards semantics and reasoning, this is to be expected.
On the other hand, among the target tasks, we find that ReCoRD, CommonsenseQA and Cosmos QA-all commonsense-oriented tasksexhibit both high correlations with each other as well as a similar set of correlations with the probing tasks. Similarly, BoolQ, MultiRC, and RTE correlate strongly with each other and have similar patterns of probing-task performance.

Related Work
Within the paradigm of training large pretrained Transformer language representations via intermediate-stage training before fine-tuning on a target task, positive transfer has been shown in both sequential task-to-task (Phang et al., 2018) and multi-task-to-task Raffel et al., 2019) formats. Wang et al. (2019a) perform an extensive study on transfer with BERT, finding language modeling and NLI tasks to be among the most beneficial tasks for improving target-task performance. Talmor and Berant (2019) perform a similar cross-task transfer study on reading comprehension datasets, finding similar positive transfer in most cases, with the biggest gains stemming from a combination of multiple QA datasets. Our work consists of a larger, more diverse, set of intermediate task-target task pairs. We also use probing tasks to shed light on the skills learned by the intermediate tasks.
Among the prior work on predicting transfer performance, Bingel and Søgaard (2017) is the most similar to ours. They do a regression analysis that predicts target-task performance on the basis of various features of the source and target tasks and task pairs. They focus on a multi-task training setting without self-supervised pretraining, as opposed to our single-intermediate task, three-step procedure.
Similar work (Lin et al., 2019b) has been done on cross-lingual transfer-the analogous challenge of transferring learned knowledge from a highresource to a low-resource language.
Many recent works have attempted to understand the knowledge and linguistic skills BERT learns, for instance by analyzing the language model surprisal for subject-verb agreements (Goldberg, 2018), identifying specific knowledge or phenomena encapsulated in the representations learned by BERT using probing tasks (Tenney et al., 2019b,a;Warstadt et al., 2019a;Lin et al., 2019a;Hewitt and Manning, 2019;Jawahar et al., 2019), analyzing the attention heads of BERT (Clark et al., 2019b;Coenen et al., 2019;Lin et al., 2019a;Htut et al., 2019), and testing the linguistic generalizations of BERT across runs (McCoy et al., 2019). However, relatively little work has been done to analyze fine-tuned BERT-style models (Wang et al., 2019a;Warstadt et al., 2019a).

Conclusion and Future Work
This paper presents a large-scale study on when and why intermediate-task training works with pretrained models. We perform experiments on RoBERTa with a total of 110 pairs of intermediate and target tasks, and perform an analysis using 25 probing tasks, covering different semantic and syntactic phenomena. Most directly, we observe that tasks like Cosmos QA and HellaSwag, which require complex reasoning and inference, tend to work best as intermediate tasks.
Looking to our probing analysis, intermediate tasks that help RoBERTa improve across the board show the most positive transfer in downstream tasks. However, it is difficult to draw definite conclusions about the specific skills that drive positive transfer. Intermediate-task training may help improve the handling of syntax, but there is little to no correlation between target-task and probing-task performance for these skills. Probes for higherlevel semantic abilities tend to have a higher correlation with the target-task performance, but these results are too diffuse to yield more specific conclusions. Future work in this area would benefit greatly from improvements to both the breadth and depth of available probing tasks.
We also observe a worryingly high correlation between target-task performance and the two probing tasks which most closely resemble RoBERTa's masked language modeling pretraining objective. Thus, the results of our intermediate-task training analysis may be driven in part by forgetting of knowledge acquired during pretraining. Our results therefore suggest a need for further work on efficient transfer learning mechanisms. A Correlation Between Probing and Target Task Performance Figure 4 shows the correlation matrix using Spearman correlation and Figure 5 shows the matrix using Pearson correlation.

B Effect of Intermediate Task Size on
Target Task Performance Figure 6 shows the effect of dataset size on intermediate task training on downstream target task performance for five intermediate tasks, which were picked to maximize the variety of original intermediate task sizes and effectiveness in transfer learning abilities.   For each subfigure, we finetune RoBERTa over a variety of dataset size (sampled randomly from the dataset). We report the macro-average of each target task's performance metrics after finetuning on each dataset size split.