Learning to Few-Shot Learn Across Diverse Natural Language Classification Tasks

Pre-trained transformer models have shown enormous success in improving performance on several downstream tasks. However, fine-tuning on a new task still requires large amounts of task-specific labeled data to achieve good performance. We consider this problem of learning to generalize to new tasks, with a few examples, as a meta-learning problem. While meta-learning has shown tremendous progress in recent years, its application is still limited to simulated problems or problems with limited diversity across tasks. We develop a novel method, LEOPARD, which enables optimization-based meta-learning across tasks with a different number of classes, and evaluate different methods on generalization to diverse NLP classification tasks. LEOPARD is trained with the state-of-the-art transformer architecture and shows better generalization to tasks not seen at all during training, with as few as 4 examples per label. Across 17 NLP tasks, including diverse domains of entity typing, natural language inference, sentiment analysis, and several other text classification tasks, we show that LEOPARD learns better initial parameters for few-shot learning than self-supervised pre-training or multi-task training, outperforming many strong baselines, for example, yielding 14.6% average relative gain in accuracy on unseen tasks with only 4 examples per label.


Introduction
Learning to learn (Schmidhuber, 1987;Bengio et al., 1992;Thrun and Pratt, 2012) from limited supervision is an important problem with widespread application in areas where obtaining labeled data for training large models can be difficult or expensive. We consider this problem of learning in k-shots for natural language processing (NLP) tasks, that is, given k labeled examples of a new NLP task learn to efficiently solve the new task. Recently, self-supervised pre-training of transformer models using language modeling objectives (Devlin et al., 2018;Radford et al., 2019;Yang et al., 2019) has achieved tremendous success in learning general-purpose parameters which are useful for a variety of downstream NLP tasks. While pre-training is beneficial, it is not optimized for fine-tuning with limited supervision and such models still require large amounts of task-specific data for fine-tuning, in order to achieve good performance (Yogatama et al., 2019).
On the other hand, meta-learning methods have been proposed as effective solutions for few-shot learning. Existing applications of such meta-learning methods have shown improved performance in few-shot learning for vision tasks such as learning to classify new image classes within a similar dataset. However, these applications are often limited to simulated datasets where each classification label is considered a task. Moreover, their application in NLP has followed a similar trend (Han et al., 2018;Mi et al., 2019;Geng et al., 2019). Since the input space of natural language is shared across all NLP tasks, it is possible that a meta-learning approach generalizes to unseen tasks. We thus move beyond simulated tasks to investigate meta-learning performance on generalization outside the training tasks, and focus on a diverse task-set with different number of labels across tasks.
Model agnostic meta-learning (MAML) (Finn et al., 2017) is an optimization-based approach to metalearning which is agnostic to the model architecture and task specification. Hence, it is an ideal candidate for learning to learn from diverse tasks. However, it requires sharing model parameters, including softmax classification layers across tasks and learns a single initialization point across tasks. This poses a barrier for learning across diverse tasks, where different tasks can have potentially disjoint label spaces. Contrary to this, multi-task learning (Caruana, 1997) naturally handles disjoint label sets, while still benefiting from sharing statistical strength across tasks. However, to solve a new task, multi-task learning would require training a new classification layer for the task. On the other hand, metric-based approaches, such as prototypical networks (Vinyals et al., 2016;Snell et al., 2017), being non-parametric in nature can handle varied number of classes. However, as the number of labeled examples increase, these methods do not adapt to leverage larger data and their performance can lag behind optimization-based methods.
We address these concerns and make the following contributions: (1) we introduce a MAML-based meta-learning method, LEOPARD 1 , which is coupled with a parameter generator that learns to generate task-dependent initial softmax classification parameters for any given task and enables meta-learning across tasks with disjoint label spaces; (2) we train LEOPARD with a transformer model, BERT (Devlin et al., 2018), as the underlying neural architecture, and show that it is possible to learn better initialization parameters for few-shot learning than that obtained from just self-supervised pre-training or pre-training followed by multi-task learning; (3) we evaluate on generalization, with a few-examples, to NLP tasks not seen during training or to new domains of seen tasks, including entity typing, natural language inference, sentiment classification, and various other text classification tasks; (4) we study how metalearning, multi-task learning and fine-tuning perform for few-shot learning of completely new tasks, analyze merits/demerits of parameter efficient meta-training, and study how various train tasks affect performance on target tasks. To the best of our knowledge, this is the first application of meta-learning in NLP which evaluates on test tasks which are significantly different than training tasks and goes beyond simulated classification tasks or domain-adaptation tasks (where train and test tasks are similar but from different domains).

Background
In meta-learning, we consider a meta goal of learning across multiple tasks and assume a distribution over tasks T i ∼ P (T ). We follow the episodic learning framework of Vinyals et al. (2016) which minimizes train-test mismatch for few-shot learning. We are given a set of M training tasks {T 1 , . . . , T M }, where each task instance potentially has a large amount of training data. In order to simulate k-shot learning during training, in each episode (i.e. a training step) a task T i is sampled with a training set D tr i ∼ T i , consisting of only k examples (per label) of the task and a validation set D val i ∼ T i , containing several other examples of the same task. The model f is trained on D tr i using the task loss L i , and then evaluated on D val i . The loss on D val i is then used to adjust the model parameters. Here the validation error of the tasks serves as the training error for the meta-learning process. At the end of training, the model is evaluated on a new task T M +1 ∼ P (T ), where again the train set of T M +1 contains only k examples per label, and the model can use its learning procedure to adapt to the task T M +1 using the train set. We next discuss model-agnostic meta-learning which is pertinent to our work.
Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017) is an approach to optimization-based meta-learning where the goal is to find a good initial point for model parameters θ, which through few steps of gradient descent, can be adapted to yield good performance on a new task. Learning in MAML consists of an inner loop, which applies gradient-based learning on the task-specific objective, and an outer-loop which refines the initial point across tasks in order to enable fast learning. Given a task T i with training datasets D tr i sampled during an episode, MAML's inner loop adapts the parameters θ as: Typically, more than one step of gradient update are applied sequentially. The learning-rate α can also be meta-learned in the outer-loop (Li et al., 2017). The parameters θ are then trained by back-propagating 1 Learning to generate softmax parameters for diverse classification through the inner-loop adaptation, with the meta-objective of minimizing the error across respective task validation sets D val i : Note that even though MAML is trained to generate a good initialization point for few-shot adaptation, since the inner-loop employs gradient-based learning, its performance can approach supervised learning in the limit of large data.
3 Model Figure 1: The proposed LEOPARD model. Input is first encoded using the Transformer. The first batch from the support set is passed through the parameter generator which learns a per-class set representation that is used to generate the initial softmax parameters. Subsequently, the support batches are used for adaptation of the generated parameters as well as the encoder parameters. Pink box (dashed) outline shows modules that are adapted in the inner loop, whereas blue boxes are optimized in the outer loop.
In this section, we describe our proposed method, LEOPARD, for learning new NLP classification tasks with k-examples. Fig. 1 shows a high-level description of the model. Our approach builds on the MAML framework and addresses some of its limitations when applied to a diverse set of tasks with different number of classes across tasks. Our model consists of three main components: (1) a shared neural input encoder which generates feature representations useful across tasks; (2) a softmax parameter generator conditioned on the training dataset for an N -way task, which generates the initial softmax parameters for the task; (3) a MAML-based adaptation method with a distinction between task-specific parameters, which are adapted per task, and task-agnostic parameters, which are shared across tasks, that can lead to parameterefficient fine-tuning of large models. Full training algorithm is shown in Alg. 1.

Text Encoder
The input consists of natural language sentences, thus our models take sequences of words as input. Note that some tasks require classifying pairs of sentences (such as natural language inference) and phrases in a sentence (such as entity typing), and we discuss how these can also be encoded as a sequence in Section 4.1. We use a Transformer model (Vaswani et al., 2017) as our text encoder which has shown success for many NLP tasks. Concretely, we follow Devlin et al. (2018) and use their BERT-base model architecture. We denote the Transformer model by f θ , with parameters θ = {θ 1 , . . . , θ 12 } where θ v are the parameters of layer v. Transformer takes a sequence of words x j = [x j1 , . . . , x jt ] as input (t being the sequence length), and outputs d-dimensional contextualized representations at the final layer of multi-head self-attention. BERT adds a special CLS token (Devlin et al., 2018) to the start of every input, which can be used as a sentence representation. We thus use this as the fixed-dimensional input feature representation of the sentence:

Generating Softmax Parameters for Task-specific Classification
Existing applications of MAML consider few-shot learning with a fixed N , i.e. the number of classes. This limits applicability to multiple types of tasks, each of which would require a different number of classes for classification. To remedy this, we introduce a method to generate task-dependent softmax parameters (both linear weights and bias). Given the training data, D tr i = {(x j , y j )}, for a task T i in Algorithm 1 LEOPARD Require: set of M training tasks and losses {(T 1 , L 1 ), . . . , (T M , L M )}, model parameters Θ = {θ, ψ, α}, hyper-parameters ν, G, β Initialize θ with pre-trained BERT-base; 1: while not converged do Thus, the softmax classification weights W i ∈ R N i ×l and bias b i ∈ R N i for task T i are obtained by row-wise concatenation of the per-class weights in equation 3. Note that encoder g ψ (·) would be shared across tasks in different episodes. Now, given the softmax parameters, the prediction for a new data-point x * is given as: where h φ (·) is another MLP with parameters φ and output dimension l, and the softmax is over the set of classes N i for the task. Note that if we use x * ∈ D val i , then the model is a form of a prototypical network (Snell et al., 2017) which uses a learned distance function. However, this would limit the model to not adapt its parameters with increasing data. We next discuss how we learn to adapt using the generated softmax. It is important to note that we do not introduce any task-specific parameters, unlike multi-task learning (Caruana, 1997) which will require new softmax layers for each task, and the existing parameters are used to generate a good starting point for softmax parameters across tasks which can then be adapted using stochastic gradient (SGD) based learning.

Learning to Adapt Efficiently
Given the task-specific classification loss computed at an episode, MAML takes multiple steps of SGD on the same training set D tr i , as in equation 1. We apply MAML on the model parameters, including the generated softmax parameters. However, the number of parameters in BERT is substantially high (∼ 110 million) and it can be beneficial to adapt a smaller number of parameters (Houlsby et al., 2019;Zintgraf et al., 2019). We thus separate the set of parameters into task-specific and task-agnostic. For the transformer parameters for each layer {θ v }, we consider a threshold ν over layers, and consider θ ≤ν = {θ v |v ≤ ν} to be the parameters for first ν layers (closest to the input) and the rest of the parameters as θ >ν . Then we consider θ ≤ν and the parameters ψ of the softmax generating function (equation 3) as the set of task-agnostic parameters Θ = θ ≤ν ∪ {ψ}. These task-agnostic parameters Θ need to generalize to produce good feature representations and good initial point for classification layer across tasks. The remaining set of parameters for the higher layers of transformer, the input projection function in 5, and the softmax weights and bias generated in equation 4 are considered as the set of task-specific parameters The task-specific parameters will be adapted for each task using SGD, as in equation 1. Note that MAML usually does gradient descent steps on the same meta-train batch D tr i for a task in an episode. However, since we use D tr i to generate the softmax parameters in equation 3, using the same data to also take multiple gradient steps can lead to over-fitting. Thus, we instead sample G > 1 meta-train batches in each episode of training, and use the subesequent batches (after the first batch) for adaptation. Task-specific adaptation in the inner loop does G steps of the following update, starting with Φ Note that we only take gradient with respect to the task-specific parameters Φ i , however the updated parameter is also a function of Θ. After the G steps of adaptation, the final point (which consists of parameters Θ and Φ G ) is evaluated on the validation set for the task, D val i , and the task-agnostic parameters Θ are updated (as in equation 2) to adjust the initial point across tasks. Note that optimization of the task-agnostic parameters requires back-propagating through the inner-loop gradient steps and requires computing higher-order gradients. Finn et al. (2017) proposed using a first-order approximation for computational efficiency. We use this approximation in this work, however we note that the distinction between task-specific and task-agnostic parameters can allow for higher order gradients when there are few task-specific parameters (for example, only the last layer).
Other Technical Details: For few-shot learning, learning rate can often be an important hyperparameter and the above approach can benefit from also learning the learning-rate for adaptation (Li et al., 2017). Instead of scalar inner loop learning rates, it has been shown beneficial to have per-parameter learning rates that are also learned (Li et al., 2017;Antoniou et al., 2018). However, this doubles the number of parameters and can be inefficient. Instead, we learn a per-layer learning rate for the inner loop to allow different transformer layers to adapt at different rates. We apply layer normalization across layers of transformers (Vaswani et al., 2017;Ba et al., 2016) and also adapt their parameters in the inner loop. The number of layers to consider as task-specific, ν, is a hyper-parameter. We initialize the meta-training of LEOPARD from pre-trained BERT model which stabilizes training.

Experiments
Our experiments evaluate how different methods generalize to new NLP tasks with limited supervision. We focus on sentence-level classification tasks, including natural language inference (NLI) tasks which require classifying pairs of sentences as well as tasks like entity typing which require classifying a phrase in a sentence. We consider 17 target tasks 2 . Main results are in Sec. 4.3.

Training Tasks
We use the GLUE benchmark tasks (Wang et al., 2018b) for training all the models. Such tasks are considered important for general linguistic intelligence, have lots of supervised data for many tasks and have been useful for transfer learning (Phang et al., 2018;Wang et al., 2018a). We consider the following tasks for training 3 : MNLI (m/mm), SST-2, QNLI, QQP, MRPC, RTE, and the SNLI dataset (Bowman et al., 2015). We use the corresponding validation sets for hyper-parameter tuning and early stopping. For meta-learning methods, we classify between every pair of labels (for tasks with more than 2 labels) which increases the number of tasks and allows for more per-label examples in a batch during training. Moreover, to learn to do phrase-level classification, we modify SST (for all models) which is a phraselevel sentiment classification task by providing a sentence in which the phrase occurs as part of the input. That is, the input is the sentence followed by a separator token (Devlin et al., 2018) followed by the phrase to classify. See Appendix A for more details.

Evaluation and Baselines
Unlike existing methods which evaluate meta-learning models on sampled tasks from a fixed dataset (Vinyals et al., 2016;Finn et al., 2017), we evaluate methods on real NLP datasets by using the entire test sets for the target task after using a sampled k-shot training data for fine-tuning. The models parameters are trained on the set of training tasks and are then fine-tuned with k training examples per label for a target test task. The fine-tuned models are then evaluated on the entire test-set for the task. We evaluate on k ∈ {4, 8, 16}. For each task, for every k, we sample 10 training datasets and report the mean and standard deviation, since model performance can be sensitive to the k examples chosen for training. In the few-shot setting it can be unreasonable to assume access to a large validation set Kann et al., 2019), thus for the fine-tuning step we tuned the hyper-parameters for all baselines on a held out validation task. We used SciTail, a scientific NLI task, and electronics domain of Amazon sentiment classification task as the validation tasks. We took the hyper-parameters that gave best average performance on validation data of these tasks, for each value of k. For LEOPARD, we only tune the number of epochs for fine-tuning, use the learned per-layer learning rates and reuse remaining hyperparameters (see Appendix C).
We evaluate multiple transfer learning baselines as well as a meta-learning baseline. Note that most existing applications of few-shot learning are tailored towards specific tasks and don't trivially apply to diverse tasks considered here. We evaluate the following methods: BERT base : We use the cased BERT-base model (Devlin et al., 2018) which is a state-of-the-art transformer (Vaswani et al., 2017) model for NLP. BERT uses language model pre-training followed by supervised fine-tuning on a downstream task. For fine-tuning, we tune all parameters as it performed better on the validation task. Multi-task BERT (MT-BERT): This is the BERT-base model trained in a multi-task learning setting on the set of training tasks. Our MT-BERT is comparable to the MT-DNN model of  that is trained on the tasks considered here and uses the cased BERT-base as the initialization. We did not use the specialized stochastic answer network for NLI used by MT-DNN. For this model, we tune all the parameters during fine-tuning. MT-BERT softmax : This is the multi-task BERT model above, where we only tune the softmax layer during fine-tuning. Prototypical BERT (Proto-BERT): This is the prototypical network method (Snell et al., 2017) that uses BERT-base as the underlying neural model. Following Snell et al. (2017), we used euclidean distance as the distance metric. All methods are initialized with pre-trained BERT. All parameters of MT-BERT and Proto-BERT are also tuned during training. We don't compare with MAML (Finn et al., 2017) as it does not trivially support varying number of classes, and show in ablations (4.4) that solutions like using zero-initialized initial softmax perform worse. Implementation Details: Since dataset sizes can be imbalanced, it can affect multi-task and metalearning performance. Wang et al. (2018a) analyze this in detail for multi-task learning. We explored sampling tasks with uniform probability, proportional to size and proportional to the square-root of the size of the task. For all models, we found the latter to be beneficial. All methods are trained on 4 GPUs to benefit from large batches. Best hyper-parameters, search ranges and data statistics are in Appendix.

Results
We evaluate all the models on 17 target NLP tasks. None of the task data is observed during the training of the models, and the models are fine-tuned on few examples for the target task and then evaluated on the entire test set for the task. For k-shot learning of tasks not seen at all during training, we observe, on average, relative gain in accuracy of 14.60%, 10.83%, and 11.16%, for k = 4, 8, 16 respectively.   We use the following datasets (more details in Appendix): (1) entity typing: CoNLL-2003 (Sang andDe Meulder, 2003), MIT-Restaurant (Liu et al., 2013); (2) rating classification: we use the review ratings for each domain from the Amazon Reviews dataset (Blitzer et al., 2007) and consider a 3-way classification based on the ratings; (3) text classification: social-media datasets from crowdflower 4 . Table 1 shows the performance. We can see that, on average, LEOPARD outperforms all the baselines, yielding significant improvements in accuracy. This shows LEOPARD's robustness to varying number of labels across tasks and across different text domains. Note that LEOPARD uses the same training tasks as MT-BERT but can adapt to new tasks with fewer examples, and improvements are highest with only 4 examples. Performance of prototypical networks is worse than most other fine-tuning methods on new training tasks. We hypothesize that this is because prototypical networks do not generate good class prototypes for new tasks and adaptation of class prototypes is important for improving performance. We also see that improved feature learning in MT-BERT with additional training tasks serves as a better initialization point for held-out tasks than BERT, and only tuning the softmax layer of this model is slightly better than tuning all parameters. Interestingly, on some tasks like Disaster classification, we observe BERT to perform better than other models, indicating negative transfer from the training tasks.

Few-Shot Domain Transfer
We now evaluate performance on new domains of tasks seen at training time. For this, we consider two tasks of Sentiment Classification and NLI. For sentiment classification we use 4 domains of Amazon reviews (Blitzer et al., 2007) and for NLI we use a scientific entailment dataset (SciTail) . We introduce another relevant baseline here, MT-BERT reuse , which reuses the trained softmax parameters of a related train task. Results are summarized in Table 2, we show two domains of sentiment classification and more results are in Appendix B. Note that the related train task, SST, only contains phrase-level sentiments and the models weren't trained to predict sentence-level sentiment, while the target tasks require sentence-level sentiment. We observe that LEOPARD performs better than the baselines on all domains of sentiment classification, while on Scitail MT-BERT models perform better, potentially because training consisted of many related NLI datasets. Note that prototypical networks is a competitive baseline here and its performance is better for these tasks in comparison to those in Table 1 as it has learned to generate prototypes for a similar task during training.

Ablation Study
For ablations we use the dev-set of 3 tasks: CoNLL-2003 entity typing, Amazon reviews DVD domain sentiment classification and SciTail NLI. Importance of softmax parameters: Since the softmax generation is an important component of LEOP-ARD, we study how it affects performance. We remove the softmax generator and instead add a softmax weight and bias with zero initialization for each task. The model is trained in a similar way as LEOP-ARD. This method, termed LEOPARD-ZERO, is a naive application of MAML to this problem. Table 3 shows that this performs worse on new tasks, highlighting the importance of softmax generator.  Table 3: Ablations: LEOPARD ν does not adapt layers 0−ν (inclusive) in the inner loop (and fine-tuning), while LEOPARD adapts all parameters. Note that the outer loop still optimizes all parameters. For new tasks (like entity typing) adapting all parameters is better while for tasks seen at training time (like NLI) adapting fewer parameters is better. LEOPARD-ZERO is a model trained without the softmax-generator and a zero initialized softmax classifier, which shows the importance of softmax generator in LEOPARD. Figure 2: Analyzing target task performance as a function of training tasks (best viewed in color). Each column represents one held-out training task (name on x-axis) and each row corresponds to one target task (name on y-axis). Each cell is the relative change in performance on the target task when the corresponding training task is held-out, compared to training on all the train tasks. Dark blue indicates large drop, dark red indicates large increase and grey indicates close to no change in performance. In general, LEOPARD's performance is more consistent compared to MT-BERT indicating that meta-training learns more generalized initial parameters compared to multi-task training.
Parameter efficiency: We consider three variants of LEOPARD with parameter efficient training discussed in Sec 3.3. Denote LEOPARD ν as the model which does not adapt layers 0 to ν (including word embeddings) in the inner loop of meta-training. Note that even for ν = 0, the parameters are still optimized in the outer loop. Table 3 shows the results. Interestingly, for all tasks (except NLI) we find that adapting all parameters is better. This is potentially because the per-layer learning rate in LEOPARD also adjust the adaptation rates for each layer. On SciTail (NLI) we observe the opposite behaviour, suggesting that adapting fewer parameters is better for small k, potentially because training consisted of multiple NLI datasets. Importance of training tasks: We study how target-task performance of MT-BERT and LEOPARD is dependent on tasks used for training. For this experiment, we held out each training task one by one and trained both models. The trained models are then evaluated for their performance on the target tasks (using the development set), following the same protocol as before. Fig. 2 shows a visualization of the relative change in performance when each training task is held out. We see that LEOPARD's performance is more consistent with respect to variation in training tasks, owing to the meta-training procedure that finds an initial point that performs equally well across tasks. Removing a task often leads to decrease in performance for LEOPARD as it decreases the number of meta-training tasks and leads to over-fitting to the training task-distribution. In contrast, MT-BERT's performance on target tasks varies greatly depending on the held-in training tasks.

Related Work
Meta-Learning approaches can be broadly classified as: optimization-based (Finn et al., 2017;Al-Shedivat et al., 2018;Nichol and Schulman, 2018;Rusu et al., 2019), model-based (Santoro et al., 2016;Ravi and Larochelle, 2017;Munkhdalai and Yu, 2017), and metric-learning based (Vinyals et al., 2016;Snell et al., 2017;Sung et al., 2018). We refer to Finn (2018) for an exhaustive review. Recently, it has been shown that learning task-dependent model parameters improves few-shot learning (Rusu et al., 2019;Zintgraf et al., 2019). While existing methods train and evaluate on simulated datasets with limited diversity, there is recent interest for more realistic meta-learning applications (Triantafillou et al., 2019) and our work significantly advances this by training and evaluating on diverse and real NLP tasks. Meta-learning applications in NLP have yielded improvements on specific tasks. Gu et al. (2018) used MAML to simulate low resource machine translation,  learn HyperLSTM (Ha et al., 2016) model in a multi-task setting across various sentiment classification domains, and other recent approaches Han et al., 2018;Obamuyide and Vlachos, 2019;Geng et al., 2019;Mi et al., 2019;Bao et al., 2020) meta-train for a specific classification task, such as relation classification, and do not generalize beyond the training task. Dou et al. (2019) train on a subset of GLUE tasks to generalize to other GLUE tasks and their approach does not consider unseen tasks. Transfer learning is a closely related research area. Self-supervised pre-training has been shown to learn general-purpose model parameters that improve downstream performance with fine-tuning (Peters et al., 2018;Howard and Ruder, 2018;Devlin et al., 2018;Radford et al., 2019;Yang et al., 2019;Raffel et al., 2019). Fine-tuning, however, typically requires large training data (Yogatama et al., 2019). Multi-task learning with BERT has been shown to improve performance for many related tasks (Phang et al., 2018;Wang et al., 2018a;. We refer the reader to Ruder (2019) for a more thorough discussion of transfer learning and multi-task learning.

Conclusions
Learning general linguistic intelligence has been a long-term goal of NLP. While humans, with all their prior knowledge, can quickly learn to solve new tasks with very few examples, machine-learned models still struggle to demonstrate such intelligence. To this end, we proposed LEOPARD, a meta-learning approach, and found that it learns more general-purpose parameters that better prime the model to solve completely new tasks with few examples. While we see improvements using meta-learning, performance with few examples still lags behind human-level performance. We consider bridging this gap as a lucrative goal to demonstrate general linguistic intelligence, and meta-learning as a strong contender to achieve this goal. 4. Text Classification: We use multiple text classification datasets from crowdflower 5 . These involve classifying sentiments of tweets towards an airline, classifying whether a tweet refers to a disaster event, classifying emotional content of text, classifying the audience/bias/message of social media messages from politicians. These tasks are quite different from the training tasks both in terms of the labels as well as the input domain.  Table 7: Dev-set accuracy on the set of train tasks for multi-task BERT. Table 6 shows the accuracy on all the four amazon sentiment classification tasks. Table 7 shows the dev-set accuracy of our trained MT-BERT model on the various training tasks. Figure 3 shows the target task performance as a function of training tasks for all k. Note that the effect of training tasks starts to decrease as k increases. Table 8 shows the hyper-parameter search range as well as the best hyper-parameters for MT-BERT, Proto-BERT and LEOPARD. We use same hyperparameters for prototypical networks except those not relevant to them. For fine-tuning we separately tune number of iterations, and batch size for each k shot for all the baselines. We also tuned warm-up (Devlin et al., 2018) in {0, 0.1} and used 0.1 for all the methods. For MT-BERT we found 10 epochs , batch size 8 to be best for 4-shot, 5 epochs, batch size 8 to be best for 8-shot and 5 epoch with 16 batch size gave the best performance for 16 shot. For MT-BERT softmax we found 125 epoch, batch size 4 to be best for 4-shot, 125 epochs, batch size 4 to be best for 8-shot and 125 epochs with batch size 4 gave the best performance for 16-shot. For BERT base 10 epochs, batch size 8 for 4 shot, 5 epochs, 16 batch size for 8 shot and 10 epochs, batch size 16 for 16 shot gave the best performance. For MT-BERT reuse we found 10 epochs , batch size 8 to be best for 4-shot, 5 epochs, batch size 8 to be best for 8-shot and 5 epoch with 16 batch size gave the best performance for 16 shot. Note, for LEOPARD we use learned per-layer learning rates with SGD. We use the following values: 150 epochs for 4-shot, 100 epochs for 8-shot, 100 epochs for 16-shot. Heatmaps on the left are for LEOPARD and on the right are for MT-BERT. Each column represents one held-out training task (name on x-axis) and each row corresponds to one target task (name on y-axis).

C Hyperparameters
Each cell is the relative change in performance on the target task when the corresponding training task is held-out, compared to training on all the train tasks. Dark blue indicates large drop, dark red indicates large increase and grey indicates close to no change in performance. In general, LEOPARD's performance is more consistent compared to MT-BERT indicating that meta-training learns more generalized initial parameters compared to multi-task training.