Self-Supervised Meta-Learning for Few-Shot Natural Language Classification Tasks

Self-supervised pre-training of transformer models has revolutionized NLP applications. Such pre-training with language modeling objectives provides a useful initial point for parameters that generalize well to new tasks with fine-tuning. However, fine-tuning is still data inefficient -- when there are few labeled examples, accuracy can be low. Data efficiency can be improved by optimizing pre-training directly for future fine-tuning with few examples; this can be treated as a meta-learning problem. However, standard meta-learning techniques require many training tasks in order to generalize; unfortunately, finding a diverse set of such supervised tasks is usually difficult. This paper proposes a self-supervised approach to generate a large, rich, meta-learning task distribution from unlabeled text. This is achieved using a cloze-style objective, but creating separate multi-class classification tasks by gathering tokens-to-be blanked from among only a handful of vocabulary terms. This yields as many unique meta-training tasks as the number of subsets of vocabulary terms. We meta-train a transformer model on this distribution of tasks using a recent meta-learning framework. On 17 NLP tasks, we show that this meta-training leads to better few-shot generalization than language-model pre-training followed by finetuning. Furthermore, we show how the self-supervised tasks can be combined with supervised tasks for meta-learning, providing substantial accuracy gains over previous supervised meta-learning.


Introduction
Self-supervised learning has emerged as an important training paradigm for learning model parameters which are more generalizable and yield better representations for many down-stream tasks. This typically involves learning through labels that come * Correspondence: tbansal@cs.umass.edu naturally with data, for example words in natural language. Self-supervised tasks typically pose a supervised learning problem that can benefit from lots of naturally available data and enable learning model parameters that act as a useful prior for supervised fine-tuning (Erhan et al., 2010). Masked language modeling (Devlin et al., 2018) , and other related approaches (Peters et al., 2018;Howard and Ruder, 2018;Radford et al., 2019), is an example of such a self-supervised task that is behind the success of transformer models like BERT.
While self-supervised pre-training is beneficial, it has been recently noted that it is not data-efficient and typically requires large amounts of fine-tuning data for good performance on a target task (Yogatama et al., 2019;Bansal et al., 2019). This can be evaluated as a few-shot learning problem, where a model is given only few examples of a new task and is expected to perform well on that task. This paper focuses on this problem of few-shot learning and develops models which demonstrate better few-shot generalization to new tasks.
Large scale pre-training suffers from a train-test mismatch as the model is not optimized to learn an initial point that yields good performance when fine-tuned with few examples. Moreover, finetuning of a pre-trained model typically introduces new random parameters, such as softmax layers, and important hyper-parameters such as learning rate, which are hard to estimate robustly from the few examples. Thus, we propose to remove this train-test mismatch, and treat learning an initial point and hyper-parameters jointly from unlabelled data, which allows data-efficient fine-tuning, as a meta-learning problem.
Meta-learning, or learning to learn (Thrun and Pratt, 2012;Schmidhuber, 1987), treats learning a parameterized algorithm, such as a neural net optimized with SGD, that generalizes to new tasks as a learning problem. This typically assumes access to a distribution over tasks in order to enable learning. Creating tasks which enable meta-learning is one of the main challenges for meta-learning (Bengio et al., 1992;Santoro et al., 2016;Vinyals et al., 2016), and typical supervised meta-learning approaches create task distributions from a fixed task dataset with large number of labels by subsampling from the set of labels (Vinyals et al., 2016;Ravi and Larochelle, 2017). While this enables generalization to new labels, it limits generalization to unseen tasks due to over-fitting to the training task distribution (Yin et al., 2020). Moreover, large supervised datasets with a large label set are not always available for meta-learning, as is often the case in many NLP applications.
To overcome these challenges of supervised meta-learning, we propose a self-supervised approach and create the task-distribution from unlabelled sentences. Taking inspiration from the cloze task (Taylor, 1953), we create separate multiclass classification tasks by gathering tokens-to-be blanked from a subset of vocabulary words, allowing for as many unique meta-training tasks as the number of subsets of words in the language. The proposed approach, which we call Subset Masked Language Modeling Tasks (SMLMT), enables training of meta-learning methods for NLP at a much larger scale than was previously feasible while also ameliorating the risk of over-fitting to the training task distribution. This opens up new possibilities for applications of meta-learning in NLP, such as few-shot learning, continual learning, architecture search and more.
This work focuses on few-shot learning and makes the following contributions: (1) we introduce a self-supervised approach to create tasks for meta-learning in NLP, Subset Masked Language Modeling Tasks (SMLMT), which enables application of meta-learning algorithms for goals like fewshot learning; (2) utilizing SMLMT as the training task distribution, we train a state-of-the-art transformer architecture, BERT (Devlin et al., 2018), using a recent optimization-based meta-learning method which was developed for diverse classification tasks (Bansal et al., 2019); (3) we show that the self-supervised SMLMT can also be combined with supervised task data to enable better feature learning, while still allowing for better generalization by avoiding meta-overfitting to the supervised tasks through the use of SMLMT; (4) we rigorously evaluate the proposed approach on few-shot gener-alization to unseen tasks as well as new domains of tasks seen during training and show that the proposed approach demonstrates better generalization than self-supervised pre-training or self-supervised pre-training followed by multi-task training; (5) we also study the effect of number of parameters for few-shot learning and find that while bigger pretrained or meta-trained models generalize better than smaller models, meta-learning leads to substantial gains even for the smaller models.

Preliminaries
In supervised meta-learning, we typically assume access to a task distribution P(T ). Practically, this translates to a fixed set of training tasks {T 1 , . . . , T M }, which are referred to as metatraining tasks. For supervised classification, each task T i is an N i -way classification task. While many meta-learning algorithms assume a fixed Nway classification, we follow the more practical approach of Bansal et al. (2019) and allow for a diverse set of classification tasks with potentially different number of classes.
The goal of a meta-learning algorithm is to utilize the meta-training tasks to learn a learning procedure that generalizes to held-out tasks T ∼ P(T ). Model-agnostic meta-learning (MAML) (Finn et al., 2017) is an example of such a metalearning algorithm. MAML learns an initial point θ for a classifier f θ : x →ŷ, that can be optimized via gradient descent on the supervised loss L i defined for the task T i , using its support set D tr ∼ T i : where α is the learning rate. The optimized point θ is then evaluated on another validation set for the task, D val ∼ T i , using the loss function L i . This loss across meta-training tasks serves as the training error to optimize the initial point and parameters like learning-rate (Θ := {θ, α}): where β is the learning rate for the meta-training process. Training proceeds in an episodic framework (Vinyals et al., 2016), where in each episode a mini-batch of tasks are sampled along with their support and validation sets, and the model parameters are optimized using (1) and (2), which are also referred to as inner and outer loop, respectively. Meta-training Tasks: We summarize how supervised task data-sets are typically leveraged to create meta-training tasks (Vinyals et al., 2016). Assuming access to a supervised task with L classes, an N -way k-shot task is created by first sampling N classes, assuming N << L. Then for each of the N sampled classes, (k + q) examples of each class is randomly sampled from the dataset and assigned a unique label in {1, . . . , N }. The k examples for each label serve as the support set, while the q examples constitute the validation set described above. Note, that each task consists of a small subset of classes and the class to label (1 to N) assignment is random . This is crucial to avoid learning the sample to label bindings in the parameters of the model, which will make the taskspecific training (in (1)) irrelevant and the model will not generalize to new tasks. An example of this approach is MiniImageNet (Ravi and Larochelle, 2017), which is a benchmark dataset for learning few-shot image classification.

Self-supervised Tasks for Meta-learning
The existing approach to using a supervised dataset to create tasks, as described above, is fraught with issues, specially for NLP applications. First, note that large classification datasets with large label spaces are not readily available for all NLP tasks, for example sentiment classification which has only few discrete labels. Second, limiting to a fixed supervised dataset to create tasks limits generalization ability and the meta-learned models might generalize to new labels for the task but fail to generalize to new novel tasks (Metz et al., 2019). Lastly, such an approach is also not feasible in all problems where we will like to apply meta-learning (Yin et al., 2020). For example, consider metalearning a natural language inference (NLI) model across multiple domains which can generalize to new domains. A powerful model can ignore the training data for each task and directly learn to predict the NLI tag for the examples in each training domain, which will lead to low training error but the model will not generalize to new domains. We overcome these issues by utilizing unlabelled data to create meta-learning tasks. See Fig. 1 for an example of generated task.
Subset Masked Language Modeling Tasks (SMLMT): We are given a text corpus split into sentences X i and each sentence is a sequence of words from a vocabulary of size V . Now, in Subset Masked Language Modeling Tasks, each task is defined from a subset of vocabulary words. To create an N -way classification task, we randomly select N unique vocabulary words: {v 1 , . . . , v N }. Then we consider all sentences containing these N words, and for each word randomly sample r = k+q sentences: Now, we mask the corresponding chosen word from the sentences in each of these N sets, so serves as input examples for the N -way classification task. We forget the original word corresponding to the masked tokens in these sets and assign labels in {1, . . . , N } to the N sets. This gives an instance of an SMLMT classification task: This can be split into support and validation for meta-training.
In an SMLMT instance, each input sentence consists of exactly one word that is masked throughout the sentence and its label corresponds to that word. This requires a similar reasoning ability as cloze tasks (Taylor, 1953). Moreover, crucially, the SMLMT task creation ensures that a model cannot memorize the input-label mapping as the target masked word is hidden and the label assignment is randomized, requiring the model to infer the labels from the support set. Note that the SMLMT tasks are also closely related to masked language modeling (MLM) (Devlin et al., 2018). While MLM is a word-level classification task, SMLMT is a sentence-level classification task. Each unique subset of words from the vocabulary defines a unique task in SMLMT. This allows for as many unique tasks as the number of subsets of words in the vocabulary, enabling large-scale meta-learning from unsupervised data.
Hybrid SMLMT: Tasks from SMLMT can also be combined with supervised tasks to encourage better feature learning (Caruana, 1997) and increased diversity in tasks. We use a sampling ratio λ ∈ (0, 1) and in each episode select an SMLMT task with probability λ or a supervised task with probability (1 − λ). The use of SMLMT jointly with supervised tasks ameliorates meta-overfitting, as tasks in SMLMT cannot be solved without using the task support data. λ is a hyper-parameter. In our experiments, we found λ = 0.5 to work well.

Meta-learning Model
We now discuss the meta-learning model for learning new NLP tasks.
Text encoder: The input to the model is natural language sentences. This is encoded using a transformer (Vaswani et al., 2017) text encoder. We follow the BERT (Devlin et al., 2018) model and use the same underlying neural architecture for the transformer as their base model. Given an input sentence, the transformer model yields contextualized token representations for each token in the input after multiple layers of self-attention. Following BERT, we add a special CLS token to the start of the input that is used as a sentence representation for classification tasks. Given an input sentence X, let f π (X) be the CLS representation of the final layer of the transformer with parameters π.
Meta-learning across diverse classes: Our motivation is to meta-learn an initial point that can generalize to novel NLP tasks, thus we consider methods that apply to diverse number of classes. Note that many meta-learning models only apply to a fixed number of classes (Finn et al., 2017) and require training different models for different number of classes. We follow the approach of Bansal et al.
(2019) that learns to generate softmax classification parameters conditioned on a task support set to enable training meta-learning models that can adapt to tasks with diverse classes. This combines benefits of metric-based methods (Vinyals et al., 2016;Snell et al., 2017) and optimization-based methods for meta-learning. The key idea is to train a deep set encoder g ψ (·), with parameters ψ, which takes as input the set of examples of a class n and generates a (d+1) dimensional embedding that serves as the linear weight and bias for class n in the softmax classification layer. Let {X 1n , . . . , X kn } be the k examples for class n in the support set of a task t: ∈ R d are the concatenation of the per-class vectors in (3), and h φ is a MLP with parameters φ and output dimension d.
Using the above model to generate predictions, the parameters are meta-trained using the MAML algorithm (Finn et al., 2017). Concretely, set θ := {π, φ, W t , b t } for the task-specific inner loop gradient updates in (1) and set Θ := {π, ψ, α} for the outer-loop updates in (2). Note that we do multiple steps of gradient descent in the inner loop. Bansal et al. (2019) performed extensive ablations over parameter-efficient versions of the model and found that adapting all parameters with learned perlayer learning rates performs best for new tasks. We follow this approach. Full training algorithm can be found in the Supplementary.
Fast adaptation: Flennerhag et al. (2019) proposed an approach which mitigates slow adaption often observed in MAML by learning to warp the task loss surface to enable rapid descent to the loss minima. This is done by interleaving a neural network's layers with non-linear layers, called warp layers, which are not adapted for each task but are still optimized across tasks in the outer-loop updates in (2). Since introducing additional layers will make computation more expensive, we use existing transformer layers as warp layers. We designate the feed-forward layers in between selfattention layers of BERT, which project from dimension 768 to 3072 to 768, as warp-layers. Note that these parameters also constitute a large fraction of total parameters (∼ 51%). Thus in addition to the benefit from warping, not adapting these layers per task means significantly faster training and smaller number of per-task parameters during fine-tuning. The warp layers are still updated in the outer loop during meta-training.

Related Work
Language model pre-training has recently emerged as a prominent approach to learning general purpose representations (Howard and Ruder, 2018;Peters et al., 2018;Devlin et al., 2018;Radford et al., 2019;Yang et al., 2019;Raffel et al., 2019). Refer to Weng (2019) for a review of self-supervised learning. Pre-training is usually a two step process and fine-tuning introduces random parameters making it inefficient when target tasks have few examples (Bansal et al., 2019). Multi-task learning of pre-trained models has shown improved results on many tasks (Phang et al., 2018;Liu et al., 2019a). More recently, and parallel to this work, Brown et al. (2020) show that extremely large language models can act as few-shot learners. They propose a query-based approach where few-shot task data is used as context for the language model. In contrast, we employ a fine-tuning based metalearning approach that enjoys nice properties like consistency which are important for good out-ofdistribution generalization (Finn, 2018). Moreover, we also show in this work that self-supervised metalearning can also improve few-shot performance for smaller models.
Meta-Learning methods can be categorized as: optimization-based (Finn et al., 2017;Nichol and Schulman, 2018;Rusu et al., 2018), model-based (Santoro et al., 2016Ravi and Larochelle, 2017;Munkhdalai and Yu, 2017), and metric-based (Vinyals et al., 2016;Snell et al., 2017). Refer to Finn (2018) for an exhaustive review. Unsupervised meta-learning has been explored in vision. Hsu et al. (2019) proposed clustering images using pre-trained embeddings to create tasks for meta-learning. Metz et al. (2019) meta-learn a biologically-motivated update rule from unsupervised data in a semi-supervised framework. Compared to these, we directly utilize text data to automatically create unsupervised tasks without relying on pre-trained embeddings or access to target tasks.
In NLP, meta-learning approaches have followed the recipe of using supervised task data and learning models for specific tasks. Such approaches (Yu et al., 2018;Gu et al., 2018;Guo et al., 2018;Han et al., 2018;Mi et al., 2019) train to generalize to new labels of a specific task like relation classification and don't generalize to novel tasks. Bansal et al. (2019) proposed an approach that applies to diverse tasks to enable practical meta-learning models and evaluate on generalization to new tasks. However, they rely on supervised task data from multiple tasks and suffer from meta-overfitting as we show in our empirical results. To the best of our knowledge, the method proposed here is the first self-supervised approach to meta-learning in NLP.

Experiments
We evaluate the models on few-shot generalization to new tasks and new domains of train tasks. Evaluation consist of a diverse set of NLP classification tasks from multiple domains: entity typing, sentiment classification, natural language inference and other text classification tasks. Our results show that self-supervised meta-learning using SMLMT improves performance over self-supervised pretraining. Moreover, combining SMLMT with supervised tasks achieves the best generalization, improving over multi-task learning by up to 21%.

Implementation Details
SMLMT: We use the English Wikipedia dump, as of March 2019, to create SMLMT. This is similar to the dataset for pre-training of BERT (Devlin et al., 2018), which ensures that gains are not due to using more or diverse pre-training corpora (Liu et al., 2019b). The corpus is split into sentences and word-tokenized to create SMLMT. We run task creation offline and create about 2 Million SMLMT for meta-training, including a combination of 2, 3 and 4-way tasks. After task creation, the data is word-piece tokenized using the BERT-base cased model vocabulary for input to the models.
Supervised Tasks: Bansal et al. (2019) demonstrated that better feature learning from supervised tasks helps few-shot learning. Thus, we also evaluate multi-task learning and multi-task meta-learning for few-shot generalization. We also use GLUE tasks (Wang et al., 2018) and SNLI (Bowman et al., 2015) as the supervised tasks. Supervised tasks can be combined with SMLMT for meta-training (see 3). Note that since these are only a few supervised tasks (8 in this case) with a small label space, it is easy for meta-learning models to overfit to the supervised tasks (Yin et al., 2020) limiting generalization as we show in experiments.
Models: We evaluate the following models: (1) BERT: This is transformer model trained with self-supervised learning using MLM as the pretraining task on Wikipedia and BookCorpus. We use the cased base model (Devlin et al., 2018).
(2) MT-BERT: This is a multi-task learning model trained on the supervised tasks. We follow Bansal et al. Note that all models share the same transformer architecture making the contribution from each component discernible. Moreover, SMLMT and Hybrid-SMLMT models use similar meta-learning algorithm as LEOPARD, so any improvements are due to the self-supervised meta-training. All model are initialized with pre-trained BERT for training.
Evaluation Methodology: We evaluate on fewshot generalization to multiple NLP tasks using the same set of tasks 1 considered in Bansal et al. (2019). Each target task consists of k examples per class, for k ∈ {4, 8, 16, 32}, and different tasks can have different number of classes. Since fewshot performance is sensitive to the few examples used in fine-tuning, each model is fine-tuned on 10 such k-shot support sets for a task, for each k, and the average performance with standard deviation is reported. Models are trained using their training procedures, without access to the target tasks, and are then fine-tuned for each of the k-shot task. Results for MT-BERT and LEOPARD are taken from Bansal et al. (2019).

Hyper-parameters:
We follow the approach of Bansal et al. (2019) and use validation tasks for estimating hyper-parameters during fine-tuning for all baseline models. Note the meta-learning approach learn the learning rates during training and only require the number of epochs of fine-tuning to be estimated from the validation tasks. Detailed hyper-parameters are in Supplementary.
Results are presented in Table 1. Results on 2 domains of Rating are in Supplementary due to space limitation. First, comparing models which don't use any supervised data, we see that on average across the 12 tasks, the meta-trained SMLMT performs better than BERT specially for small k ∈ {4, 8, 16}. Interestingly, the SMLMT model which doesn't use any supervised data, also outperforms even MT-BERT models which use super-vised data for multi-task training. Next, comparing among all the models, we see that the Hybrid-SMLMT model performs best on average across tasks. For instance, on average 4-shot performance across tasks, Hybrid-SMLMT provides a relative gain in accuracy of 21.4% over the best performing MT-BERT baseline. Compared to LEOPARD, the Hybrid-SMLMT yields consistent improvements for all k ∈ {4, 8, 16, 32} and demonstrates steady improvement in performance with increasing data (k). We note that on some tasks, such as Disaster, SMLMT is better than Hybrid-SMLMT. We suspect negative transfer from multi-task training on these tasks as also evidenced by the drop in performance of MT-BERT. These results show that SMLMT meta-training learns a better initial point that enables few-shot generalization.

Few-shot domain transfer
The tasks considered here had another domain of a similar task in the GLUE training tasks. Datasets used are (1) 4 domains of Amazon review sentiments (Blitzer et al., 2007), (2) Scitail, a scientific NLI dataset (Khot et al., 2018). Results on 2 domains of Amazon are in Supplementary due to space limitation. A relevant baseline here is MT-BERT reuse which reuses the softmax layer from the related training task. This is a prominent approach to transfer learning with pre-trained models. Comparing Hybrid-SMLMT with variants of MT-BERT, we see that Hybrid-SMLMT performs comparable or better. Comparing with LEOPARD, we see that Hybrid-SMLMT generalizes better to new domains. LEOPARD performs worse than Hybrid-SMLMT on Scitail even though the supervised tasks are biased towards NLI, with 5 of the 8 tasks being variants of NLI tasks. This is due to meta-overfitting to the training domains in LEOP-ARD which is prevented through the regularization from SMLMT in Hybrid-SMLMT.

Analysis
Meta-overfitting: We study the extent of metaoverfitting in LEOPARD and Hybrid-SMLMT. Since these models learn the adaptation learningrates, we can study the learning rates trajectory during meta-training. Fig. 3 shows the results. We expect the learning rates to converge towards zero if the task-adaptation become irrelevant due to metaoverfitting. LEOPARD shows clear signs of metaoverfitting with much smaller learning rates which converge towards zero for most of the layers. Note     that due to this, held-out validation during training is essential to enable any generalization (Bansal et al., 2019). Hybrid-SMLMT doesn't show this phenomenon for most layers and learning rates converge towards large non-zero values even when we continue training for much longer. This indicates that SMLMT help in ameliorating meta-overfitting.
Effect of the number of parameters: We study how the size of the models affect few-shot performance. Recently, there has been increasing evidence that larger pre-trained models tend to generalize better (Devlin et al., 2018;Radford et al., 2019;Raffel et al., 2019). We explore whether this is true even in the few-shot regime. For this analysis we use the development data for 3 tasks: Scitail, Amazon DVD sentiment classification, and CoNLL entity typing. We consider the BERT base architecture with 110M parameters, and two smaller versions made available by Turc et al. (2019) consisting of about 29M and 42M parameters. We train versions of Hybrid-SMLMT as well as MT-BERT corresponding to the smaller models. Results are presented in Fig. 2. Interestingly, we see that bigger models perform much better than the smaller models even when the target task had only 4 examples per class. Moreover, we see consistent and large performance gains from the meta-learned Hybrid-SMLMT, even for its smaller model variants. These results indicate that meta-training helps in data-efficient learning even with smaller models, and self-supervised learning enables larger models to learn more generalizable representations.
Representation analysis: To probe how the representations in the proposed models are different from the representations in the self-supervised BERT model and multi-task BERT models, we performed CCA analysis on their representations (Raghu et al., 2017). We use the representations on the CoNLL and Scitail tasks for this analysis. Results on CoNLL task are in Fig. 4. First, we analyze the representation of the same model before and after fine-tuning on the target task. Interestingly, we see that the Hybrid-SMLMT model is closer to the initial point after task-specific fine-tuning than the BERT and MT-BERT models. Coupled with the better performance of Hybrid-SMLMT (in 6.2), this indicates a better initialization point for Hybrid-SMLMT. Note that the representations in lower layers are more similar before and after finetuning, and lesser in the top few layers. Next, we look at how representations differ across these models. We see that the models converge to different representations, where the lower layer representations are more similar and they diverge as we move towards the upper layers. In particular, note that this indicates that multi-task learning helps in learning different representations than self-supervised pre-training, and meta-learning model representations are different from the other models.
We introduced an approach to leverage unlabeled data to crate meta-learning tasks for NLP. This enables better representation learning, learning key hyper-parameters like learning rates, demonstrates data-efficient fine-tuning, and ameliorates metaoverfitting when combined with supervised tasks. Through extensive experiments, we evaluated the proposed approach on few-shot generalization to novel tasks and domains and found that leveraging unlabelled data has significant benefits to enabling data-efficient generalization. This opens up the possibility of exploring large-scale meta-learning in NLP for various meta problems, including neural architecture search, continual learning, hyperparameter learning, and more. Table 4: Dataset statistics. Note that "-" indicates the correspoding split was not used.  Sampling for Hybrid-SMLMT: We restrict the word vocabulary for task creation with a term frequency of at least 50 in the corpus. This is then used to create tasks in SMLMT as described. This word vocabulary is discarded at this point and the data is word-piece tokenized using the BERT-base cased model vocabulary for input to the models. Note that after a supervised task is selected to be sampled based on λ, it is sampled proportional to the square-root of the number of samples in the supervised tasks following Bansal et al. (2019).
Fine-tuning Hyper-parameter: We tune the number of fine-tuning epochs and batch-size using the development data of Scitail and Amazon Electronics tasks following (Bansal et al., 2019). Note that best values are determined for each k. Epochs search range is [5,10,50,100,150,200,300,400] and batch-size search range is [4,8,16]. Training Hardware and Time: We train the SMLMT and Hybrid-SMLMT models on 4 V100 GPUs, each with 16GB memory. Owing to the warp layers, our training time per step and the GPU memory footprint is lower than LEOPARD (Bansal et al., 2019). However, our training typically runs much longer as the model doesn't overfit unlike LEOPARD (see learning rate trajectory in main paper). Meta-training takes a total of 11 days and 14hours.