DReCa: A General Task Augmentation Strategy for Few-Shot Natural Language Inference

Meta-learning promises few-shot learners that can adapt to new distributions by repurposing knowledge acquired from previous training. However, we believe meta-learning has not yet succeeded in NLP due to the lack of a well-defined task distribution, leading to attempts that treat datasets as tasks. Such an ad hoc task distribution causes problems of quantity and quality. Since there’s only a handful of datasets for any NLP problem, meta-learners tend to overfit their adaptation mechanism and, since NLP datasets are highly heterogeneous, many learning episodes have poor transfer between their support and query sets, which discourages the meta-learner from adapting. To alleviate these issues, we propose DReCA (Decomposing datasets into Reasoning Categories), a simple method for discovering and using latent reasoning categories in a dataset, to form additional high quality tasks. DReCA works by splitting examples into label groups, embedding them with a finetuned BERT model and then clustering each group into reasoning categories. Across four few-shot NLI problems, we demonstrate that using DReCA improves the accuracy of meta-learners by 1.5-4%


Introduction
A key desideratum for human-like understanding is few-shot adaptation. Adaptation is central to many NLP applications since new concepts and words appear often, leading to distribution shifts. People can effortlessly deal with these distribution shifts by learning these new concepts quickly and we would like our models to have similar capabilities. While finetuning large pre-trained transformers is one way to facilitate this adaptation, this procedure requires thousands of samples where humans might require only a few.
Can these pre-trained transformers be made to achieve few-shot adaptation? One promising direction is meta-learning (Schmidhuber, 1987; Ben-Figure 1: Overview of our approach. We embed all examples with BERT, and then cluster within each label group separately (red and green correspond to entailment and not_entailment respectively). Then, we group clusters from distinct label groups to form tasks. gio et al., 1997). Meta-learning promises few-shot classifiers that can adapt to new tasks by repurposing skills acquired from training tasks. An important prerequisite for successful application of meta-learning is a task-distribution from which a large number of tasks can be sampled to train the meta-learner. While meta-learning is very appealing, applications in NLP have thus far proven challenging due to the absence of a well-defined set of tasks that correspond to re-usable skills. This has led to less effective ad hoc alternatives, like treating entire datasets as tasks.
Treating entire datasets as tasks has two major issues. The first issue is learner overfitting (Rajendran et al., 2020), where a meta-learner overfits its adaptation mechanism to the small number of training tasks, since there's only a small number of supervised datasets available for any NLP problem. Second, the heterogeneity of NLP datasets can lead to learning episodes that encourage memorization overfitting (Yin et al., 2020;Rajendran et al., 2020), a phenomenon where a meta-learner ignores the support set, and doesn't learn to adapt.
To improve the quality and quantity of tasks, we propose the simple approach of Decomposing datasets into Reasoning Categories or DRECA. DRECA is a meta data augmentation strategy that takes as input a set of tasks (entire datasets), and then decomposes them to approximately recover some of the latent reasoning categories underlying these datasets, such as various syntactic constructs within a dataset, or semantic categories such as quantifiers and negation. These reasoning categories are then used to construct additional fewshot classification tasks, augmenting the original task distribution. We illustrate these steps in Fig. 1. DRECA first embeds the examples using a BERT model finetuned over all the datasets. We then run k-means clustering over these representations to produce a refinement of the original tasks.
Experiments demonstrate the effectiveness of our simple approach. As a proof of concept, we adapt the classic sine-wave regression problem from Finn et al. (2017) to mimic the challenges of the NLP setting, and observe that standard metalearning procedures fail to adapt. However, a model that meta-learns over the underlying reasoning types shows a substantial improvement. Then, we consider the problem of natural language inference (NLI). We show that meta-learners augmented with DRECA improve over baselines by 1.5-4 accuracy points across four separate NLI few-shot problems without requiring domain-specific engineering or additional unlabeled data.

Related Work
Few-shot learning in NLP. The goal of learning from few examples has been studied for various NLP applications. Common settings include few-shot adaptation to new relations (Han et al., 2018), words (Holla et al., 2020), domains (Bao et al., 2020;Geng et al., 2019), and language pairs (Gu et al., 2018). Since these applications come with well-defined task distributions, they do not have the same overfitting challenges. On the other hand, many works deal with few-shot adaptation in settings with no clear task distribution (Dou et al., 2019;Bansal et al., 2020a) but do not address meta-overfitting, and thus are complementary to our work.
Overfitting and Task Augmentation. The memorization problem in meta-learning is studied in Yin et al. (2020) who propose a meta-regularizer to mitigate memorization overfitting, but don't study learner overfitting. Task augmentation for mitigating overfitting in meta-learners is first studied in Rajendran et al. (2020) in the context of few-shot label adaptation. Hsu et al. (2019) propose CACTUs, a clustering-based approach for unsupervised meta-learning in the context of few-shot label adaptation for images. While also based on clustering, CACTUs creates meta-learning tasks where the goal is to predict cluster membership of images, whereas our work is focused on using clusters to subdivide pre-existing tasks for mitigating meta-overfitting in NLP. Most closely related to our work is the SMLMT method from Bansal et al. (2020b). SMLMT creates new self-supervised tasks that improve meta-overfitting but this does not directly address the dataset-as-tasks problem we identify. In contrast, we focus on using clustering as a way to subdivide and fix tasks that already exist. This approach allows us to mitigate meta-overfitting without additional unlabeled data. In Section 6, we compare our model against SMLMT, and demonstrate comparable or better performance.

NLI
We consider the problem of Natural Language Inference or NLI (MacCartney and Manning, 2008;Bowman et al., 2015), also known as Recognising Textual Entailment (RTE) (Dagan et al., 2005). Given a sentence pair x = (p, h) where p is referred to as the premise sentence, and h is the hypothesis sentence, the goal is to output a binary label 1ŷ ∈ {0, 1} indicating whether the hypothesis h is entailed by the premise p or not. For instance, the sentence pair (The dog barked, The animal barked) is classified as entailed, whereas the sentence pair (The dog barked, The labrador barked) would be classified as not entailed. As shown in Table 1, NLI datasets typically encompass a broad range of linguistic phenomena. Apart from the reasoning types shown in Table 1, examples may also vary in terms of their genre, syntax, annotator writing style etc. leading to extensive linguistic variability. Taken together, these factors of variation make NLI datasets highly heterogeneous.

Meta-Learning
The goal of meta-learning is to output a metalearner f : (S i , x i q ) →ŷ that takes as input a support set S i of labeled examples and a query point Example

Reasoning Category
A boy with the green jacket went back =⇒ A boy went back Restrictive Modifiers A white rabbit ran =⇒ A rabbit ran Intersective Adjectives Bill is taller than Jack =⇒ Jack is taller than Bill Comparatives The dog barked =⇒ The dog did not bark Negation The man went to the restaurant since he was hungry =⇒ The man was hungry Coreference Resolution Bill is taller than Jack =⇒ Jack is not taller than Bill Negated Comparatives Table 1: Some common reasoning types within NLI. These can also be composed to create new types.
x i q and returns a predictionŷ. In the usual metalearning setting, these support and query sets are defined as samples from a task T i , which is a collection of labeled examples {(x i , y i )}. In N -way k-shot adaptation, each T i is an N -way classification problem, and f is given k examples per label to adapt. A simple baseline for meta-learning is to train a supervised model on labeled data from training tasks, and then finetune it at test time on the support set. This can be powerful, but is ineffective for very small support sets. A better alternative is episodic meta-learning, which explicitly trains models to adapt using training tasks Episodic Training. In the standard setup for training episodic meta-learners, we are given a collection of training tasks. We assume that both train and test tasks are i.i.d. draws from a task distribution p(T ). For each training task T tr i ∼ p(T ), we create learning episodes which are used to train the meta-learner. Each learning episode consists of a support set S i = {(x i s , y i s )} and a query set Q i = {(x i q , y i q )}. The goal of episodic meta-learning is to ensure that the meta-learning loss L(f (S i , x i q ), y i q ) is small on training tasks T tr i . Since train tasks are i.i.d. with test tasks, this results in meta-learners that achieve low loss at test time.
MAML. In MAML, the meta-learner f takes the form of gradient descent on a model h θ : x → y using the support set, where θ i denotes task-specific parameters obtained after gradient descent. The goal of MAML is to produce an initialization θ, such that after performing gradient descent on h θ using S i , the updated model h θ i can make accurate predictions on Q i . MAML consists of an inner loop and an outer loop.
In the inner loop, the support set S i is used to update model parameters θ, to obtain task-specific parameters θ i , These task-specific parameters are then used to make predictions on Q i . The outer loop takes gradient steps over θ such that task-specific parameters θ i perform well on Q i . Since θ i is itself a differentiable function of θ, we can perform this outer optimization using gradient descent, where Opt is an optimization algorithm typically chosen to be Adam. The outer loop gradient is typically computed in a mini-batch fashion by sampling a batch of episodes from distinct training tasks. The gradient ∇ θ L(h θ i (x i q ), y i q ) involves back-propagation through the adaptation step which requires computing higher order gradients. This can be computationally expensive so a first order approximation (FoMAML), is often used instead (Finn et al., 2017).

Meta-Learning for NLI
As mentioned earlier, training tasks in NLP are often entire datasets, leading to a small number of heterogeneous training tasks. Thus, to train a meta-learner for NLI, our training tasks T tr i are NLI datasets. At test time, we are given new datasets that we must adapt to, given a support set of randomly drawn examples from the dataset.
Meta Overfitting. Consider learning episodes sampled from an NLI dataset (Table 2). NLI datasets consist of a wide range of linguistic phenomena, and so we expect an episode to be comprised of a diverse set of reasoning categories. Such heterogeneous episodes can lead to scenarios where the support and query sets do not have any overlap in reasoning skills, causing the model to ignore the support set. This is known as memorization overfitting. Moreover, since we have a limited number of datasets, the meta-learner is exposed to a very small number of tasks at meta-training time causing it to generalize poorly to test tasks. This is known as learner overfitting (Rajendran et al., 2020).

NLI Example Reasoning Category
Everyone has visited every person =⇒ Jeff didn't visit Byron

Negation, Quantifier
Generally, LC mail is lighter than AO mail =⇒ AO mail is almost always heavier than LC mail

Comparative, Quantifier
They've had their house that long =⇒ They don't own the house and have never lived there

Negation
Then he strolled gently in the opposite direction =⇒ He wasn't walking in the same direction

Negation
Query A white rabbit ran =⇒ A rabbit ran Intersective adjective Table 2: Illustration of an episode sampled from a heterogeneous task. We can observe that there is no overlap between the support and query reasoning categories, leading to limited transfer.

An Illustration of Overfitting in Meta-Learning
We illustrate meta overfitting challenges by modifying the classic sine-wave toy example for metalearning from Finn et al. (2017).
Dataset. Consider the sine-wave regression problem from Finn et al. (2017) where each task corresponds to learning a sine wave mapping with a fixed amplitude and phase offset. As shown in Fig. 2(a), each support and query set consists of points drawn from the same sine wave mapping.
The key observation here is that since support and query examples are drawn from the same mapping, we might expect a meta-learner to use the support set for adaptation. In the NLP case, since tasks are heterogeneous, support and query examples may belong to different reasoning categories. We instantiate this by letting support and query points come from different sine waves ( Fig. 2(b)). More formally, our construction consists of multiple datasets. Each dataset is defined as a unit (a) 1D sine wave regression (Finn et al., 2017). Each task is a sine-wave with a fixed amplitude and phase offset.
(b) Three datasets from our 2D sine wave regression. Each dataset is a unit square with multiple reasoning categories; A reasoning category is a distinct sinusoid along a ray that maps x = (x1, x2) to the value of the sine-wave y at that point. Figure 2: Comparing the classic 1D sine wave regression with our setting. For a randomly sampled episode, red dots mark support examples and the green square marks a query example. Notice how in 2(a), the support and query come from the same sine wave while in 2(b) they often come from different sine waves. This makes adaptation challenging, leading to memorization overfitting.
square sampled from a 10 × 10 grid over x 1 = [−5, 5] and x 2 = [−5, 5]. Within each dataset, we construct multiple reasoning categories by defining each reasoning category to be a sine wave with a distinct phase offset. This is illustrated in Fig. 2(b) where each unit square represents a dataset, and sine waves along distinct rays correspond to reasoning categories. The target label y for the regression task is defined for each category by a randomly sampled phase φ ∈ [0.1, 2π] and y = sin( x − x 2 − φ). At meta-training time, we sample a subset of these 100 squares as our training datasets, and then evaluate few-shot adaptation to reasoning categories from held out datasets at meta-test time.
Experiments. We use similar hyperparameters as Finn et al. (2017) We start by considering MAML-BASE, a metalearner that is trained directly over a dataset-based task distribution. Concretely, we define each training task as a dataset and randomly sample episodes to train the meta-learner. Note that since episodes are drawn uniformly at random from an entire dataset, we expect support and query sets to often contain points from disjoint reasoning categories ( Fig. 2(b)), making adaptation infeasible. Thus, we expect pre and post adaptation losses to be similar, which is indeed reflected in the learning curves in Fig. 3(a). We observe that the orange and blue lines, corresponding to pre and post adaptation losses respectively, almost overlap. In other words, the meta-learner ignores the support set entirely. This is what we mean by memorization overfitting.
Next we consider MAML-ORACLE, a metalearner that is trained on tasks based on the underlying reasoning categories-distinct sine waves. Consequently, support and query sets are both drawn from the same sine wave, similar to Finn et al. (2017) making adaptation feasible. From Fig. 3(b), we observe large gaps between pre and post adaptation losses which indicates that memorization overfitting has been mitigated. These experiments confirm our hypothesis about the challenges of meta-learning with heterogeneous task distributions. Since NLI datasets require a wide range of skills, we might expect similar challenges on few-shot NLI as well.

DRECA
In this section, we introduce our approach for extracting reasoning categories for NLI. The key observation here is that high quality sentence pair representations, such as those obtained from a finetuned BERT model, can bring out the microstructure of NLI datasets. Indeed, the fact that pretrained transformers can be used to create meaningful clusters has been shown in other recent works (c.f. Aharoni and Goldberg (2020); Joshi et al. (2020)).
At a high level, the goal of DRECA is to take a heterogeneous task (such as a dataset) and produce a decomposed set of tasks. In doing so, we hope to obtain a large number of relatively homogeneous tasks that can prevent meta overfitting.
Given a training task T tr i , we first group examples by their labels, and then embed examples within each group with an embedding function EMBED(.). Concretely, for each N -way classification task T tr i we form groups g i l = {(EMBED(x p i ), y p i ) | y p i = l}. Then, we proceed to refine each label group into K clusters via kmeans clustering to break down T tr i into groups {C j (g i l )} K j=1 for l = 1, 2, . . . , N . These cluster groups can be used to produce K N potential DRECA tasks. 2 Each task is obtained by choosing one of K clusters for each of the N label groups, and taking their union. At meta-training time, learning episodes are sampled uniformly at random from DRECA tasks with a probability λ and from one of the original tasks with probability 1 − λ. Since our clustering procedure is based on finetuned BERT vectors, we expect the resulting clusters to roughly correspond to distinct reasoning categories. Indeed, when the true reasoning categories are known, we show in Section 7.2 that DRECA yields clusters that recover these reasoning categories almost exactly.

Datasets
We evaluate DRECA on 4 NLI few-shot learning problems which we describe below (more details in Appendix A.2.1). The first problem is based on synthetic data, while the other 3 problems are on real datasets and hence a good demonstration of the utility of our proposal.
HANS-FEWSHOT is a few-shot classification problem over HANS (McCoy et al., 2019), a synthetic diagnostic dataset for NLI. Each example in HANS comes from a hand-designed syntactic template which is associated with a fixed label (entailment or not_entailment). The entire dataset consists of 30 such templates which we use to define 15 reasoning categories. We then hold out 5 of these for evaluation, and train on the remaining 10. While this is a simple setting, it allows us to compare DRECA against an "oracle" with access to the underlying reasoning categories.
COMBINEDNLI consists of a combination of 3 NLI datasets-MultiNLI (Williams et al., 2018), Diverse Natural Language Inference Collection (DNC; Poliak et al. (2018)) and Semantic Fragments (Richardson et al., 2020) for training. These training datasets cover a broad range of NLI phenomena. MultiNLI consists of crowdsourced examples, DNC consists of various semantic annotations from NLP datasets re-cast into NLI and Semantic fragments is a synthetic NLI dataset covering logical and monotonicity reasoning. Our objective is to train a single meta-learner that can then be used to make predictions on diverse NLP problems recast as NLI. To this end, we evaluate models trained on COMBINEDNLI on 2 datasets. In COMBINEDNLI-RTE, we evaluate on the RTE datasets (Dagan et al., 2005;Bar-Haim et al., 2006;Giampiccolo et al., 2007;Bentivogli et al., 2009) as provided in GLUE (Wang et al., 2019). The RTE datasets consist of various IE and QA datasets recast as NLI. Second, we consider the QANLI dataset (Demszky et al., 2018) which recasts question answering into NLI. In particular, we consider RACE (Lai et al., 2017) and use gold annotations provided in Demszky et al. (2018) to transform it into an NLI dataset.
GLUE-SciTail where we train on all NLI datasets from GLUE (Wang et al., 2019) and evaluate on SciTail (Khot et al., 2018). This setting is comparable to Bansal et al. (2020b) with the difference that we only meta-train on the NLI subset of GLUE, whereas they meta-train on all GLUE tasks. We follow the same evaluation protocol as Bansal et al. (2020b) and report 2-way 4-shot accuracy.

Models
Non-Episodic Baselines. All non-episodic baselines train h θ on the union of all examples from each T tr i . In MULTITASK (FINETUNE), we additionally finetune the trained model on the support set of each test task. In MULTITASK (K-NN), each query example in the test task is labeled according to the nearest neighbor of the example in the support set. Finally, in MULTITASK (FINETUNE + K-NN), we first finetune the trained model on the support set and then label each query example based on its nearest neighbor in the support set.
Episodic Meta-learners. MAML-BASE is a MAML model where every task corresponds to a dataset. In the HANS-FEWSHOT setting where underlying reasoning categories are known, we also compare with an oracle model MAML-ORACLE which is trained over a mixture of dataset-based tasks as well as oracle reasoning categories. Finally, MAML-DRECA is our model which trains MAML over a mixture of the original dataset-based tasks as well as the augmented tasks from DRECA.
Evaluation. To control for variations across different support sets, we sample 5-10 random support sets for each test task. We finetune each of our models on these support sets and report means and 95% confidence intervals assuming the accuracies follow a Gaussian.
Training Details. We use first order MAML (Fo-MAML) for computational efficiency. We use BERT-base as provided in the transformers library (Wolf et al., 2019) as the parameterization for h θ and EMBED(; ). The meta-training inner loop optimization involves 10 gradient steps with Adam, with a support set of 2 examples (2-way 1-shot) for all except GLUE-SciTail where the support set size is 8 (2-way 4-shot). We experiment with 4-shot adaptation on GLUE-SciTail to match the evaluation setup from Bansal et al. (2020b). The mixing weight λ is set to 0.5 for all our experiments. More details can be found in Appendix A.2.2.
Results. We report results on the synthetic HANS-FEWSHOT setting in Table 4, where we find that DRECA improves over all baselines. In particular, we observe an improvement of +6.94 over MULTITASK (FINETUNE + K-NN) and +4.3 over MAML-BASE. Moreover, we observe that MAML-DRECA obtains a comparable accuracy as MAML-ORACLE.
Next, we report results on our 3 real NLI settings in Table 3. Again, we find that DRECA improves model performance across all 3 settings: MAML-DRECA improves over MAML-BASE by +2.5 points on COMBINEDNLI-QANLI, +2.7 points on COMBINEDNLI-RTE and +1.6 points on GLUE-SciTail. On GLUE-SciTail, we compare against SMLMT (Bansal et al., 2020b) and find that MAML-DRECA improves over it by 1.5 accuracy points. However, we note that the confidence intervals of these approaches overlap, and also that (Bansal et al., 2020a) consider the entire GLUE data to train the meta-learner whereas we only consider NLI datasets within GLUE.     (Maaten and Hinton, 2008) to project these representations onto 2 dimensions. Each point in Fig. 4 is colored with its corresponding reasoning category, and we can observe a clear clustering of examples according to their reasoning category.

Evaluating DRECA Cluster Purity
To understand if reasoning categories can be accurately recovered with our approach, we measure the purity of DRECA clusters for HANS-FEWSHOT where true reasoning categories are known. This is evaluated by computing the number of examples belonging to the majority reasoning type for each cluster and then dividing by the total number of examples. From Table 5, we observe high cluster purity which provides evidence that DRECA is able to recover true reasoning categories.

Distribution of linguistic phenomena across clusters
We seek to understand how different linguistic phenomena present in the overall population are distributed among various clusters. To perform this analysis, we focus on MultiNLI annotation tags from Williams et al. (2018). A subset of examples in MultiNLI are assigned tags based on the presence of certain keywords, e.g., time words like days of the week; quantifiers like every, each, some; negation words like no, not, never. Additionally, certain tags are assigned based on the PTB (Marcus et al., 1993) parses of examples, e.g., presence or absence of adjectives/adverbs etc. For each annotation tag, we compute the fraction of examples labeled with that tag in each cluster. We visualize this for 10 annotation tags and indicate statistically significant deviations from the averages in Fig. 5. Statistical significance is measured with binomial testing with a Bonferroni correction to account for multiple testing. For every annotation tag, we shade all clusters that contain a statistically significant deviation from the mean. For instance, there is a positive cluster with 2.5 fold enrichment in Negation tags compared to the average, and a negative cluster that contains over 4 times the population average of Negation (Hyp only) tags. Similarly, among Conditionals, we have positive clusters that contain 1.4 times the population average and a negative cluster containing half the population average. Interestingly, we find most positive clusters to be significantly poverished in Adverb (Hyp only) tags, while most negative clusters are enriched in these tags. This analysis presents evidence that clusters used by DRECA localize linguistic phenomena to a small number of clusters.

Discussion
Comparing with CACTUs. Our work is most similar to CACTUs from Hsu et al. (2019). Apart from differences in the modality considered (text vs images), we differ in the following ways. Conceptually, Hsu et al. (2019) consider a fully unsupervised meta-learning setting where no labels are provided and use cluster IDs to induce labels, while our goal is to produce additional tasks in a supervised meta-learning setting. Second, CACTUs tasks are constructed by directly applying k-means on the entire training dataset while we apply k-means separately on each label group and construct tasks by choosing a cluster from each label group, leading to tasks with uniform label distribution. Finally, while CACTUs uses constructed tasks directly, our work using them to augment the original task distribution.
Number of examples in support set. All evaluation in this work considers small support sets where number of examples per label range from 1-4. This setting is somewhat restrictive since in practice, one might be able to get a few hundred examples for the target domain. These moderately sized support sets could themselves be heterogeneous where adapting a single learner might be hard. In such cases, we can use a similar clustering approach to separate out the support set into homogeneous tasks and adapt a separate learner for each task. These learners could then be plugged into a mixture of experts framework for making predictions.
Using k-means to produce task refinements. While we are able to get sufficiently homogeneous clusters with k-means, we note one shortcoming with this approach. Any input has multiple attributes / factors of variations and it may be possible to create a clustering for each factor. The current k-means based approach doesn't model this since we only produce a single clustering of the data. For instance, x 1 = The man was walking in the park =⇒ The man is not at home and x 2 = He went with his friends to the mall =⇒ He is not at work can belong to the same cluster if the underlying metric is based on reasoning types. At the same time, it could also be clustered with x 3 = The man was walking in the park =⇒ The woman is in the park if the distance metric is based on lexical similarity. A promising direction for future work is to explore these multi-clusterings based on the various factors of variation present in the training data.
Non-meta learning based few-shot adaptation.
In this work, we use tools from meta-learning to directly optimize for few-shot behavior. While not directly comparable to us, there have been many recent approaches to few-shot adaptation for NLP that do not use meta-learning. Brown et al. (2020) show impressive few-shot adaptation in large language models through "in-context learning" which is presumably acquired only through its language modeling objective,. Schick and Schütze (2020) train multiple models on lexical variations of a small support set and use these to label additional unlabeled examples from the target domain. These "self-labeled" examples are used to train a second model which can then make predictions on query examples. Finally, Gao et al. (2020) explore incontext learning of smaller language models for few-shot adaptation. In particular, they introduce a pipeline to identify useful prompts for the target domain, along with informative labeled examples to prepend as context for the LM.

Conclusion
Many papers point out fundamental challenges in creating systems that achieve human-like understanding of tasks like NLI. Here, we studied conditions under which systems can learn from extremely few samples. We believe that such systems would complement and enhance further study into more sophisticated challenges such as model extrapolation.
One of the main ingredients for successful application of meta-learning is a large number of high quality training tasks to sample learning episodes for the meta-learner. We observe that such a task distribution is usually not available for important NLP problems, leading to less desirable ad hoc attempts that treat entire datasets as tasks. In response, we propose DRECA as a simple and general purpose task-augmentation strategy. Our approach creates a refinement of the original set of tasks (entire datasets) that roughly correspond to linguistic phenomena present in the dataset. We show that training on a task distribution augmented with DRECA leads to consistent improvements on 4 NLI few-shot classification problems, matching other approaches that require additional unlabeled data and well as oracles that have access to the true task distribution.

A.1 2D Sine Wave Regression: Training Details
We use a two layer neural network with 40 dimensional hidden representations and ReLU nonlinearity as the parameterization of f . Following Finn et al. (2017), we take a single gradient step on the support set at meta-training time, and take 10 gradient steps at meta-test time. The MAML weights are optimized with Adam and the inner loop adaptation is done with SGD with a learning rate of 1e-2. For each outer loop update, we sample 5 tasks, and each episode consists of a support set of size 5, i.e., we consider 5 shot adaptation.

A.2.1 Dataset Generation Details
We describe details of how our datasets are generated below. Note that all our datasets are in English.
HANS-FEWSHOT. The reasoning categories we use are in Table 6. We randomly split these 15 reasoning categories in HANS into training and test tasks. For each task, we sample 500 examples split equally among entailment and not_entailment.
COMBINEDNLI. We first convert MultiNLI and Semantic Fragments into 2-way (entailment vs not_entailment) NLI problems by collapsing both contradiction and neutral labels into not_entailment, and resampling such that the dataset is balanced between the 2 label classes. To evaluate on QANLI, we use the RACE QA dataset and transform it into NLI as in Demszky et al. (2018). For RTE, we create a test set ourselves by randomly sampling examples from the RTE dataset provided by Wang et al. (2019). Dataset statistics can be found in Table 7.
GLUE-SciTail. We use MultiNLI, RTE, QNLI and SNLI as training data, following a similar procedure to convert 3-way NLI datasets into 2-way NLI. For evaluation, we use SciTail. Dataset statistics are in Table 8.

A.2.2 Training Details
Hyperparameters for all MAML models can be found in Table 9. We implement MAML in Pytorch using the higher library (Grefenstette et al., 2019). We take the BERT-base implementation from the huggingface library (Wolf et al., 2019) as

B Discovering Reasoning Categories in 2D Sine Wave Regression
To discover latent reasoning categories for the 2D Sine Wave Regression dataset, we train a feedforward neural net (paramaterized similarly as h θ ) on the union of all the datasets, and use the final layer representation to cluster examples. We then use these clusters instead of the true reasoning categories to augment the original task distribution. We now show learning curves on held out test tasks in Fig. 6. As expected, MAML-BASE fails to adapt to new reasoning categories, indicating that it was unable to acquire the required skill from its training tasks. On the other hand, MAML-ORACLE is able to adapt very well, which confirms our hypothesis that a large number of high quality tasks helps. Finally, we see that using MAML trained on the augmented task distribution is able to match the performance of the oracle.  Figure 6: Learning curves on the 2D sine-wave regression task. We observe that the oracle meta-learner outperforms the baseline, and our proposed approach is able to bridge the gap.