Few-Shot Text Classification with Triplet Networks, Data Augmentation, and Curriculum Learning

Few-shot text classification is a fundamental NLP task in which a model aims to classify text into a large number of categories, given only a few training examples per category. This paper explores data augmentation—a technique particularly suitable for training with limited data—for this few-shot, highly-multiclass text classification setting. On four diverse text classification tasks, we find that common data augmentation techniques can improve the performance of triplet networks by up to 3.0% on average. To further boost performance, we present a simple training strategy called curriculum data augmentation, which leverages curriculum learning by first training on only original examples and then introducing augmented data as training progresses. We explore a two-stage and a gradual schedule, and find that, compared with standard single-stage training, curriculum data augmentation trains faster, improves performance, and remains robust to high amounts of noising from augmentation.


Introduction
Traditional text classification tasks such as sentiment classification (Socher et al., 2013) typically have few output classes (e.g., in binary classification), each with many training examples. Many practical scenarios such as relation classification (Han et al., 2018), answer selection (Kumar et al., 2019), and sentence clustering (Mnasri et al., 2017), however, have a converse setup characterized by a large number of output classes (Gupta et al., 2014), often with few training examples per class. This scenario, which we henceforth refer to as few-shot, highly-multiclass text classification, is a common setting in NLP applications and can be challenging due to the scarcity of training data.
Data augmentation for NLP has seen increased interest in recent years (Wei and Zou, 2019;Qiu Figure 1: Schematic showing the two types of curriculum augmentation that we propose. τ is a parameter that controls augmentation temperature (fraction of perturbed tokens). et al., 2020). In traditional text classification tasks, it has been shown that although performance improvements can be marginal when training data is sufficient, augmentation is especially beneficial in limited data scenarios (Xie et al., 2020). As such, we hypothesize that the few-shot, highly-multiclass text classification scenario is a suitable context for data augmentation.
Based on this motivation, our paper makes two main contributions.
• First, we apply popular data augmentation techniques to the common triplet loss (Schroff et al., 2015) approach for few-shot, highly multiclass classification, finding that out-of-the-box augmentation can improve performance noticeably.
• We then propose a simple curriculum learning strategy called curriculum data augmentation and experiment with two schedules, as shown in Figure 1. A two-stage curriculum, which first trains on original data and then introduces aug-mented data of fixed temperature (amount of noising), achieves slightly better performance than standard augmentation, while training faster and remaining more robust to high temperatures. A gradual curriculum, which also first trains on original data only but gradually increases augmentation temperature at each subsequent stage, takes longer to converge but improves more than 1% over standard augmentation.

Curriculum Data Augmentation
Motivation. Inspired by human and animal learning, curriculum learning (Bengio et al., 2009) posits that neural networks train better when examples are not randomly presented but instead organized in a meaningful order that gradually shows more concepts and complexity. Traditionally, curriculum learning approaches first assume that a range of example difficulty exists in the data and then leverage various heuristics to sort examples by difficulty and train models on progressively harder examples (Bengio et al., 2009;Tsvetkov et al., 2016;Weinshall et al., 2018). A newer school of thought, however, has noted that instead of discovering a curriculum in existing data, data can be intentionally modified to dictate an artificial range of difficulty (Korbar et al., 2018;Ganesh and Corso, 2020)this is the approach we will take here.
Our approach. Unlike data augmentation in computer vision where augmented data undoubtedly resembles original data, in text, data augmentation techniques might introduce linguistic adversity and therefore can be seen as a form of noising (Li et al., 2017;, where noised data is harder to learn from than unmodified original data. As such, we can create an artificial curriculum in the data by leveraging controlled application of data augmentation, starting by training on only original data and then adding augmented data with a higher levels of noising as training progresses. Specifically, we propose two simple schedules. (1) Two-stage curriculum data augmentation calls for one stage of training with only original data, followed by one stage of training with augmented data of fixed temperature.
(2) Gradual curriculum data augmentation involves one stage of training with only original data, followed by multiple stages of training with augmented data where the temperature of augmented data (i.e., fraction of perturbed tokens) gradually increases each stage.  (Wei and Zou, 2019). Our proposed twostage curriculum (second stage starts at four-thousand updates) trains faster and achieves slightly higher performance compared with standard augmentation while using the same number of updates. Our proposed gradual curriculum (which here linearly increases augmentation temperature τ by 0.1 at {4, 8, 12, 16, 20}thousand updates) outperforms both standard augmentation and the two-stage curriculum, but takes longer to converge. Results shown are averaged over thirteen random seeds.

Experimental Setup
We conduct empirical experiments to evaluate curriculum data augmentation on a variety of text classification tasks using a triplet loss model. 1
2. FEWREL (c = 64). The FewRel dataset contains sentences categorized by a relationship between its specified head and tail tokens such as 'capital of,' 'member of,' and 'birth name' (Han et al., 2018). We use all 64 classes given in the posted training set, splitting 100 examples per class into a test set, with the remainder of the examples going into the training set.
3. COV-C (c = 87). The COVID-Q dataset classifies questions into 89 clusters where all ques-tions in a cluster ask about the same thing (Wei et al., 2020). We use the train-test split with three training examples per class as given by the authors. We find that 2 of the 89 classes in the training set actually have only two examples per class instead of the reported three, and so we remove these classes from the training and test sets and use the 87 classes that remain.
4. AMZN (c = 318). The Amazon product review dataset aims to categorize a product into a certain class given a review (Yury, 2020). We only consider the 318 'level-3' classes given in this dataset with at least six examples per class.
To balance the class distribution during experiments, we randomly sample N c examples per class to be used for training, with N c varying based on the experiment and dataset. Our sampled training sets for COV-C and AMZN have N c = 3 examples per class, and our training sets for HUFF and FEWREL have N c = 10, a common low-resource scenario. 2 For all experiments, we use top-1 accuracy (%) as the evaluation metric.

Triplet Loss Model
For few-shot, highly-multiclass classification, a common approach is the triplet loss classifier (Schroff et al., 2015), first developed for facial recognition and now also used in NLP (dos Santos et al., 2016;Ein Dor et al., 2018;Lauriola and Moschitti, 2020). Specifically addressing few-shot classification, a triplet loss network minimizes distance between examples with the same label and maximizes distance between examples with different labels. During training, given a triplet of (anchor a, positive example p, and negative example n), a triplet loss network minimizes: where α is a margin enforced between positive and negative pairs, and d(·) computes the distance between the input encodings of two examples. To sample triplets, we will consider two strategies: random sampling, which selects triplets randomly, and hard negative mining (Schroff et al., 2015), where triplets are sampled such that d(a, p) + α > d(a, n). At evaluation time, a triplet loss classifier returns the class of the example in the training set with the smallest distance to a given test example. Indeed, both triplet loss and data augmentation target training with limited data, and so combining them seems particularly promising for the the fewshot classification scenario. For our model, we use standard BERT-base with average-pooled encodings and then train a twolayer triplet loss network on top of these encodings. Our triplet loss network architecture contains a linear layer with 200 hidden units, tanh activation, a dropout layer with p = 0.4, and a final linear layer with 40 hidden units. We use cosine distance, a margin of α = 0.4, a batch size of 64 triplets, and a learning rate of 2 × 10 −5 .

Augmentation Techniques
We implement EDA (Wei and Zou, 2019), a popular combination of token-level augmentation techniques (synonym replacement, random insertion, random swap, random deletion) that defines their temperature parameter 0 ≤ τ ≤ 1 as the fraction of perturbed tokens, in §4.1-4.4, and explore four other techniques in §4.5.

Schedules
For the two-stage curriculum, we started by training on original data only, and when validation loss converges, we introduce augmented data of fixed temperature at an augmented to original data ratio of 4:1. For the gradual curriculum, we begin with a temperature of τ = 0.0 (equivalent to no augmentation) and then linearly increase the temperature by 0.1 every time validation loss plateaus, up to a final temperature of 0.5. Schedules for each dataset are shown in the Appendix. Figure 2 shows an example training plot with our proposed curriculum schedules.

Curriculum Data Augmentation
This section compares no augmentation, standard augmentation, and curriculum augmentation for triplet loss networks using two different triplet sampling strategies. Table 1 summarizes these results for five random seeds. We also implement a crossentropy loss classifier for reference.
For triplet loss using random sampling, a model with no augmentation achieved a mean accuracy across our four datasets of 30.2%, and standard augmentation improved performance noticeably by +1.9%. Two-stage curriculum augmentation, which Triplet loss with hard negative mining 21.0 ± 1.2 44.6 ± 1.2 39.5 ± 1.0 16.2 ± 0.9 30.3 -+ standard data augmentation 22.6 ± 1.8 45.0 ± 1.6 48.2 ± 0.9 17.4 ± 1.7 33.3 +3.0 + curriculum data augmentation: two-stage 22.6 ± 1.8 45.7 ± 1.4 47.6 ± 1.3 17.9 ± 1.1 33.5 +3.2 + curriculum data augmentation: gradual 23.8 ± 0.9 47.1 ± 1.4 48.9 ± 0.9 18.9 ± 0.9 34.7 +4.4 Table 1: Accuracy (%) on four diverse highly multiclass classification tasks for no augmentation, standard augmentation, and curriculum augmentation. c: number of classes; ∆: improvement compared with no augmentation. trains for the same number of updates as standard augmentation, achieved a mean accuracy of 32.4%, outperforming standard augmentation by +0.5%. The gradual curriculum further improved +1.0% over the two-stage curriculum. For triplet loss with hard negative mining, standard augmentation substantially improved +3.0% over no augmentation, as adding in augmented data, which is more difficult to classify, likely helped generate a more diverse set of hard negatives. The two-stage curriculum still maintained small improvement over standard augmentation here, and the gradual curriculum provided an even-stronger boost of +4.4% over no augmentation, possibly because increasing the temperature of augmented data over time facilitated hard-negative mining more so than using a constant temperature.
Notably, the largest gains for all augmentation types were on COV-C (up to +9.4%). We hypothesize that this occurred not necessarily because of COV-C's smaller data size; rather, there was likely more overfitting to be mitigated by data augmentation as a result of the greater semantic difference between COV-C and the corpus used to pre-train BERT, compared with the other three datasets.  Figure 4: Curriculum data augmentation outperforms standard data augmentation for a range of different augmentation temperatures τ . Whereas standard augmentation performs better at lower τ , curriculum data augmentation helps even for higher τ (e.g., τ ≥ 0.2).

Ablation: Dataset Size
This ablation investigates how data augmentation performs for different dataset sizes. Figure 3 shows these results for hard negative mining averaged over HUFF and FEWREL, our two datasets where sufficient data is available. The two-stage curriculum outperformed standard augmentation by a small margin, although both dropped in performance at N c = 20, consistent with prior findings on the diminished effect of data augmentation for larger datasets (Xie et al., 2020;Andreas, 2020). The gradual curriculum, on the other hand, maintained relatively robust improvement for all dataset sizes explored.

Ablation: Augmentation Temperature
Effective curriculum learning necessitates a range of difficulty in training data. In our case, this range is controlled by augmentation temperature, a parameter that dictates how perturbed augmented examples are and therefore affects the distribution of difficulty in training examples. When the distribution of difficulty in data is larger, we should expect  Table 2: Gradual curriculum augmentation with three schedules. Curriculum: temperature τ increases. Control: τ is randomly selected every fifty updates. Anti: decreasing τ . Results are shown for ten seeds. a greater improvement from curriculum learning. Figure 4 compares standard and two-stage curriculum augmentation for various temperatures, with results averaged over all four datasets. At low temperature, augmented examples remained pretty similar to original examples, and so the range of difficulty in examples was small and therefore curriculum learning showed little improvement. At higher temperatures, however, augmented examples became quite different from original examples, and so the range of difficulty in examples was much larger and therefore curriculum data augmentation improved over standard augmentation more. Whereas Wei and Zou (2019) recommend τ ∈ {0.05, 0.1}, our curriculum framework liberates us to use much larger τ and maintain relatively robust improvements even at τ ∈ {0.4, 0.5} when standard augmentation is no longer useful.

Ablation: Curriculum Schedules
The gradual curriculum linearly increases temperature τ from 0.0 to 0.5 in six stages, and so to isolate the effect of this curriculum, in this section we compare it with a control schedule (where the τ in each stage is decided randomly) and an anti-curriculum schedule (where τ linearly decreases from 0.5 to 0.0 in six stages). As expected, these results, shown in Table 2, indicate that the curriculum contributes substantively over the control schedule.

For Various Augmentation Techniques
As our experiments so far have focused on EDA augmentation (Wei and Zou, 2019), this section explores other common techniques in the curriculum framework: (1) Token Substitution replaces words with WordNet synonyms (Zhang et al., 2015); (2) Pervasive Dropout applies word-level dropout with probability p = 0.1 (Sennrich et al., 2016a); (3) SwitchOut replaces a token with a randomly token uniformly sampled from the vocabulary ; and (4) Round-Trip Translation translates text into another language and then back into the original language (Sennrich EDA improved performance the most, perhaps because it combines four token perturbation functions, creating more diverse noise compared with using a single operation.

Related Work and Conclusions
Our work combines curriculum learning, data augmentation, and triplet loss, and is inspired by prior work in these areas. In vision, several papers have proposed reinforcement learning policies for data augmentation (Cubuk et al., 2019;Ho et al., 2019), and hard negative mining (Schroff et al., 2015;Song et al., 2016) itself can be seen as a form of curriculum learning. In NLP, the work of Kumar et al. (2019) is perhaps most similar to ours-they show that sampling strategies are key for improving performance with triplet loss networks. We see our work as the first to explicitly analyze curriculum learning for data augmentation in text.
In closing, we have proposed a curriculum data augmentation framework that is simple yet provides empirical performance improvements, a compelling case for the combination of ideas explored. Our approach exemplifies how data augmentation can create an artificial range of example difficulty that is helpful for curriculum learning, a direction that potentially warrants future research.   Table 3 shows the training schedules for single-stage, two-stage curriculum, and gradual curriculum training.

A Appendix
All models in standard single-stage training (with and without augmentation) for the same dataset trained for the same number of updates; convergence typically took longer with augmentation compared to without augmentation.
Curriculum two-stage training employs a first stage of only original data and a second stage of augmented data, using the same number of updates as single-stage training in total. We determined the number of updates in the first stage based on when training loss plateaued in the training plot for training with no augmentation.
The gradual curriculum starts with one stage of training with original data only and then increases the augmentation temperature by 0.1 in each of the following five stages. To determine the number of updates in each stage, we examined training plots in preliminary experiments and increased the augmentation temperature (i.e., begun the next stage) whenever training loss plateaued. Since our preliminary experiments already showed relatively strong performance improvements, we did not perform an extensive hyperparameter search or experiment with automatic scheduling, which could further improve performance. As the gradual curriculum trains on more diverse set of augmented data, more updates are needed than in the single-stage and two-stage schedules.
For evaluation, we evaluate our models every 200 updates for COV-C and every 300 updates for HUFF, FEWREL, and AMZN, reporting the highest validation accuracy achieved during training.
In all models, we include 20% original data whenever augmented data is used, in order to prevent catastrophic forgetting.