Curriculum Learning for Natural Language Understanding

With the great success of pre-trained language models, the pretrain-finetune paradigm now becomes the undoubtedly dominant solution for natural language understanding (NLU) tasks. At the fine-tune stage, target task data is usually introduced in a completely random order and treated equally. However, examples in NLU tasks can vary greatly in difficulty, and similar to human learning procedure, language models can benefit from an easy-to-difficult curriculum. Based on this idea, we propose our Curriculum Learning approach. By reviewing the trainset in a crossed way, we are able to distinguish easy examples from difficult ones, and arrange a curriculum for language models. Without any manual model architecture design or use of external data, our Curriculum Learning approach obtains significant and universal performance improvements on a wide range of NLU tasks.


Introduction
Natural Language Understanding (NLU), which requires machines to understand and reason with human language, is a crucial yet challenging problem. Recently, language model (LM) pre-training has achieved remarkable success in NLU. Pre-trained LMs learn universal language representations from large-scale unlabeled data, and can be simply finetuned with a few adjustments to adapt to various NLU tasks, showing consistent and significant improvements in these tasks (Radford et al., 2018;Devlin et al., 2018).
While lots of attention has been devoted to designing better pre-training strategies Raffel et al., 2019), it is also valuable to explore how to more effectively solve downstream NLU tasks in the fine-tuning stage. * Equal contribution. † Corresponding author.  Most current approaches perform fine-tuning in a straightforward manner, i.e., all training examples are treated equally and presented in a completely random order during training. However, even in the same NLU task, the training examples could vary significantly in their difficulty levels, with some easily solvable by simple lexical clues while others requiring sophisticated reasoning. Table 1 shows some examples from the SST-2 sentiment classification task (Socher et al., 2013), which identifies sentiment polarities (positive or negative) of movie reviews. The easy cases can be solved directly by identifying sentiment words such as "comfortable" and "unimaginative", while the hard ones further require reasoning with negations or verb qualifiers like "supposedly" and "occasionally". Extensive research suggests that presenting training examples in a meaningful order, starting from easy ones and gradually moving on to hard ones, would benefit the learning process, not only for humans but also for machines (Skinner, 1958;Elman, 1993;Peterson, 2004;Krueger and Dayan, 2009).
Such an organization of learning materials in human learning procedure is usually referred to as Curriculum. In this paper, we draw inspiration from similar ideas, and propose our approach for arranging a curriculum when learning NLU tasks. Curriculum Learning (CL) is first proposed by (Bengio et al., 2009) in machine learning area, where the definition of easy examples is established ahead, and an easy-to-difficult curriculum is arranged accordingly for the learning procedure. Recent developments have successfully applied CL in computer vision areas (Jiang et al., 2017;Guo et al., 2018;Hacohen and Weinshall, 2019). It is observed in these works that by excluding the negative impact of difficult or even noisy examples in early training stage, an appropriate CL strategy can guide learning towards a better local minima in parameter space, especially for highly non-convex deep models. We argue that language models like transformer, which is hard to train (Popel and Bojar, 2018), should also benefit from CL in the context of learning NLU tasks, and such idea still remains unexplored.
The key challenge in designing a successful CL strategy lies in how to define easy/difficult examples. One straightforward way is to simply predefine the difficulty in revised rules by observing the particular target task formation or training data structure accordingly (Guo et al., 2018;Platanios et al., 2019;Tay et al., 2019). For example, (Bengio et al., 2009) utilized an easier version of shape recognition trainset which comprised of less varied shapes, before the training of complex one started. More recently, (Tay et al., 2019) considered the paragraph length of a question answering example as its reflection of difficulty. However, such strategies are highly dependent on the target dataset itself and often fails to generalize to different tasks.
To address this challenge, we propose our Cross Review method for evaluating difficulty. Specifically, we define easy examples as those well solved by the exact model that we are to employ in the task. For different tasks, we adopt their corresponding golden metrics to calculate a difficulty score for each example in the trainset. Then based on these difficulty scores, we further design a re-arranging algorithm to construct the learning curriculum in an annealing style, which provides a soft transition from easy to difficult for the model. In general, our CL approach is not constrained to any particular task, and does not rely on human prior heuristics about the task or dataset.
Experimental results show that our CL approach can greatly help language models learn in their finetune stage. Without any task-tailored model architecture design or use of external data, we are able to obtain significant and universal improvements on a wide range of downstream NLU tasks. Our contributions can be concluded as follows: • We explore and demonstrate the effectiveness of CL in the context of finetuning LM on NLU tasks. To the best of our knowledge, this is one of the first times that CL strategy is proved to be extensively prospective in learning NLU tasks.
• We propose a novel CL framework that consists of a Difficulty Review method and a Curriculum Arrangement algorithm, which requires no human pre-design and is very generalizable to a lot of given tasks.
• We obtain universal performance gain on a wide range of NLU tasks including Machine Reading Comprehension (MRC) and Natural Language Inference. The improvements are especially significant on tasks that are more challenging.

Preliminaries
We describe our CL approach using BERT (Devlin et al., 2018), the most influential pre-trained LM that achieved state-of-the-art results on a wide range of NLP tasks. BERT is pretrained using Masked Language Model task and Next sentence Prediction task via large scale corpora. It consists of a hierarchical stack of l self-attention layers, which takes an input of a sequence with no more than 512 tokens and output the contextual representation of a H-dimension vector for each token in position i, which we denote as h l i ∈ R H . In natural language understanding tasks, the input sequences usually start with special token CLS , and end with SEP , for sequences consisting of two segments like in pairwise sentence tasks, another SEP is added in between for separating usage.
For target benchmarks, we employ a wide range of NLU tasks, including machine reading comprehension, sequence classification and pairwise text similarity, etc.. Following (Devlin et al., 2018), we adapt BERT for NLU tasks in the most straightforward way: simply add one necessary linear layer upon the final hidden outputs, then finetune the entire model altogether. Specifically, we brief the configurations and corresponding metrics for different tasks employed in our algorithms as follows: Machine Reading Comprehension In this work we consider the extractive MRC task. Given a passage P and a corresponding question Q, the goal is to extract a continuous span p start , p end from P as the answer A, where the start and end are its boundaries.
We pass the concatenation of the question and paragraph [ CLS , Q, SEP , P, SEP ] to the pretrained LM and use a linear classifier on top of it to predict the answer span boundaries.
For the i − th input token, the probabilities that it is the start or end are calculated as: The training objective is the log-likelihood of the true start and end positions y start and y end : loss = −(log(p start ystart ) + log(p end y end )) For unanswerable questions, the probability is calculated as s un = p start cls + p end cls using CLS representation. We classify a question into unanswerable when s un > s i,j = max i≤j (p start i + p end j ). F1 is used as the golden metric.

Sequence Classification
We consider the final contextual embedding of CLS token h l 0 as the pooled representation of the whole input sequence S. The probability that the input sequence belongs to label c is calculated by a linear output layer with parameter matrix W SC ∈R K×H following a softmax: , where K is the number of classes. The loglikelihood is also used as the training objective for this task. Accuracy is considered as the golden metric.
Pairwise Text Similarity Similar to sequence classification task, final embedding of CLS token h l 0 is used to represent the input text pair (T 1 , T 2 ). A parameter vector W P T S ∈ R H is introduced to compute the similarity score: For this task, we use Mean Squared Error (MSE) as the training objective and also the golden metric: where y is the similarity label in continuous score. Figure 1: Our Cross Review method: the target dataset is split into N meta-datasets, after the teachers are trained on them, each example will be inferenced by all other teachers (except the one it belongs to), the scores will be summed as the final evaluation results.

Our CL Approach
We decompose our CL framework into two stages: Difficulty Evaluation and Curriculum Arrangement. For any target task, let D be the examples set used for training, and Θ be our language model which is expected to fit D. In the first stage, the goal is to assign each example d j in D with a score c j which reflects its difficulty with respect to the model. We denote C as the whole difficulty score set corresponding to trainset D. In the second stage, based on these scores, D is organized into a sequence of ordered learning stages {S i : i = 1, 2, . . . , N } with an easy-to-difficult fashion, resulting in the final curriculum where the model will be trained on. We will elaborate these two stages in section 3.1 and 3.2 respectively.

Difficulty Evaluation
The difficulty of a textual example reflects itself in many ways, e.g., the length of the context, the usage of rare words, or the scale of learning target. Although such heuristics seems reasonable to human, the model itself may not see it the same way. So we argue that difficulty score as the intrinsic properties of an example should be decided by the model itself, and the best metric should be the golden metric of the target task, which can be accuracy, F1 score, etc., as we introduced in section 2.
To perform difficulty evaluation, we first scatter our trainset D into N shares uniformly as { D i : i = 1, 2, . . . , N }, and train N corresponding models { Θ i : i = 1, 2, . . . , N } on them, which are all identical to Θ (note that each model Θ i will only see 1/N of the entire trainset). We refer to these N models as teachers, and { D i } as metadatasets for that they are attended only to collect information (i.e. the extent of difficulty) about the original trainset D. This preparing of teacher can be formulated as: where L indicates the loss function. After every teacher is respectively trained on its meta-dataset, the evaluation of trainset D should begin. For each example d j , it should be included in one and only one meta-dataset, let's assume it's D k , then we perform inference of d j on all teachers except teacher k, because the inference from teacher k is supposed to be isolated with the meta-dataset D k it has already seen during training. After all inferences finished, we calculate scores of d j in the target task's metric, resulting N − 1 scores from N − 1 different teachers: where Θ i (•) represents the inference function, x j and y j is the input and label of example d j respectively, M is the metric calculation formula, which can be either F1, Accuracy or MSE for different tasks as introduced in section 2, and c ji is the score of d j from teacher Θ i . Finally, we define the difficulty score of d j as the integration of all N − 1 scores: with all scores calculated, we obtain the final difficulty score set C as desired. We refer to our difficulty evaluation method as Cross Review(see Fig. 1) In the proposed method, the teacher models perform their inferences in a crossed way, which prevents the meta-dataset from contaminating the inference set. Besides, each example gets its score from multi teachers, thus the fluctuation of evaluation results is greatly alleviated. In general, our Cross Review method can address the difficulty evaluation problem in an elegant design.

Curriculum Arrangement
In this section we describe our method to arrange the training examples D into a learning curriculum according to their difficulty scores C. We design our curriculum in a multi-stage setting The sampling algorithm is built upon such principle: The proportion of difficult examples in each stage should start with 0, and gradually increase until it reachs how much it accounts for in the original dataset distribution.
We first sort all examples by their difficulty score C, and divide them into N buckets: {C i : i = 1, 2, . . . , N }, so the examples are now collected into N different levels of difficulty, ranging from C 1 (the easiest) to C N (the hardest), with the proportion distribution as: For tasks with discrete metrics, such distribution is naturally formed by the difficulty score hierarchy, and directly reflects the intrinsic difficulty distribution of the dataset. For other tasks, we manually divide C uniformly 1 . Based on these buckets, we construct the learning curriculum one stage after another. For each learning stage S i , we sample examples from all antecedent buckets {C j : j = 1, 2, . . . , i} by the following proportion: and the final curriculum {S i : i = 1, 2, . . . , N } is formed as such. We refer to the arrangement algorithm as Annealing method for it provides a soft transition through multi learning stages. At each stage, the model is trained for one epoch. When the training reached S N , the model should be ready for the original distribution in trainset D, so we finally add another stage S N +1 which covers the entire trainset, and the model is trained on it until converges.

Datasets
In this section we briefly describe three popular NLU benchmarks on which we evaluate our CL approach: SQuAD 2.0 (Rajpurkar et al., 2018), NewsQA (Trischler et al., 2016) and GLUE (Wang et al., 2018), their scale and metrics are detailed in Table 2.  SQuAD The Stanford Question Answering Dataset (SQuAD), constructed using Wikipedia articles, is a well known extractive machine reading comprehension dataset with two tasks: SQuAD1.1 (Rajpurkar et al., 2016) and SQuAD 2.0 (Rajpurkar et al., 2018). The latest 2.0 version also introduced unanswerable questions, making it a more challenging and practical task. In this paper, We take SQuAD 2.0 as our testbed.
NewsQA NewsQA (Trischler et al., 2016) is also a MRC dataset in extractive style but is much more challenging, with human performance at 0.694 F1 score. NewsQA is collected from news articles of CNN with two sets of crowdworkers, the "questioners" is provided with the article's headline only, and "answerers" is supposed to find the answer in full article. We ignore examples flagged to be without annotator agreement for better evaluation following (Fisch et al., 2019).
GLUE The General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018) is a collection of nine 2 diverse sentence or sentence pair language understanding tasks including sentiment analysis, textual entailment, sentence similarity, etc. It is considered as a well-designed benchmark that can evaluate the generalization and robustness of NLU algorithms. The labels for GLUE test set is hidden, and users must upload their predictions to obtain evaluation results, the submission is limited to protect test set from overfitting.

Experimental Setups
We use BERT Large (    , * indicates our reimplementation. fectiveness of our CL approach. For MRC, we also test on BERT Base model for more comprehensive results. Besides reported results from literature, we also provide our re-implementation on all datasets, which form a more competitive baseline for comparison. The only difference between our re-implementation and our CL approach is the arrangement of curriculum, i.e., the order of training examples. To obtain a more comparable and stable difficulty score, we binarize the review results before sum them together if possible. For accuracy as metric, the score c ji is already binary in instance level, for F1 as metric, we count any review result c ji > 0 as correct. For other continuous metrics (MSE in this paper), we sum c ji directly. We empirically choose N = 10 as the number of meta-datasets for most tasks (also the number of difficulty level and the number of stages), for three datasets with rather limited scale (RTE, MRPC, and STS-B), we change it to N = 3. The scale of all datasets employed in this work is provided in Table 2. Intuitively, we shall get better results by searching for the best N , we leave it to future works due to limited computation resource.
We implement our approach based on the Py-Torch implementation of BERT (Wolf et al., 2019 Table 4: Results on GLUE benchmark, * indicates our re-implementation, baselines on dev sets are obtained from , baselines on test sets are obtained from the leaderboard (https://gluebenchmark. com/leaderboard) submitted by (Devlin et al., 2018), they may have taken different hyperparmeters. All results are produced with single task and single model.
with eplison equals to 1e-8. The learning rates warm up over the first 5% steps and then decay linearly to 0 for all experiments. To construct our reimplementation, on both SQuAD 2.0 and NewsQA we perform hyperparameter search with batch size in {16, 32} and learning rate in {1e-5, 2e-5, 3e-5, 4e-5} for Base model, and {32, 48, 64}, {5e-5, 6e-5, 7e-5} for Large model. We reuse the best parameter setting in SQuAD 2.0 on NewsQA. We set the max length of input sequence to 512 for NewsQA task because the paragraph is much more longer. On GLUE, we implement the experiments on Large model with batch size in {16, 32} and learning rate in {1e-5, 2e-5, 3e-5}.

MRC Results
The results for MRC tasks are presented in Table 3. In all experiments, our CL approach outperforms its baseline with considerable margin. On SQuAD 2.0, we obtain +1.30 EM/+1.15 F1 improvements using base model and +0.31 EM/+0.57 F1 using large model compare to our competitive re-implemented baseline. Note that the performance gain is more significant with Base model. On NewsQA, we also get +0.02 EM/+0.47 F1 and +0.10 EM/+0.30 F1 improvements for base and large model respectively.

GLUE Results
We summarize our GLUE results in Table 4. Results on dev sets show that our CL method consistently outperforms their competitive baseline on all 8 tasks, which proves that our CL is not only robustly effective but also generalizable on a wide range of NLU tasks. Because the model architecture and hyper-parameters setting are identical, all the performance gains can be attributed to our CL approach alone. Specifically, we observe that our CL approach is doing better on more challenging tasks. For CoLA and RTE, the margin is up to +3.3 and +1.8 in respective metrics, which is relatively larger than less challenging tasks where the model performance already reached a plateau. Such results are understandable: when learning harder tasks, the model can be overwhelmed by very difficult examples at early stages, and a well-arranged curriculum thus can be more helpful. And for tasks where the baselines are already approaching the human performance like SST-2, our CL approach is still able to provide another +0.4 improvements, which demonstrates the robustness of our approach. Overall, our CL approach obtains +0.9 average score gain on GLUE benchmark compare to our re-implemented baseline.
Results on test sets further demonstrate the effectiveness of our approach. We obtain +0.4 average score gain compare to our re-implementation and the baseline on the leaderboard.

Ablation Study
In this section, we delve into our approach on a series of interesting topics including: (i) what is the best CL design strategy for NLU tasks, (ii) can Cross Review really distinguish easy examples from difficult ones, (iii) the best choice of N . We choose SQuAD 2.0 task in most experiments for generality, and all experiments are performed with BERT Base model.
Comparison with Heuristic CL Methods To demonstrate our advantage over manually designed CL methods, we compare our approach with sev-  Table 5: Comparisions with heuristic CL design (written in italic). * indicates our re-implementation, ∆ indicates absolute improvements on F1. eral heuristic curriculum design in Table 5. For Difficulty Review methods, we adopt word rarity, answer length, question length, and paragraph length as difficulty metrics similar to (Tay et al., 2019;Platanios et al., 2019). We calculate word rarity as the average word frequency of the question, where the frequency is count from all questions in trainset. We define difficult examples as those with lower words frequencies, longer answer, question, and paragraph length. We first sort all examples using these metrics, and divide them evenly to obtain 10 example buckets with a corresponding level of difficulty, and the Curriculum Arrangement strategy remains unchanged as Annealing. For Curriculum Arrangement method, we try Naive order for comparison. We directly implement the curriculum as {C i } (instead of {S i }) without any sampling algorithm, only that S N +1 is still retained for fair comparison. In the meantime, the Difficulty Eval-uation method remains unchanged as Cross Review. The results show that these intuitive design indeed works well with various improvements ranging from +0.12 to +0.76 on F1 score. But they are all outperformed by our Cross Review + Annealing approach.
Case study: Easy VS Difficult In our Cross-Review method, the dataset was divided into N buckets {C i } with different levels of difficulty.
Here we further explore what do these easy/difficult examples in various tasks actually look like. Earlier in the introduction (see Table 1), we have provided a straightforward illustration of easy cases versus hard cases in SST-2 dataset. Among ten different levels of difficulty, these cases are sampled from the most easy bucket (C 1 ) and the most difficult bucket (C 10 ), respectively. The results are very clear and intuitive. We further choose SQuAD 2.0 as a more complex task to perform in-depth analysis. Under the N = 10 setting, we reveal the statistical distinctions of all buckets {C i } in Fig 2. With three monotonically increasing curve, it is very clear that difficult examples tend to entail longer paragraph, longer questions, and longer answers. Such conclusions conforms to our intuition that longer text usually involves more complex reasoning patterns and context-dependency. And these challenging examples are now successfully excluded in the early stages attributing to our CL approach. Another interesting result is that the percentage of unanswerable examples drops consistently from 40% to 20% along the difficulty axis. We assume that simply doing classification is easier than extracting the exact answer boundaries.
On Different Settings of N One argument that needs to be specified ahead in our approach is N , which decides the number of meta-datasets, learning stages, and also the granularity of our difficulty score. Assume the metric is between 0 and 1, which fits almost all the cases, then the difficulty score c ji should range from 0 (when all teacher models fail) to N − 1 (when all teacher models succeed), so all examples can be distinguished into N different levels. With N becoming larger, the granularity is also finer.
To examine the impact of different settings, we perform ablation study on SQuAD 2.0 task given a wide range of choices: from 2 to 20 (see Fig 3). It is obvious that under all settings our approach outperforms the baseline by at least +0.5 F1 score (even including N = 2, where the difficulty evaluation results may be affected by the fluctuation of single-teacher review). We also experiment with extremely large N value. For N = 100, the result is 74.10 on F1 score (2.68 below our baseline), which is as expected because the meta-dataset is too small to prepare a decent teacher that is capable of evaluating. In general, our approach is very robust with the settings of N .

Related Works
The idea of training a neural network in an easyto-difficult fashion can be traced back to (Elman, 1993). (Krueger and Dayan, 2009) revisited the idea from a cognitive perspective with the shaping procedure, in which a teacher decomposes a complete task into sub-components. Based on these works, Curriculum Learning is first proposed in (Bengio et al., 2009). They designed several toy experiments to demonstrate the benefits of curriculum strategy both in image classification and language modeling. They also propose that curriculum can be seen as a sequence of training criteria, and at the end of it, the reweighting of examples should be uniform with the target distribution, which inspired the design of our Curriculum Arrangement algorithm.
Although CL has been successfully applied to many areas in computer vision (Supancic and Ramanan, 2013;Chen and Gupta, 2015;Jiang et al., 2017), it was not introduced to solve NLU tasks until (Sachan and Xing, 2016). By experimenting with several heuristics, they migrated the success of CL (Kumar et al., 2010) to machine reading com-prehension tasks. (Sachan and Xing, 2018) further extended this work to question generation. More recently, (Tay et al., 2019) employed CL strategy to solve reading comprehension over long narratives. Apart from them, there aren't very many works that discuss CL in the context of NLU to the best of our knowledge.
On the methodology of designing CL algorithms, our approach is closely related to (Guo et al., 2018;Platanios et al., 2019;Tay et al., 2019), where a curriculum is formed via two steps: evaluating the difficulty first, then sampling the examples into batches accordingly. For different target tasks, the evaluation methods also vary greatly. (Guo et al., 2018) first examined the examples in their feature space, and define difficulty by the distribution density, which successfully distinguished noisy images.  incorporated category information into difficulty metric to address imbalanced data classification. In language tasks, (Platanios et al., 2019) and (Tay et al., 2019) propose to consider the length of context as extent of difficulty. Another line of works see curriculum construction as an optimization problem (Kumar et al., 2010;Graves et al., 2017;Fan et al., 2018), which usually involves sophisticated design and is quite different from our approach.

Conclusion
In this work we proposed a novel Curriculum Learning approach which does not rely on human heuristics and is simple to implement. With the help of such a curriculum, language models can significantly and universally perform better on a wide range of downstream NLU tasks. In the future, we look forward to extend CL strategy to the pretraining stage, and guide deep models like transformer from a language beginner to a language expert.