Do We Need to Create Big Datasets to Learn a Task?

Deep Learning research has been largely accelerated by the development of huge datasets such as Imagenet. The general trend has been to create big datasets to make a deep neural network learn. A huge amount of resources is being spent in creating these big datasets, developing models, training them, and iterating this process to dominate leaderboards. We argue that the trend of creating bigger datasets needs to be revised by better leveraging the power of pre-trained language models. Since the language models have already been pre-trained with huge amount of data and have basic linguistic knowledge, there is no need to create big datasets to learn a task. Instead, we need to create a dataset that is sufficient for the model to learn various task-specific terminologies, such as ‘Entailment’, ‘Neutral’, and ‘Contradiction’ for NLI. As evidence, we show that RoBERTA is able to achieve near-equal performance on 2% data of SNLI. We also observe competitive zero-shot generalization on several OOD datasets. In this paper, we propose a baseline algorithm to find the optimal dataset for learning a task.


Introduction
Large scale datasets such as Imagenet (Russakovsky et al., 2015) in Vision, and SQUAD (Rajpurkar et al., 2016) and SNLI (Bowman et al., 2015) in NLP have accelerated our progress in deep learning. The general trend has been to create large scale datasets for various tasks such as Abductive NLI , DROP (Dua et al., 2019), and SWAG (Zellers et al., 2018). The process of creating big datasets involves heavy investment in resources, that further increases when models are developed in response to these datasets, and trained to top leaderboards. This makes deep learning research and development inaccessible to * equal contribution communities where resources are scarce. Additionally, the heavy computation involved in training models adversely affects the environment on a broader scale (Schwartz et al., 2019). This leads us to the question: Do we always need to create big datasets?
We probe this question with motivation from the process by which we learn a certain topic/task. Even though we have access to hundreds of materials available online, we do not need to go through all of them in order to learn the specific topic. In fact, we intentionally avoid certain materials which are noisy, distracting, or irrelevant to the topic. Humans have deep background knowledge about the world which makes this possible. With the recent developments in language modelling, pre-training on huge datasets have imparted linguistic knowledge to models like BERT (Devlin et al., 2018) and RoBERTA (Liu et al., 2019). With this knowledge, models need not learn everything from scratch; instead they should just learn task specific terminologies such as 'Entailment', 'Neutral', and 'Contradiction' for NLI, which might not necessitate the use of big datasets.
There are certain other factors that recommend against creating big datasets. A growing number of recent works (Poliak et al., 2018;Geva et al., 2019;Kaushik and Lipton, 2018;Schwartz et al., 2017;Mishra et al., 2020;Bras et al., 2020) have exposed the presence of spurious bias in many popular benchmarks. Spurious bias represents unintended correlations between input and output (e.g.: the word 'not' is most often associated with the label 'contradiction' (Gururangan et al., 2018)). Spurious bias makes a task easy for models, allowing them to exploit instead of learning generalizable features like humans. Models finetuned on these benchmarks fail to generalize in Out of Distribution (OOD) and Adversarial settings. Since the sources of these spurious biases: data collection, Figure 1: Proposed baseline approach to select the optimal data necessary to learn a task. and crowdsourcing are hard to control, carefully selecting a smaller and optimal dataset may be a viable alternative.
We propose a baseline in this paper to find the optimal dataset for learning a task. Our approach is inspired by the human tendency to first make a rough estimate of the presence of relevant materials by glancing at various parts of an entire set of available materials. After selecting a slice of existing materials in the first phase, they remove redundant/easy/already known/possible distracting content from the slice. Finally, they use heuristics based on their background knowledge about the task to sort based on relevance and select the most optimal content based on the priority of the task and availability of time to learn it. We utilize several recently proposed modules in our baseline.
We prune SNLI (Bowman et al., 2015) to ∼ 2% of its original size using our baseline. Our results show that RoBERTA on training with this pruned set achieves near-equal performance on the SNLI dev set and competitive zero-shot generalization on three OOD datasets (i) NLI Diagnostics (Wang et al., 2018), (ii) Stress Tests (Naik et al., 2018), (iii) Adversarial NLI (Nie et al., 2019). Our analysis shows that big datasets not only prevent generalization, but also impact IID testset performance. Interestingly, we find the annotation of those data to be correct and not noisy. This indicates that, certain data samples might be distracting a model by acting against the inductive bias created by rest of the dataset. Our finding opens up the possible existence of such distractors in real datasets, encouraging NLP community to explore the optimal selection of certain samples in a dataset instead of trying to dominate a leaderboard with the entire dataset.

Proposed Algorithm
We mimic the relevant material selection process in humans to propose algorithm for selecting the optimal dataset necessary to learn a task, as illustrated in Figure 1. We use robotics terminology (Rauch et al., 2019) to explain the stages of learning (i) coarse action (ii) fine action. Algorithm 1 details our approach. We briefly explain each stage below.
Formalization: Let D represent the entire dataset, s represent samples, M be the model, S be the pruned set, E(s), C(s) and P (s) be the evaluation score, correct evaluation score and predictability score of each sample s respectively. In this preliminary work, we just explore the first term of DQI c1 . Expanding this to other terms will be the immediate future work.
Coarse Action: We start with a random subset of a% of dataset (D), train model (M ) on it and calculate accuracy on the IID testset. We iteratively append a random subset of b% of data from the rest of D, train M on the combined data and calculate accuracy on the testset. We continue adding b% of data until the testset accuracy stops increasing. L1-L8 of algorithm 1 explains coarse action.
Fine Action: We use two key modules (i) AFLite (Bras et al., 2020; and (ii) DQI (Mishra et al., 2020) for fine action on the data selected after coarse action. AFLite is a recent technique for adversarial filtering of dataset biases, whereas DQI has a method to quantify quality of samples with or without annotation.
AFLite: In our setup, AFLite randomly selects 10% of data (selected after coarse action) for fine tuning on M , and then discards them. It randomly partition the data into train and test set, and does it in parallel several times. It trains linear models (logistic regression and SVM) with the train data, and evaluate on the test data. It combines parallel sessions by calculating predictability score (P (s)) of every data as the number of time it has been correctly predicted (C(s)) divided by the number of times it has been evaluated (E(s)). It then shortlists samples for which predictability score is greater than a threshold (tau).
DQI: DQI stands for Data Quality Index. It is a compilation of various linguistic parameters related to dataset biases. It has seven components -(i) Vocabulary, (ii) Inter-Sample N-gram Frequency and Relation, (iii) Inter-Sample STS (Semantic Textual Similarity), (iv) Intra-Sample Word Similarity, (v) Intra-Sample STS, (vi) N-Gram Frequency per Label, (vii) Inter-Split STS -that cover various possible inter/intra-sample interactions (a subset of which leads to biases) in an NLP dataset. DQI has a total of 20 subcomponents and 133 terms. Higher DQI is meant to indicate lower existence of spurious bias and higher generalizable features.

Leveraging AFLite and DQI in Fine Action:
We use DQI in the pruning step of AFLite; instead of sorting samples based on the predictability score, we sort them based on the DQI values. L9-L34 and L34-36 of algorithm 1 explain the usage of AFLite and DQI respectively in fine action.
Analysis: Table 1 shows that IID testset accuracy decreases after 20k, so we stop there and proceed for fine action with 20k data. With fine action, we prune 20k data further to the size of 5k-15k, as shown in Table 2. Our results in Table 2 shows that the pruned datasets achieves near equal performance on IID testset and competitive performance on various sections of three OOD datasets. Since we have included just the first term of DQI c 1, we perform ablation study of that specific term. Our results in Table 3 shows that the first term of DQI c1 helps in improving performance on most of the cases. Interestingly we observe that, 20k data has lower IID testset accuracy than 5k, 8k, 10k, 12k and 15k datasets, as shown in Table 2. We perform a preliminary analysis on the 15k samples retained using our algorithm and observe that the 15k retained data contains 4939, 5058 and 4983

Conclusion
We propose a baseline approach to find the optimal set of samples required to learn a task. Our approach mimics humans in identifying relevant materials for learning a task. In the first stage, our algorithm finds a rough estimate as part of the coarse action. The second stage leverages two recently proposed modules AFLite (for adversarial filtering of dataset biases) and DQI (for quantifying the quality of data) to perform fine action. We show the efficacy of our baseline by pruning SNLI to 2% of its original size. Our results show that RoBERTA on training with this pruned set achieves near-equal performance on the SNLIdev set and competitive zero-shot generalization on three OOD datasets. Our analysis shows that big datasets not only prevent generalization, but also impact IID performance. Our findings about distracting samples will encourage community to look for the possible existence of such distractors in real datasets and subsequently explore the optimal selection of samples in a dataset instead of trying to dominate a leaderboard with the entire dataset. Studying the effect of our algorithm on model training time, memory footprint, model interpretation research and better understanding of how deep learning models work in general are some of the potential future directions to explore.