Quantifying the Evaluation of Heuristic Methods for Textual Data Augmentation

Data augmentation has been shown to be effective in providing more training data for machine learning and resulting in more robust classifiers. However, for some problems, there may be multiple augmentation heuristics, and the choices of which one to use may significantly impact the success of the training. In this work, we propose a metric for evaluating augmentation heuristics; specifically, we quantify the extent to which an example is “hard to distinguish” by considering the difference between the distribution of the augmented samples of different classes. Experimenting with multiple heuristics in two prediction tasks (positive/negative sentiment and verbosity/conciseness) validates our claims by revealing the connection between the distribution difference of different classes and the classification accuracy.


Introduction
Machine learning approaches have been shown to be capable of making accurate predictions in many well-known problem domains with an abundance of training data. This heavy reliance on the availability of the data, however, may hamper the application of machine learning approaches to resourcelimited problem domains, where a sizable training data are not always available.
There is a growing body of research on training under resource scarcity, and data augmentation is one of such techniques. It aims to reconcile the data requirement of the machine learning approaches by applying a general (e.g., randomly remove a word) or domain-inspired heuristic (e.g., replace an adjective with an antonym) to the (limited) existing data in order to generate more training samples.
One challenge for data augmentation is in choosing the most appropriate heuristic for the application in question. There may be many domainindependent augmentation heuristics, and a domain expert may come up with many different domaininspired heuristics; but the choices of which examples from these heuristics to use may have a significant impact on the success of the trained model.
A straightforward approach to choose an augmentation heuristic is to actually perform the classification experiment on all possible augmented datasets and then chose the best performing one(s) based on the evaluative results. However, this approach may not be computationally practical when there are too many augmentation heuristic options.
In this paper, we propose an alternative heuristic evaluation approach based on the idea that a good heuristic should aim to generate "hard to distinguish" samples for different classes. We further argue that the generation quality of "hard to distinguish" examples could be quantified as the difference between the distribution of the augmented samples that a heuristic generates for different classes.
To calculate the distribution difference, we proposed to use pre-trained off-the-shelf embeddings to convert sentences into class distributions, then calculate the KL-divergence between them and used that as a metric to evaluate the "hard to distinguish" examples that a heuristic produces.
We validate our proposed heuristic evaluation approach by experimenting with multiple heuristics and augmented datasets for two classification tasks: predicting whether a sentence expresses positive or negative sentiment and predicting whether a sentence is verbose or concise. Results suggest that quantifying the "hard to distinguish" example generation quality of the heuristics as the difference between class distribution of the augmented examples, could be served as an effective metric for choosing a suitable augmented dataset for a classification task.

Data Augmentation
Data augmentation is a technique for generating additional training data by applying a heuristic transformation to the existing training examples. For example, an existing image could by rescaled or flipped to get more images with the same label to expand the size and diversity of the training dataset and thus train a more reliable and accurate model (Frénay and Verleysen, 2014;Hendrycks et al., 2018;Shorten and Khoshgoftaar, 2019).
In general, data augmentation could be formulated as Equation 1, where h is a heuristic function that transforms the datapoint and label pair of (x, y) to a new augmented sample (x,ŷ).
The majority of existing data augmentation approaches are label-preserving, which relaxes the Equation 1 as (x, y) = h(x, y) = (h(x), y); this means, if x belong to some class A, augmented x also belong to class A. For example, using a synonym replacement heuristic, a sentence with positive sentiment could be augmented into a new example, while preserving the overall positive sentiment. Label-preserving data augmentation requires existing labeled samples for every class that is needed to be augmented.
Data augmentation can be non-label-preserving as well, where the label itself might also transform using function h y that expands Equation 1 as: This means, while x belongs to class A,x might not necessarily belong to class A. For example, by replacing the most positive word(s) of a sentence with positive sentiment with an antonym, the sentence's sentiment may become negative. Non-labelpreserving data augmentation is not bound to the assumption of having labeled samples for instances of all classes and samples from one class may be enough to generate instances of other classes.
Given a classification task, there may be multiple heuristics and data augmentation approaches that allow us to transform existing samples to new ones, but the choice of heuristic may significantly impact the success of the task. In this paper, we aim to answer the key question: "which heuristic and data augmentation approach is more appropriate for a classification task?" In Section 3, we propose a low-cost approach to quantify the evaluation of different heuristics and the resulting augmented datasets for classification tasks.
However, with all these textual augmentation options, trying all of them for a (classifier) training task might be impractical, and to our best knowledge, there is not a guideline for how to choose between them for a task.

Quantification of Heuristics Suitability
A straightforward approach to assess which heuristic and data augmentation approach is more appropriate for the task is to try every heuristic to generate an augmented dataset, then train a classifier on each and check the final classification performance (Qiu et al., 2020;Wei and Zou, 2019). The training process in this brute-force approach, however, may be time-consuming and resource-intensive, especially in complex training scenarios.
Alternatively, we may try to identify qualities that make a heuristic effective. Intuitively, a good heuristic ought to generate augmented samples that are the most similar to the original data distribution. However, this approach may overlook the additional generalization benefit that may come from diverse augmented training examples. Moreover, this approach may not be possible for problem domains with limited resources, where original labeled data is not available for all classes, and one may have to use non-label-preserving heuristics to augment examples for all classes.
On the other hand, from the classification task perspective, a good heuristic should aim to generate near-miss examples (samples of class B hard to distinguish from A). We believe, the "hard to distinguish" samples can be quantified by finding a way to compute the difference between the samples of different classes, to sever as an guideline for choosing between different heuristic approaches.
Let us assume samples of class A are drawn from distribution A, which should be different from distribution B that samples of class B are drawn from. The difference between distribution A and B can be calculated as the KL-divergence (KLD) (Kullback and Leibler, 1951) from B to A as: D KL (A||B). KLD calculates how probability distribution A is different from the reference probability distribution B as the amount of information gained if samples of B are used instead of samples of A.
Thus, a lower D KL (A||B) means distribution A is more similar to distribution B, so samples of class A are harder to distinguish from samples of class B. Therefore, the extent to which "hard to distinguish" samples can be generated by heuristic h could be quantified as D KL (A h ||B h ), where A h and B h indicate the samples of class A and B augmented using heuristic h, and Equation 2 could be used to identify which heuristic is generating "harder to distinguish" samples and so more suitable for the classification task.

arg min
Finally, to transform sentences from their discrete word representation into a continuous distribution representation, we utilize a few of the numerous pre-trained embeddings that nowadays are the de facto approach for encoding sentences into vector space (Cho et al., 2014;Le and Mikolov, 2014;Cer et al., 2018;Devlin et al., 2019).
We examine the applicability of our proposed approach by studying two classification tasks: sentiment analysis, as a resource-rich problem domain that allows experimenting with both labelpreserving and non-label-preserving heuristics, and verbosity analysis, as a resource-limited problem domain that the absence of sizable labeled data limits the options to non-label-preserving heuristics.

Augmented Datasets
In this section, we go over some heuristic options for augmenting training corpora for sentiment analysis and verbosity detection domains.

Augmented Sentiment Corpus
Our sentiment analysis task is to predict whether a sentence expresses positive, negative, or neutral sentiment? For this task, we use the sentences from the Yelp Polarity Dataset (YPD) (Zhang et al., 2015) to create the augmented dataset.
As label-preserving heuristics, we use following heuristics proposed by Wei and Zou (2019): • Synonym Replacement (SR). Randomly pick a content word from the sentence and replace it with a synonym chosen at random.
• Random Insertion (RI). Randomly choose a content word from the sentence and insert one of its synonyms to a random place in the sentence.
• Random Swap (RS). Swap the position of two randomly chosen words in the sentence.
• Random Deletion (RD). Delete a randomly chosen word from the sentence.
We apply these heuristics to the positive sentences of YPD to generate more positive examples, and the other way around for generating more negative examples. For each sentence, we repeat each heuristic operation until about 20% of its words are changed (α = .2).
Moreover, we propose the following non-labelpreserving heuristics and apply them to the positive sentences to create the augmented negative examples, and the other way around for generating the augmented positive examples.
• ALL. In this heuristic, we replace all the sentiment words of a sentence with their antonyms. To find the sentiment words, we first collected a vocabulary of positive and negative unigrams by combing the labeled words of Stanford Sentiment Treebank (Socher et al., 2013) and the Opinion Lexicon (Hu and Liu, 2004). This results in a vocabulary of 3,453 positive and 6,000 negative unigrams.
Then, for a positive sentence in YPD, we replace every word of it that appeared in the positive portion of the collected vocabulary by one of its randomly chosen antonyms, using WordNet (Miller, 1995), to create the augmented negative sentence. We perform similarly but in the opposite direction to create the augmented positive sentences. • ONE. In this heuristic, instead of replacing all sentiment words with their randomly chosen antonym, we first filtered for antonyms that match the POS and sense of the sentiment word, then we pick the antonym that makes the most fluent augmented sentences, ranked by a language model (LM) trained on YPD.
Finally, for every sentence, we only replace one of its sentiment words with its POS, sense, and LM filtered antonym.
Using this heuristic, for example, a sentence with overall positive polarity may still contain a word that expresses a negative opinion about an aspect, so intuitively, this creates "harder to distinguish" examples compared to the ALL heuristic.
In total, we generated 50K positive and 50K negative augmented samples using each heuristic. We removed all of the original YPD sentences so that these datasets contain only augmented samples. We refer to each dataset with the same name as the heuristic function it is augmented with. Table 1 shows examples of sentences augmented using the label-preserving and non-label-preserving sentiment heuristics.

Augmented Verbosity Corpus
The verbosity detection task is to predict whether a sentence is verbose or concise. Unlike the sentiment analysis domain, the set of existing resources for the verbosity detection problem is much more limited: NUCLE covers grammatical redundancy (Dahlmeier et al., 2013), and Kashefi et al. (2018) has a small corpus called Semantic Pleonasm Corpus (SPC) that contains semantic redundancy (i.e., verbosity) labels. Due to its small size, it is primarily suitable as a benchmark.
Since to the best of our knowledge, there is no sizable resource with explicit verbose and concise labels, to augment a dataset of concise and verbose sentences, we start by trying to identify an existing real-world data source that has verbosity or conciseness characteristics. One domain-specific feature of Yelp that we exploit is the data category called "tips." Since "tips" are very short sentences, they are likely to be concise; we sample for "tips" that contain adjectives because the evaluation corpus (i.e. SPC) mainly focuses on adjectival semantic redundancies.
Based on domain knowledge, we come up with the following non-label-preserving heuristics to create verbose samples based on the collected "concise" sentences by adding a superfluous adjective to the concise sentences: • Duplicate (DUP). This heuristic is an obvious case for word redundancy by duplicating an adjective word of the sentence right next to itself.
• Synonym (SYN). This heuristic inserts a synonym next to an adjective word of the sentence. The conventional way to get synonyms of a word is to use WordNet, however, since these synonyms may express a different quality of the noun clause compared to the original adjective, augmented construction might not be semantically redundant.
For this reason, we opt to use sense2vec (Trask et al., 2015), a contextual wordembedding fine-tuned on Yelp "tips". Since the adjective synonyms from sense2vec are matching the context and follow the same intent and emotional state of the original adjective, these two adjacent synonyms are likely to make a pleonastic construction.
• Near-Miss Negative (NMN). In this heuristic, we try to create concise examples that are "hard to distinguish" from the verbose examples. We trained a language model on the Yelp "tips" and used that to predict the most likely words that can occur right after an adjective of the sentence. Let assume for adjective w adj in sentence s, using LM, we retrieved {w aug1 , w aug2 , ..., w aug5 } as a sorted list of most likely words that can appear next to w adj given its context s.  We then filter for w aug s that are adjective themselves and a synonym of w adj , lets assume the filtered list be {w aug2 , w aug5 }.
Since LM is trained on Yelp, the w aug2 is already observed in the Yelp tips after the w adj in some context. Taking into account that Yelp tips are considered concise, the sequence of ... w adj w aug2 ... is also concise. Therefore, we can create concise examples that are containing two adjacent synonyms but are not verbose For each heuristic, we generate only one augmented verbose sample from an original concise sentence. In total, we augmented 100K concise and 100K verbose samples using each heuristic. Since the verbose examples are generated from concise sentences that are included in the augmented corpus, we removed the concise sentences with odd and verbose sentences with even indexes to make sure that non of the concise are verbose sentences in the corpus are corresponding to each other. The final augmented corpus, thus, contains 50K nonparallel samples of each class. We refer to each dataset with the same name as the heuristic function that was used to augment it. Table 2 shows examples of sentences augmented using the non-label-preserving verbosity heuristics. While duplicating the word "delicious" or adding "tasty" next to it makes the sentence verbose, adding "redolent" does not make it verbose because "redolent" and "delicious" are describing different quality of the "bread."

Experiments
The key questions for validating our proposed approach for quantifying the evaluation of heuristic textual data augmentation methods are: • Q1. Can generating "hard to distinguish" examples be an effective way to assess whether a heuristic is generating a suitable augmented training dataset?
To what extent could the notion of "hard to distinguish" examples be quantified by our proposed metric -the difference between the class distribution of the augmented samples?
• Q3. Is calculating the difference of class distributions computationally efficient in practice?
To measure the accuracy of sentiment and verbosity classification in answering Q1, we trained an LSTM (Liu et al., 2016) and a CNN (Kim, 2014) classifier on each the augmented dataset. The classification result for each task and augmented dataset is reported in Section 5.2.
The LSTM and CNN models are trained on augmented corpora separately for each task; the sentiment classifiers are evaluated on a held-out portion of the YPD, and the verbosity classifiers are evaluated on SPC. None of the sentences of the held-out YPD and SPC are used during the creation of the augmented datasets.
To answer Q2, we use two pre-trained encoder models: Universal Sentence Encoder (USE) (Cer et al., 2018) and Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019), both of which are transformer-based encoder of greater-than-word length text, to transform the sentences into a continuous space so that we can treat them as class distributions and measure their similarity.

Classification Accuracy
If a good heuristic is the one that generates "hard to distinguish" examples, the dataset augmented using ONE should train a better classifier than ALL for the sentiment analysis task, and the verbosity classifier trained on NMN should outperform the classifiers trained on SYN and DUP. Table 3 and Table 4 show the classification accuracy of the neural models trained on different augmented datasets for sentiment and verbosity prediction tasks, and as we expected, heuristics that intuitively generate "harder to distinguish" examples are more suitable for the prediction task and trained a better classifier on both tasks:   These observations suggest that an augmented dataset generated from a heuristic that produces "harder to distinguish" examples for different classes could train a better classifier (Q1).
Since label-preserving heuristics do not change the class label of the samples, the extent to which "hard to distinguish" examples can be generated rely heavily on their existence in the original data. Thus, we cannot intuitively predict which labelpreserving heuristic might be a better choice, however, in Section 5.2, we further study whether our purposed heuristic evaluation approach is applicable to label-preserving heuristics.

Augmented Distribution Difference
To investigate the extent to which "hard to distinguish" examples might be quantified as a difference between the distribution of the augmented samples of different classes, we first encode the augmented sentences into a continuous high dimensional vector space; then, we computed the difference be- The distribution difference for the verbosity analysis task is calculated as follow, where concise and verbose indicate augmented concise and verbose examples respectively: It must be noted that since there is no correspondence between the augmented examples of different classes, we computed the difference as the average KL-Divergence over mini-batches of the size 64 samples from the shuffled augmented dataset for 10 epochs (the same batch and epoch values used for training LSTM and CNN models). Table 3 shows the distribution difference between augmented positive and negative samples for the sentiment analysis task. As shown, although the average classification accuracy of models trained on label-preserving heuristics are only marginally different, the divergence between distributions of augmented examples with positive and negative sentiments are following the reverse order for both BERT and USE representations, with one exception for USE representation of RI compared to RD: Since the non-label-preserving heuristics apply significant semantic changes to the original samples to change its class label, it is expected that the choice of heuristic should have a more noticeable impact on the classification accuracy compared to the augmentation using label-preserving heuristics. We also observe the same results for non-label-preserving heuristics: augmented dataset with higher classification accuracy has lower divergence between distributions of their positive and negative examples:  Table 4 shows the distribution difference between augmented concise and verbose samples for the verbosity prediction task. Here, similar to the sentiment analysis task, we observe that the divergence between distributions of augmented concise and verbose examples are following the reverse order of classification accuracy for both BERT and USE representations:   These observations may indicate that the extent to which a heuristic might generate "hard to distinguish" examples could be quantified as the difference (divergence) between the distribution of augmented examples in different classes (Q2).

Computational Efficiency
Now that we have investigated the role of "hard to distinguish" examples in the success of training a classifier (Q1) and how to quantify that (Q2), it is time to evaluate the computational efficiency of our purposed approach to see how practical it is compared to training a separate classifier for each augmented dataset and pick the best performing one(s) (Q3).
To investigate this, we calculated the time for encoding the augmented examples into continuous space and the time requires for computing the KLD and compared them with the time required for training a classifier on an augmented dataset. Table 5 shows the average execution time of our proposed approach for evaluating the suitability of different data augmentation heuristics and training neural classifiers on augmented datasets. Reported numbers are averaged over sentiment and verbosity prediction tasks for all augmented datasets. Encoding is a one-time process for each augmented dataset, and numbers reported under KLD and Classification columns are the overall execution time after 10 epochs of training on an NVIDIA Tesla P100 GPU.
We observed that encoding and divergence calculation times only depend on the number of samples and the classification task and choice of heuristic is not affecting the execution times. We also observed that the training time for both LSTM and CNN also highly depends on the number of training samples, and changing tasks and augmented dataset only slightly change the training time (standard deviation of 9.4s and 6.8s, respectively).
Execution times are showing that our proposed heuristic evaluation approach is about 25 times faster than training a classifier; this may suggest that our proposed approach could be a low-cost alternative solution for assessing the suitability of the heuristic strategies for augmenting training dataset for different classification tasks, especially for complex training scenarios when training many classifiers on different augmented dataset might not be computationally practical (Q3).

Conclusion
This paper presents an approach for evaluating the suitability of augmentation heuristics for classifications task via "hard to distinguish" example generation capacity of the heuristics through analyzing the difference of class distribution of the augmented examples.
Experimental results suggest our proposed heuristic evaluation approach could be a low-cost yet effective way of measuring the suitability of an augmented heuristic for a classification task.