Data Boost: Text Data Augmentation Through Reinforcement Learning Guided Conditional Generation

Data augmentation is proven to be effective in many NLU tasks, especially for those suffering from data scarcity. In this paper, we present a powerful and easy to deploy text augmentation framework, Data Boost, which augments data through reinforcement learning guided conditional generation. We evaluate Data Boost on three diverse text classification tasks under five different classifier architectures. The result shows that Data Boost can boost the performance of classifiers especially in low-resource data scenarios. For instance, Data Boost improves F1 for the three tasks by 8.7% on average when given only 10% of the whole data for training. We also compare Data Boost with six prior text augmentation methods. Through human evaluations (N=178), we confirm that Data Boost augmentation has comparable quality as the original data with respect to readability and class consistency.


Introduction
Data augmentation is a widely-used technique in classification tasks. In the field of computer vision (CV), data is augmented by flipping, cropping, tilting, and altering RGB channels of the original images (Krizhevsky et al., 2012;Chatfield et al., 2014;Szegedy et al., 2015); however, similar intuitive and simple strategies do not obtain equal success in NLP tasks. Existing methods tend to produce augmentation with low readability or unsatisfying semantic consistency (Yang et al., 2020). Table 1 shows some output samples of popular text augmentation methods. Naive methods imitate pixel manipulation in CV, augmenting sentences by adding spelling errors (Xie et al., 2017), or randomly deleting and swapping tokens (Wei and Zou, 2019). The output of such augmentation methods are often illegible since the word order is disrupted (e.g., "is The baby very!"); even worse,  crucial feature words (e.g., the word lovely which is a signal-carrying word for sentiment detection) could be mistakenly removed through random deletion. A more advanced method is synonym insertion or replacement (Zhang et al., 2015;Wang and Yang, 2015), which uses Word2Vec (Mikolov et al., 2013) to replace words with their synonyms. Such a method respects the original sentence structure but fails to consider the context. It sometimes replaces words with synonyms that are awkward in the full context of the sentence. For example, replacing lovely with fabulous to get the sentence "The baby is fabulous!". Recent work leans towards translation-based methods for augmentation (Fadaee et al., 2017;Silfverberg et al., 2017). In particular, Yu et al. (2018) proposed a back-translation method that first translates the text to French and then translates back to English, using the noisy output as the augmentation data. Although back-translation is intuitive and valid, its generation skews towards high frequency words (e.g., cute, lovely are both back-translated to cute), which not only causes repetition but also leads to lexical shrinkage in the augmented data. In a nutshell, existing techniques are still far from perfect, partially due to the strong interdependency of syntactic and semantic features in text data.
In recent years, we have witnessed extensive progress in language models (LM). Large-scale LMs such as BERT (Devlin et al., 2019), XL-Net , and GPT-2 (Radford et al., 2019), are commonly trained on large amounts of text data (e.g., GPT-2 was trained on 8 million web pages that emphasized content diversity). One of the most interesting usages of these models is utilizing them as text generators (Raffel et al., 2019;Lewis et al., 2019;Dong et al., 2019). In this paper, we explore whether we can leverage the generation ability of the state-of-the-art LMs, to generate augmented samples for a given target class.
Augmentation samples should exhibit features of the target class. Off-the-shelf LMs cannot be directly used to augment data; since they are not trained for specific contexts, their generation is undirected and random. Conditional LMs can generate text directed by certain condition (e.g., target class), but they require training a LM from scratch with data covering all the conditions. Keskar et al.
(2019), for instance, trained a 1.6 billion-parameter LM conditioned to a variety of control codes. The training is rather costly; however, collecting sufficient data for the training is also tedious, especially in low-resource tasks (Waseem, 2016).
We thus present Data Boost: a reinforcement learning guided text data augmentation framework built on off-the-shelf LM (GPT-2). Data Boost requires neither collecting extra data nor training a task-specific LM from scratch. We convert GPT-2 into a conditional generator, and for a given task, we guide the generator towards specific class labels during its decoding stage through reinforcement learning. The generated samples can then serve as augmentation data which are similar to the original data in terms of semantics and readability.
The advantages of Data Boost are three-fold: First, Data Boost is powerful. We achieve significant advances in three tasks on five different classifiers compared with six related works. Second, Data Boost generates sentence-level augmentation. Unlike prior methods that do word-level or phraselevel replacement (Kobayashi, 2018;Wei and Zou, 2019), our augmented data is of much greater variety in terms of vocabulary and sentence structure. Human evaluations also verify the high readability and label consistency of our augmentation. method). Instead, we take the off-the-shelf GPT-2 language model and modify its decoding stage without changing its architecture.
2 Data Boost

Conditional Generator
Given tokens x <t = {x 0 , x 1 , ..., x t−1 } and accumulated hidden states h θ <t 1 before time step t, a vanilla auto-regressive language model (LM) is trained to maximize probability of the next step tokenx t . Normally the model will pick the token that has the highest probability x t as the t step decoding output: The generation of such step-by-step decoding is unconditional, since the model is trained on unannotated data (Figure 1 (a)). Conditional generation, however, normally needs to train a conditional LM. By modifying the LM architecture to allow for extra input (target label), the conditional LM can model the language and its corresponding label at the same time (Figure 1 (b)). Its generation is thus conditional on the label but the training of LM is always costly. Different from the above conditional generation method, we keep the architecture of the existing LM unchanged but postpone the argmax function to a later stage. In this way, the output of softmax is still differentiable (as it is a probability over the whole vocab rather than decoded tokens), which allows for gradient-based optimization. As shown in Figure 1 (c), we add a reinforcement learning (RL) stage within the gap between the softmax and argmax function. The RL reward (defined in Section 2.2.1) is where we inject the controlling signal of target label to guide the generation towards the target label. Specifically, in each decoding step, we update the hidden states parameter θ to the conditional θ c in terms of the back-propagated reward after several iterations of RL optimization. The final decoded output shall be conditional on the target label (which is positive in Figure 1 (c)).

Reward
In the reinforcement learning framework, we define the state at step t as all the generated sequence before t (i.e., s t = x <t ), and the action at step t as the t-th output token (i.e., a t = x t ). The policy π θ is interpreted as the probability we choose token x t (action a t ) given the state s t = x <t , which is the softmax output of the hidden states (i.e., π θ (a t |s t ) = softmax(h θ <t ), and similar for the conditional case).
We define the single-step reward of the conditional generated token x c t at step t as: is the salience gain that measures how closely the generated token resembles the salient lexicon of the target label, and serves as a guide signal for the conditional generation. We also consider the Kullback-Leibler (KL) divergence between the conditional θ c and unconditional distribution of θ as an auxiliary constraint (with weight β). Such a reward composition follows the classic PPO (Proximal Policy Optimization) (Schulman et al., 2017) form. Note that we are using an off-policy strategy to collect unconditional (s t , a t ) pairs as trajectory to estimate the conditional reward R(x c i ). In this way, we are able to perform several iterations of updates on θ to maximize the reward without changing the sampling policy frequently, which avoids potential instability (Munos et al., 2016). As a result, we use the probability ratio between the conditional policy π θc and the unconditional policy π θ to re-calibrate the reward in the first term of Equation 2.
Salience Gain. For a given task that has K classes, we define the salience score of word x belonging to a certain class c as: where |x ∈ c| refers to the count of word x in samples with class label c, |V | is the total vocabulary, and GM is geometric mean of the two terms. The two fractions try to guarantee that both P(c|x) and P(x|c) probabilities are high for a word marked as salient. We calculate the salience score for each word and pick the top-N highest words 2 as the salient lexicon for class label c (denoted as w c ). Compared with other methods such as training a discriminator (Dathathri et al., 2020) or deriving control codes (Keskar et al., 2019), we find our frequency-based method is relatively simple but efficient especially in data hungry cases, where the performance of a discriminator could be limited given very few training data. For the t-th step token x c t conditional on the target class c, we calculate the salience gain as the logarithm summation of cosine similarity with each word in the salient lexicon w c : We use the embedded vector of w i and the softmax output of t-th step hidden states h θc t to compute a dot product in the latent space. The salience gain measures how much the current step token resembles the salient lexicon of the target class. KL Penalty. It is possible that the conditional policy π θc drifts away too much from the unconditional policy π θ resulting in an unreadable generation. Therefore, we incorporate a KL divergence penalty term measuring the distance between the two policies, in order to have better insurance that we are optimizing within a trust region. The KL divergence on the policies is computed as: We deduct KL divergence with weight β as a penalty term in the reward function (Equation 2). One can either choose a constant β or vary it dynamically.

Policy Gradient
Given the reward and the definitions described above, we update our policy at t-th step as: where η is the learning rate and θ c is the parameter for the conditional hidden states. In general, we follow the classical SGD update rule, but make two main changes: (1) We use temperature parameter T to control the stochastic sampling during token decoding (Keskar et al., 2019). T → 0 approximates a greedy decoding strategy that amplifies the peak in the vocab distribution while T → ∞ makes the distribution more uniform.
(2) We sum the normalized gradient of the reward for k steps. k can be treated as the strength of control over the conditional generation. Combining all above definitions, the policy gradient of Data Boost is summarized in Algorithm 1.
Generate (a t |s t ) by unconditional policy π θ as trajectories; Estimate reward R(x c t ) using Eq. 2; Compute policy update using Eq. 6 by taking k steps of SGD Return the conditional policy π θc ; end We use a dynamic β to control the KL penalty within the reward function. The target divergence σ depends on the users' need: smaller σ means more resemblance to the unconditional generation while larger σ provides more space for RL guidance. After several iterations of RL optimization, the updated parameter set θ c should be conditional on the target class label, whose feature lexicon contribute to the calculation of the reward R. We then use the conditional policy π θc (which is based on the hidden states with θ c ) to decode this step token. The token should conform to the specified target class label c, since its corresponding hidden states have shifted towards c due to RL optimization.

Tasks & Datasets
We evaluated and compared Data Boost with several state-of-the-art text augmentation methods on the following three tasks: Offense Detection 3 ICWSM 20' Data Challenge dataset (N = 99, 603) for offensive language detection on tweets. Offense Detection and Irony Classification are popular NLU tasks that are low-resource. Sentiment Analysis, though seemingly well-resolved according to some literature (Baziotis et al., 2017;Cliche, 2017), is reported to have severe overfitting problems when given extremely limited training data (Elming et al., 2014;Severyn and Moschitti, 2015). We choose challenging datasets varying in total data size (N ≈ 80k, 17k, 3k) and the number of class (# of class = 4, 3, 2) for a realistic evaluation of our framework.
We removed all punctuation, stop words, hashtags and url links in the samples for all datasets. Samples whose length was above 30 tokens were filtered out (around 2% of the data on average) as 30 was also used as the max sequence length for Data Boost generation. We further split the data into training and test set by the ratio {80%, 20%}, and maintained the original class distributions. We made sure the distributions remained the same in all of our experiments.

Experiments 6
We conducted extensive experiments to answer the following three overarching questions about Data Boost:

Does Data Boost Improve Performance?
Several sets of data starvation tests are prepared, each using restricted fractions of the total data as training data. We keep test data the same (20% of the whole dataset) but gradually decrease the size of training data from 80% (as the fully-loaded case) to 1% (as the extremely low-resource case). We run both normal training and boosted training over the following training set fractions (%): {1%, 5%, 20%, 40%, 60%, 80%} of the total data for both Offense Detection and Sentiment Analysis. Since the dataset for Irony Classification is small (N = 3, 810), we use the following fractions: {10%, 20%, 30%, 40%, 60%, 80%}. Note that for boosted training we add augmentation samples to training data until the training data size reaches 80% of the total size (same as fully-loaded size), to make sure that the size of the training set does not influence the results. Figure 2 shows the performance of the BERT (Devlin et al., 2019) (bert-base-cased) classifier fine-tuned on the three tasks with and without 6 We run our generation and classification training on 2 RTX 2080 GPUs for all the experiments. The average time for Data Boost to generate a 30 token long sequence is under 1 second.
Data Boost over all training set fractions. Data Boost has greater improvements on extremely lowresource cases: we achieve absolute F1 increases of 12.4% (Offense), 9.1% (Sentiment) and 8.8% (Irony) when using only 1% (10% for the Irony task) of the original data as training data. The results show that Data Boost can benefit a wide range of tasks with different characteristics. Also, since we used BERT as our classifier, which is already pre-trained on a large corpus, our results confirm that Data Boost can even improve the performance of large-scale LM based classifiers.

Does Boosted Data Resemble the Original?
A common concern in text data augmentation is whether the augmented sentences preserve the qual-   Table 3: Evaluation of the generation quality in terms of F1 deterioration and perplexity (PPL) increase. We keep the training data size the same, but control the ratio of original/boosting. The first results column corresponds to no boosting.
ity of the original data. This is especially true for generation-based methods since we create new sentences rather than simply replace tokens to produce augmented data. We will illustrate the quality of our data generation with two approaches: (1) Visualizing the class distribution of the original and the augmented data (2) By using the boosting ratio experiments described in Section 4.1 to see whether data augmentation causes performance deterioration and perplexity increase. For visualization, we randomly pick 400 (100 for each class) original and generated sentences in the Offense Detection task (since it has the largest number of classes) and vectorize with Sentence-BERT (Reimers and Gurevych, 2019). We apply t-SNE (Maaten and Hinton, 2008) to these vectors and plot their 2-D representations in Figure 3. From the figure, we can see that our RL-based algorithm manages to guide the generation towards the target labels, and for the most part, the distribution of generated sentences matches that of the original data.
Ratio-controlled experiments test the quality of boosted data by comparing training performance. If training on augmented dataset has comparable performance (F1) as training on purely original data, one may infer that the quality of the augmentation data resembles that of the original data. We also use perplexity (PPL) as an auxiliary metric to evaluate the augmentation quality. We trained three language models using kenLM (Heafield, 2011) on the original data of the three tasks. We use these models to calculate the perplexity of the ratiocontrolled sets.
In Table 3 we show the F1 deterioration and perplexity increase (higher perplexity means poorer fit to the LM) for different augmentation ratios. Even when we use 25% original data fused with 75% generated samples, the F1 score only undergoes a slight decrease (0.06, absolute) compared to when using 100% original data. We found that the perplexity also did not substantially increase even with higher boosting ratios.

Is Data Boost Classifier-Agnostic?
We have shown Data Boost to be effective when used in conjunction with a BERT classifier, but can the performance be replicated with other clas-  Table 4: Performance comparison with other text augmentation methods. 10%: 10% original data + 30% augmented data; 40%: 40% original data + 40% augmented data. We report the F1 score of the BERT classifier over five times repeat experiments. We also report the perplexity score (PPL) of the augmented data (10,000 randomly sampled) from different methods scored by kenLM language models trained on the training data of each task.
sifiers? In other words, is Data Boost a classifieragnostic augmentation method? To answer this question, we ran experiments on four other mainstream classifiers, including the plain CNN classifier (Kim, 2014), the Bi-LSTM with attention mechanism (Zhou et al., 2016), the self-attention based Transformer network (Vaswani et al., 2017), and another LM-based classifier XLNet  for comparison. We trained all classifiers on three different training data settings: {20%, 40%, 80%} of the total data used as training data, the first two datasets are doubled in size using Data Boost augmentation. As shown in Table 2, Data Boost generally improves the performance of all the classifiers (from 1% to 13%, absolute), regardless of the classifier architecture. Moreover, we find Data Boost is not only effective for relatively simple classifiers (e.g., CNN), but also beneficial to complex LM-based classifiers (e.g., BERT and XLNet), which are already trained on a large corpus and generally used as very strong baselines for text classification tasks. Table 5 shows sample generations by Data Boost. Table 4 compares the performance of Data Boost with six prior text augmentation methods on all three tasks and using a BERT classifier. Naive methods (Coulombe, 2018;Xie et al., 2017) and translation-based methods (Fadaee et al., 2017;Sennrich et al., 2016) treated data noise either from artificial typos or translation errors as augmentation. Wei and Zou (2019) proposed EDA which is a combination of token-level augmentation (randomly delete, swap, etc.); they reported modest improvement (0.8% on average) on several benchmark datasets. Zhang et al. (2015) performed character-level augmentation. These methods were usually compromised by low readability and flawed syntactic structure. Other methods utilized external resources to improve augmentation quality. For example, Wang and Yang (2015) leveraged Word2Vec to extract synonyms. Kobayashi (2018) trained a Bi-RNN LM to propose replacements that are context-aware. Our tests find that these methods have higher perplexity than others. The reason could be that Word2Vec does not take context into account, while LM replacement highly depends on the quality of self-trained LM. Data Boost, however, is built on a state-of-the-art LM (GPT-2) and generates augmentations from scratch using RL rather than by replacement. Data Boost outperforms the other methods in the majority of the experiments (Table 4).

Comparison with Related Work
A few words about conditional generation: CTRL (Keskar et al., 2019) and BART (Lewis et al., 2019) are large-scale conditional LMs trained Class Generation Samples ironic freezing cold winter air can be a real treat but if your room temperature stays below freezing for a long time then the best way to cool down is death.

non-ironic
FoxNewscom reporter Michelle Fields is being sued by a Republican donor for disrespecting him and his family the Republican National Committee announced Wednesday.
positive Congratulations to our friends at Bored Panda Pizza for the wonderful promotion that they have done We are very happy and proud to be able to share.
neutral Results of the study revealed that the amount of protein ingested was similar in each group but not significantly different in total fat total carbohydrate or total protein.
negative disappointed by the news media reports In the United States media have been covering reports on the killing of a woman by two men on a train.
normal Im not a doctor or any other medical profession Im just trying to make this post useful to others who are looking through this topic.
spam Black Friday sales on Xbox One begin today Nov at am ET Heres everything you can find in Black.
abusive sick of all the crap If youve been following the news you know that the Trump administration and Democrats have been attacking President Trump executive.
hateful idiot how does she know that you are fucking with her I dont want to see a stupid person like you get raped by any fucking person. on self-collected data. PPLM (Dathathri et al., 2020) does conditional generation through perturbing the vocabulary distribution during token decoding. These methods have not been explored for text augmentation applications. More importantly, they do not use reinforcement learning to have finegrained control over generation, which we found especially helpful when dealing with multiple labels within the same task.
6 Human Evaluation

Experimental Design
We conducted human evaluation on Amazon Mechanical Turk (MTurk) in May 2020. Participants (N = 178) were randomly assigned to evaluate one of the three tasks, respectively Irony Classification (n = 60), Sentiment Analysis (n = 58), and Offense Detection (n = 60). Participants were all from the United States and above 18 years old. The average age of participants was 36.92 years-old. More than half (57.3%) of participants were male, and 42.1% were female, one participant self-report gender as other. Each participant was paid 75 cents for their participation in this study.

Procedures
For each class, participants were asked to read three samples from each version (the original, un-conditionally generated (vanilla GPT-2), and RLconditional generated(Data Boost)). They were not informed of actual labels and versions of samples. After reading, participants were shown the actual label and version of those samples they just read. They were then asked to answer a series of questions about label agreement (e.g., "How much do you agree with the assigned class label?" on a 7-point scale (1-strongly disagree to 7-strongly agree)). Additionally, they were asked to rate the readability of samples on a 7-point scale (lower scores correspond to lower readability and vice versa). The readability measure included five items adapted from previous studies (Graefe et al., 2018), namely well-written, concise, comprehensive, coherent, and clear.

Label Agreement
We conducted paired sample t-tests to examine how much participants agreed with the assigned labels.
To conduct an ablation study, we included samples generated using vanilla GPT-2 and Data Boost. Compared to the vanilla GPT-2, Data Boost samples received higher label agreement scores in eight out of nine classes. Five of which were statistically significantly (p < .05) higher. No statistically significant differences were seen between the original  and boosted data, except for the spam and normal class in Offense Detection (p = .02 and p = .03). This result further confirms that Data Boost samples look very similar to the original samples and that Data Boost generates higher quality samples than the vanilla GPT-2.

Readability
We conducted several one-way analyses of variance (ANOVA) to test whether there were any statistically significant differences in the readability of the three models (Table 6). There were no significant differences for six of the classes. Curiously, for the neutral (Sentiment), abusive (Offense) and hateful (Offense) labels, both Data Boost and vanilla GPT-2 generated samples were rated as more readable than the original samples (p < .05). This could be explained by the fact that original samples are generally noisy tweets. These results indicate that the Data Boost generation has similar readability as the vanilla GPT-2 or original samples.

Limitations
In this section we discuss the limitations of Data Boost. The performance gain achieved by using Data Boost could be marginal on certain tasks, especially those whose classes cannot be modeled well by lexical features. For example, we experimented with Data Boost for metaphor detection using the LCC dataset (Mohler et al., 2016), sarcasm classification using the GHOSH dataset (Ghosh and Veale, 2017), and formality detection using the GYAFC formality style transfer dataset (Rao and Tetreault, 2018). We saw marginal improvements in the tasks, with an absolute increase in F1 scores of 1.3%, 0.9%, and 0.7% for the three tasks re-spectively (in the extreme data scarcity case, where we expect Data Boost to help the most; i.e., when boosting 1% of the original data to 80%). We found that it was difficult for our model to extract explicit lexical features for the metaphor, sarcastic, and formal classes. This could be because syntactic features play a role in these classes. It is challenging for Data Boost to compose meaningful augmentation in such cases, given that our guidance on the generation is token-by-token.

Conclusion
We have proposed a powerful and easy to deploy approach to augment text data through conditional generation. By leveraging an off-the-shelf language model (GPT-2), we successfully guide the generation towards a specified direction (i.e, target class), with the help of reinforcement learning. We find that Data Boost improves the performance of classification tasks, is classifier-agnostic, and that it surpasses several prior augmentation methods in three diverse classification tasks.
In the future, we plan to implement a more sophisticated guidance for the augmentation by adding syntactic and position features to the reward function, to enable augmentation of more diverse types of text data. The code will be made available upon request.