Automatically Learning Data Augmentation Policies for Dialogue Tasks

Automatic data augmentation (AutoAugment) (Cubuk et al., 2019) searches for optimal perturbation policies via a controller trained using performance rewards of a sampled policy on the target task, hence reducing data-level model bias. While being a powerful algorithm, their work has focused on computer vision tasks, where it is comparatively easy to apply imperceptible perturbations without changing an image’s semantic meaning. In our work, we adapt AutoAugment to automatically discover effective perturbation policies for natural language processing (NLP) tasks such as dialogue generation. We start with a pool of atomic operations that apply subtle semantic-preserving perturbations to the source inputs of a dialogue task (e.g., different POS-tag types of stopword dropout, grammatical errors, and paraphrasing). Next, we allow the controller to learn more complex augmentation policies by searching over the space of the various combinations of these atomic operations. Moreover, we also explore conditioning the controller on the source inputs of the target task, since certain strategies may not apply to inputs that do not contain that strategy’s required linguistic features. Empirically, we demonstrate that both our input-agnostic and input-aware controllers discover useful data augmentation policies, and achieve significant improvements over the previous state-of-the-art, including trained on manually-designed policies.


Introduction
Data augmentation aims at teaching invariances to a model so that it generalizes better outside the training set distribution.Recently, there has been substantial progress in Automatic Data Augmentation (AutoAugment) for computer vision (Cubuk et al., 2019).This algorithm searches for optimal perturbation policies via a controller trained with reinforcement learning, where its reward comes from training the target model with data perturbed by the sampled augmentation policy.Each policy consists of 5 sub-policies sampled randomly during training, and each sub-policy consists of 2 operations applied in sequence.These operations are semantic-preserving image processing functions such as translation and rotation.
We adapt AutoAugment to NLP tasks, where the operations are subtle, semantic-preserving text perturbations.To collect a pool of such operations, the first challenge we face is that the discrete nature of text makes it less straightforward to come up with semantic-preserving perturbations.We thus employ as a starting point Should-Not-Change strategies (equivalent to operations in this paper) proposed by Niu and Bansal (2018) which are shown to improve their dialogue task performance when trained on data perturbed by these strategies.Importantly, we next divide their operations into several smaller ones (such as Grammar Errors divided into Singular/Plural Errors and Verb Inflection Errors) and also add a new operation Stammer (word repetition).This modification provides a much larger space of operation combinations for the model to explore from, so that it could potentially learn more complex and nuanced augmentation policies.Figure 1 shows a sub-policy containing two operations.It first paraphrases 2 tokens with probability 0.7 and then introduces 1 grammar error with probability 0.4. 1  We choose the dialogue generation task based on the Ubuntu Dialogue Corpus (Lowe et al., 2015) because as opposed to Natural Language Inference and Question Answering tasks, realworld dialogue datasets more naturally afford such perturbation-style human errors (i.e., contain more noise), and thus are inherently compatible with 1 Our code and sampled augmented data is publicly available at: https://github.com/WolfNiu/AutoAugDialogue.The learned policies are presented in Table 4.  a variety of artificial perturbations.Empirically, we show that our controller can self-learn policies that achieve state-of-the-art performance on Ubuntu, even compared with very strong baselines such as the best manually-designed policy in Niu and Bansal (2018).We also verify this result through human evaluation to show that our model indeed learns to generate higher-quality responses.We next analyze the best-learned policies and observe that the controller prefers to sample operations which work well on their own as augmentation policies, and then combines them into stronger policy sequences.Lastly, observing that some operations require the source inputs to have certain linguistic features (e.g., we cannot apply Stopword Dropout to inputs that do not contain stopwords), we also explore a controller that conditions on the source inputs of the target dataset, via a sequence-to-sequence (seq2seq) controller.We show that this input-aware model performs on par with the input-agnostic one (where the controller outputs do not depend on the source inputs), and may need more epochs to expose the model to the many diverse policies it generates.We also present selected best policies to demonstrate that the seq2seq controller can sometimes successfully attend to the source inputs.

Related Works
There has been extensive work that employs data augmentation in both computer vision (Simard et al., 2003;Krizhevsky et al., 2012;Cires ¸an et al., 2012;Wan et al., 2013;Sato et al., 2015;DeVries and Taylor, 2017;Tran et al., 2017;Lemley et al., 2017) and NLP (Fürstenau and Lapata, 2009;Sennrich et al., 2016;Wang and Yang, 2015;Zhang et al., 2015;Jia and Liang, 2016;Kim and Rush, 2016;Hu et al., 2017;Xu et al., 2017;Xia et al., 2017;Silfverberg et al., 2017;Kafle et al., 2017;Hou et al., 2018;Wang et al., 2018).Automatic data augmentation is addressed via the AutoAugment algorithm proposed by Cubuk et al. ( 2019), which uses a hypernetwork (in our case, a controller) to train the target model, an approach inspired by neural architecture search (Zoph et al., 2017).Previous works have also adopted Generative Adversarial Networks (Goodfellow et al., 2014) to either directly generate augmented data (Tran et al., 2017;Sixt et al., 2018;Antoniou et al., 2017;Zhu et al., 2017;Mun et al., 2017;Wang and Perez, 2017), or generate augmentation strategies (Ratner et al., 2017).These approaches produce perturbations through continuous hidden representations.Motivated by the fact that our pool of candidate perturbations are discrete in nature, we identify AutoAugment as a more proper base model to use, and adapt it linguistically to the challenging task of generative dialogue tasks.Our work closely follows Niu and Bansal (2018) to obtain a pool of candidate operations.Although their work also used combinations of operations for data augmentation, their best model is manually designed, training on each operation in sequence, while our model automatically discovers more nuanced and detailed policies that have not only the operation types but also the intensity (the number of changes) and probability.

Model
AutoAugment Architecture: As shown in Figure 2, our AutoAugment model consists of a controller and a target model (Cubuk et al., 2019).The controller first samples a policy that transforms the original data to augmented data, on which the target model trains.After training, the target model is evaluated to obtain the performance on the validation set.This performance is then fed back to the controller as the reward signal.Figure 3  is a discrete equivalence of the continuous Magnitude in (Cubuk et al., 2019)), and the Probability of applying that operation.The input-aware controller corresponds to the whole figure, i.e., we novelly add in an encoder that takes as inputs the source of the training data, making it a seq2seq model.Since for each source input, there may be a different set of perturbations that are most suitable to it, our input-aware controller aims at providing customized operations for each training example.Search Space: Following Niu and Bansal (2018), our pool of operations contains Random Swap, Stopword Dropout, Paraphrase, Grammar Errors, and Stammer, which cover a substantial enough search space of real-world noise in text. 2 To allow the controller to learn more nuanced combinations of operations, we further divide Stopword Dropout into 7 categories: Noun, Adposition, Pronoun, Adverb, Verb, Determiner, and Other, and divide Grammar Errors into Noun (plural/singular confusion) and Verb (verb inflected/base form confusion).For Stopword Dropout, we chose the first six categories because they are the major universal POS tags in the set closed-class words.We perform this subdivision also because different categories of an operation can influence the model to different extents.Suppose the original utterance is "What is the offer?",dropping the interrogative pronoun "what" is more semantic-changing than dropping the determiner "the."Our pool thus consists of 12 operations.For generality, we do not distinguish in advance which operation alone is effective as an augmentation policy on the target dataset, but rather let the controller figure it out.Moreover, it is possible that an operation alone is not effective, but when applied in sequence with another operation, they collectively teach the model a new pattern of invariance in the data.Each policy consists of 4 sub-policies chosen randomly during training, each sub-policy consists of 2 operations,3 and each operation has 3 hyperparameters.We discretize the Probability of applying an operation into 10 uniformly-spaced values starting from 0.1 and let the number of replaces range from 1 to 4. Thus, the search space size is (12 × 4 × 10) (2×4) = 2.82 × 10 21 .2 operations allows more perturbations than exerted by a single strategy: in reality a sentence may contain several types of noise.Search Algorithm: We use REINFORCE, a policy gradient method (Williams, 1992;Sutton et al., 2000) to train the controller.At each step, the decoder samples a token and feeds it into the next step.Since each policy consists of 4 sub-policies, each sub-policy contains 2 operations, and each operation is defined by 3 tokens (Operation Type, Number of Changes, Probability), the total number of steps for each policy is 4 × 2 × 3 = 24.Sampling multiple sub-policies to form one policy provides us with a less biased estimation of the controller performance.We sample these 4 subpolicies at once rather than sample the controller 4 times to reduce repetition -the controller needs to keep track of what policies it has already sampled.To obtain the reward for the controller, we train the target model with the augmented data and obtain its validation set performance.We calculate a weighted average of two F1s (Activity and Entity) as the reward R since both are important complementary measures of an informative response, as discussed in Serban et al. (2017a). 4We use the reinforcement loss following Ranzato et al. ( 2016) and an exponential moving average baseline to reduce training variance.At the end of the search, we use the best policy to train the target model from scratch and evaluate on the test set.

Experimental Setup
Dataset and Model: We investigate Variational Hierarchical Encoder-Decoder (VHRED) (Serban et al., 2017b) on the troubleshooting Ubuntu Dialogue task (Lowe et al., 2015). 5For this short paper, we focus on this dataset because it has a well-established F1 metric among dialogue tasks. 6n future work, we plan to apply the idea to other datasets and NLP tasks since the proposed model is not specific to the Ubuntu dataset.We hypothesize that most web/online datasets that have regular human errors/noise (e.g.Twitter Dialogue Dataset (Serban et al., 2017b) and multiple movie corpora (Serban et al., 2016)) would be suitable for our framework.Training Details: We employ code7 from Niu and Bansal (2018) to reproduce the VHRED baseline results and follow methods described in Zoph et al. (2017) to train the controller.For details, please refer to their Appendix A.2 (Controller Architecture) and A.3 (Training of the Controller).
We adopt the following method to speed up controller training and hence facilitate scaling. 8We let the target model resume from the converged baseline checkpoint and train on the perturbed data for 1 epoch.During testing, we use the policy that achieves the highest weighted average F1 score to train the final model for 1 epoch.Human Evaluation: We conducted human studies on MTurk.We compared each of the inputagnostic/aware models with the VHRED baseline and All-operations from Niu and Bansal (2018), where we followed the same setting.Each study contains 100 samples randomly chosen from the test set.The utterances were randomly shuffled and we only allowed US-located human evaluators with approval rate > 98%, and at least 10, 000 approved HITs.More details are in the appendix.

Results and Analysis
Automatic Results: Table 1 shows that all dataaugmentation approaches (last 3 rows) improve statistically significantly (p < 0.01)10 over the strongest baseline VHRED (w/ attention  ment models obtained significantly more net wins (last column) than the VHRED-attn baseline.They both outperform even the strong manualpolicy All-operations model.
Policy Learned by the Input-Agnostic Controller: We present 3 best learned policies from the Ubuntu val set (Table 4).Although there is a probability of 7/12 = 58.3% to sample one of the Stopword Dropout operations from our pool, all 3 learned policies show much more diversity on the operations they choose.This is also the case for the other two hyperparameters: Number of Changes varies from 1 to 4, and Probability varies from 0.1 to 0.9, which basically extend their whole search range.Moreover, all best policies include Random Swap, which agrees with the results in Niu and Bansal (2018).
Example Analysis of Perturbation Procedure in Generated Responses: We also present a selected example of perturbed source inputs from the three Augmentation models with their respective best policies in Table 3.First of all, the Alloperations model is forced to use an operation (in this case Random Swap) with a fixed number of changes and a probability of 1.0, leading to less variation in the source inputs.On the other hand, our Input-agnostic AutoAugment model samples 3 Verb Dropout's followed by Random Swap.
Note that although the number of changes for the dropout is 3, there are only 2 verb stopwords in the utterance.Thus, it has to resort back to modifying only 2 tokens.The Input-aware model samples Stammer followed by 2 Verb Dropout's.Interestingly, it inserts an extra "bla" around other "bla's" in the utterance.It also did not sample a policy that drops more than 2 verb stopwords (this operation is not applied due to its Probability parameter).These two observations indicate that the model can sometimes successfully attend to the source inputs to provide customized policies.

Conclusion and Future Work
We adapt AutoAugment to dialogue and extend its controller to a conditioned model.We show via automatic and human evaluations that our AutoAugment models learn useful augmentation policies which lead to state-of-the-art results on the Ubuntu task.Motivated by the promising success of our model in this short paper, we will apply it to other diverse NLP tasks in future work.

Figure 1 :
Figure1: Example of a sub-policy applied to a source input.The first operation (Paraphrase, 2, 0.7) paraphrases the input twice with probability 0.7; the second operation (Grammar Errors, 1, 0.4) inserts 1 grammar error with probability 0.4.Thus there are at most 4 possible outcomes from each sub-policy.

Figure 2 :
Figure 2: The controller samples a policy which is used to perturb the training data.After training on the augmented inputs, the model feeds the performance as reward back to the controller.

Figure 3 :
Figure 3: AutoAugment controller.An input-agnostic controller corresponds to the lower part of the figure.It samples a list of operations in sequence.An inputaware controller additionally has an encoder (upper part) that takes in the source inputs of the data.

Table 2 :
For the All-operations model (which corresponds to the All-Should-Not-Change model in Niu and Bansal (2018)), we train on each operation (without subdivisions for Stopword Dropout and Grammar Er-Human evaluation results on comparisons among the baseline, All-operations, and the two Au-toAugment models.W: Win, T: Tie, L: Loss.