Mixup-Transfomer: Dynamic Data Augmentation for NLP Tasks

Mixup is the latest data augmentation technique that linearly interpolates input examples and the corresponding labels. It has shown strong effectiveness in image classification by interpolating images at the pixel level. Inspired by this line of research, in this paper, we explore i) how to apply mixup to natural language processing tasks since text data can hardly be mixed in the raw format; ii) if mixup is still effective in transformer-based learning models, e.g., BERT. To achieve the goal, we incorporate mixup to transformer-based pre-trained architecture, named"mixup-transformer", for a wide range of NLP tasks while keeping the whole end-to-end training system. We evaluate the proposed framework by running extensive experiments on the GLUE benchmark. Furthermore, we also examine the performance of mixup-transformer in low-resource scenarios by reducing the training data with a certain ratio. Our studies show that mixup is a domain-independent data augmentation technique to pre-trained language models, resulting in significant performance improvement for transformer-based models.


Introduction
Deep learning has shown outstanding performance in the field of natural language processing (NLP). Recently, transformer-based methods (Devlin et al., 2018;Yang et al., 2019) have achieved state-ofthe-art performance across a wide variety of NLP tasks 1 . However, these models highly rely on the availability of large amounts of annotated data, which is expensive and labor-intensive. To solve the data scarcity problem, data augmentation is commonly used in NLP tasks. For example, Wei and Zou (2019) investigated language transformations like insertion, deletion and swap. Malandrakis et al. (2019) and Yoo et al. (2019) utilized variational autoencoders (VAEs) (Kingma and Welling, 2013) to generate more raw inputs. Nevertheless, these methods often rely on some extra knowledge to guarantee the quality of new inputs, and they have to be working in a pipeline. Zhang et al. (2017) proposed mixup, a domain-independent data augmentation technique that linearly interpolates image inputs on the pixel-based feature space. Guo et al. (2019) tried mixup in CNN (LeCun et al., 1998) and LSTM (Hochreiter and Schmidhuber, 1997) for text applications. Despite effectiveness, they conducted mixup only on the fixed word embedding level like Zhang et al. (2017) did in image classification. Two questions arise, therefore: (i) how to apply mixup to NLP tasks if text data cannot be mixed in the raw format? Apart from the embedding feature space, what other representation spaces can be constituted and used? ii) whether or not mixup can boost the state-of-the-art further in transformer-based learning models, such as BERT (Devlin et al., 2018).
To answer these questions, we stack a mixup layer over the final hidden layer of the pre-trained transformer-based model. The resulting system can be applied to a broad of NLP tasks; in particular, it is still end-to-end trainable. We evaluate our proposed mixup-transformer on the GLUE benchmark, which shows that mixup can consistently improve the performance of each task. Our contributions are summarized as follows: • We propose the mixup-transformer that applies mixup into transformer-based pre-trained models. To our best knowledge, this is the first work that explores the effectiveness of mixup in Transformer.
• In experiments, we demonstrate that mixup-transformer can consistently promote the performance across a wide range of NLP benchmarks, and it is particularly helpful in low-resource scenarios where we reduce the training data from 10% to 90%.

Mixup-Transformer
In this section, we first introduce the mixup used in previous works. Then, we show how to incorporate the mixup into transformer-based methods and how to do the fine-turning on different text classification tasks. Last, we will discuss the difference between the previous works and our new approach.

Mixup
Mixup is first proposed for image classification (Zhang et al., 2017), which incorporates the prior knowledge that linear interpolations of feature representations should lead to the same interpolations of the associated targets. In mixup, virtual training examples are constructed by two examples drawn at random from the training data: where λ could be either fixed value in [0, 1] or λ ∼ Beta(α, α), for α ∈ (0, ∞). In previous works, mixup is a static data augmentation approach that improves and robusts the performance in image classification.

Mixup for Text Classification
Text classification is the most fundamental problem in the NLP field. Unlike image data, text input consists of discrete units (words) without an inherent ordering or algebraic operations -it could be one sentence, two sentences, a paragraph or a whole document.
The first step of text classification is to use the word embedding to convert each word of the text into a vector representation. In the traditional approaches, the word embedding method can be bag-of-words, or a fixed word to vector mapping dictionary built by CNN or LSTM. In our approach, instead of using the traditional encoding methods, we use transformer-based pre-trained language models to learn the representations for text data. For downstream tasks, we fine-tune transformer-based models with the mixup data augmentation method. Formally, mixup-transformer constructs virtual hidden representations dynamically durning the training process as follows: where T (·) represents outputs of the transformer layers as shown in Figure 1. Note that, the mixup process is trained together with the fine-tuning process in an end-to-end fashion, and the hidden mixup representations are dynamic during the training process.

Discussion
In this section, we highlight two main differences between our approach and previous methods using mixup techniques (Zhang et al., 2017;Guo et al., 2019) for comparison.
• Dynamic mixup representation. For each input pair, x i and x j , the previous approaches produce a fixed mixup representation given a fixed λ. However, the mixup hidden representations in our approach are dynamic since they are trained together with the fine-tuning process.
• Dynamic mixup activation. Since a pre-trained network needs to be fine-tuned for a specific task, we can dynamically activate the mixup during the training. For example, if the training epoch is 3, we can choose to use mixup in any epoch or all epochs. In our experiments, we fine-tune the model without mixup in the first half of epochs for good representations and add mixup in the last half of the epochs.

Experiments
To show the effectiveness of our proposed mixup-transformer, we conduct extensive experiments by adding the mixup strategy to transformer-based models on seven NLP tasks contained in the GLUE benchmark. Furthermore, we reduce the training data with different ratios (from 10% to 90%) to see how the mixup strategy works with insufficient training data. We report the performance on development sets for all the tasks because the test time is limited by the online GLUE benchmark. Baselines. Two baselines are conducted in the experiments, including BERT-base and BERT-large (Devlin et al., 2018). We evaluate the performance of our methods by adding the mixup strategy to these two baselines.
Implementation details. When fine-tuning BERT with or without the mixup strategy for these NLP tasks, we fix the hyper-parameters as follows: the batch size is 8, the learning rate is 2e-5, the max sequence length is 128, and the number of the training epochs is 3. We test different values of λ, (from 0.1 to 0.9) on the default dataset (CoLA) and find mixup-transformer is insensitive to this hyper-parameter, so we set a fixed value of λ = 0.5. Experimental results for eight different NLP tasks are illustrated in Table 1. By adding the proposed mixup technique to BERT-base and BERT-large, the mixup-transformer improves the performance consistently on most of these tasks. The average improvement is around 1% for all the settings. The highest performance gain comes from the RTE task by adding the proposed mixup technique on BERT-base. In this experiment, the accuracy improves from 68.23% to 71.84%, which is an increase of 3.61%. The Matthew's correlation for CoLA increases from 59.71% to 62.39% (improved 2.68%) with mixup on BERT-large. Some experiments also get performance decrease with mixup. For example, adding mixup to BERT-base on STS-B decreases the Spearman correlation from 89.41% to 88.66%. Overall, most of the tasks improved (14 out of 16, while 2 got slightly worse) were found by applying mixup-transformer.

Results on limited data
As mixup is a technique for augmenting the feature space, it is interesting to see how it works when the training data is insufficient. Therefore, we reduce the training data with a certain ratio (from 10% to 90% with a step 10%) and test the effectiveness of mixup-transformer in low-resource scenarios. As shown in Table 2, BERT-large + mixup consistently outperforms BERT-large when we reduce the training data for MRPC, where the highest improvement (4.90%) is achieved when only using 40% of the training data.
Using the full training data (100%) gets an increase of 2.46%, which indicates that mixup-transformer works even better with reduced annotations. We also report the experiments of reducing training data on other tasks, including STS-B, RTE and CoLA. As shown in Figure 2, the mixup-transformer again consistently improves the performance for all the experiments. The performance gains with less training data (like 10% for STS-B, CoLA, and RTE) are higher than using full training data since data augmentation is more effective when the annotations are insufficient. Therefore, the mixup strategy is highly helpful in most low-resource scenarios.

Conclusion and Future Work
In this paper, we propose the mixup-transformer that incorporates a data augmentation technique called mixup into transformer-based models for NLP tasks. Unlike using the static mixup in previous works, our approach can dynamically construct new inputs for text classification. Extensive experimental results show that mixup-transformer can be dynamically used with a pre-trained model to achieve better performance on GLUE benchmark. Two future directions are worth considering on text data. First, how to use mixup on other challenging NLP problems, such as zero-shot, few-shot or meta-learning tasks. Second, how to do mixup on document-level text data like paragraphs. Instead of using mixup directly, we may need to extract appropriate information from the data in the training process. Selecting the right information to mix up for text classification would be an exciting and challenging area to be in.