Augmenting NLP models using Latent Feature Interpolations

Models with a large number of parameters are prone to over-fitting and often fail to capture the underlying input distribution. We introduce Emix, a data augmentation method that uses interpolations of word embeddings and hidden layer representations to construct virtual examples. We show that Emix shows significant improvements over previously used interpolation based regularizers and data augmentation techniques. We also demonstrate how our proposed method is more robust to sparsification. We highlight the merits of our proposed methodology by performing thorough quantitative and qualitative assessments.


Introduction
In this paper, we deal with the problem of sentence classification. Current state-of-the-art deep learning approaches for sentence classification have millions of parameters, thus requiring a large number of training examples, which may be time-consuming and expensive to obtain. We present a new data augmentation technique called Emix to improve the performance of current text classification models.
In general, it is difficult to come up with rules for language transformation similar to image transformations; hence universal data augmentation techniques have not thoroughly been explored in NLP yet (Wei and Zou, 2019). One such technique is Mixup -it uses systematic transformations to make sure the model trains on samples from the vicinity distribution (Chapelle et al., 2001) along with the original distribution of the training data. Mixup is a data-agnostic data augmentation method that is proven to be effective in Computer Vision tasks (Zhang et al., 2017;Verma et al., 2018). Guo et al. (2019) extends Mixup to Text classification using interpolations in the embedding space. In our proposed method, we eliminate the common mean vector from the embeddings. We then interpolate hidden representations at multiple layers, including the embedding layer. This mixing method is also simple and easy to implement and leads to a further improvement in performance.

Related Work:
Data Augmentation has been vastly used in Computer Vision (Krizhevsky et al., 2012), which employs techniques like flipping, scaling, and rotation. Data augmentation has also been explored in NLP to some extent. (Zhang et al., 2015) replaced words with their synonyms according to a geometric distribution. (Wang and Yang, 2015) used k-NN and Cosine Similarity metrics to find a similar word for replacement. (Wei and Zou, 2019) introduced EDA (Easy Data Augmentation) for NLP which includes Synonym Replacement, Random Insertion, Random Deletion, and Random Swap. These methods have the risk of completely changing the context of the sentence with the replacement of a word which alters the entire meaning of the sentence. We create virtual examples by interpolating the latent spaces of two sentences and mix the corresponding labels accordingly to ensure that the interpolated sentence still conforms to the correct label.
Mixup (Zhang et al., 2017;Verma et al., 2018;Tokozume et al., 2018;Yun et al., 2019) has shown improvements in the accuracy of image classification models in Computer Vision. In NLP, (Guo et al., 2019) trained a CNN based classifier on various sentiment classification datasets with mixup and found an increase in the generalization capabilities of the model. However, this methodology only interpolates features within the input spaces. Through Emix, we capture a greater breadth of the feature space by interpolating hidden layer representations. Our Contribution: We make the following contributions : i) We create improved sentence classification models using Emix, a technique that uses latent interpolations of hidden layer representations. ii) We demonstrate how Emix makes text classification models more robust to sparsification. iii) Qualitatively, we highlight how Emix leads to more pronounced decision boundaries for text classification.

Methodology
The main idea of Mixup (Zhang et al., 2017) is that given two labeled data points (x i , y i ) and (x j , y j ), where x i and x j are two random samples and y i and y j are the one-hot representation of the label. The algorithm creates virtual training samples by linear interpolation of input as well as labels: where λ ∼ β(α, α) is the mixing ratio andỹ is defined as the mixed label. Mixup was demonstrated to work well on continuous image data. However, extending it to text is challenging since it is infeasible to compute the interpolation of discrete tokens. Here, we propose a new mixing method where the neural network is trained on interpolations of the hidden states and takes into account the properties of word representation. The word embeddings have non-zero mean, word vectors share a large common vector. Since all words share the same common vector we first eliminate them by removing the non-zero mean vector from all word vectors, effectively reducing the energy (Mu et al., 2017). Let x i (w) ∈ R d be a word representation for a given word w in the vocabulary V . x i (w) shares a non-zero common vector, . Second, we take the difference of text energies (Mu et al., 2017) of the samples into consideration analogous to how Tokozume et al. (2017) takes into account the energy of the speech samples, in order to make the perception of the mixed sample x i : x j = λ : (1 -λ). We define t using the standard deviation per text sample (σ i and σ j ) so that the ratio of text energy becomes x i : x j = λ : (1 -λ). We solve tσ i : (1 -t) σ j = λ : (1 -λ) and obtain the proposed mixing method: Let g(., θ) denote the classification model used, where θ denotes the model parameters. Assuming this model has M layers, we choose to mix the hidden representations at the m-th layer, m ∈ [0, M ]. Mathematically the m-th layer is denoted as g m (., θ), hence the hidden representation of the m-th layer is h m = g m (h m−1 , θ). The 0-th layer is considered as the embedding layer. Hence, for two text samples x i and x j , h i 0 = W E x i , h j 0 = W E x j and the following hidden representations are as follows: These hidden representations at the m-th layer are mixed using equation 3. We denote this mixed representation as h m . Mixup at the m-th layer is thus defined as follows: The continued forward pass after the mixed hidden representation has been generated is defined as follows: The layers chosen for mixup are denoted by The layer m where mixup occurs is chosen randomly from S with equal probability given to each layer in S and sampled separately for each pair of examples that are mixed.

Experiments
We investigated the effectiveness of EMix with five benchmark sentence classification tasks as used by Guo et al. (2019), for a fair comparison. The following datasets are used : TREC (Li and Roth, 2002), SST-1 (Socher et al., 2013), SST-2 (binary classification) and Subj (Pang and Lee, 2004). We evaluate our proposed mixup using the popular CNN (Kim, 2014) and BERT (Devlin et al., 2018) for sentence classification model. We compare with three recent text augmentation methods including EDA (Wei and Zou, 2019), wordMixup (linear interpolation is applied on the word embedding level) and senMixup (linear interpolation is conducted on the layer before the Softmax layer) (Guo et al., 2019). These augmentation strategies rely less on additional text resources or domain knowledge.
In our experiments, we follow the exact implementation and settings in (Kim, 2014), (Guo et al., 2019), and (Wei and Zou, 2019). Specifically, we use filter sizes of 3, 4, and 5, each with 100 feature maps; dropout rate of 0.5 and L2 regularization of 0.2 for the baseline CNN model. We use the HuggingFace (Wolf et al., 2019) implementation of the BERT base model with 12 transformer blocks, 12 attention heads, and 110 million parameters. We use the default learning rate of 0.00002, dropout rate as 0.1, and batch size 8 for all experiments with BERT. All models were trained on Nvidia Tesla K80 GPU.

Observations
Observation 1: Impact of EMix on Sentence Classification Table 1 shows the accuracy for text classification. The results demonstrate how Emix outperforms word-Mixup and senMixup (Guo et al., 2019). Emix leverages interpolations from intermediate layers, which capture richer features to provide an additional training signal.
Wei and Zou (2019) noted that EDA does not work for regularized pretrained models like BERT. We observe the same across our experiments as well, as absolute improvements with EDA are limited to a maximum of 0.5 %. However, we observe substantial absolute improvements with EMix, using CNN as well as BERT. Another point to note is that EDA doesn't show major improvements when the dataset is sufficiently large, as is with the case of UCI News dataset. On the same dataset, Emix shows an absolute improvement of 2.2 % (CNN) and 3.2 % (BERT).

Observation 2: Performance after Sparsification:
Over parameterized networks have a considerably large memory and environmental footprint . Recent works (Chirkova et al., 2018) have shown that test accuracy similar to a large network can be achieved with only a fraction of it's parameters. We check the impact of Mixup on models with fewer parameters obtained through pruning. Various pruning techniques have been suggested to approximate weight's importance (LeCun et al., 1990). We test our models with weight pruning, where we mask out weights with magnitude lesser than a calculated threshold value.
We observe that EMix is more robust to pruning at all levels of sparsity (Table 2). This indicates that EMix eliminates noisy weight estimates by providing more consistent gradient estimates that allow weights to be concentrated in some localized regions of the network.   Table 1: Accuracy (%) obtained on the testing methods. We report mean scores over 10 runs with standard deviations (denoted ±). Best results highlighted in Bold.

Qualitative Analysis : Class Separability
We use the class-to-class (c2c) separability (Khan et al., 2017) defined by estimating the spread of the with-in class samples (intraclass) compared to the between-class (interclass) ones. The separability (S) is defined as : where distances are calculated in the feature space where each point is a 768 dimensional feature vector f i from the final layer in BERT, and N class1 is the number of samples belonging to class1. Here, lower the value of S, higher is the separability. Table 3 shows the S values for EMix and vanilla BERT. We see that the average of S value is lower for EMix as compared to vanilla BERT. Figure 1    the features learned using EMix lead to higher class separability as compared to vanilla BERT and forms clusters with lesser inter-class overlap than vanilla model. Also, from figure 1 we can see the Health and Technology class embeddings are better seperated using EMix which were previously highly overlapped in the vanilla model.

Conclusion
We introduced Emix, a novel interpolation based mixing strategy for text classification that uses linear interpolations of hidden states. We performed thorough quantitative assessments to demonstrate how Emix outperforms strong augmentation baselines, both in terms of accuracy and robustness to weight pruning. Qualitatively, we also elucidate how Emix allows for better separation between classes.