Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations

We propose a novel data augmentation for labeled sentences called contextual augmentation. We assume an invariance that sentences are natural even if the words in the sentences are replaced with other words with paradigmatic relations. We stochastically replace words with other words that are predicted by a bi-directional language model at the word positions. Words predicted according to a context are numerous but appropriate for the augmentation of the original words. Furthermore, we retrofit a language model with a label-conditional architecture, which allows the model to augment sentences without breaking the label-compatibility. Through the experiments for six various different text classification tasks, we demonstrate that the proposed method improves classifiers based on the convolutional or recurrent neural networks.


Introduction
Neural network-based models for NLP have been growing with state-of-the-art results in various tasks, e.g., dependency parsing (Dyer et al., 2015), text classification (Socher et al., 2013;Kim, 2014), machine translation (Sutskever et al., 2014). However, machine learning models often overfit the training data by losing their generalization. Generalization performance highly depends on the size and quality of the training data and regularizations. Preparing a large annotated dataset is very time-consuming. Instead, automatic data augmentation is popular, particularly in the areas of vision (Simard et al., 1998;Szegedy et al., 2015) and speech (Jaitly and Hinton, 2015;Ko et al., 2015). Data augmentation is basically performed based on human knowledge on invariances, rules, or heuristics, e.g., "even if a picture is flipped, the class of an object should be unchanged". Contextual augmentation with a bidirectional RNN language model, when a sentence "the actors are fantastic" is augmented by replacing only actors with words predicted based on the context. However, usage of data augmentation for NLP has been limited. In natural languages, it is very difficult to obtain universal rules for transformations which assure the quality of the produced data and are easy to apply automatically in various domains. A common approach for such a transformation is to replace words with their synonyms selected from a handcrafted ontology such as Word-Net (Miller, 1995;Zhang et al., 2015) or word similarity calculation (Wang and Yang, 2015). Because words having exactly or nearly the same meanings are very few, synonym-based augmentation can be applied to only a small percentage of the vocabulary. Other augmentation methods are known but are often developed for specific domains with handcrafted rules or pipelines, with the loss of generality.
In this paper, we propose a novel data aug-mentation method called contextual augmentation. Our method offers a wider range of substitute words by using words predicted by a bidirectional language model (LM) according to the context, as shown in Figure 1. This contextual prediction suggests various words that have paradigmatic relations (Saussure and Riedlinger, 1916) with the original words. Such words can also be good substitutes for augmentation. Furthermore, to prevent word replacement that is incompatible with the annotated labels of the original sentences, we retrofit the LM with a label-conditional architecture. Through the experiment, we demonstrate that the proposed conditional LM produces good words for augmentation, and contextual augmentation improves classifiers using recurrent or convolutional neural networks (RNN or CNN) in various classification tasks.

Proposed Method
For performing data augmentation by replacing words in a text with other words, prior works (Zhang et al., 2015;Wang and Yang, 2015) used synonyms as substitute words for the original words. However, synonyms are very limited and the synonym-based augmentation cannot produce numerous different patterns from the original texts. We propose contextual augmentation, a novel method to augment words with more varied words. Instead of the synonyms, we use words that are predicted by a LM given the context surrounding the original words to be augmented, as shown in Figure 1.

Motivation
First, we explain the motivation of our proposed method by referring to an example with a sentence from the Stanford Sentiment Treebank (SST) (Socher et al., 2013), which is a dataset of sentiment-labeled movie reviews. The sentence, "the actors are fantastic.", is annotated with a positive label. When augmentation is performed for the word (position) "actors", how widely can we augment it? According to the prior works, we can use words from a synset for the word actor obtained from WordNet (histrion, player, thespian, and role player). The synset contains words that have meanings similar to the word actor on average. 1 However, for data augmentation, the word actors can be further replaced with non-synonym words such as characters, movies, stories, and songs or various other nouns, while retaining the positive sentiment and naturalness. Considering the generalization, training with maximum patterns will boost the model performance more. We propose using numerous words that have the paradigmatic relations with the original words. A LM has the desirable property to assign high probabilities to such words, even if the words themselves are not similar to the original word to be replaced.

Word Prediction based on Context
For our proposed method, we requires a LM for calculating the word probability at a position i based on its context. The context is a sequence of words surrounding an original word w i in a sentence S, i.e., cloze sentence S\{w i }. The calculated probability is p(·|S\{w i }). Specifically, we use a bi-directional LSTM-RNN (Hochreiter and Schmidhuber, 1997) LM. For prediction at position i, the model encodes the surrounding words individually rightward and leftward (see Figure 1). As well as typical uni-directional RNN LMs, the outputs from adjacent positions are used for calculating the probability at target position i. The outputs from both the directions are concatenated and fed into the following feed-forward neural network, which produces words with a probability distribution over the vocabulary.
In contextual augmentation, new substitutes for word w i can be smoothly sampled from a given probability distribution, p(·|S\{w i }), while prior works selected top-K words conclusively. In this study, we sample words for augmentation at each update during the training of a model. To control the strength of augmentation, we introduce temperature parameter τ and use an annealed distribution p τ (·|S\{w i }) ∝ p(·|S\{w i }) 1/τ . If the temperature becomes infinity (τ → ∞), the words are sampled from a uniform distribution. 2 If it becomes zero (τ → 0), the augmentation words are always words predicted with the highest probability. The sampled words can be obtained at one time at each word position in the sentences. We replace each word simultaneously with a probability based approach further requires word sense disambiguation or some rules for selecting ideal synsets. as well as Wang and Yang (2015) for efficiency.

Conditional Constraint
Finally, we introduce a novel approach to address the issue that context-aware augmentation is not always compatible with annotated labels. For understanding the issue, again, consider the example, "the actors are fantastic.", which is annotated with a positive label. If contextual augmentation, as described so far, is simply performed for the word (position of) fantastic, a LM often assigns high probabilities to words such as bad or terrible as well as good or entertaining, although they are mutually contradictory to the annotated labels of positive or negative. Thus, such a simple augmentation can possibly generate sentences that are implausible with respect to their original labels and harmful for model training.
To address this issue, we introduce a conditional constraint that controls the replacement of words to prevent the generated words from reversing the information related to the labels of the sentences. We alter a LM to a label-conditional LM, i.e., for position i in sentence S with label y, we aim to calculate p τ (·|y, S\{w i }) instead of the default p τ (·|S\{w i }) within the model. Specifically, we concatenate each embedded label y with a hidden layer of the feed-forward network in the bidirectional LM, so that the output is calculated from a mixture of information from both the label and context.

Settings
We tested combinations of three augmentation methods for two types of neural models through six text classification tasks. The corresponding code is implemented by Chainer (Tokui et al., 2015) and available 3 .
The benchmark datasets used are as follows: (1, 2) SST is a dataset for sentiment classification on movie reviews, which were annotated with five or two labels (SST5, SST2) (Socher et al., 2013). (3) Subjectivity dataset (Subj) was annotated with whether a sentence was subjective or objective (Pang and Lee, 2004). (4) MPQA is an opinion polarity detection dataset of short phrases rather than sentences (Wiebe et al., 2005). (5) RT is another movie review sentiment dataset (Pang and Lee, 2005). (6) TREC is a dataset for classification of the six question types (e.g., person, location) (Li and Roth, 2002). For a dataset without development data, we use 10% of its training set for the validation set as well as Kim (2014).
We tested classifiers using the LSTM-RNN or CNN, and both have exhibited good performances. We used typical architectures of classifiers based on the LSTM or CNN with dropout  using hyperparameters found in preliminary experiments. 4 The reported accuracies of the models were averaged over eight models trained from different seeds.
The tested augmentation methods are: (1) synonym-based augmentation, and (2, 3) contextual augmentation with or without a labelconditional architecture. The hyperparameters of the augmentation (temperature τ and probability of word replacement) were also selected by a gridsearch using validation set, while retaining the hyperparameters of the models. For contextual augmentation, we first pretrained a bi-directional LSTM LM without the label-conditional architecture, on WikiText-103 corpus (Merity et al., 2017) from a subset of English Wikipedia articles. After the pretraining, the models are further trained on each labeled dataset with newly introduced labelconditional architectures. Table 1 lists the accuracies of the models with or without augmentation. The results show that our contextual augmentation improves the model performances for various datasets from different domains more significantly than the prior synonymbased augmentation does. Furthermore, our labelconditional architecture boosted the performances on average and achieved the best accuracies. Our methods are effective even for datasets with more 4 An RNN-based classifier has a single layer LSTM and word embeddings, whose output is fed into an output affine layer with the softmax function. A CNN-based classifier has convolutional filters of size {3, 4, 5} and word embeddings (Kim, 2014). The concatenated output of all the filters are applied with a max-pooling over time and fed into a two-layer feed-forward network with ReLU, followed by the softmax function. For both the architectures, training was performed by Adam (Kingma and Ba, 2015) and finished by early stopping with validation at each epoch.

Results
The hyperparameters of the models and training were selected by a grid-search using baseline models without data augmentation in each task's validation set individually. We used the best settings from the combinations by changing the learning rate, unit or filter size, embedding dimension, and dropout ratio.  Table 1: Accuracies of the models for various benchmarks. The accuracies are averaged over eight models trained from different seeds.
than two types of labels, SST5 and TREC. For investigating our label-conditional bidirectional LM, we show in Figure 2 the top-10 word predictions by the model for a sentence from the SST dataset. Each word in the sentence is frequently replaced with various words that are not always synonyms. We present two types of predictions depending on the label fed into the conditional LM. With a positive label, the word "fantastic" is frequently replaced with funny, honest, good, and entertaining, which are also positive expressions. In contrast, with a negative label, the word "fantastic" is frequently replaced with tired, forgettable, bad, and dull, which reflect a negative sentiment. At another position, the word "the" can be replaced with "no" (with the seventh highest probability), so that the whole sentence becomes "no actors are fantastic.", which seems negative as a whole. Aside from such inversions caused by labels, the parts unrelated to the labels (e.g., "actors") are not very different in the positive or negative predictions. These results also demonstrated that conditional architectures are effective.

Related Work
Some works tried text data augmentation by using synonym lists (Zhang et al., 2015;Wang and Yang, 2015), grammar induction (Jia and Liang, 2016), task-specific heuristic rules (Fürstenau and Lapata, 2009;Kafle et al., 2017;Silfverberg et al., 2017), or neural decoders of autoencoders (Bergmanis et al., 2017;Xu et al., 2017;Hu et al., 2017) or encoder-decoder models (Kim and Rush, 2016;Sennrich et al., 2016;Xia et al., 2017). The works most similar to our research are Kolomiyets et al. (2011) andFadaee et al. (2017). In a task of time expression recognition, Kolomiyets et al. replaced only the headwords under a task-specific assumption that temporal trigger words usually occur as headwords. Figure 2: Words predicted with the ten highest probabilities by the conditional bi-directional LM applied to the sentence "the actors are fantastic". The squares above the sentence list the words predicted with a positive label. The squares below list the words predicted with a negative label.
They selected substitute words with top-K scores given by the Latent Words LM (Deschacht and Moens, 2009), which is a LM based on fixedlength contexts. Fadaee et al. (2017), focusing on the rare word problem in machine translation, replaced words in a source sentence with only rare words, which both of rightward and leftward LSTM LMs independently predict with top-K confidences. A word in the translated sentence is also replaced using a word alignment method and a rightward LM. These two works share the idea of the usage of language models with our method. We used a bi-directional LSTM LM which captures variable-length contexts with considering both the directions jointly. More importantly, we proposed a label-conditional architecture and demonstrated its effect both qualitatively and quantitatively. Our method is independent of any task-specific knowledge, and effective for classification tasks in various domains.
We use a label-conditional fill-in-the-blank context for data augmentation. Neural models using the fill-in-the-blank context have been invested in other applications. Kobayashi et al. (2016Kobayashi et al. ( , 2017 proposed to extract and organize information about each entity in a discourse using the context. Fedus et al. (2018) proposed GAN (Goodfellow et al., 2014) for text generation and demon-strated that the mode collapse and training instability can be relieved by in-filling-task training. Melamud et al. (2016) and Peters et al. (2018) reported that encoding the context with bidirectional LM was effective for a broad range of NLP tasks.

Conclusion
We proposed a novel data augmentation using numerous words given by a bi-directional LM, and further introduced a label-conditional architecture into the LM. Experimentally, our method produced various words compatibly with the labels of original texts and improved neural classifiers more than the synonym-based augmentation. Our method is independent of any task-specific knowledge or rules, and can be generally and easily used for classification tasks in various domains.
On the other hand, the improvement by our method is sometimes marginal. Future work will explore comparison and combination with other generalization methods exploiting datasets deeply as well as our method.