LexicalAT: Lexical-Based Adversarial Reinforcement Training for Robust Sentiment Classification

Recent work has shown that current text classification models are fragile and sensitive to simple perturbations. In this work, we propose a novel adversarial training approach, LexicalAT, to improve the robustness of current classification models. The proposed approach consists of a generator and a classifier. The generator learns to generate examples to attack the classifier while the classifier learns to defend these attacks. Considering the diversity of attacks, the generator uses a large-scale lexical knowledge base, WordNet, to generate attacking examples by replacing some words in training examples with their synonyms (e.g., sad and unhappy), neighbor words (e.g., fox and wolf), or super-superior words (e.g., chair and armchair). Due to the discrete generation step in the generator, we use policy gradient, a reinforcement learning approach, to train the two modules. Experiments show LexicalAT outperforms strong baselines and reduces test errors on various neural networks, including CNN, RNN, and BERT.


Introduction
Sentiment classification is a fundamental research area in natural language processing (Pang et al., 2002;Glorot et al., 2011;Lai et al., 2015;Kiritchenko and Mohammad, 2018;Liu et al., 2018;Chen et al., 2018). With the development of deep learning, neural networks have obtained state-ofthe-art results on many sentiment classification datasets (Kim, 2014;Dong et al., 2014;Tang et al., 2015). However, despite the promising results, recent work has shown that these models easily fail in adversarial examples 2 with little perturba- * Equal Contribution. 1 The code will be released at https://github.com/ lancopku/LexicalAT 2 Adversarial examples are intentionally designed by attackers to cause the model to make a mistake.  Figure 1: An attacking example for sentiment classification generated by the proposed approach. A pretrained classifier correctly predicts the label of the original text but fails on the generated text.
tions on real examples. This phenomenon shows that current sentiment classification models have poorly learned the true underlying patterns that determine the correct label. The over-fitting problem still needs to be further explored.
Several approaches have been proposed in recent years for the attack problem. These studies can be roughly classified into two categories, data augmentation based approaches and adversarial training based approaches. The key idea of the former approaches is to assist the training of the classifier by augmenting the training data with predesigned examples (Wang and Yang, 2015;Jia and Liang, 2017;Iyyer et al., 2018). Adversarial training based approaches (Miyato et al., 2017) aim to improve the generalization ability by adding random noises to word embeddings. Although these methods are good pioneering work, they either heavily rely on human knowledge or suffer from low diversity of attacks, which limits the robustness to diverse words and expressions.
In this work, we propose a lexical-based adversarial reinforcement training framework, Lexica-lAT, for robust sentiment classification. Compar-ing to previous studies, our approach can generate diversely attacking examples with self-learned policies. Our framework contains a generator and a classifier. The generator provides examples to attack the classifier while the classifier learns to defend these attacks. To generate diverse examples, we propose to use a large lexical knowledge base, WordNet, to add perturbations on training examples by replacing some words with their synonyms (e.g., sad and unhappy), neighbor words (e.g., fox and wolf), or super-superior words (e.g., chair and armchair) 3 , as shown in Figure 1. Specifically, the output of the generator is a sequence of actions deciding which words should be replaced and their replacements. By involving the attacking examples into the training, the classifier will be more robust and powerful.
Since the generator has discrete generation steps, the gradient-based approach cannot be directly used to back-propagate the errors. We consider policy gradient, a reinforcement learning approach, to train the generator. With the feedback from the classifier as the reward, the generator is encouraged to generate tougher examples for the classifier. In return, with the increasing number of hard examples for training, the classifier becomes more robust and powerful. We evaluate the proposed approach on four popular sentiment classification datasets. Experiments show that Lexica-lAT outperforms strong baselines on various models and various datasets.
• We propose a lexical-based adversarial reinforcement training approach, LexicalAT, to improve the robustness of sentiment classification models.
• To the best of our knowledge, it is the first work of combining a knowledge base and adversarial learning. The knowledge base contributes to diverse example generation and the adversarial learning develops the attacking policy.
• Experiments show that LexicalAT outperforms strong baselines and improves results on various models, including CNN, RNN, and BERT.

Related Work
In this work, we focus on single-label sentiment classification where the input is a word sequence and the output is a single label. In many sentiment classification datasets, neural networks have achieved promising results, even comparable or super to humans. However, several studies have noted that these models are vulnerable and the performance is very sensitive to simple perturbations (Szegedy et al., 2014;Huang et al., 2017;Yuan et al., 2017). Based on these findings, some studies have been proposed to improve the robustness of neural networks. These studies can be roughly classified into two categories, data augmentation based approaches and adversarial training based approaches. The main idea of the data augmentation based approaches is to augment the training data with pre-designed adversarial examples. Iyyer et al. (2018) propose a paraphrase method to generate syntactically adversarial examples for machine translation tasks. To explore semantically adversarial examples, Kobayashi (2018) replaces input words in real examples with the word predicted by a label-conditional language model. In these approaches, the attacking policy is taskspecific and elaborately designed by humans. Unlike these approaches, Miyato et al. (2017) propose to use an adversarial training framework to attack the classifier by adding perturbations to the word embedding layer. However, the unchanged input text makes it hard to improve the robustness to diverse words and expressions.
In this work, we propose a new adversarial reinforcement training framework that aims to generate diverse attacks with self-learned policies. We build a generator that acts as a policy learner to automatically learn to attack the classifier. To generate diverse examples, we include WordNet, a large-scale lexical knowledge base, into the generator for word replacement. Figure 2 shows the overall structure of LexicalAT. Given a sentence, the generator first generates a sequence of actions to replace some words with substitutes in WordNet and build a new example. Then, the new example is sent to the classifier to get the action reward. If the generated example successfully confuses the classifier and decreases the probability of the original label, we regard it as a good example and give it a high reward. In this way, the generator is encouraged to generate tough examples for the classifier and the challenging generated examples are used in the training process of the classifier. By alternatively training the generator and the classifier, the classifier is trained to be more robust and powerful.

Generator
The generator generates attacking examples by adding noises on real examples with WordNet. WordNet is a large English lexical database. Nouns, verbs, adjectives and adverbs are grouped into cognitive synonyms sets (synsets). Each synset represents a distinct concept. As shown in Table 1, WordNet has two basic relations: • Synonymy. It is the most basic relation, because WordNet uses sets of synonyms (synsets) to represent word senses.
• Hyponymy and hypernymy (supersubordinate). They are transitive relations between synsets. Because there is usually only one hypernym, the semantic relation hierarchically organizes the meanings of nouns.
The generation process is defined as a sequence labeling task for simplification. It takes a text sequence as input and an action sequence as output. Specifically, we define five actions representing the replacement decision, whether a word   should be replaced or its replacement type, as shown in Table 2. For example, if a word is labeled with action "2", we choose a word from its subordinate words with the highest frequency as the replacement. Formally, assume that the input is a sequence of words x = {x 1 , x 2 , · · · , x n } where n is the length of input text. This module generates a sequence of replacement actions a = {a 1 , a 2 , · · · , a n }. Then, by querying Word-Net based on x and a, we can get a new sentence.
Since the proposed framework is independent of the structure of the generator, for simplification we use a traditional sequence labeling model, Bi-Directional Long-Short Term Memory Network (Bi-LSTM) (Hochreiter and Schmidhuber, 1997), as implementation.

Classifier
In this work, we focus on single-label sentiment classification. The input is a sequence of words and the output is a label from a pre-defined set Y = {y 1 , y 2 , · · · , y k }, where k is the number of labels. To evaluate the effectiveness of the proposed approach on various settings, we implement two widely-used sentiment classification networks, Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) (Kim, 2014), and one state-of-the-art pre-trained model, BERT (Devlin et al., 2018).
In the RNN-based classifier, the input word embeddings are fed into a single Long-Short Term Memory Network (LSTM). Then, a feed-forward layer transfers the last hidden vector of LSTM into the probability distribution of labels.
In the CNN-based classifier, we first feed the word embeddings into a convolutional layer with four convolutional filters. Then, the concatenated filter output applies a max-pooling and is fed into a two-layer feed-forward network with ReLU, followed by the softmax function.
In the BERT-based classifier, we use a model pre-trained on a large-scale language modeling dataset for initialization. Following the work (Devlin et al., 2018), we take the final hidden state for the first token in the input as sentence representation for classification. We use the released model and code, BERT-large, as implementation, which is based on 24 transformer layers. 4

Adversarial Reinforcement Training
Due to the discrete choice in generation steps, we use policy gradient, a reinforcement learning approach, to adversarially train the two modules. The generator can be viewed as the agent which interacts with the classifier that acts as environment. The generator improves itself by maximizing the reward returned from the environment.
Given a real example (x, y), the generator first samples an action sequenceâ based on the following probability distribution: where θ represents the parameter of the generator. 4 The pre-trained model is provided by goo.gl/ language/bert Algorithm 1 Adversarial reinforcement training.
Then, we feed the real example (x, y) and the generated example (x , y) into the classifier to get the replacement reward r(â). Specifically, we define the reward as the absolute difference of the probability of y between the real example and the generated example: where φ is the parameter of the classifier. If the generated example successfully confuses the classifier and decreases the probability of y, we regard it as a good example and give it a high reward. In practice, we use the following equation to get a new reward to train the generator steadily: where E P G (a|x,θ) (r(a)) is the expectation of r(a). Then, the reward is fed back to the generator. Formally, the expected gradient of the parameter θ can be approximated as: whereâ is the generated action sequence. The second term E p G (â|x;θ) r (â) means we approximate the expectation with sampling in practice.
The generated example (x , y) is then used to train the classifier by minimizing the following cross-entropy loss: Furthermore, to prevent the classifier from forgetting the knowledge of training data, we also use teacher forcing (Goyal et al., 2016) to train the classifier. After each reinforcement training step, we run a teacher forcing step by directly using the real example (x, y) to train the classifier: Considering that reinforcement learning process requires the two modules with the initial learning ability, we pre-train the generator and the classifier before the reinforcement learning stage. The classifier is directly pre-trained on real training data until convergence. Due to the lack of supervisory signals, we build a warm-up stage to pre-train the generator. In the warm-up stage, the feedback of the pre-trained classifier is used to train the generator.

Experiments
We evaluate LexicalAT on four sentiment classification datasets. We first introduce datasets, implementation details, and baselines. Then, we show experiment results and provide detailed analysis.

Datasets
SST-2 & SST-5. The Stanford sentiment treebank (Socher et al., 2013) is a single-sentence classification dataset built on movie reviews. Based on the difference of sentiment granularity, the human annotators design two label sets. Following existing work (Kobayashi, 2018), we run experiments on both label sets. For clarify, we call them SST-2 and SST-5.
RT. The rating inference dataset (Pang and Lee, 2005) is another sentiment classification dataset. The data is from internet movie reviews and has two types of labels.
Yelp. The dataset is built upon reviews from website Yelp. 5 Each review has a rating label varying from 1 to 5. We randomly select 100K for training, 10K for validation, and 10K for testing. It is used to verify whether the proposed approach is applied to tasks with large-scale data.
The dataset statistics are shown in Table 3. For SST-2 and SST-5, we use the standard split in their work. For RT, due to the lack of the standard split, we randomly divide all examples into 90% for the 5 https://www.yelp.com/dataset/challenge  Table 3: Dataset statistics. "Class" is the number of pre-defined labels. "Avg. #w" is the average word number in the input text. "Train", "Dev", and "Test" represent the sizes of the training set, the development set, and the test set.
training set and 10% for the test set. To build the development set, we randomly take out 10% from the training set.

Baselines
We compare our proposed approach with the following robustness-driven approaches.
SynDA. It is a synonymy based data augmentation approach (Zhang et al., 2015). It uses an English thesaurus, obtained from WordNet, to randomly replace some words in real examples with their synonymys to build new examples.
ConDA. It is a contextual data augmentation approach (Kobayashi, 2018). They build adversarial examples by randomly replacing words in real examples with the words that are predicted by a bidirectional language model at the word positions.
VAT. It is an adversarial training based approach (Miyato et al., 2017) for robust text classification. It adds perturbations to recurrent neural networks to improve the robustness. We use the released code for implementation.

Experiment settings
Based on the performance on the development sets, we set batch size to 64 except yelp whose batch size is 256. We use the Adam optimizer to train the modules. The details of model-specific hyper-parameter settings are shown in Table 4. In the pre-training stage, we train the classifier until convergence. In the warm-up stage, we train the generator by 3 epoch. In the reinforcement training stage, we set the maximum epcoh to 100 and adopt the early stopping mechanism.

Results and Discussion
The results of the proposed approach and baselines are shown in Table 5.  As expected, LexicalAT, the proposed approach, substantially outperforms the naive baselines (RNN, CNN and BERT). Since RNN and CNN are directly trained on training sets, they tend to suffer from over-fitting. Therefore, it is reasonable that LexicalAT brings improvements to RNN and CNN models. BERT has better generalization ability because it has been pre-trained on a large scale corpus. Even so, LexicalAT still gains accuracy improvements over BERT by 0.43 on SST-2, 0.31 on SST-5, 0.11 on RT, and 0.74 on Yelp, respectively. These results show that the proposed approach is universal and well applied to various models, even the state-of-the-art pre-trained model BERT.
By contrast, SynDA, the synonymy based data augmentation approach, does not bring significant performance improvements over the naive baselines. The main difference between LexicalAT and SynDA is that SynDA uses the random replacement policy to generate new examples, while LexicalAT uses the dynamic replacement policy learned by the proposed adversarial reinforcement training framework. The performance gap between LexicalAT and SynDA shows that the proposed adversarial reinforcement training framework is effective for learning the attacking policy toward the weaknesses of the classifier. Further-  (Kobayashi, 2018) 80.30 40.20 * * +SynDA (Zhang et al., 2015) 80.20 40.50 * * +ConDA (Kobayashi, 2018) 80.10 41.10 * * +VAT (Miyato et al., 2017) (Kobayashi, 2018) 79.50 41.30 * * +SynDA (Zhang et al., 2015) 80.00 40.70 * * +ConDA (Kobayashi, 2018)   more, the proposed approach beats VAT under various datasets. Since VAT only adds random noises on word embeddings without changing the input text, it does not augment the training data with new words and expressions, and thus limits the robustness improvement. In summary, with adversarial reinforcement training, lexicalAT is capable of learning the attacking policy toward the weaknesses of the classifier. With WordNet, lexicalAT can generate examples with diverse expressions, which improves the classifier robustness. These advantages make the proposed approach perform well and achieve the best performance on various datasets.
To illustrate the training process, we plot the performance on the SST-5 development set in Figure 3. LexicalAT converges to a higher accuracy than the naive baseline RNN. It shows that the adversarial reinforcement training mechanism is robust and can converge to good results.
Furthermore, we compare the performance of lexicalAT and naive baselines on defending attacks, taking RNN and CNN on SST-2, SST-5, and RT as examples. The results are shown in Table 6. "RNN-Attacking Set" and "CNN-Attacking Set" are two sets generated by two different gen-

Analysis
What is the effect of each action? To show the effect of different actions, we take RNN and CNN on SST-2, SST-5, and RT as examples and conduct experiments by dropping one action at a time. As Table 7 shows, only some actions are useful for the robustness improvement. The average performance becomes better after dropping subordinate  and "wolf". In WordNet, neighbor words usually have similar semantic meanings. By replacing a word with its similar word, the semantic diversity can be largely enhanced. Furthermore, the semantic similarity can reduce the risk of changing the label when modifying the input text. Therefore, it is a good choice to include this relation in models unless the replacement exactly impacts the original label in specific tasks.
Is the learned attacking policy universal or model-specific? To explore this question, we  conduct the following experiments. We train two generators based on the feedback of the classifiers with different structures (e.g., RNN and CNN) on SST-2. For clarify, we call the learned attacking policies in the two generators as policy-RNN and policy-CNN, as shown in Table 8. Then, we put the examples generated by policy-RNN and policy-CNN into the training data to train the classifiers with different structures.
For the RNN-based classifier, two attacking policies both outperform the naive RNN baseline, which shows the universality of the learned attacking policy. Furthermore, policy-RNN brings more improvements than policy-CNN does on the RNNbased classifier. It demonstrates that the learned attacking policy has model-specific features. Similar result happens to the CNN-based classifier.
In summary, different classification models share some weaknesses and have their own unique vulnerability. Considering unique weaknesses are difficult to be summarized by human knowledge, our approach is an effective way to automatically learn the attacking policy toward the classifier weaknesses. Table 9 presents the adversarial examples generated by the generator on SST-5. Even the simple perturbations can confuse the classifier, which shows the weak robustness of the classifier.

Error Analysis
Although the proposed approach improves the robustness of current classifiers, it is important to note that there still have several problems to be further explored.
First, some generated examples contain lowquality phrases, such as the bottom one in Table 9. This problem is due to the inappropriate word replacement. In the future work, we would

Original Text
The movie is without intent. Generated Text The film is without spirit.

Original Text
The script is smart not cloying. Generated Text The dialogue is smart not saccharine.
Original Text This is a gorgeous film vivid with color music and life. Generated Text This is a gorgeous movie vivid with gloss sound and spirit.

Original Text
Hollywood ending is the most disappointing woody allen movie ever. Generated Text Hollywood ending is the most fail woody allen film ever.

Conclusion and Future Work
In this work, we propose a new adversarial training approach, LexicalAT, to improve the robustness of current sentiment classification models.
The key idea is to use WordNet and adversarial reinforcement training to automatically learn the diversely attacking policy. We evaluate Lexica-lAT on four representative sentiment classification datasets. Experiments demonstrate that the proposed approach has better generality and reduces test errors on various neural networks, including CNN, RNN, and BERT.
In the future work, we would like to build an advanced version of LexicalAT from the following perspectives. First, to address the problem of low-quality phrases in generated text, we will explore how to keep the syntactic correctness of the generated text. Second, we would like to figure out the detailed effects of the generated examples on the robustness of sentiment classification models. By removing useless examples, we can obtain higher training speed.