A Geometry-Inspired Attack for Generating Natural Language Adversarial Examples

Generating adversarial examples for natural language is hard, as natural language consists of discrete symbols, and examples are often of variable lengths. In this paper, we propose a geometry-inspired attack for generating natural language adversarial examples. Our attack generates adversarial examples by iteratively approximating the decision boundary of Deep Neural Networks (DNNs). Experiments on two datasets with two different models show that our attack fools natural language models with high success rates, while only replacing a few words. Human evaluation shows that adversarial examples generated by our attack are hard for humans to recognize. Further experiments show that adversarial training can improve model robustness against our attack.


Introduction
Although Deep Neural Networks (DNNs) have been successful in many machine learning applications (Kim, 2014;Rajpurkar et al., 2016;He et al., 2016), researchers have demonstrated that DNNs are remarkably vulnerable to adversarial attacks, which generate adversarial examples by adding small perturbations to the original input (Szegedy et al., 2014;Goodfellow et al., 2015;Nguyen et al., 2015). Adversarial examples are essential as they showcase the limitations of DNN models. Like humans, good DNN models should be robust to small perturbations to inputs. If a DNN model judges two almost identical inputs differently, one must profoundly question the quality of the DNN. As such adversarial examples are more than just a gimmick: they are a proof of the fundamental limitations of a DNN model.
Previous research on adversarial attacks has been largely focused on images, e.g., (Akhtar and Mian, 2018). In this paper, we study how to adversarially attack natural language models. Generating adversarial examples for natural language is fundamentally different from generating adversarial examples for images. Images live in a continuous universe, where one can simply change pixel values. Natural language sentences and words on the other hand are typically discrete. This discrete nature makes it difficult to apply existing attacks from the image domain directly to natural language, as an arbitrary point in the input space is unlikely to correspond to a valid natural language sentence or word. Moreover, inputs of natural language to DNNs are of variable lengths, which further complicates generating adversarial examples for natural language.
Despite these obstacles, researchers have proposed various attacks to generate adversarial examples for natural language. Jia and Liang (2017) manage to fool a DNN model for machine reading by adding sentences to the original texts. Zhao et al. (2018) generate adversarial examples for natural language by using an autoencoder. Ebrahimi et al. (2018)  However, all these methods do not address the "geometry" of DNNs, which has been shown to be a useful approach in the image domain (Moosavi-Dezfooli et al., 2016;Moosavi-Dezfooli et al., 2019;Modas et al., 2019). In this paper, we propose a geometry-inspired attack for generating natural language adversarial examples. Our attack generates adversarial examples by iteratively approximating the decision boundary of DNNs. We conduct experiments with Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) on two text classification tasks: the IMDB movie review dataset, and AG's News dataset. Experimental results show that our attack fools the models with high success rates while keeping the word replacement rates low. We also conduct a human evaluation, showing that adversarial examples generated by our attack are hard for humans to recognize. Further experiments show that model robustness against our attack can be achieved by adversarial training.

Related Work
Despite the success of Deep Neural Networks (DNNs) in many machine learning applications (Kim, 2014;He et al., 2016), researchers have revealed that such models are vulnerable to adversarial attacks, which fool DNN models by adding small perturbations to the original input (Goodfellow et al., 2015). The vulnerability of DNN models poses threats to many applications requiring high-level security. For example, in the image domain, a small error in a self-driving car could lead to life threatening disaster. For natural language, a machine might misunderstand a meaning, coming to a wrong conclusion. Researchers have also shown that a universal trigger could lead a system to generate highly offensive language (Wallace et al., 2019).
Previously, researchers have developed various adversarial attacks for fooling DNN models for images. Goodfellow et al. (2015) propose Fast Gradient Signed Method (FGSM), which aims to maximize the loss of the model with respect to the correct label. Projected Gradient Descent (PGD) (Madry et al., 2018) can be viewed as a multi-step version of FGSM. In each step, PGD generates a perturbation using FGSM, and then projects the perturbed input to an l ∞ ball. While these gradient-based methods are effective, researchers also show that leveraging geometry information of DNNs can be helpful. Moosavi-Dezfooli et al. (2016)  On the one hand, while pixel values of images are continuous, natural language consists of sequences of discrete symbols. Moreover, natural language sentences and words are often of variable lengths. Hence, existing adversarial attacks designed for images cannot be directly applied to natural language.
Despite obstacles, researchers have proposed various methods for generating adversarial examples for natural language. Based on the granularity of adversarial perturbations, adversarial attacks for natural language models can be divided into three categories: character level, word level and sentence level.

Character Level
Character-level adversarial attacks for natural language models generate adversarial examples by modifying individual characters of the original example. Ebrahimi et al. (2018) propose HotFlip, which uses gradient information to swap, insert, or delete a character in an original example.  generate adversarial examples by first selecting important words, and then modifying characters of the selected words.
Although character-level adversarial attacks for natural language are effective, such methods suffer from the problem of perceptibility. Humans are likely to recognize adversarial examples generated by these methods, as changing individual characters of texts often results in invalid words. Furthermore, character-level adversarial attacks are easy to be defended against. Using a simple spell checking tool to preprocess inputs can defend a DNN model against such attacks.

Word Level
Word-level adversarial attacks generate adversarial examples for natural language by changing words of the original example. Alzantot et al. (2018) propose a genetic attack, in which they replace original words with their synonyms by iteratively applying a genetic algorithm. Zhang et al. (2019) generate adversarial examples for natural language by leveraging Metropolis-Hastings Sampling. Ren et al. (2019) leverage several heuristics to generate word-level adversarial examples. Wallace et al. (2019) propose a universal attack, in which a fixed, input-agnostic sequence of words triggering the model to make a specific prediction is prepended to any example from the dataset. They search such universal triggers by leveraging gradient information.

Sentence Level
While most researchers focus on character/word-level attacks, some researchers propose to fool DNN models for natural language with sentence-level attacks. Jia and Liang (2017) propose to fool a machine reading model by adding an additional sentence to the original texts. However, their method requires heavy human engineering.  generate adversarial examples by rewriting the entire sentence with an encoder-decoder model for syntactically controlled paraphrase generation.
All these methods, however, do not address the geometry of DNNs although such information has been proven useful for generating adversarial examples for images. In this paper, we propose a geometryinspired word-level adversarial attack for generating natural language adversarial examples. The rest of this paper is organized as follows. Section 3 describes our attack. Section 4 details the experimental settings as well as results. Section 5 gives conclusions and insights for future work.

Methodology
Our attack is a white-box attack in that the attacker has access to the architecture and parameters of the victim model. The attack crafts natural language adversarial examples by replacing original words with their synonyms. Specifically, our attack can be divided into two steps: word selection and synonym replacement. In each iteration, the attack first selects a word from the original text, and then replaces the selected word with one of its synonyms to craft an adversarial example. The remainder of this section gives the details of our attack.

Word Selection Strategy
A crucial step in generating text adversarial examples is to find which word of the original example to replace. We follow previous work by ranking words with their saliency scores (Li et al., 2016a;Li et al., 2016b;Ren et al., 2019). The saliency score of word w j is obtained by computing the decrease of true class probability after replacing w j with an out-of-vocabulary word u, embeddings of which are initialized to all zeros during training.
Specifically, we have where X is obtained by replacing word w j of the original example X with out-of-vocabulary word u. Let y be the ground truth label of original example X. The saliency score S j for word w j is given by A higher saliency score indicates the corresponding word is of more importance for predicting the true class. Hence, the word with the highest saliency score in candidate set C will be selected for replacement. We build the candidate set C from words of the original example X and then exclude all out-of-vocabulary words and punctuations.

Synonym Replacement Strategy
Before going into the details of our synonym replacement strategy, we first clarify our assumptions on model architectures. For text classification tasks, a model can be divided into a text encoder Encoder and a feed forward layer FFNN. Specifically, a text encoder encodes an input X into a fixed-size vector v. Choices of such encoders include RNNs, CNNs (Kim, 2014), etc. A feed forward layer then takes the fixed-size vector v as input for classification. A fully connected network followed by a softmax activation layer is common for feed forward layers.
Our attack iterates over the candidate set C to generate adversarial examples. In each iteration, we first compute word saliency score S k for each candidate word w k ∈ C. We derive the synonym set } using WordNet 1 for candidate word w k * , which has the largest saliency score S k * in the current iteration.
We then use geometric information to select the best synonym of w k * for replacement. Given a DNN classifier consisting of text encoder Encoder and feed forward layer FFNN, we first use Encoder to compute the text vector v i of X i , which is the example before replacement at iteration i. We then find the nearest point b i on the decision boundary of FFNN by leveraging the DeepFool algorithm (Moosavi-Dezfooli et al., 2016). Next, we compute r i , which originates from text vector v i to decision boundary i are text vectors obtained by replacing word w k * with its synonyms w 0 k * and w 1 k * , respectively. p 0 For each synonym w m k * ∈ Q k * , example X m i is obtained by replacing w k * with w m k * . We compute text vector v m i by feeding X m i into the Encoder. We obtain p m i by projecting d m i , which is the vector originating from v i to v m i , onto r i . A new example X i+1 is crafted by replacing original word w k * with its synonym w m * k * , which corresponds to the largest projection z max i , where z max i = p m * i · u i , with u i being the unit direction vector of r i . Our intuition is that a text vector with larger projection on r i is closer to the decision boundary. We assign X i to X i+1 directly and continue to the next iteration if z max i is negative (which indicates p m * i is in the opposite direction of u i and r i ). Figure 1 illustrates our synonym replacement strategy. The algorithm stops under the condition that the model is fooled or the candidate set C is exhausted. We give details of our attack in Algorithm 1.

Experimental Results
We elaborate our experiments in this section 2 . Section 4.1 details the experimental settings. Section 4.2 describes the results of adversarial attacks. We conduct a human evaluation in Section 4.3 to understand the perceptibility of our adversarial perturbations. Section 4.4 gives the results of adversarial training, which we found can improve the robustness of DNN models against our attack.

Setup
We describe our experimental setup, including datasets and models in this subsection. We test our attack on two datasets with two different models. Datasets We conduct our experiments on two English datasets for text classification. Specifically, we have • IMDB 3 (Maas et al., 2011): The IMDB dataset is a large dataset for binary sentiment classification.
Each example in the dataset is a movie review. The classification label is positive/negative. Both labels are equally distributed in the dataset.
• AG's News 4 : The AG's News dataset consists of news articles for topic classification. The dataset has four equally distributed labels: World, Sports, Business and Sci/Tech.

Example Predictions
Replacements Distance True Class Prob Obviously, most of the budget was put into the dinosaurs, and although there is a fair share of them, there's not nearly enough to save (preserve) us from our boredom (ennui). These human characters are only there to scream, run around, and mutter these poorly-written and verbose speeches about survival. And unfortunately (regrettably), not nearly enough of them get eaten by the dinosaurs. Overall , "planet of the dinosaurs" is not a film I plan on seeing again. Screening as part of a series of funny shorts at the sydney gay and lesbian mardi gras film festival, this film was definitely (unquestionably) a highlight. The script is great (smashing) and the direction and acting was terrific. As another posting said, the actors' comedic timing really made this film. Lots of fun.     We list the details of the datasets in table 2. Note that in preprocessing, we limit the maximum number of words to 600 for each example in the IMDB dataset. We do not limit the maximum number of words in the AG's News dataset. Additionally, examples in both datasets are tokenized using NLTK 5 . The average/maximum number of words is computed after preprocessing. Models We consider two different DNN models to test the effectiveness of our attack. Specifically, we use word-based convolutional neural networks (CNN) and recurrent neural networks (RNN). We use LSTM as the recurrent unit in RNN. A CNN or RNN is a text encoder, which takes as input texts X and outputs a fixed-size vector v. A fully connected layer with softmax activation is followed for classification. For both models, we use 100-dimensional GloVe embeddings 6 (Pennington et al., 2014) in our experiments. All hidden layers are 128-dimensional. Table 3 gives the performance of our model on clean examples. These results are comparable to model performance in other studies (Alzantot et al., 2018;Ren et al., 2019), which means that our implementation is fair and that we are ready to investigate performance of our adversarial attacks on these models.

Adversarial Attacks
We limit the maximum number of word replacements to 50 and 25 for the IMDB dataset and AG's News dataset, respectively. In other words, the algorithm gives up if it still cannot find an adversarial example after the number of words replaced in the original example has exceeded the limit. We report the success rate of our attack on all correctly classified examples from the testset to prevent the model performance on clean examples from confounding the attack results. We also report the average word replacement rate for our adversarial examples. A lower word replacement rate makes it harder for humans to distinguish adversarial examples from the original ones. We compare our attack with Probability Weighted Word Saliency (PWWS) (Ren et al., 2019), which uses a greedy algorithm based on heuristics like word saliency and true class probability. For a fair comparison, the max length of examples is set to 600 for the IMDB dataset. We do not limit the maximum number of words for the AG's News dataset. To facilitate comparison, we obtain the results of PWWS for each dataset by evaluating on 1,000 randomly selected original examples from the testset, while we evaluate our attack on the entire testset. Table 4 shows the results of our adversarial attacks. As we can see, our attack outperforms PWWS in most of the metrics. For the IMDB dataset, our attack fools the CNN and RNN model with success  Compared to attack results of the IMDB dataset, we obtain lower success rates and higher word replacement rates for AG's News dataset. Possible explanations are: (1) fooling a multi-class classifier is harder than fooling a binary classifier, (2) examples of the IMDB dataset are longer than examples of AG's News dataset, and it is easier to generate adversarial examples for longer sequences. Table 1 gives some adversarial examples generated by our attack. As the attack replaces words from the original example, the true class probability decreases and the resulting text vector moves closer to the decision boundary. A distance smaller than 0 indicates that the text vector has crossed the decision boundary.

Human Evaluation
We conducted a human evaluation to understand the perceptibility of our adversarial perturbations. For each dataset, we randomly select 100 adversarial examples and the corresponding original examples. We hired workers from Amazon Mechanical Turk 7 to conduct the evaluation. We asked the workers to perform three tasks: (1) Accuracy: Predict the label of an example.
(2) Similarity: Judge the similarity of the given example to the original example.
(3) Modified: Judge the possibility that some words in the texts having been replaced by a machine.
For the last two tasks, the workers are required to give a score between 1 to 5. A higher score indicates more similarity/higher possibility. For each task, each assignment is shown to five workers. All assignments are randomly shuffled before shown to workers. For task (1), we take the majority of the five predictions as our final label. Note that for the AG's News dataset, we count an example as incorrectly classified if no majority label exists. For tasks (2) and (3), we average scores across workers.  News dataset is higher than that of the IMDB dataset. We also notice that although some examples are not grammarly correct after perturbing, the adversarial examples are still hard for humans to recognize as the word replacement rates are very low.
To better understand human performance in judging similarity of texts, the workers were also asked in task (2) to give the similarity score between two identical original examples (refer to column 4 of Table 5). We see from Table 5 that although the workers were expected to give a score of 5 for identical examples, they gave a score of 4.13 and 4.96 for the IMDB and AG's News dataset, respectively. The score for the IMDB dataset (4.13) is lower than that of the AG's News dataset (4.96). We believe that examples of the IMDB dataset longer than the examples of the AG's News dataset makes it harder for workers to judge whether or not two examples are identical.

Adversarial Training
We conduct further experiments to validate if robustness against our attack can be achieved by adversarial training. To save time, we do adversarial training by fine-tuning on pretrained models. During adversarial training, the training set is augmented by adversarial examples, which successfully fool the model and are generated in each epoch by perturbing the correctly classified examples. Figure 2 (a) shows that adversarial training helps, as the success rates of the attack gradually drops. Figure 2 (b) demonstrates that the average replacement rates increase as we do adversarial training. Lower success rates and higher word replacement rates show that the models are gaining robustness against our attack by adversarial training. We also notice from Figure 2 (b) that the average word replacement rates of adversarial examples start to decrease after training for some epochs. We believe that the model becomes more robust by first identifying adversarial examples with higher word replacement rates. Hence, the adversarial examples left after some epochs of adversarial training have relatively lower word replacement rates.  (Tsipras et al., 2019). However, we do not observe this phenomenon for the AG's News dataset. This indicates that although adversarial training for texts and images are similar, they are still different in certain aspects.

Conclusion and Future Work
In this paper, we propose a geometry-inspired attack for generating natural language adversarial examples. Our attack generates adversarial examples by iteratively approximating the decision boundary of Deep Neural Networks. Experiments on two text classification tasks with two models show that our attack reaches high success rates while keeping low word replacement rates. Human evaluation shows that adversarial examples generated by our attack are hard to recognize for humans. Experiments also show that adversarial training increases model robustness against our attack. Our current attack works for models with context-independent word embeddings. In the future, we would like to extend our attack to models using contextualized word embeddings, including ELMo (Peters et al., 2018), BERT (Devlin et al., 2019), etc.