Universal Adversarial Attacks with Natural Triggers for Text Classification

Recent work has demonstrated the vulnerability of modern text classifiers to universal adversarial attacks, which are input-agnostic sequences of words added to text processed by classifiers. Despite being successful, the word sequences produced in such attacks are often ungrammatical and can be easily distinguished from natural text. We develop adversarial attacks that appear closer to natural English phrases and yet confuse classification systems when added to benign inputs. We leverage an adversarially regularized autoencoder (ARAE) to generate triggers and propose a gradient-based search that aims to maximize the downstream classifier’s prediction loss. Our attacks effectively reduce model accuracy on classification tasks while being less identifiable than prior models as per automatic detection metrics and human-subject studies. Our aim is to demonstrate that adversarial attacks can be made harder to detect than previously thought and to enable the development of appropriate defenses.


Introduction
Adversarial attacks have recently been quite successful in foiling neural text classifiers (Jia and Liang, 2017;Ebrahimi et al., 2018). Universal adversarial attacks (Wallace et al., 2019;Behjati et al., 2019) are a sub-class of these methods where the same attack perturbation can be applied to any input to the target classifier. These attacks, being input-agnostic, point to more serious shortcomings in trained models since they do not require re-generation for each input. However, the attack sequences generated by these methods are often meaningless and irregular text (e.g., "zoning tapping fiennes" from Wallace et al. (2019)). While § Equal contribution 1 Our code is available at https://github.com/ Hsuan-Tung/universal_attack_natural_ trigger.
human readers can easily identify them as unnatural, one can also use simple heuristics to spot such attacks. For instance, the words in the above attack trigger have an average frequency of 14 compared to 6700 for words in benign inputs in the Stanford Sentiment Treebank (SST) (Socher et al., 2013).
In this paper, we design natural attack triggers by using an adversarially regularized autoencoder (ARAE) (Zhao et al., 2018a), which consists of an auto-encoder and a generative adversarial network (GAN). We develop a gradient-based search over the noise vector space to identify triggers with a good attack performance. Our method -Natural Universal Trigger Search (NUTS) -uses projected gradient descent with l 2 norm regularization to avoid using out-of-distribution noise vectors and maintain the naturalness of text generated. 2 Our attacks perform quite well on two different classification tasks -sentiment analysis and natural language inference (NLI). For instance, the phrase combined energy efficiency, generated by our approach, results in a classification accuracy of 19.96% on negative examples on the Stanford Sentiment Treebank (Socher et al., 2013). Furthermore, we show that our attack text does better than prior approaches on three different measures -average word frequency, loss under the GPT-2 language model (Radford et al., 2019), and errors identified by two online grammar checking tools (scr; che). A human judgement study shows that up to 77% of raters find our attacks more natural than the baseline and almost 44% of humans find our attack triggers concatenated with benign inputs to be natural. This demonstrates that using techniques similar to ours, adversarial attacks could be made much harder to detect than previously thought and we require the development of appropriate defenses in the long term for securing our NLP models.

Related Work
Input-dependent attacks These attacks generate specific triggers for each different input to a classifier. Jia and Liang (2017) fool reading comprehension systems by adding a single distractor sentence to the input paragraph. Ebrahimi et al. (2018) replace words of benign texts with similar tokens using word embeddings. Similarly, Alzantot et al. (2018) leverage genetic algorithms to design word-replacing attacks. Zhao et al. (2018b) adversarially perturb latent embeddings and use a text generation model to perform attacks. Song et al. (2020) develop natural attacks to cause semantic collisions, i.e. make texts that are semantically unrelated judged as similar by NLP models.
Universal attacks Universal attacks are input-agnostic and hence, word-replacing and embedding-perturbing approaches are not applicable. Wallace et al. (2019) and Behjati et al. (2019) concurrently proposed to perform gradient-guided searches over the space of word embeddings to choose attack triggers. In both cases, the attack triggers are meaningless and can be easily detected by a semantic checking process. In contrast, we generate attack triggers that appear more natural and retain semantic meaning. In computer vision, GANs have been used to create universal attacks (Xiao et al., 2018;Poursaeed et al., 2018). Concurrent to our work, Atanasova et al. (2020) design label-consistent natural triggers to attack fact checking models. They first predict unigram triggers and then use a language model conditioned on the unigram to generate natural text as the final attack, while we generate the trigger directly.

Universal Adversarial Attacks with Natural Triggers
We build upon the universal adversarial attacks proposed by Wallace et al. (2019). To enable natural attack triggers, we use a generative model which produces text using a continuous vector input, and perform a gradient-guided search over this input space. The resulting trigger, which is added to benign text inputs, is optimized so as to maximally increase the loss under the target classification model.
Problem formulation Consider a pre-trained text classifier F to be attacked. Given a set of benign input sequences {x} with the same ground truth label y, the classifier has been trained to predict F (x) = y. Our goal is to find a single input- Figure 1: Overview of our attack. Based on the gradient of the target model's loss function, we iteratively update the noise vector n with small perturbation to obtain successful and natural attack triggers.
agnostic trigger, t, that when concatenated 3 with any benign input, causes F to perform an incorrect classification, i.e., F ([t; x]) = y, where ; represents concatenation. In addition, we also need to ensure the trigger t is natural fluent text.
Attack trigger generation To ensure the trigger is natural, fluent and carries semantic meaning, we use a pre-trained adversarially regularized autoencoder (ARAE) (Zhao et al., 2018a) (details in Section 4). The ARAE consists of an encoder-decoder structure and a GAN (Goodfellow et al., 2014). The input is a standard Gaussian noise vector n, which is first mapped to a latent vector z by the generator. Then the decoder uses this z to generate a sequence of words -in our case, the trigger t. This trigger is then concatenated with a set of benign texts {x} to get full attack texts {x }. The overall process can be formulated as follows: We then pass each x into the target classifier and compute the gradient of the classifier's loss with respect to the noise vector, ∇ n L(F (x ), y). Backpropagating through the decoder is not straightforward since it produces discrete symbols. Hence, we use a reparameterization trick similar to the trick in Gumbel softmax (Jang et al., 2017) to sample words from the output vocabulary of ARAE model as a one-hot encoding of triggers, while allowing gradient backpropagation. Figure 1 provides an overview of our attack algorithm, which we call Natural Universal Trigger Search (NUTS).
Ensuring natural triggers In the ARAE model, the original noise vector n 0 is sampled from a stan-dard multi-variant Gaussian distribution. While we can change this noise vector to produce different outputs, simple gradient search may veer significantly off-course and lead to bad generations. To prevent this, following Carlini and Wagner (2017), we use projected gradient descent with an l 2 norm constraint to ensure the noise n is always within a limited ball around n 0 . We iteratively update n as: where Π B (n 0 ) represents the projection operator with the l 2 norm constraint B (n 0 ) = {n | n − n 0 2 ≤ }. We try different settings of attack steps, and η, selecting the value based on the quality of output triggers. In our experiments, we use 1000 attack steps with = 10 and η = 1000.
Final trigger selection Since our process is not deterministic, we initialize multiple independent noise vectors (256 in our experiments) and perform our updates (1) to obtain many candidate triggers. Then, we re-rank the triggers to balance both target classifier accuracy m 1 (lower is better) and naturalness in terms of the average per-token cross-entropy under GPT-2, m 2 (lower is better) using the score m 1 + λm 2 . We select λ = 0.05 to balance the difference in scales of m 1 and m 2 .

Experiments
We demonstrate our attack on two tasks -sentiment analysis and natural language inference. We use the method of Wallace et al. (2019) as a baseline 4 and use the same datasets and target classifiers for comparison. For the text generator, we use an ARAE model pre-trained on the 1 Billion Word dataset (Chelba et al., 2014). For both our attack (NUTS) and the baseline, we limit the vocabulary of attack trigger words to the overlap of the classifier and ARAE vocabularies. We generate triggers using the development set of the tasks and report results on test set (results on both sets in Appendix).
Defense metrics We employ three simple defense metrics to measure the naturalness of attacks: 1. Word frequency: The average frequency of words in the trigger, computed using empirical estimates from the training set of the target classifier. Figure 2: Difference in (a) average word frequency (normalized) and (b) average GPT-2 loss between benign text (x) and different attack triggers (t) (length 8) for SST and SNLI (computed as stat(x)−stat(t)). For SNLI, our attacks have lower GPT-2 loss values than even the original text, leading to a positive delta.   Table 3: Human judgement results: all numbers in %, columns represent the choices provided to human raters.
(Left) Our attacks are judged more natural than baseline attacks (both on their own and when concatenated with benign input text). Significance tests return p < 1.7 × 10 −130 and p < 4.9 × 10 −45 for the two rows, respectively.
(Right) Individual assessments show that our attack is more natural than the baseline but less than benign text on its own (as expected). Significance between natural ratings for our model and baseline has p < 1.4 × 10 −18 .
baseline. Further, as shown in Table 1, two grammar checkers (scr; che) report 12.50% and 21.88% errors per word on our attack triggers, compared to 15.63% and 28.13% for the baseline. We attack the classifier by adding a trigger to the front of the hypothesis.
Results From Table 2, we see that both our attack and the baseline decrease the accuracy to almost 0% on entailment and neutral examples. On contradiction examples, our attack brings the accuracy down to 26.78% while the baseline decreases it to 23.02%. Although less successful, our attacks are much more natural than the baseline. In Figure 2, our attacks are closer to the word frequency of benign inputs and even achieve a lower GPT-2 loss than the benign text. In Table 1, two grammar checkers (scr; che) also report lower errors on our attacks compared to the baseline.

Human-Subject Study
To further validate that our attacks are more natural than baseline, we perform a human-subject study on Amazon Mechanical Turk. We collect ratings by: (1) providing a pair of our trigger vs baseline trigger (with and without benign text) and asking the worker to select the more natural one; (2) providing a piece of text (our attack text/baseline attack text/benign input) and asking the human to determine whether it is naturally generated or not. Both conditions allow the human to choose a "Not sure" option. We generated attack triggers with lengths of 3, 5, and 8 (see Appendix for details) and created 450 comparison pairs for (1)   workers find our attack trigger to be more natural than the baseline while 61.16% judge our attack to be more natural even when concatenated with benign text. The other table shows 44.27% human subjects think our attack inputs are naturally generated. Although it is lower than the 83.11% for real natural inputs, it is still significantly higher than the 22.84% of baseline attack inputs, which shows that our attacks are more natural and harder to detect than the baseline for humans.

Attack Transferability
Similar to Wallace et al. (2019), we also evaluate the attack transferability of our universal adversarial attacks to different models and datasets. A transferable attack further decreases the assumptions being made: for instance, the adversary may not need white-box access to a target model and instead generate attack triggers using its own model to attack the target model. We first evaluate transferability of our attack across different model architectures. Besides the LSTM classifier in Section 4.1, we also train a BERT-based classifier on the SST dataset with 92.86% and 91.15% test accuracy on positive and negative data. From Table 4, we can see that the transferred attacks, generated for the LSTM model, lead to 14% ∼ 51% accuracy drop on the target BERT model.
We also evaluate attack transferability across different datasets. In addition to the SST dataset in Section 4.1, we train a different LSTM classifier with the same model architecture on the IMDB sentiment analysis dataset, which gets 89.75% and 89.85% test accuracy on positive and negative data. Our attacks transfer in this case also, leading to accuracy drops of 18% ∼ 34% on the target model (Table 4).

Conclusion
We developed universal adversarial attacks with natural triggers for text classification and experimentally demonstrated that our model can generate attack triggers that are both successful and appear natural to humans. Our main goals are to demonstrate that adversarial attacks can be made harder to detect than previously thought and to enable the development of appropriate defenses. Future work can explore better ways to optimally balance attack success and trigger quality, while also investigating ways to detect and defend against them.

Ethical considerations
The techniques developed in this paper have potential for misuse in terms of attacking existing NLP systems with triggers that are hard to identify and/or remove even for humans. However, our intention is not to harm but instead to publicly release such attacks so that better defenses can be developed in the future. This is similar to how hackers expose bugs/vulnerabilities in software publicly. Particularly, we have demonstrated that adversarial attacks can be harder to detect than previously thought (Wallace et al., 2019) and therefore can present a serious threat to current NLP systems. This indicates our work has a long-term benefit to the community.
Further, while conducting our research, we used the ACM Ethical Code as a guide to minimize harm. Our attacks are not against real-world machine learning systems.

A Experimental Details
Hyperparameter search For our gradient-based attack approach (Equation (1) in the main paper), there are three hyperparameters: the l 2 norm budget of the adversarial perturbation, the number of attack steps T , and the step size η in each attack step. Among them, is super critical for our attacks. A too small limits the search space over the ARAE (Zhao et al., 2018a) noise input, thus leads to a low attack success. A too large changes the noise input significantly, thus leads to unnatural trigger generations. In our experiments, we use grid search to manually try different settings of these hyperparameter values: is selected from {2, 5, 10, 20, 50}; T is selected from {500, 1000, 2000, 5000}; and η is selected from {10, 100, 1000, 10000}. Based on the attack success and the naturalness of generated triggers, we finally set = 10, T = 1000, and η = 1000.

B.1 Attack Results with Different Trigger
Lengths Table 5 provides examples of attacks with varying lengths, along with their corresponding classifier accuracies (lower numbers indicate more successful attacks).

B.2 Attack Results on the Development Set and the Test Set
In our experiments, the attack trigger is first generated by increasing the target classifier's loss on the development set, and then applied on the test set to measure its success. Here, we present both the development accuracy and the test accuracy under the same attack triggers in Table 6. We can see that although generated by only attacking the development set, the trigger also works well on the test set: it causes similar accuracy drop on both the development set and the test set.

B.3 Naïve Attacks with Random Triggers
In this section, we check how difficult it is to attack a certain task by implementing two naïve attacks without gradient information. In the first attack method ("Random ARAE"), we randomly collect the candidate triggers generated by the ARAE model (Zhao et al., 2018a), compute the classifier accuracy for each trigger, and finally select the attack trigger as the one with lowest classifier accuracy. We can consider this attack as a simplified version of our attack (NUTS) by removing the gradient information. The second attack method ("Random outputs") is similar as the first one, except that we do not enforce the naturalness of the triggers: we select the attack trigger with the lowest classifier accuracy from many random word sequences. We can also consider this attack as a much simplified version of the baseline attack (Wallace et al., 2019). For both naïve attacks, following our gradient-based attack, we select the final trigger from 256 candidates triggers for a fair comparison. Table 7 shows all the attack results. First, we observe that these two naïve attacks ("Random ARAE" and "Random outputs") are quite successful in attacking entailment and neutral examples in the SNLI task: they successfully decrease the classifier accuracy to 0% and 3.45%. This indicates that those examples are quite easy to be attacked. Second, for both positive and negative examples in the SST task and the contradiction examples in the SNLI task, the success of these two naïve attacks is quite limited. We also observe a significant improvement on the attack success with these two gradient-based attacks correspondingly.  , our attacks are slightly less successful at reducing test accuracy but generate more natural triggers. For SST, "+"=positive, "-"=negative sentiment. For SNLI, "+"=entailment , "0"=neutral, and "-"=contradiction. Lower numbers are better. 'No trigger'=classifier accuracy without any attack.

B.4 Attack Results without GPT-2 Based Reranking
The GPT-2 based reranking is used to balance attack success and trigger naturalness. Without GPT-2 based reranking, the selected trigger will have a slightly higher attack success, however with significantly larger GPT-2 loss. For SST, without reranking, our attack triggers decrease accuracy to 26.84% and 7.68% on positive and negative data, but GPT-2 losses increase from 6.85 (or 6.65) to 8.80 (or 8.88) for positive (or negative) data.

B.5 Variance over Candidate Triggers
For our attacks against negative SST data with trigger length of 8, among all 256 candidate triggers, the average classifier accuracy after attack is 0.23 with a standard deviation of 0.10; and the average GPT-2 loss is 7.93 with a standard deviation of 0.85. There is no inherent tradeoff between naturalness and attack success: some triggers have both low classifier accuracy and low GPT-2 loss, and the pearson correlation is -0.08.

C Human-Subject Study Details
We perform the human-subject study on Amazon Mechanical Turk. Crowdworkers were required to have a 98% HIT acceptance rate and a minimum of 5000 HITs. Workers were asked to spend a maximum of 5 minutes on each assignment (i.e., comparing the naturalness of a pair of our trigger vs baseline trigger, or evaluating the naturalness of a piece of text), and paid $0.01 for each assignment.  Table 6: Universal attack results on both the development (dev) set and the test set for the Stanford Sentiment Treebank (SST) classifier and the Stanford Natural Language Inference (SNLI) classifier. For SST, "+"=positive, "-"=negative sentiment. For SNLI, "+"=entailment , "0"=neutral, and "-"=contradiction. We first generate the attack trigger by increasing the classifier's loss on the dev set, and then apply the same trigger on the test set. 'No trigger' refers to classifier accuracy without any attack. We can observe that the same triggers achieve similar attack success in both the development set and the test set.  Table 7: Universal attack results on both the Stanford Sentiment Treebank (SST) classifier and the Stanford Natural Language Inference (SNLI) classifier. Besides gradient-based attacks including our attack (NUTS) and the baseline attack (Wallace et al., 2019), we further implement two naïve attacks without gradient-guided search: "Random ARAE" means we select the best attack trigger from random natural ARAE outputs; "Random outputs" represents we select the best attack trigger from random unnatural word sequences.