Second-Order NLP Adversarial Examples

Adversarial example generation methods in NLP rely on models like language models or sentence encoders to determine if potential adversarial examples are valid. In these methods, a valid adversarial example fools the model being attacked, and is determined to be semantically or syntactically valid by a second model. Research to date has counted all such examples as errors by the attacked model. We contend that these adversarial examples may not be flaws in the attacked model, but flaws in the model that determines validity. We term such invalid inputs second-order adversarial examples. We propose the constraint robustness curve and associated metric ACCS as tools for evaluating the robustness of a constraint to second-order adversarial examples. To generate this curve, we design an adversarial attack to run directly on the semantic similarity models. We test on two constraints, the Universal Sentence Encoder (USE) and BERTScore. Our findings indicate that such second-order examples exist, but are typically less common than first-order adversarial examples in state-of-the-art models. They also indicate that USE is effective as constraint on NLP adversarial examples, while BERTScore is nearly ineffectual. Code for running the experiments in this paper is available at https://github.com/jxmorris12/second-order-adversarial-examples.


Introduction
If an imperceptible change to an input causes a model to make a misclassification, the perturbed input is known as an adversarial example (Goodfellow et al., 2014). In domains with continuous inputs like audio and vision, whether such a change is considered "imperceptible" can be easily measured: A change to an image may be considered imperceptible (and thus a valid adversarial example) if the resulting image is no more than some fixed distance away in pixel space (Chakraborty et al., 2018). Although the perturbation has different meaning than the original (and the entailment model correctly predicts a contradiction), the sentence encoding similarity does not reflect this change. Current NLP adversarial example generation methods would incorrectly consider this a flaw in the entailment model. We refer to the function that determines imperceptibility as the constraint, C. For input x and perturbation x adv , if C(x, x adv ) is true, x adv is a valid perturbation for x. Different domains call for different constraints. In vision, a common constraint is ∞ (x, x adv ), the maximum pixel-wise distance between image x and its perturbation x adv (Goodfellow et al., 2014). In audio, a common constraint is |dB(x) − dB(x adv )|, the distortion in decibels between audio input x and perturbation x adv (Carlini and Wagner, 2018). Both constraints are easily computed, wellunderstood, and correlate with human perceptual distance.
Choosing the correct constraint is not always so straightforward. In discrete domains like language, there is no obvious choice. In fact, the field lacks consensus on even the meaning of "imperceptibility". Different adversarial attacks have used different definitions of imperceptibility (Zhang et al., 2020a). One common definition (Alzantot et al., 2018;Jin et al., 2019;Ren et al., 2019;Garg and Ramakrishnan, 2020) is imperceptibility with respect to meaning: C(x, x adv ) is true if x adv retains the semantics of x.
With this definition, a perturbation x adv is determined to be a valid adversarial example if it simultaneously fools the model and retains the semantics of x. This formulation is problematic because measuring semantic similarity is an open problem in NLP. As a consequence, many adversarial attacks use a second NLP model as a constraint, to determine whether or not x adv preserves the semantics of x.
Just like the model under attack, the semantic similarity model is vulnerable to adversarial examples. So when this type of attack finds a valid adversarial example, it is unclear which model has made a mistake: was it the model being attacked, or the model used to enforce the constraint?
In other words, it is possible that the semantic similarity model improperly classified x adv as preserving the semantics of x. We refer to these flaws in constraints as second-order adversarial examples. Figure 1 shows a sample second-order adversarial example. Second-order adversarial examples have been largely ignored in the literature on NLP adversarial examples to date. Now that we are aware of the existence of second-order adversarial examples, we seek to minimize their impact. How can we measure a given constraint's susceptibility to second-order adversarial examples? We suggest one such measurement tool: the constraint robustness curve and its associated metric ACCS.
We then develop an adversarial example generation technique for finding examples that fool these semantic similarity models. Our findings indicate that adversarial examples for these types of models exist, but are less likely than adversarial examples that fool other NLP models.
Along the way, we compare the Universal Sentence Encoder (USE) (Cer et al., 2018), a sentence encoder commonly used as a constraint for NLP adversarial examples, with BERTScore (Zhang et al., 2019), a metric that outperforms sentence encoders for evaluating text generation systems.
The main contributions of this work can be summarized as follows: 1. We formally define second-order adversarial examples, a previously unaddressed is-sue with the problem statement for semanticspreserving adversarial example generation in NLP.
2. We propose Adjusted Constraint C-Statistic (ACCS), the normalized area under the constraint robustness curve, as a measurement of the efficacy of a given model as a constraint on adversarial examples.
3. We run NLP adversarial attacks not on models fine-tuned for downstream tasks, but on semantic similarity models used to regulate the adversarial attack process. We show that they are [robust-not robust]. Across the board, USE achieves a much higher ACCS, indicating that USE is a more robust choice than BERTScore for constraining NLP adversarial perturbations.

Second-order adversarial examples
To create natural language adversarial examples that preserve semantics, past work has implemented the constraint using a model that measures semantic similarity (Garg and Ramakrishnan, 2020; Alzantot et al., 2018;Li et al., 2018;Jin et al., 2019). For semantic similarity model S, original input x, and adversarial perturbation x adv , constraint C can be defined as defined as: where is a threshold that determines semantic similarity. If their semantic distance is higher than some threshold, the perturbation is considered a valid adversarial example.
Using such a constraint in an untargeted attack on classification model F , the attack goal G function can be written as: Here, x adv is a valid adversarial example when both criteria of the goal are fulfilled: F produces a different class output for x adv than for x, and C(x, x adv ) is true. This type of joint goal function is common in NLP adversarial attacks (Zhang et al., 2020a).
It is possible that these constraints evaluate the semantic similarity of the original and perturbed text incorrectly. If the semantic similarity score is too low, then x adv will be rejected by the algorithm; if the score is too high, then the algorithm will consider x adv a valid adversarial example.
If S(x, x adv ) is too high, x adv is incorrectly considered a valid adversarial example: a flaw in model F . However, since semantics is not preserved from x to x adv , there is no reason to assume that F (x) should be consistent with F (x adv ). The flaw is actually in S, the semantic similarity model that erroneously considered x adv to be a valid adversarial example.
For adversarial attacks on model F using a constraint determined by model S, we suggest the following terminology: • First-order adversarial examples are perturbations that are correctly classified as imperceptible by S, and fool F .
• Second-order adversarial examples fool S, the model used as a constraint. Regardless of the output of F , these are adversarial examples for S.
In the next section, we suggest a method for determining the vulnerability of S to second-order adversarial examples.

Constraint robustness curves and ACCS
In this section, we propose the constraint robustness curve, a method for analyzing the robustness, or susceptibility to second-order adversarial examples, of a given constraint. Each semantic similarity model may produce scores on a different scale, varying the best for preservation of semantics. As such, we cannot fairly compare two models at the same values of .
However, the problem of comparing two binary classifiers that may have different threshold scales is common in machine learning (Hajian-Tilaki, 2013). Inspired by the receiver operating characteristic (ROC) curve for binary classifiers, we propose the constraint robustness curve, a plot of first-order vs. second-order adversarial examples as constraint sensitivity varies. To create the constraint robustness curve for semantic similarity model S and threshold , we plot the number of true positives (first-order adversarial examples, found using S as a constraint) vs. false positives (secondorder adversarial examples, found by attacking S directly). The constraint robustness curve can be interpreted similarly to an ROC curve. An effective constraint will allow many true positives (first-order adversarial examples) before many false positives (second-order adversarial examples). The model that produces a curve with a higher AUC (area under the constraint robustness curve) is better at distinguishing valid from invalid adversarial examples, and less susceptible to second-order adversarial examples.
When = 0, C(x, x adv ) is always true. But even when the constraint accepts all possible x adv , some attacks may still fail. So unlike a typical ROC curve, which is bounded between 0 and 1 on both axes, the constraint robustness curve is bounded on each axis between 0 and the maximum attack success rate (when = 0). We suggest normalizing to bound the score between 0 and 1.
We call the resulting metric Adjusted Constraint C-Statistic (ACCS) 1 . ACCS is defined as the area under the constraint robustness curve normalized by the maximum first-and second-order success rate. Figure 2 shows an example of a constraint robustness curve for a toy problem. (The area under the green dashed curve is 0.105; after normalizing by the maximum first-and secondorder attack success rates of 0.7 and 0.3, we find ACCS = 0.5.) There is one crucial difference between interpreting an ROC curve and a constraint robustness curve. A naive binary classifier will guess randomly and achieve as many false positives as true positives, and an AUC of 0.5. A naive constraint will yield all second-order adversarial examples at the same threshold, and garner an ACCS of 0.0.
To create such a curve, we must devise methods for generating both first-order and second-order adversarial examples. In the following section, we propose an attack for each purpose.

Generating first and second-order adversarial examples
To calculate ACCS(S, ) for each S and , we design two attacks: one to calculate the number of first-order adversarial examples, and one to calculate the number of second-order adversarial examples. In Section 5, we run the attacks across a variety of models and datasets and examine their constraint robustness curves.

Generating first-order adversarial examples
To measure the number of first-order adversarial examples allotted by a semantic similarity model for a given value of , we can run any standard adversarial attack that uses the semantic similarity model as a constraint.
We devise a simple attack to generate adversarial examples for some classifier F . We choose untargeted classification, the goal of changing the classifier's output to any but the ground-truth output class, as the goal function. To generate perturbations, we swap words in x with their synonyms from WordNet (Miller, 1995).
Simply swapping words with synonyms from a thesaurus would frequently create ungrammatical perturbations (even though they may be semantically similar to the originals). To better preserve grammaticality, we enforce an additional constraint, requiring that the log-probability of any replaced word not decrease by more than some fixed amount, as according to the GPT-2 language model (Radford et al., 2019). (This is similar the language model perplexity constraints used in the NLP attacks of Alzantot et al. (2018) and(Kuleshov et al., 2018).) As an additional constraint, the attack filters potential perturbations using the semantic similarity model to ensure that S(x, x adv ) ≥ .
Finally, we choose greedy with word importance ranking as our search method (Gao et al., 2018). We can use these four components (goal function, transformation, constraints, and search method) to construct an adversarial attack to generate adversarial examples for any NLP classifier (Morris et al., 2020b).

Generating second-order adversarial examples
Generating adversarial examples for classification model F is a well-studied problem. But how do we generate perturbations that fool S, a semantic similarity model?
We first note what these adversarial examples might look like. Our goal is to find 'false positives' where a semantic similarity model incorrectly indicates that semantics is preserved. Specifically, we want to find some (x, x adv ) where S(x, x adv ) ≥ , even though we know x adv does not preserve the semantics of x.
To generate such perturbations, we design a transformation with the goal of changing the meaning of an x as much as possible (instead of preserving its meaning). At each step of the adversarial attack, instead of replacing words with their synonyms, we replace words with their antonyms, also sourced from WordNet (Miller, 1995).
Next, we need to establish a goal function that perturbations must meet to be considered adversarial examples for a given semantic similarity metric. We establish the following goal function: Here, x[i] represents the i th word in sequence x, and γ represents the minimum number of words that must be changed for the attack to succeed.
With our goal function, perturbation x adv is a valid adversarial example if it differs by at least γ words from x, but its semantic similarity to x is still higher than . If γ words are substituted with antonyms, as γ increases, we can say with high certainty that semantics is not preserved. In this case, the semantic similarity model should produce a value smaller than .
As in 4.1, we apply a second constraint, using GPT-2 to ensure antonyms substituted are likely in their context. For the search method, we use beam search, as it does a better job finding adversarial examples when the set of valid perturbations is sparse (Ebrahimi et al., 2017).
A sample output of this attack (where γ = 2) is shown in Figure 1.

Attack Prototypes
We implemented our adversarial attacks using the TextAttack adversarial attack framework (Morris et al., 2020b). Figure 4 shows the attack prototypes of each attack, as constructed in TextAttack.
As noted in the previous section, each attack used the GPT-2 language model to preserve grammaticality during word replacements; we disallowed word replacements that decreased in logprobability from the original word 2.0 or more. The other constraints in the attack prototype disallow multiple modifications of the same word, stopword substitutions, and, in the case of entailment datasets, edits to the premise. 2

Semantic similarity models
We tested two semantic similarity models as S: • The Universal Sentence Encoder (USE) (Cer et al., 2018), a model trained to encode sentences into fixed-length vectors. Semantic similarity between x and x adv is measured as the cosine similarity of their encodings. This is consistent with NLP attack literature (Li et al., 2018;Jin et al., 2019;Garg and Ramakrishnan, 2020).
• BERTScore (Zhang et al., 2019), an automatic evaluation metric for text generation. BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence using the contextual embedding of each token. According to human studies, BERTScore correlates better than other metrics (including sentence encodings) for evaluating machine translations. It also outperforms sentence encodings on PAWS (Yang et al., 2019), an adversarial paraphrase dataset where inputs have a similar format to NLP adversarial examples.

Victim Classifiers
To create constraint robustness curves, we ran each attack (first and second-order) while varying from 0.75 to 1.0 in increments of 0.01. For the SST-2 dataset, which has some very short examples, we varied from 0.5 to 1.0 in increments of 0.02. For 2 It is standard for NLP attacks on entailment models to only edit the hypothesis (Alzantot et al., 2018;Zhao et al., 2017;Jin et al., 2019)).

First-Order Attack
Attack(

Results
We sampled 100 examples from each test set of dataset for each attack. We repeated each attack twice, once using BERTScore and once using the Universal Sentence Encoder. In total, we ran 300 attacks. Table 1 shows results for each model and dataset. Figure 5 shows the constraint robustness curve for each scenario.
Surprisingly, the Universal Sentence Encoder achieved a higher ACCS than BERTScore across all nine scenarios. This appears contradictory to the claims of Zhang et al. (2019) that "BERTScore is more robust to challenging examples when compared to existing metrics".
Additionally, at any given point, first-order adversarial examples are found over twice as often as second-order adversarial examples. This indicates that most adversarial examples found in NLP attacks may be first-order. This corroborates human studies from (Reevaluating-Morris2020mb), which showed that humans rate adversarial examples from the attacks of (Alzantot2018-ti) and (TextFooler-Jin2019-re) to preserve semantics around 65% of the time. 3 The Rotten Tomatoes dataset is sometimes called Movie Review, or MR, dataset. First-order adversarial example still it may please those who love movies that blare with pop songs , young science fiction fans will stomp away in horror .

USE Cosine Similarity: 0.96
Original however it may please those who love movies that blare with pop songs , young science fiction fans will stomp away in disgust .

Second-order adversarial example
however it may displease those who hate movies that blare with pop songs , old science fiction fans will stomp away in disgust .    Table 1.

Discussion
Sentence length, S, and . As input x grows in length, the a single word swap will have an increasingly smaller impact on S(x, x adv ). Some NLP attacks that use sentence encoders as a constraint have combatted this problem by measuring the sentence encodings within a fixed-length window of words around each substitution. For example, Jin et al. (2019) considers a window of 15 words around each substitution. We chose instead to encode the entire input, as both the Universal Sentence Encoder and BERTScore were trained using full inputs.
Applications beyond NLP.

Related Work
We can categorize adversarial attacks in NLP based on their chosen definition of imperceptibility: generally adversarial attacks in NLP aim either for visual imperceptibility or in semantic imperceptibility.
Visual imperceptibility. These adversarial example generation techniques focus on characterlevel modifications that a fast-reading human may not notice. HotFlip ( crafts adversarial examples using word-level substitutions, but uniquely chooses between characterlevel perturbations (exploiting imperceptibility in appearance) and word-level synonym swaps (exploiting imperceptibility in meaning). Although there have been many adversarial attacks proposed on NLP models (Zhang et al., 2020a), surprisingly few constraints have been ex-  plored. Alzantot et al. (2018)

Conclusion
Work in generating adversarial examples in NLP has relied on outside models to evaluate imperceptibility. While useful, this inadvertently increases the size of the attack space. We propose methods for analyzing constraints' susceptibility to secondorder adversarial examples, including the ACCS and associated constraint robustness curve metric. This requires us to design an attack specific to semantic similarity models. We demonstrate these methods with a comparison of two models used in constraints, the Universal Sentence Encoder and BERTScore. We would especially like to see future research examine constraint robustness curves across more constraints and different attack designs. We hope that future researchers can use our method when choosing constraints for NLP adversarial example generation.

A.1 Experimental Details
Setup All experiments were run using the Tex-tAttack framework in Jupyter notebooks running in Google Colab using Tesla K80 GPUs. 4 .

A.2 Constraints tested on paraphrase datasets
Before running adversarial attacks on USE and BERTScore, we compared their effectiveness on common paraphrase identification tasks. USE and BERTScore each assign a semantic similarity score to each (original text, perturbed text) pair. A hard threshold determines whether a given score indicates a valid adversarial example. Above this threshold, the perturbed text is assumed to have preserved the semantics of the original input; below it, semantics is not preserved, and the perturbation is invalid. Li et al. (2018) defines validity as a cosine similarity of 0.8 or higher, as measured by USE. Jin et al. (2019) and Garg and 4 Google Colab is a great resource, providing free, easy access to high-powered GPUs, but its timeout constraints can be frustrating and unpredictable. By the end of the project, this author shelled out $9.99 for the high-octane Google Colab Pro.
Ramakrishnan (2020) choose a lower USE threshold of 0.5.
Current state-of-the-art attacks in NLP generate perturbations one word at a time: generally by swapping out a word with neighbors in the embedding space (Alzantot et al., 2018) or with synonyms provided by a thesaurus (Ren et al., 2019). Consequently, their adversarial perturbations share the lexical structure of the original inputs, with some words swapped out for synonyms. This implies that BERTScore would be a better fit for ensuring semantic preservation during these adversarial attacks, and less susceptible to second-order adversarial examples.
Our initial question was how USE and BERTScore compare on common datasets for paraphrase identification. When used as constraints on adversarial attacks, constraints that can more correctly distinguish paraphrases from nonparaphrases should be less vulnerable to secondorder adversarial examples.
In the following subsections, we compare USE and BERTScore on two paraphrase datasets, QQP and PAWS, and then on Adversarial SNLI, on a custom dataset designed to resemble the format of NLP adversarial examples on the SNLI entailment dataset.

A.2.1 Performance on paraphrase identification
We evaluate USE and BERTScore on two common paraphrase datasets: • The QQP (Quora Question Pairs) dataset, which contains 400k real-world pairs of paraphrases and non-paraphrases collected during Quora question disambiguation.  Table 3: AUC Scores for BERTScore and the Universal Sentence Encoder on QQP, PAWS, and our Adversarial SNLI dataset. BERTScore shows an advantage PAWS and Adversarial SNLI, indicating that it is a more robust choice for constraining semantics during NLP adversarial example generation. library (Morris et al., 2020b) is used to load pretrained USE and BERTScore models and to run augmentation and adversarial attack experiments. Figure 6 shows the distributions of scores from each model (USE, BERTScore) on each dataset (QQP, PAWS). Both models exhibit some ability to distinguish paraphrases and non-paraphrases on QQP, but produce very similar scores for paraphrases and non-paraphrases on PAWS (with the non-paraphrases having slightly lower scores).
We then used these scores to plot ROC curves for each dataset; these are shown in Figure 7. This table shows AUC for each model and dataset. Surprisingly, USE (AUC 0.827) slightly outperforms BERTScore (AUC 0.764) on QQP; however, BERTScore (AUC 0.662) outperforms USE (AUC 0.608) on the PAWS dataset. This corroborates findings from Zhang et al. (2019) that BERTScore is superior to sentence encoding methods on datasets with high lexical overlap.

A.2.2 Performance on Adversarial SNLI
BERTScore exhibited higher performance than USE on PAWS, a dataset of adversarial crafted paraphrases. However, USE outperformed on QQP, a more traditional paraphrase task. To shed light on which method might perform better in an NLP attack setting, we generate a dataset that resembles potential perturbations during an NLP attack.
We set out to compare the two constraints in a scenario more similar to a typical NLP adversarial attack. To do this, we crafted a dataset of perturbations that might appear during the course of an adversarial attack.
We crafted our dataset of adversarial perturbations starting with examples from the SNLI dataset. We chose SNLI because it is commonly used for testing NLP adversarial attack systems (Zhang et al., 2020b), and because second-order adversarial examples are particularly dangerous in the case of entailment, where a slight change in meaning can cause a shift in ground-truth output. However, this process could be emulated to test out constraint options before running an adversarial attack on any NLP dataset.
We sampled 1,000 (premise, hypothesis) from the SNLI dataset and discarded each premise. For each hypothesis, we created ten adversarial examples: one by substituting synonyms, and one by substituting antonyms, and by substituting each of (10%, 20%, 30%, 40%, 50%) of the original words. This produced a dataset with 10,000 examples. We sourced synonyms and antonyms from WordNet (Miller, 1995).
BERTScore achieved a higher AUC on the two adversarial datasets, PAWS and Adversarial SNLI. This is a surprising result since BERTScore turned out to be so much less effective than USE as a constraint on adversarial examples (see Section 5). We hypothesize that BERTScore is better at measuring semantic changes of 1-2 words, while USE is superior as the perturbation size grows beyond 2 words.
We can also see how across datasets, BERTScore  : ROC Curves for BERTScore and the Universal Sentence Encoder (USE) on the QQP and PAWS datasets. USE outperforms BERTScore on QQP, but BERTScore is better at PAWS. assigns scores that are generally lower; a threshold of = 0.8 on USE cosine similarity may correspond to a lower threshold, for example, = 0.5.