Contextualized Perturbation for Textual Adversarial Attack

Adversarial examples expose the vulnerabilities of natural language processing (NLP) models, and can be used to evaluate and improve their robustness. Existing techniques of generating such examples are typically driven by local heuristic rules that are agnostic to the context, often resulting in unnatural and ungrammatical outputs. This paper presents CLARE, a ContextuaLized AdversaRial Example generation model that produces fluent and grammatical outputs through a mask-then-infill procedure. CLARE builds on a pre-trained masked language model and modifies the inputs in a context-aware manner. We propose three contextualized perturbations, Replace, Insert and Merge, that allow for generating outputs of varied lengths. CLARE can flexibly combine these perturbations and apply them at any position in the inputs, and is thus able to attack the victim model more effectively with fewer edits. Extensive experiments and human evaluation demonstrate that CLARE outperforms the baselines in terms of attack success rate, textual similarity, fluency and grammaticality.


Introduction
Adversarial example generation for natural language processing (NLP) tasks aims to perturb input text to trigger errors in machine learning models, while keeping the output close to the original. Besides exposing system vulnerabilities and helping improve their robustness and security (Zhao et al., 2018;Wallace et al., 2019;Cheng et al., 2019;Jia et al., 2019, inter alia), adversarial examples are also used to analyze and interpret the models' decisions (Jia and Liang, 2017;Ribeiro et al., 2018).
Generating adversarial examples for NLP tasks can be challenging, in part due to the discrete nature of natural language text. Most recent efforts have explored heuristic rules, such as replacing tokens with their synonyms (Samanta and Mehta, 2017;

CLARE: Contextualized Perturbation
Original Text Adversarial Text Figure 1: Illustration of CLARE. Through a mask-theninfill procedure, the model generates the adversarial text with three contextualized perturbations: Replace, Insert and Merge. A mask is indicated by " ". The degree of fade corresponds to the (decreasing) priority of the infill tokens. Liang et al., 2019;Alzantot et al., 2018;Ren et al., 2019;Jin et al., 2020, inter alia). Despite some empirical success, rule-based methods are agnostic to context, limiting their ability to produce natural, fluent, and grammatical outputs (Wang et al., 2019b;Kurita et al., 2020, inter alia).
This work presents CLARE, a ContextuaLized AdversaRial Example generation model for text. CLARE perturbs the input with a mask-then-infill procedure: it first detects the vulnerabilities of a model and deploys masks to the inputs to indicate missing text, then plugs in an alternative using a pretrained masked language model (e.g., RoBERTa;Liu et al., 2019). CLARE features three contextualized perturbations: Replace, Insert and Merge, which respectively replace a token, insert a new one, and merge a bigram ( Figure 1). As a result, it can generate outputs of varied lengths, in contrast to token replacement based methods that are limited to outputs of the same lengths as the inputs (Alzantot et al., 2018;Ren et al., 2019;Jin et al., 2020). Further, CLARE searches over a wider range of attack strategies, and is thus able to attack the victim model more effectively with fewer edits. Building on a masked language model, CLARE maximally preserves textual similarity, fluency, and grammaticality of the outputs.
We evaluate CLARE on text classification, natural language inference, and sentence paraphrase tasks, by attacking finetuned BERT models (Devlin et al., 2019). Extensive experiments and human evaluation results show that CLARE outperforms baselines in terms of attack success rate, textual similarity, fluency, and grammaticality, and strikes a better balance between attack success rate and preserving input-output similarity. Our analysis further suggests that the CLARE can be used to improve the robustness of the downstream models, and improve their accuracy when the available training data is limited. We release our code and models at https://github.com/ cookielee77/CLARE.

CLARE
At a high level, CLARE applies a sequence of contextualized perturbation actions to the input. Each can be seen as a local mask-then-infill procedure: it first applies a mask to the input around a given position, and then fills it in using a pretrained masked language model ( §2.1). To produce the output, CLARE scores and descendingly ranks the actions, which are then iteratively applied to the input ( §2.2). We begin with a brief background review and laying out of necessary notation.
Background. Adversarial example generation centers around a victim model f , which we assume is a text classifier. We focus on the blackbox setting, allowing access to f 's outputs but not its configurations such as parameters. Given an input sequence x = x 1 x 2 . . . x n and its label y (assume f (x) = y), an adversarial example x is supposed to modify x to trigger an error in the victim model: f (x ) = f (x). At the same time, textual modifications should be minimal, such that x is close to x and the human predictions on x stay the same. 1 This is achieved by requiring the similarity be-1 In computer vision applications, minor perturbations to continuous pixels can be barely perceptible to humans, thus it can be hard for one to distinguish x and x (Goodfellow et al., 2015). It is not the case for text, however, since changes to the discrete tokens are more likely to be noticed by humans. tween x and x to be larger than a threshold: sim(x , x) > . A common choice of sim(·, ·) is to encode sentences using neural networks, and calculate their cosine similarity in the embedding space (Jin et al., 2020).

Masking and Contextualized Infilling
At a given position of the input sequence, CLARE can execute three perturbation actions: Replace, Insert, and Merge, which we introduce in this section. These apply masks at the given position with different strategies, and then fill in the missing text based on the unmasked context.
Replace: A Replace action substitutes the token at a given position i with an alternative (e.g., changing "fantastic" to "amazing" in "The movie is fantastic."). It first replaces x i with a mask, and then selects a token z from a candidate set Z to fill in: For clarity, we denote replace (x, i) by x z . To produce an adversarial example, • z should fit into the unmasked context; • x z should be similar to x; • x z should trigger an error in f . These can be achieved by selecting a z such that • z receives a high probability from a masked language model: p MLM (z | x) > k; • x z is similar to x: sim(x, x z ) > ; • f predicts low probability for the gold label given x z , i.e., p f (y | x z ) is small. p MLM denotes a pretrained masked language model (e.g., RoBERTa;Liu et al., 2019). Using higher k, thresholds produces outputs that are more fluent and closer to the original. However, this can undermine the success rate of the attack. We choose k, to trade-off between these two aspects. 2 The first two requirements can be met by the construction of the candidate set: V is the vocabulary of the masked language model. To meet the third, we select from Z the token that, if filled in, will cause most "confusion" to f : (1)

5055
The Insert and Merge actions differ from Replace in terms of masking strategies. The alternative token z is selected analogously to that in a Replace action.

Insert:
This aims to add extra information to the input (e.g., changing "I recommend ..." to "I highly recommend ..."). It inserts a mask after x i and then fills it. Slightly overloading the notations, This increases the sequence length by 1.
Merge: This masks out a bigram x i x i+1 with a single mask and then fills it, reducing the sequence length by 1: z can be the same as one of the masked tokens (e.g., masking out "New York" and then filling in"York"). This can be seen as deleting a token from the input.
For Insert and Merge, z is chosen in the same manner as replace action. 3 In sum, at each position i of an input sequence, CLARE first: (i) replaces x i with a mask; (ii) or inserts a mask after x i ; (iii) or merges x i x i+1 into a mask. Then a set of candidate tokens is constructed with a masked language model and a textual similarity function; the token minimizing the gold label's probability is chosen as the alternative token. The combination of these three operations enables conversion between any two sequences.
CLARE first constructs the local actions for all positions in parallel, i.e., the actions at position i do not affect those at other positions. Then, to produce the adversarial example, CLARE gathers the local actions and selects an order to execute them.

Sequentially Applying the Perturbations
Given an input pair (x, y), let n denote the length of x. CLARE chooses from 3n actions to produce the output: 3 actions for each position, assuming the candidate token sets are not empty. We aim to generate an adversarial example with minimum modifications to the input. To achieve this, we iteratively apply the actions, and first select those 3 A perturbation will not be considered if its candidate token set is empty.

14:
end if 15: end for 16: return NONE minimizing the probability of outputting the gold label y from f . Each action is associated with a score, measuring how likely it can "confuse" f : denote by a(x) the output of applying action a to x. The score is then the negative probability of predicting the gold label from f , using a(x) as the input: Only one of the three actions can be applied at each position, and we select the one with the highest score. This constraint aims to avoid multiple modifications around the same position, e.g., merging "New York" into "Seattle" and then replacing it with "Boston".
Actions are iteratively applied to the input, until an adversarial example is found or a limit of actions T is reached. Each step selects the highest-scoring action from the remaining ones. Algorithm 1 summarizes the above procedure. 4

Discussion.
A key technique of CLARE is the local mask-then-infill perturbation. Compared with existing context-agnostic replacement approaches (Alzantot et al., 2018;Jin et al., 2020;Ren et al., 2019, inter alia), contextualized infilling produces more fluent and grammatical outputs.
Generating adversarial examples with masked language models is also explored by concurrent work BERTAttack (Li et al., 2020) and BAE (Garg and Ramakrishnan, 2020). 5 • BERTAttack only replaces tokens and thus can only produce outputs of the same lengths as the inputs. This is analogous with a CLARE model with the Replace action only. BAE entangles replacing and inserting tokens: it inserts only at positions neighboring a replaced token, limiting its attacking capability. Departing from both, CLARE uses three different perturbations (Replace, Insert and Merge), each allowing efficient attacking against any position of the input, and can produce outputs of varied lengths. As we will show in the experiments ( §3.3), CLARE outperforms both these methods. • When selecting the attack positions, neither BERTAttack or BAE takes into account the tokens to be infilled, whereas CLARE does. This results in better adversarial attack performance according to our ablation study ( §4.1). • CLARE demonstrates the advantage of using RoBERTa over BERT, which was used in the concurent works ( §4.1).

Experiments
We evaluate CLARE on text classification, natural language inference, and sentence paraphrase tasks. We begin by describing the implementation details of CLARE and the baselines ( §3.1). §3.2 introduces the experimental datasets and the evaluation metrics; the results are summarized in §3.3.

Setup
•  (Garg and Ramakrishnan, 2020). We use the open source implementation of the above baselines provided by the authors. More details are included in Appendix §A.1.

Datasets and Evaluation
Datasets. We evaluate CLARE with the following datasets: • Yelp Reviews (Zhang et al., 2015): a binary sentiment classification dataset based on restaurant reviews. • AG News (Zhang et al., 2015) , 2016). The task is to determine whether the context contains the answer to a question. It is mainly based on English Wikipedia articles. Table 1 summarizes some statistics of the datasets. In addition to the above four datasets, we experiment with DBpedia ontology dataset (Zhang et al., 2015), Stanford sentiment treebank (Socher et al., 2013), Microsoft Research Paraphrase Corpus (Dolan and Brockett, 2005), and Quora Question Pairs from the GLUE benchmark. The results on these datasets are summarized in Appendix A.2.
Following previous practice (Alzantot et al., 2018), we fine-tune CLARE on training data, and evaluate with 1,000 randomly sampled test instances of lengths ≤ 100. In the sentence-pair tasks (e.g., MNLI, QNLI), we attack the longer sentence excluding the tokens that appear in both.
Evaluation metrics. We follow previous works (Jin et al., 2020;Morris et al., 2020a), and evaluate the models with the following automatic metrics: •   Figure 2 compares trade-off curves between attack success rate and textual similarity. We tune the thresholds for constructing the candidate token sets, and plot textual similarity against the attack success rate. CLARE strikes the best balance, showing a clear advantage in success rate with least similarity drop. We observe similar trends for attack success rate and perplexity trade off.

Results
Human evaluation. We further conduct human evaluation on the AG News dataset. We randomly sample 300 instances which both CLARE and TextFooler successfully attack. For each input, we pair the adversarial examples from the two models, and present them to crowd-sourced judges along with the original input and the gold label. We ask them which they prefer with a neutral option in terms of (1) having a meaning that is closer to the original input (similarity), and (2) being more fluent and grammatical (fluency and grammaticality). Additionally, we ask the judges to annotate adversarial examples, and compare their annotations against the gold labels (label consistency). We collect 5 responses for each pair on every eval-  uated aspect. Further details are in Appendix A.3. As shown in Table 3, CLARE has a significant advantage over TextFooler: in terms of similarity 56% responses prefer CLARE, while 16% prefer TextFooler. The trend is similar for fluency & grammaticality (42% vs. 9%). This observation is consistent with results from automatic metrics. On label consistency, CLARE slightly underperforms TextFooler at 68% with a 95% condidence interval (CI) (66%, 70%), versus 70% with a 95% CI (68%, 73%). We attribute this to an inherent overlap of some categories in the AG News dataset, e.g., Science & Technology and Business, as evidenced by a 71% label consistency for original inputs.
Closing this section,

Analysis
This section first conducts an ablation study ( §4.1). We then explore CLARE's potential to be used to improve downstream models' robustness and accuracy in §4.3. In §4.2, we empirically observe that CLARE tends to attack noun and noun phrases.

Ablation Study
We ablate each component of CLARE to study its effectiveness. We evaluate on the 1,000 randomly selected AG news instances ( §3.2). The results are summarized in Table 5.
We first investigate the performance of three perturbations when applied individually. Among three editing strategies, using INSERTONLY achieves the best performance, with REPLACEONLY coming a close second. MERGEONLY underperforms the other two, partly because the attacks are restricted to bigram noun phrases ( §3.1). Combining all three perturbations, CLARE achieves the best performance with the least modifications.

AG (Sci&Tech)
Sprint Corp. is in talks with Qualcomm Inc. about using a network the chipmaker is building to deliver live television to Sprint mobile phone customers.

TextFooler (Business)
Sprint Corps. is in talks with Qualcomm Inc. about operated a network the chipmaker is consolidation to doing viva television to Sprint mobile phone customers.

CLARE (Business)
Sprint Corp. is in talks with Qualcomm Inc. about using a network Qualcomm is building to deliver cable television to Sprint mobile phone customers.

MNLI (Neutral)
Premise: Let me try it. She began snapping her fingers and saying the word eagerly, but nothing happened. Hypothesis: She became frustrated when the spell didn't work.

TextFooler (Contradiction)
Premise: Authorisation me attempting it. She triggered flapping her pinkies and said the word eagerly, but nothing arisen. Hypothesis: She became frustrated when the spell didn't work.

CLARE (Contradiction)
Premise: Let me try it. She began snapping her fingers and saying the word eagerly, but nothing unexpected happened. Hypothesis: She became frustrated when the spell didn't work. To examine the efficiency of attacking order, we compare REPLACEONLY against BERTAttack. Notably, REPLACEONLY outperforms BERTAttack across the board. This is presumably because BERTAttack does not take into account the tokens to be infilled when selecting the attack positions.
We now turn to the two constraints imposed when constructing the candidate token set. Perhaps not surprisingly, ablating the textual similarity constraint (w/o sim > l) decreases textual similarity performance, but increases other aspects. Ablating the masked language model yields a better success rate, but much worse perplexity, grammaticality, and textual similarity.
Finally, we compare CLARE implemented with different masked language models. Table 6 summarizes the results. Overall, distilled RoBERTa achieves the fastest speed without losing performance. Since the victim model is based on BERT, we conjecture that it is less efficient to attack a model using its own information.

Perturbations by Part-of-speech Tags
In this section, we break down the adversarial attacks by part-of-speech (POS) tags in AG News dataset. We find that most of the adversarial attacks happen to nouns or noun phrases. Presumably, in many topic classification datasets, the prediction heavily relies on some characteristic noun words/phrases. As shown in Table 7, 64% of the Replace actions are applied to nouns. Insert actions tend to insert tokens into noun phrase bigram: two of the most frequent POS bigrams are noun phrases. In fact, around 48% of the Insert actions are applied to noun phrases. This also justifies our choice of only applying Merge to noun phrases.

Adversarial Training
This section explores CLARE's potential in improving downstream models' accuracy and robustness.  . (a, b): insert a token between a and b. a-b: merge a and b into a token. Bottom: An AG news sample, where CLARE perturbs token "cybersecurity." TextFooler is unable to attack this token since it is out of its vocabularies.
shown in Table 8, when the full training data is available, adversarial training slightly decreases the test accuracy by 0.2% and 0.5% respectively. This aligns with previous observations (Jia et al., 2019). Interestingly, in the low-data scenario with adversarial training, the BERT-based classifier has no accuracy drop, and TextCNN achieves a 2.0% absolute improvement. This suggests that a model with less capacity can benefit more from silver data.
Does adversarial training help the models defend against adversarial attacks? To evaluate this, we use CLARE to attack classifiers trained with and without adversarial examples. 9 A higher success rate and fewer modifications indicate a victim classifier is more vulnerable to adversarial attacks. As shown in Table 8, in 3 out of the 4 cases, adversarial training helps to decrease the attack success rate by more than 10.3%, and to increase the number of modifications needed by more than 0.8. The only exception is the TextCNN model trained with 10% data. A possible reason can be that it is trained with little data and thus generalizes less well.
These results suggest that CLARE can be used to improve downstream models' robustness, with a negligible accuracy drop.

Related Work
Textual adversarial attack. An increasing amount of effort is being devoted to generating better textual adversarial examples with various 9 In preliminary experiments, we found that it is more difficult to use other models to attack a victim model trained with the adversarial examples generated by CLARE, than to use CLARE itself.   (Zhang et al., 2020a). Recent word-level models explore synonym substitution rules to enhance semantic meaning preservation (Alzantot et al., 2018;Jin et al., 2020;Ren et al., 2019;Zhang et al., 2019;Zang et al., 2020, inter alia). Our work differs in that CLARE uses three contextualized perturbations that produces more fluent and grammatical outputs.
Text generation with BERT. Generation with masked language models has been widely studied in various natural language tasks, ranging from lexical substitution (

Conclusion
We have presented CLARE, a contextualized adversarial example generation model for text. It uses contextualized knowledge from pretrained masked language models, and can generate adversarial examples that are natural, fluent and grammatical. With three contextualized perturbation patterns, Replace, Insert and Merge in our arsenal, CLARE can produce outputs of varied lengths and achieves a higher attack success rate than baselines and with fewer edits. Human evaluation shows significant advantages of CLARE in terms of textual similarity, fluency and grammaticality. We release our code and models at https://github.com/cookielee77/CLARE.

A Appendix
A. During the implementation of w/o p MLM > k in the ablation study ( §4.1), we randomly sample 200 tokens and then apply the similarity constraint to construct candidate set, as exhausting the vocabulary is computationally expensive.
Evaluation Metric. The similarity function sim builds on the universal sentence encoder (USE; Cer et al., 2018) to measure a local similarity at the perturbation position with window size 15 between the original input and its adversary. All baselines are equipped this sim when constructing the candidate vocabulary. The evaluation metric Sim uses USE to calculate a global similarity between two texts. These procedures are typically following Jin et al. (2020). We mostly rely on human evaluation ( §3.3) to conclude the significant advantage of preserving textual similarity on CLARE compared with TextFooler.
Data Processing. When processing the data, we keep all punctuation in texts for both victim model training and attacking. This differs the preprocessing setting in TextFooler (Jin et al., 2020) as we empirically found that removing punctuation makes the victim model vulnerable.

A.2 Additional Results
We include the results of DBpedia ontology dataset (DBpedia; Zhang et al., 2015, Stanford sentiment treebank (SST-2;Socher et al., 2013), Microsoft Research Paraphrase Corpus (MRPC;Dolan and Brockett, 2005), and Quora Question Pairs (QQP) from the GLUE benchmark in this section. Table 9 summarizes come statistics of these datasets. The results of different models on these datasets are summarized Table 10. Compared with all baselines, CLARE achieves the best performance on attack success rate, perplexity, grammaticality, and similarity. It is consistent with our observation in §3.3.

A.3 Human Evaluation Details
For each human evaluation on AG News dataset, we randomly sampled 300 sentences from the test set combining the corresponding adversarial examples from CLARE and TextFooler (We only consider sentences can be attacked by both models). In order to make the task less abstract, we pair the adversarial examples by the two models, and present them to the participants along with the original input and its gold label. We ask them which one they prefer in terms of (1) having more similar a meaning to the original input (similarity), and (2) being more fluent and grammatical (fluency and grammaticality). We also provide them with a neutral option, when the participants consider the two  indistinguishable. Additionally, we ask the participants to annotate the adversarial examples, and compare their annotations against the gold labels (label consistency). Higher label consistency indicates the model is better at causing the victim model to make errors while preserving human predictions. Each pair of system outputs was randomly presented to 5 crowd-sourced judges, who indicated their preference for similarity, fluency, and grammaticality using the form shown in Figure 3. The labelling task is illustrated in Figure 4. To minimize the impact of spamming, we employed the topranked 30% of U.S. workers provided by the crowdsourcing service. Detailed task descriptions and examples were also provided to guide the judges. We calculate p-value based on 95% confidence intervals by using 10K paired bootstrap replications, implemented using the R Boot statistical package.

A.4 Qualitative Samples
We include generated adversarial examples by CLARE and TextFooler on AG News, DBpeida, Yelp, MNLI, and QNLI datasets in Table 11 and  Table 12

AG (Sport)
Padres Blank Dodgers 3 -0. LOS ANGELES -Adam Eaton allowed five hits over seven innings for his careerhigh 10th victory, Brian Giles homered for the second straight game, and the San Diego Padres beat the Los Angeles Dodgers 3 -0 Thursday night. The NL West -leading Dodgers' lead was cut to 2 1 / 2 games over San Franciscotheir smallest since July 31 ...

TextFooler (World)
Dodger Blank Yanks 3 -0. Loos ANGELES -Adams Parades enabling five hits over seven slugging for his careerhigh 10th victoria, Brian Giles homered for the second straight matching, and the Tome José Dodger beat the Los Angeles Dodger 3 -0 Thursday blackness. The NL Westernereminent Dodger' lead was cut to 2 1 / 2 games over San San -their tiny as janvier 31 ...

CLARE (World)
Padres Blank Dodgers 3 -0. Milwaukee NEXT -Adam Eaton allowed five hits over seven innings for his careerhigh 10th victory, Brian Giles homered for the second straight game, and the San Diego Padres beat the Los Angeles Dodgers 3 -0 Thursday night. The NL West -leading Dodgers' lead was cut to 2 1 / 2 games over San Franciscotheir smallest since July 31 ...

Yelp (Positive)
The food at this chain has always been consistently good. Our server in downtown ( where we spent New Year's ) was new, but that did not impact our service at all. She was prompt and attentive to our needs.

TextFooler (Negative)
The food at this chain has always been necessarily ok. Our server in downtown ( where we spent New Year's ) was new, but that did not impact our service at all. She was early and attentive to our needs.

CLARE (Negative)
The food at this chain has always been looking consistently good. Our server in downtown ( where we spent New Year's ) was new, but that did not enhance our service at all. She was prompt and attentive to our needs.

Yelp (Positive)
The pho broth is actually flavorful and doesn't just taste like hot water with beef and noodles. I usually do take out and the order comes out fast during dinner which should be expected with pho, it's not hard to soak noodles, slice beef and pour broth.

TextFooler (Negative)
The pho broth is actually flavorful and doesn't just tasty like torrid waters with slaughter and salads. I repeatedly do take out and the order poses out fast during dinner which should be expected with pho , it's not strenuous to soak noodles, severing beef and pour broth.

CLARE (Negative)
The pho broth is actually flavorful and doesn't just taste bland like hot water with beef and noodles. I usually do take out and the order comes out awfully fast during dinner which should be expected with pho, it's not hard to soak noodles, slice beef and pour broth.

MNLI (Neutral)
Premise: Thebes held onto power until the 12th Dynasty, when its first king, Amenemhet Iwho reigned between 1980 1951 b.c. established a capital near Memphis. Hypothesis: The capital near Memphis lasted only half a century before its inhabitants abandoned it for the next capital.

TextFooler (Contradiction)
Premise: Thebes apprehended pour powers until the 12th Familial , when its earliest king , Amenemhet Iwho reigned between 1980 1951 c.c. established a capital near Memphis . Hypothesis: The capital near Memphis lasted only half a century before its inhabitants abandoned it for the next capital.

CLARE (Contradiction)
Premise: Thebes held onto power until the 12th Dynasty, when its first king, Amenemhet Iwho reigned between 1980 1951 b.c. thereafter established a capital near Memphis. Hypothesis: The capital near Memphis lasted only half a century before its inhabitants abandoned it for the next capital.

MNLI (Entailment)
Premise: Hopefully, Wall Street will take voluntary steps to address these issues before it is forced to act. Hypothesis: Wall Street is facing issues, that need to be addressed.

TextFooler (Neutral)
Premise: Hopefully, Wall Street will take voluntary steps to treatment these issues before it is forced to act. Hypothesis: Wall Street is facing issues, that need to be addressed.

CLARE (Neutral)
Premise: Hopefully, Wall Street will take voluntary steps to eliminate these issues before it is forced to act. Hypothesis: Wall Street is facing issues, that need to be addressed.