Reevaluating Adversarial Examples in Natural Language

State-of-the-art attacks on NLP models lack a shared definition of a what constitutes a successful attack. We distill ideas from past work into a unified framework: a successful natural language adversarial example is a perturbation that fools the model and follows some linguistic constraints. We then analyze the outputs of two state-of-the-art synonym substitution attacks. We find that their perturbations often do not preserve semantics, and 38% introduce grammatical errors. Human surveys reveal that to successfully preserve semantics, we need to significantly increase the minimum cosine similarities between the embeddings of swapped words and between the sentence encodings of original and perturbed sentences.With constraints adjusted to better preserve semantics and grammaticality, the attack success rate drops by over 70 percentage points.


Introduction
One way to evaluate the robustness of a machine learning model is to search for inputs that produce incorrect outputs. Inputs intentionally designed to fool deep learning models are referred to as adversarial examples (Goodfellow et al., 2017). Adversarial examples have successfully tricked deep neural networks for image classification: two images that look exactly the same to a human receive completely different predictions from the classifier (Goodfellow et al., 2014). The literature contains many potential answers to this question, proposing varying definitions for successful adversarial examples (Zhang et al., 2019). Even attacks with similar definitions of success often measure it in different ways. The lack of a consistent definition and standardized evaluation has hindered the use of adversarial examples to understand and improve NLP models. 2 Therefore, we propose a unified definition for successful adversarial examples in natural language: perturbations that both fool the model and fulfill a set of linguistic constraints. In Section 2, we present four categories of constraints NLP adversarial examples may follow, depending on the context: semantics, grammaticality, overlap, and non-suspicion to human readers.
By explicitly laying out categories of constraints adversarial examples may follow, we introduce a shared vocabulary for discussing constraints on adversarial attacks. In Section 4, we suggest options for human and automatic evaluation methods for each category. We use these methods to evaluate two SOTA synonym substitution attacks: GENETICATTACK by Alzantot et al. (2018) and TEXTFOOLER by Jin et al. (2019). Human surveys show that the perturbed examples often fail to fulfill semantics and non-suspicion constraints. Additionally, a grammar checker detects 39% more errors in the perturbed examples than in the original inputs, including many types of errors humans almost never make.
In Section 5, we produce TFADJUSTED, an attack with the same search process as TEXTFOOLER, but with constraint enforcement tuned to generate higher quality adversarial examples. To enforce semantic preservation, we tighten the thresholds on the cosine similarity between embeddings of swapped words and between the sentence encodings of original and perturbed sentences. To enforce grammaticality, we validate perturbations with a grammar checker. As in TEXTFOOLER, these constraints are applied at each step of the search. Human evaluation shows that TFADJUSTED generates perturbations that better preserve semantics and are less noticeable to human judges. However, with stricter constraints, the attack success rate decreases from over 80% to under 20%. When used for adversarial training, TEXTFOOLER's examples decreased model accuracy, but TFADJUSTED's examples did not.
Without a shared vocabulary for discussing constraints, past work has compared the success rate of search methods with differing constraint application techniques. Jin et al. (2019) reported a higher attack success rate for TEXTFOOLER than Alzantot et al. (2018) did for GENETICATTACK, but it was not clear whether the improvement was due to a better search method 3 or more lenient constraint application 4 . In Section 6 we compare the search methods with constraint application held constant. We find that GENETICATTACK's search method is more successful than TEXTFOOLER's, contrary to the implications of Jin et al. (2019).
The five main contributions of this paper are: • A definition for constraints on adversarial perturbations in natural language and suggest evaluation methods for each constraint. • Constraint evaluations of two SOTA synonymsubstitution attacks, revealing that their perturbations often do not preserve semantics, grammaticality, or non-suspicion. • Evidence that by aligning automatic constraint application with human judgment, it is possible for attacks to produce successful, valid adversarial examples. • Demonstration that reported differences in attack success between TEXTFOOLER and GENET-ICATTACK are the result of more lenient constraint enforcement. • Our framework enables fair comparison between attacks, by separating effects of search methods from effects of loosened constraints.

Constraints on Adversarial Examples in Natural Language
We define F : X → Y as a predictive model, for example, a deep neural network classifier. X is the input space and Y is the output space. We focus on adversarial perturbations which perturb a correctly predicted input, x ∈ X , into an input x adv . The boolean goal function G(F, x adv ) represents whether the goal of the attack has been met. We define C 1 ...C n as a set of boolean functions indicating whether the perturbation satisfies a certain constraint.
Adversarial attacks search for a perturbation from x to x adv which fools F by both achieving some goal, as represented by G(F, x adv ), and fulfilling each constraint C i (x, x adv ).
The definition of the goal function G depends on the purpose of the attack. Attacks on classification frequently aim to either induce any incorrect classification (untargeted) or induce a particular classification (targeted). Attacks on other types of models may have more sophisticated goals. For example, attacks on translation may attempt to change every word of a translation, or introduce targeted keywords into the translation (Cheng et al., 2018).
In addition to defining the goal of the attack, the attacker must decide the constraints perturbations must meet. Different use cases require different Input, x: "Shall I compare thee to a summer's day?" -William Shakespeare, Sonnet XVIII

Constraint
Perturbation, x adv Explanation Semantics Shall I compare thee to a winter's day?
x adv has a different meaning than x. Grammaticality Shall I compares thee to a summer's day?
x adv is less grammatically correct than x. Edit Distance Sha1l i conpp$haaare thee to a 5umm3r's day? x and x adv have a large edit distance. Non-suspicion Am I gonna compare thee to a summer's day? A human reader may suspect this sentence to have been modified. 1 1 Shakespeare never used the word "gonna". Its first recorded usage wasn't until 1806, and it didn't become popular until the 20th century. In the following, we define four categories of constraints on adversarial perturbations in natural language: semantics, grammatically, overlap, and non-suspicion. Table 1 provides examples of adversarial perturbations that violate each constraint.

Semantics
Semantics constraints require the semantics of the input to be preserved between x and x adv . Many attacks include constraints on semantics as a way to ensure the correct output is preserved (Zhang et al., 2019). As long as the semantics of an input do not change, the correct output will stay the same. There are exceptions: one could imagine tasks for which preserving semantics does not necessarily preserve the correct output. For example, consider the task of classifying passages as written in either Modern or Early Modern English. Perturbing "why" to "wherefore" may retain the semantics of the passage, but change the correct label from Modern to Early Modern English 5

Grammaticality
Grammaticality constraints place restrictions on the grammaticality of x adv . For example, an adversary attempting to generate a plagiarised paper which fools a plagiarism checker would need to ensure that the paper remains grammatically correct. Grammatical errors don't necessarily change semantics, as illustrated in Table 1.

Overlap
Overlap constraints restrict the similarity between x and x adv at the character level. This in-cludes constraints like Levenshtein distance as well as n-gram based measures such as BLEU, ME-TEOR and chRF (Papineni et al., 2002;Denkowski and Lavie, 2014;Popović, 2015).
Setting a maximum edit distance is useful when the attacker is willing to introduce misspellings. Additionally, the edit distance constraint is sometimes used when improving the robustness of models. For example, Huang et al. (2019) uses Interval Bound Propagation to ensure model robustness to perturbations within some edit distance of the input.

Non-suspicion
Non-suspicion constraints specify that x adv must appear to be unmodified. Consider the example in Table 1. While the perturbation preserves semantics and grammar, it switches between Modern and Early Modern English and thus may seem suspicious to readers.
Note that the definition of the non-suspicious constraint is context-dependent. A sentence that is non-suspicious in the context of a kindergartner's homework assignment might be suspicious in the context of an academic paper. An attack scenario where non-suspicion constraints do not apply is illegal PDF distribution, similar to a case discussed by Gilmer et al. (2018). Consumers of an illegal PDF may tacitly collude with the person uploading it. They know the document has been altered, but do not care as long as semantics are preserved.
3 Review and Categorization of SOTA: Attacks by Paraphrase: Some studies have generated adversarial examples through paraphrase. Iyyer et al. (2018) used neural machine translation systems to generate paraphrases. Ribeiro et al. (2018) proposed semantically-equivalent adversarial rules. By definition, paraphrases preserve semantics. Since the systems aim to generate perfect paraphrases, they implicitly follow constraints of grammaticality and non-suspicion.
Attacks by Synonym Substitution: Some works focus on an easier way to generate a subset of paraphrases: replacing words from the input with synonyms (Alzantot et al., 2018;Jin et al., 2019;Kuleshov et al., 2018;Papernot et al., 2016;Ren et al., 2019). Each attack applies a search algorithm to determine which words to replace with which synonyms. Like the general paraphrase case, they aim to create examples that preserve semantics, grammaticality, and non-suspicion. While not all have an explicit edit distance constraint, some limit the number of words perturbed.
Attacks by Character Substitution: Some studies have proposed to attack natural language classification models by deliberately misspelling words (Ebrahimi et al., 2017;Gao et al., 2018;Li et al., 2018). These attacks use character replacements to change a word into one that the model doesn't recognize. The replacements are designed to create character sequences that a human reader would easily correct into the original words. If there aren't many misspellings, non-suspicion may be preserved. Semantics are preserved as long as human readers can correct the misspellings.
Attacks by Word Insertion or Removal: Liang et al. (2017) and Samanta and Mehta (2017) devised a way to determine the most important words in the input and then used heuristics to generate perturbed inputs by adding or removing important words. In some cases, these strategies are combined with synonym substitution. These attacks aim to follow all constraints.
Using constraints defined in Section 2 we categorize a sample of current attacks in Table 2.

Constraint Evaluation Methods and Case Study
For each category of constraints introduced in Section 2, we discuss best practices for both human and automatic evaluation. We leave out overlap due to ease of automatic evaluation.
Additionally, we perform a case study, evaluating how well black-box synonym substitution attacks GENETICATTACK and TEXTFOOLER fulfill constraints. Both attacks find adversarial examples by swapping out words for their synonyms until the classifier is fooled. GENETICATTACK uses a genetic algorithm to attack an LSTM trained on the IMDB 6 document-level sentiment classification dataset. TEXTFOOLER uses a greedy approach to attack an LSTM, CNN, and BERT trained on five classification datasets. We chose these attacks because: • They claim to create perturbations that preserve semantics, maintain grammaticality, and are not suspicious to readers. However, our inspection of the perturbations revealed that many violated these constraints. • They report high attack success rates. 7 • They successfully attack two of the most effective models for text classification: LSTM and BERT.
To generate examples for evaluation, we attacked BERT using TEXTFOOLER and attacked an LSTM using GENETICATTACK. We evaluate both methods on the IMDB dataset. In addition, we evaluate TEXTFOOLER on the Yelp polarity document-level sentiment classification dataset and the Movie Review (MR) sentence-level sentiment classification dataset (Pang and Lee, 2005;Zhang et al., 2015). We use 1, 000 examples from each dataset. Table 3 shows example violations of each constraint.

Human Evaluation
A few past studies of attacks have included human evaluation of semantic preservation (Ribeiro et al., 2018;Iyyer et al., 2018;Alzantot et al., 2018;Jin et al., 2019). However, studies often simply ask users to simply rate the "similarity" of x and x adv . We believe this phrasing does not generate an accurate measure of semantic preservation, as users may consider two sentences with different semantics "similar" if they only differ by a few words. Instead, users should be explicitly asked whether changes between x and x adv preserve the meaning of the original passage.
We propose to ask human judges to rate if meaning is preserved on a Likert scale of 1-5, where 1 is "Strongly Disagree" and 5 is "Strongly Agree" (Likert, 1932). A perturbation is semantics-preserving if the average score is at least sem . We propose Character Substitution. (Ebrahimi et al., 2017;Gao et al., 2018;Li et al., 2018) Word Insertion or Removal. (Liang et al., 2017;Samanta and Mehta, 2017) General Paraphrase. (Zhao et al., 2017;Ribeiro et al., 2018;Iyyer et al., 2018) Table 2: Summary of Constraints and Attacks. This table shows a selection of prior work (rows) categorized by constraints (columns). A "" indicates that the respective attack is supposed to meet the constraint, and a "" means the attack is not supposed to meet the constraint.

Constraint Violated
Input, x Perturbation, x adv Semantics Jagger, Stoppard and director Michael Apted deliver a riveting and surprisingly romantic ride.
Jagger, Stoppard and director Michael Apted deliver a baffling and surprisingly sappy motorbike. Grammaticality A grating, emaciated flick.
A grates, lanky flick. Non-suspicion Great character interaction.
Gargantuan character interaction. sem = 4 as a general rule: on average, humans should at least "Agree" that x and x adv have the same meaning.

Automatic Evaluation
Automatic evaluation of semantic similarity is a well-studied NLP task. The STS Benchmark is used as a common measurement (Cer et al., 2017). Michel et al. (2019) explored the use of common evaluation metrics for machine translation as a proxy for semantic similarity in the attack setting. While n-gram overlap based approaches are computationally cheap and work well in the machine translation setting, they do not correlate with human judgment as well as sentence encoders .
Some attacks have used sentence encoders to encode two sentences into a pair of fixed-length vectors, then used the cosine distance between the vectors as a proxy for semantic similarity. TEXTFOOLER uses the Universal Sentence Encoder (USE), which achieved a Pearson correlation score of 0.782 on the STS benchmark (Cer et al., 2018). Another option is BERT fine-tuned for semantic similarity, which achieved a score of 0.865 (Devlin et al., 2018).
Additionally, synonym substitution methods, including TEXTFOOLER and GENETICATTACK, often require that words be substituted only with neighbors in the counter-fitted embedding space, which is designed to push synonyms together and antonyms apart (Mrksic et al., 2016). These automatic metrics of similarity produce a score that represents the similarity between x and x adv . Attacks depend on a minimum threshold value for each metric to determine whether the changes between x and x adv preserve semantics. Human evaluation is needed to find threshold values such that people generally "agree" that semantics is preserved.

Case Study
To quantify semantic similarity of x and x adv , we asked users whether they agreed that the changes between the two passages preserved meaning on a scale of 1 (Strongly Disagree) to 5 (Strongly Agree). We averaged scores for each attack method to determine if the method generally preserves semantics.
Perturbations generated by TEXTFOOLER were rated an average of 3.28, while perturbations generated by GENETICATTACK were rated on average 2.70. 8 The average rating given for both methods was significantly less than our proposed sem of 4. Using a clear survey question illustrates that humans, on average, don't assess these perturbations as semantics-preserving.

Human Evaluation
Both Jin et al. (2019) and Iyyer et al. (2018) reported a human evaluation of grammaticality, but neither study clearly asked if any errors were introduced by a perturbation. For human evaluation of the grammaticality constraint, we propose presenting x and x adv together and asking judges if grammatical errors were introduced by the changes made. However, due to the rule-based nature of grammar, automatic evaluation is preferred.

Automatic Evaluation
The simplest way to automatically evaluate grammatical correctness is with a rule-based grammar checker. Free grammar checkers are available online in many languages. One popular checker is LanguageTool, an open-source proofreading tool (Naber, 2003). LanguageTool ships with thousands of human-curated rules for the English language and provides an interface for identifying grammatical errors in sentences. LanguageTool uses rules to detect grammatical errors, statistics to detect uncommon sequences of words, and language model perplexity to detect commonly confused words.

Case Study
We ran each of the generated (x, x adv ) pairs through LanguageTool to count grammatical errors. LanguageTool detected more grammatical errors in x adv than x for 50% of perturbations generated by TEXTFOOLER, and 32% of perturbations generated by GENETICATTACK.
Additionally, perturbations often contain errors that humans rarely make. LanguageTool detected 6 categories for which errors in the perturbed samples appear at least 10 times more frequently than in the original content. Details regarding these error categories and examples of violations are shown in Table 4.

Human Evaluation
We propose evaluation of non-suspicion by having judges view a shuffled mix of real and adversarial inputs and guess whether each is real or computer-altered. This is similar to the human evaluation done by Ren et al. (2019), but we formulate it as a binary classification task rather than on a 1-5 scale. A perturbed example x adv is not suspicious if the percentage of judges who identify x adv as computer-altered is at most ns , where 0 ≤ ns ≤ 1.

Automatic Evaluation
Automatic evaluation may be used to guess whether or not an adversarial example is suspicious. Models can be trained to classify passages as real or perturbed, just as human judges do. For example, Warstadt et al. (2018) trained sentence encoders on a real/fake task as a proxy for evaluation of linguistic acceptability. Recently, Zellers et al. (2019) demonstrated that GROVER, a transformer-based text generation model, could classify its own generated news articles as human or machine-written with high accuracy.

Case Study
We presented a shuffled mix of real and perturbed examples to human judges and asked if they were real or computer-altered. As this is a time-consuming task for long documents, we only evaluated adversarial examples generated by TEXTFOOLER on the sentence-level MR dataset.
If all generated examples were non-suspicious, judges would average 50% accuracy, as they would not be able to distinguish between real and perturbed examples. In this case, judges achieved 69.2% accuracy.

Producing Higher Quality Adversarial Examples
In Section 4, we evaluated how well generated examples met constraints. We found that although attacks in NLP aspire to meet linguistic constraints, in practice, they frequently violate them. Now, we adjust automatic constraints applied during the course of the attack to produce better quality adversarial examples.
We set out to find if a set of constraint application methods with appropriate thresholds could produce adversarial examples that are semanticspreserving, grammatical and non-suspicious. We modified TEXTFOOLER to produce TFADJUSTED, a new attack with stricter constraint application. To enforce grammaticality, we added Language-Tool. To enforce semantic preservation, we tuned two thresholds which filter out invalid word substitutions: (a) minimum cosine similarity between counter-fitted word embeddings and (b) minimum The pronoun 'they' must be used with a non-third-person form of a verb: "do" --Replace with one of [do] they does a ok operating of painting this family ... cosine similarity between sentence embeddings. Through human studies, we found threshold values of 0.9 for (a) and 0.98 for (b) 9 . We implemented TFADJUSTED using TextAttack, a Python framework for implementing adversarial attacks in NLP (Morris et al., 2020).

With Adjusted Constraint Application
We tested TFADJUSTED to determine the effect of tightening constraint application. We used the IMDB, Yelp, and MR datasets for classifcation as in Section 4. We added the SNLI and MNLI entailment datasets (Bowman et al., 2015;Williams et al., 2018) for the portions not requring human evaluation. Table 5 shows the results.
Semantics. TEXTFOOLER generates perturbations for which human judges are on average "Not sure" if semantics are preserved. With perturbations generated by TFADJUSTED, human judges on average "Agree" that semantics are preserved.

Grammaticality. Since all examples produced by TFADJUSTED are checked with LanguageTool, no perturbation can introduce grammatical errors. 10
Non-suspicion. We repeated the non-suspicion study from Section 4.3 with the examples generated by TFADJUSTED. Participants were able to guess with 58.8% accuracy whether inputs were computer-altered. The accuracy is over 10% lower than the accuracy on the examples generated by 9 Details in the appendix, Section A.2.2. 10 Since the MR dataset is already lowercased and tokenized, it is difficult for a rule-based grammar checker like Language-Tool to parse some inputs.

TEXTFOOLER.
Attack success. For each of the three datasets, the attack success rate decreased by at least 71 percentage points (see last row of Table 5).

Adversarial Training With Higher Quality Examples
Using the 9, 595 samples in the MR training set as seed inputs, TEXTFOOLER generated 7,382 adversarial examples, while TFADJUSTED generated just 825. We append each set of adversarial examples to a copy of the original MR training set and fine-tuned a pre-trained BERT model for 10 epochs. Figure 2 plots the test accuracy over 10 training epochs, averaged over 5 random seeds per dataset. While neither training method strongly impacts accuracy, the augmentation using TFADJUSTED has a better impact than that of TEXTFOOLER.
We then re-ran the two attacks using 1000 examples from the MR test set as seeds. Again averaging over 5 random seeds, we found no significant change in robustness. That is, models trained on the original MR dataset were approximately as robust as those trained on the datasets augmented with TEXTFOOLER and TFADJUSTED examples. This corroborates the findings of Alzantot et al. (2018) and contradicts those of Jin et al. (2019). We include further analysis along with some hypotheses for the discrepancies in adversarial training results in A.4.

Ablation of TFADJUSTED Constraints
TFADJUSTED generated better quality adversarial examples by constraining its search to exclude examples that fail to meet three constraints: word embedding distance, sentence encoder similarity, and grammaticality. We performed an ablation study to understand the relative impact of each on attack success rate.
We reran three TFADJUSTED attacks (one for each constraint removed) on each dataset. Table 6 shows attack success rate after individually removing each constraint. The word embedding distance constraint was the greatest inhibitor of attack success rate, followed by the sentence encoder.

Comparing Search Methods
When an attack's success rate improves, it may be the result of either (a) improvement of the search method for finding adversarial perturbations or (b) more lenient constraint definitions or constraint application. TEXTFOOLER achieves a higher success rate than GENETICATTACK, but Jin et al. (2019) did not identify whether the improvement was due to (a) or (b). Since TEXTFOOLER uses both a different search method and different constraint application methods than GENETICATTACK, the source of the difference in attack success rates is unclear.
To determine which search method is more effective, we used TextAttack to compose attacks from the search method of GENETICATTACK and the constraint application methods of each of TEXTFOOLER and TFADJUSTED (Morris et al., 2020). With the constraint application held constant, we can identify the source of the difference in attack success rate. Table 7 reveals that the genetic algorithm of GENETICATTACK is more successful than the greedy search of TEXTFOOLER at both constraint application levels. This reveals the source of improvement in attack success rate between GENETICATTACK and TEXTFOOLER to be more lenient constraint application. However, GE-NETICATTACK's genetic algorithm is far more computationally expensive, requiring over 40x more model queries.

Discussion
Tradeoff between attack success and example quality. TFADJUSTED made semantic constraints more selective, which helped attacks generate examples that scored above 4 on the Likert scale for preservation of semantics. However, this led to a steep drop in attack success rate. This indicates that, when only allowing adversarial perturbations  Table 7: Comparison of the search methods from GENETICATTACK and TEXTFOOLER with two sets of constraints (TEXTFOOLER and TFADJUSTED). Attacks were run on 1000 samples against BERT fine-tuned on the MR dataset. GENETICATTACK's genetic algorithm is more successful than TEXTFOOLER's greedy strategy, albeit much less efficient.
that preserve semantics and grammaticality, NLP models are relatively robust to current synonym substitution attacks. Note that our set of constraints isn't necessarily optimal for every attack scenario. Some contexts may require fewer constraints or less strict constraint application.
Decoupling search methods and constraints. It is critical that researchers decouple new search methods from new constraint evaluation and constraint application methods. Demonstrating the performance of a new attack that simultaneously introduces a new search method and new constraints makes it unclear whether empirical gains indicate a more effective attack or a more relaxed set of constraints. This mirrors a broader trend in machine learning where researchers report differences that come from changing multiple independent variables, making the sources of empirical gains unclear (Lipton and Steinhardt, 2018). This is especially relevant in adversarial NLP, where each experiment depends on many parameters.
Towards improved methods for generating textual adversarial examples. As models improve at paraphrasing inputs, we will be able to explore the space of adversarial examples beyond synonym substitutions. As models improve at measuring semantic similarity, we will be able to more rigorously ensure that adversarial perturbations preserve semantics. It remains to be seen how robust BERT is when subject to paraphrase attacks that rigorously preserve semantics and grammaticality.

Related Work
The goal of creating adversarial examples that preserve semantics and grammaticality is common in the NLP attack literature (Zhang et al., 2019). However, previous works use different definitions of adversarial examples, making it difficult to compare methods. We provide a unified definition of an adversarial example based on a goal function and a set of linguistic constraints. Gilmer et al. (2018) laid out a set of potential constraints for the attack space when generating adversarial examples, which are each useful in different real-world scenarios. However, they did not discuss NLP attacks in particular. Michel et al. (2019) defined a framework for evaluating attacks on machine translation models, focusing on meaning preservation constraints, but restricted their definitions to sequence-to-sequence models. Other research on NLP attacks has suggested various constraints but has not introduced a shared vocabulary and categorization that allows for effective comparisons between attacks.

Conclusion
We showed that two state-of-the-art synonym substitution attacks, TEXTFOOLER and GENETI-CATTACK, frequently violate the constraints they claim to follow. We created TFADJUSTED, which applies constraints that produce adversarial examples judged to preserve semantics and grammaticality.
Due to the lack of a shared vocabulary for discussing NLP attacks, the source of improvement in attack success rate between TEXTFOOLER and GENETICATTACK was unclear. Holding constraint application constant revealed that the source of TEXTFOOLER's improvement was lenient constraint application (rather than a better search method). With a shared framework for defining and applying constraints, future research can focus on developing better search methods and better constraint application techniques for preserving semantics and grammaticality.

A.2 Details about Human Studies.
Our experiments relied on labor crowd-sourced from Amazon Mechanical Turk. We used five datasets: MIT and Yelp datasets from (Alzantot et al., 2018) and MIT, Yelp, and Movie Review datasets from (Jin et al., 2019). We limited our worker pool to workers in the United States, Canada, Canada, and Australia that had completed over 5,000 HITs with over a 99% success rate. We had an additional Qualification that prevented workers who had submitted too many labels in previous tasks from fulfilling too many of our HITs. In the future, will also use a small qualifier task to select workers who are good at the task.
For the human portions, we randomly select 100 successful examples for each combination of attack method and dataset, then use Amazon's Mechanical Turk to gather 10 answers for each example. For the automatic portions of the case study in Section 4, we use all successfully perturbed examples.

A.2.1 Evaluating Adversarial Examples
Rating Semantic Similarity. In one task, we present results from two Mechanical Turk questionnaires to judge semantic similarity or dissimilarity. For each task, we show x and x adv , side by side, in a random order. We added a custom bit of Javascript to highlight character differences between the two sequences. We provided the following description: "Compare two short pieces of English text and determine if they mean different things or the same." We then prompted labelers: "The changes between these two passages preserve the original meaning." We paid $0.06 per label for this task.
Inter-Annotator Agreement. For each semantic similarity prompt, we gathered annotations from 10 different judges. Recall that each selection was one of 5 different options ranging from "Strongly Agree" to "Strongly Disagree." For each pair of original and perturbed sequences, we calculated the number of judges who chose the most frequent option. For example, if 7 choose "Strongly Agree" and 3 chose "Agree," the number of judges who chose the most frequent option is 7. We found that for the examples studied in Section 4 the average of this metric was 5.09. For the examples in Section 5 at the threshold of .98 which we chose, the average was 5.6.

Guessing Real vs.
Computer-altered. We present results from our Mechanical Turk survey where we asked users "Is this text real or computeraltered?". We restricted this task to a single dataset, Movie Review. We chose Movie Review because it had an average sample length of 20 words, much shorter than Yelp or IMDB. We made this restriction because of the time-consuming nature of classifying long samples as Real or Fake. We paid $0.05 per label for this task.
Rating word similarity. We performed a third study where we asked showed users a pair of words and asked "In general, replacing the first word with the second preserves the meaning of a sentence:". We paid $0.02 per label for this task.
Phrasing matters. Mechanical Turk comes with a set of pre-designed questionnaire interfaces. These include one titled "Semantic Similarity" which asks users to rate a pair of sentences on a scale from "Not Similar At All" to "Highly Similar." Examples generated by synonym attacks benefit from this question formulation because humans tend to rate two sentences that share many words as "Similar" due to their small morphological distance, even if they have different meanings.
Notes for future surveys . In the future, we would also try to filter out bad labels by mixing some number of ground-truth "easy" data points into our dataset and rejecting the work of labelers who performed poorly on this set.

A.2.2 Finding The Right Thresholds
Comparing two words. We showed study participants a pair of words and asked them whether swapping out one word for the other would change the meaning of a sentence. The results are shown in Figure 3. Using this information, we chose 0.9 as the word-level cosine similarity threshold.     (Alzantot et al., 2018) reported no effect. We trained 5 models on each dataset, and saw significant variance in the robustness of adversarially trained models between random initializations and between epochs. The results are shown in Figure 5. It is possible that Jin et al. (2019) trained a single model for each training set (original and augmented) and happened to see an increase in robustness. It remains to be seen whether examples generated by GENETICAT-TACK, TEXTFOOLER, and TFADJUSTED help or hurt the robustness and accuracy of adversarially trained models across other model architectures and datasets.

A.5 Word Embeddings
It is common to perform synonym substitution by replacing a word by a neighbor in the counterfitted embedding space. The distance between word embeddings is frequently measured using Euclidean distance, but it is also common to compare word embeddings based on their cosine similarity (the cosine of the angle between them). (Some work also measures distance based on the meansquared error between embeddings, which is just the square of Euclidean distance.) For this reason, past work has sometimes constrained nearest neighbors based on the Euclidean distance between two word vectors, and other times based on their cosine similarity. Alzantot et al. (2018) considered both distance metrics, and reported that they "did not see a noticeable improvement using cosine." We would like to point out that, when using normalized word vectors (as is typical for counter-fitted embeddings), filtering nearest neighbors based on their minimum cosine similarity is equivalent to filtering by maximum Euclidean distance (or MSE, for that matter).
Therefore, the Euclidean distance between u and v is directly proportional to the cosine between them. For any minimum cosine distance , we can use maximum euclidean distance √ 2 − 2 and achieve the same result.

A.6 Examples In The Wild
We randomly select 10 attempted attacks from the MR dataset and show the original inputs, perturbations before constraint change, and perturbations after constraint change. See Table 9.