Adversarial Semantic Collisions

We study semantic collisions: texts that are semantically unrelated but judged as similar by NLP models. We develop gradient-based approaches for generating semantic collisions and demonstrate that state-of-the-art models for many tasks which rely on analyzing the meaning and similarity of texts-- including paraphrase identification, document retrieval, response suggestion, and extractive summarization-- are vulnerable to semantic collisions. For example, given a target query, inserting a crafted collision into an irrelevant document can shift its retrieval rank from 1000 to top 3. We show how to generate semantic collisions that evade perplexity-based filtering and discuss other potential mitigations. Our code is available at https://github.com/csong27/collision-bert.


Introduction
Deep neural networks are vulnerable to adversarial examples (Szegedy et al., 2014;Goodfellow et al., 2015), i.e., imperceptibly perturbed inputs that cause models to make wrong predictions. Adversarial examples based on inserting or modifying characters and words have been demonstrated for text classification (Liang et al., 2018;Ebrahimi et al., 2018;Pal and Tople, 2020), question answering (Jia and Liang, 2017;Wallace et al., 2019), and machine translation (Belinkov and Bisk, 2018;Wallace et al., 2020). These attacks aim to minimally perturb the input so as it to preserve its semantics while changing the output of the model.
In this work, we introduce and study a different class of vulnerabilities in NLP models for analyzing the meaning and similarity of texts. Given an input (query), we demonstrate how to generate a semantic collision: an unrelated text that is judged semantically equivalent by the target model. Semantic collisions are the "inverse" of adversarial examples. Whereas adversarial examples are similar inputs that produce dissimilar model outputs, semantic collisions are dissimilar inputs that produce similar model outputs.
We develop gradient-based approaches for generating collisions given white-box access to a model and deploy them against several NLP tasks. For paraphrase identification, the adversary crafts collisions that are judged as a valid paraphrase of the input query; downstream applications such as removing duplicates or merging similar content will thus erroneously merge the adversary's inputs with the victim's inputs. For document retrieval, the adversary inserts collisions into one of the documents that cause it to be ranked very high even though it is irrelevant to the query. For response suggestion, the adversary's irrelevant text is ranked as the top suggestion and can also carry spam or advertising. For extractive summarization, the adversary inserts a collision into the input text, causing it to be picked as the most relevant content.
Our first technique generates collisions aggressively, without regard to potential defenses. We then develop two techniques, "regularized aggressive" and "natural," that constrain generated collisions using a language model so as to evade perplexity-based filtering. We evaluate all techniques against state-of-the-art models and benchmark datasets on all four tasks. For paraphrase identification on Quora question pairs, our collisions are (mis)identified as paraphrases of inputs with 97% confidence on average. For document retrieval, our collisions shift the median rank of irrelevant documents from 1000 to around 10. For response suggestion in dialogue (sentence retrieval), our collisions are ranked as the top response 99% and 86% of the time with the aggressive and natural techniques, respectively. For extractive summarization, our collisions are chosen by the model as the summary 100% of the time. We conclude by discussing potential defenses against these attacks. Aggressive (c): chesapeake oval mayo knuckles crowded double transmitter gig after nixon, tipped incumbent physician kai joshi astonished northwestern documents | obliged dumont determines philadelphia consultative oracle keyboards dominates tel node Regularized aggressive (c): and acc near floors : panicked ; its employment became impossible, the -of cn magazine usa, in which " "'panic over unexpected noise, noise of and a of the of the of the of a of of the of the of of of of the of of of of the of of the of. Natural (c): the ansb and other buildings to carry people : three at the mall, an infirmary, an auditorium, and a library, as well as a clinic, pharmacy, and restaurant

Response Suggestion
Context (x): ...i went to school to be a vet , but i didn't like it.
c's rank = 1 Aggressive (c): buy v1agra in canadian pharmacy to breath as four ranger color Regularized aggressive (c): kill veterans and oxygen snarled clearly you were a a to to and a a to to to to to to to to to to Natural (c): then not have been an animal, or a human or a soldier but should

Extractive Summarization
Truth: on average, britons manage just six and a half hours ' sleep a night , which is far less than the recommended eight hours.
c's rank = 1 Aggressive (c): iec cu franks believe carbon chat fix pay carbon targets co2 8 iec cu mb Regularized aggressive (c): the second mercury project carbon b mercury is a will produce 38 million 202 carbon a a to to to to to to to to to to to to to Natural (c): 1 million men died during world war ii; over 40 percent were women Table 1: Four tasks in our study. Given an input x and white-box access to a victim model, the adversary produces a collision c resulting in a deceptive output. Collisions can be nonsensical or natural-looking and also carry spam messages (shown in red).

Related Work
Adversarial examples in NLP. Most of the previously studied adversarial attacks in NLP aim to minimally modify or perturb inputs while changing the model's output. Hosseini et al. (2017) showed that perturbations, such as inserting dots or spaces between characters, can deceive a toxic comment classifier. HotFlip used gradients to find such perturbations given white-box access to the target model (Ebrahimi et al., 2018). Wallace et al. (2019) extended HotFlip by inserting a short crafted "trigger" text to any input as perturbation; the trigger words are often highly associated with the target class label. Other approaches are based on rules, heuristics or generative models (Mahler et al., 2017;Ribeiro et al., 2018;Iyyer et al., 2018;Zhao et al., 2018). As explained in Section 1, our goal is the inverse of adversarial examples: we aim to generate inputs with drastically different semantics that are perceived as similar by the model.
Several works studied attacks that change the semantics of inputs. Jia and Liang (2017) showed that inserting a heuristically crafted sentence into a paragraph can trick a question answering (QA) system into picking the answer from the inserted sentence. Aggressively perturbed texts based on HotFlip are nonsensical and can be translated into meaningful and malicious outputs by black-box translation systems (Wallace et al., 2020). Our semantic collisions extend the idea of changing input semantics to a different class of NLP models; we design new gradient-based approaches that are not perturbation-based and are more effective than HotFlip attacks; and, in addition to nonsensical adversarial texts, we show how to generate "natural" collisions that evade perplexity-based defenses.
Feature collisions in computer vision. Feature collisions have been studied in image analysis models. Jacobsen et al. (2019a) showed that images from different classes can end up with identical representations due to excessive invariance of deep models. An adversary can modify the input to change its class while leaving the model's prediction unaffected (Jacobsen et al., 2019b). The intrinsic property of rectifier activation function can cause images with different labels to have the same feature vectors .

Threat Model
We describe the targets of our attack, the threat model, and the adversary's objectives.
Semantic similarity. Evaluating semantic sim-ilarity of a pair of texts is at the core of many NLP applications. Paraphrase identification decides whether sentences are paraphrases of each other and can be used to merge similar content and remove duplicates. Document retrieval computes semantic similarity scores between the user's query and each of the candidate documents and uses these scores to rank the documents. Response suggestion, aka Smart Reply (Kannan et al., 2016) or sentence retrieval, selects a response from a pool of candidates based on their similarity scores to the user's input in dialogue. Extractive summarization ranks sentences in a document based on their semantic similarity to the document's content and outputs the top-ranked sentences.
For each of these tasks, let f denote the model and x a , x b a pair of text inputs. There are two common modeling approaches for these applications. In the first approach, the model takes the concatenation ⊕ of x a and x b as input and directly produces a similarity score f (x a ⊕ x b ). In the second approach, the model computes a sentence-level embedding f (x) ∈ R h , i.e., a dense vector representation of input x. The similarity score is then computed as s(f (x a ), f (x b )), where s is a vector similarity metric such as cosine similarity. Models based on either approach are trained with similar losses, such as the binary classification loss where each pair of inputs is labeled as 1 if semantically related, 0 otherwise. For generality, let S(·, ·) be a similarity function that captures semantic relevance under either approach. We also assume that f can take x in the form of a sequence of discrete words (denoted as w) or word embedding vectors (denoted as e), depending on the scenario.
Assumptions. We assume that the adversary has full knowledge of the target model, including its architecture and parameters. It may be possible to transfer white-box attacks to the black-box scenario using model extraction (Krishna et al., 2020;Wallace et al., 2020); we leave this to future work. The adversary controls some inputs that will be used by the target model, e.g., he can insert or modify candidate documents for a retrieval system.
Adversary's objectives. Given a target model f and target sentence x, the adversary wants to generate a collision x b = c such that f perceives x and c as semantically similar or relevant. Adversarial uses of this attack depend on the application. If an application is using paraphrase identification to merge similar contents, e.g., in Quora (Scharff, Figure 1: Overview of generating semantic collision c for a query input x. The continuous variables zt relax the words in c and are optimized with gradients. We search in the simplex produced by zt for the actual colliding words in c. 2015), the adversary can use collisions to deliver spam or advertising to users. In a retrieval system, the adversary can use collisions to boost the rank of irrelevant candidates for certain queries. For extractive summarization, the adversary can cause collisions to be returned as the summary of the target document.

Adversarial Semantic Collisions
Given an input (query) sentence x, we aim to generate a collision c for the victim model with the whitebox similarity function S. This can be formulated as an optimization problem: arg max c∈X S(x, c) such that x and c are semantically unrelated. A brute-force enumeration of X is computationally infeasible. Instead, we design gradient-based approaches outlined in Algorithm 1. We consider two variants: (a) aggressively generating unconstrained, nonsensical collisions, and (b) constrained collisions, i.e., sequences of tokens that appear fluent under a language model and cannot be automatically filtered out based on their perplexity.
We assume that models can accept inputs as both hard one-hot words and soft words, 1 where a soft word is a probability vectorw ∈ ∆ |V|−1 for vocabulary V.

Aggressive Collisions
We use gradient-based search to generate a fixedlength collision given a target input. The search is done in two steps: 1) we find a continuous representation of a collision using gradient optimization with relaxation, and 2) we apply beam search to produce a hard collision. We repeat these two steps iteratively until the similarity score S converges.

Algorithm 1 Generating adversarial semantic collisions
Input: input text x, similarity function S, embeddings E, language model g, vocabulary V, length T Hyperparams: beam size B, top-k size K, iterations N , step size η, temperature τ , score coefficient β, label smoothing Optimizing for soft collision. We first relax the optimization to a continuous representation with temperature annealing. Given the model's vocabulary V and a fixed length T , we model word selection at each position t as a continuous logit vector z t ∈ R |V| . To convert each z t to an input word, we model a softly selected word at t as: where τ is a temperature scalar. Intuitively, softmax on z t gives the probability of each word in V. The temperature controls the sharpness of word selection probability; when τ → 0, the soft worď c t is the same as the hard word arg max z t . We optimize for the continuous values z. At each step, the soft word collisionsč = [č 1 , . . . ,č T ] are forwarded to f to calculate S(x,č). Since all operations are continuous, the error can be backpropagated all the way to each z t to calculate its gradients. We can thus apply gradient ascent to improve the objective.
Searching for hard collision. After the relaxed optimization, we apply a projection step to find a hard collision using discrete search. 2 Specifically, we apply left-to-right beam search on each z t . At every search step t, we first get the top K words w based on z t and rank them by the target similarity S(x, c 1:t−1 ⊕ w ⊕č t+1:T ), whereč t+1:T is the partial soft collision starting at t+1. This procedure allows us to find a hard-word replacement for the soft word at each position t based on the previously found hard words and relaxed estimates of future words.
Repeating optimization with hard collision. If the similarity score still has room for improvement after the beam search, we use the current c to initialize the soft solution z t for the next iteration of optimization by transferring the hard solution back to continuous space.
In order to initialize the continuous relaxation from a hard sentence, we apply label smoothing (LS) to its one-hot representation. For each word c t in the current c, we soften its one-hot vector to be inside ∆ |V|−1 with where is the label-smoothing parameter. Since LS(c t ) is constrained in the probability simplex ∆ |V|−1 , we set each z t to log LS(c t ) ∈ R |V| as the initialization for optimizing the soft solution in the next iteration.

Constrained Collisions
The Aggressive approach is very effective at finding collisions, but it can output nonsensical sentences. Since these sentences have high perplexity under a language model (LM), simple filtering can eliminate them from consideration. To evade perplexity-based filtering, we impose a soft constraint on collision generation and jointly maximize target similarity and LM likelihood: where P (c; g) is the LM likelihood for collision c under a pre-trained LM g and β ∈ [0, 1] is an interpolation coefficient. We investigate two different approaches for solving the optimization in equation 3: (a) adding a regularization term on softč to approximate the LM likelihood, and (b) steering a pre-trained LM to generate natural-looking c.

Regularized Aggressive Collisions
Given a language model g, we can incorporate a soft version of the LM likelihood as a regularization term on the soft aggressiveč computed from the variables [z 1 , . . . , z T ]: where H(·, ·) is cross entropy, P (w t |č 1:t−1 ; g) are the next-token prediction probabilities at t given partial soft collisionč 1:t−1 . Equation 4 relaxes the LM likelihood on hard collisions by using soft collisions as input, and can be added to the objective function for gradient optimization. The variables z t after optimization will favor words that maximize the LM likelihood.
To further reduce the perplexity of c, we exploit the degeneration property of LM, i.e., the observation that LM assigns low perplexity to repeating common tokens (Holtzman et al., 2020), and constrain a span of consecutive tokens in c (e.g., second half of c) to be selected from most frequent English words instead of the entire V. This modification produces even more disfluent collisions, but they evade LM-based filtering.

Natural Collisions
Our final approach aims to produce fluent, lowperplexity outputs. Instead of relaxing and then searching, we search and then relax each step for equation 3. This lets us integrate a hard language model while selecting next words in continuous space. In each step t, we maximize: where c 1:t−1 is the beam solution found before t. This sequential optimization is essentially LM decoding with a joint search on the LM likelihood and target similarity S, of the collision prefix.
Optimizing equation 5 exactly requires ranking each w ∈ V based on LM likelihood log P (c 1:t−1 ⊕ w; g) and similarity S(x, c 1:t−1 ⊕ w). Evaluating LM likelihood for every word at each step is efficient because we can cache log P (c 1:t−1 ; g) and compute the next-word probability in the standard manner. However, evaluating an arbitrary similarity function S(x, c 1:t−1 ⊕ w), ∀w ∈ V, requires |V| forwarded passes to f , which can be computationally expensive.
Perturbing LM logits. Inspired by Plug and Play LM (Dathathri et al., 2020), we modify the LM logits to take similarity into account. We first let t = g(c 1:t−1 ) be the next-token logits produced by LM g at step t. We then optimize from this initialization to find an update that favors words maximizing similarity. Specifically, we let z t = t + δ t where δ t ∈ R |V| is a perturbation vector. We then take a small number of gradient steps on the relaxed similarity objective max δt S(x, c 1:t−1 ⊕č t ) wherě c t is the relaxed soft word as in equation 1. This encourages the next-word prediction distribution from the perturbed logits,č t , to favor words that are likely to collide with the input x.
Joint beam search. After perturbation at each step t, we find the top K most likely words inč t . This allows us to only evaluate S(x, c 1:t−1 ⊕ w) for this subset of words w that are likely under the LM given the current beam context. We rank these top K words based on the interpolation of target loss and LM log likelihood. We assign a score to each beam b and each top K word as in equation 5, and update the beams with the top-scored words.
This process leads to a natural-looking decoded sequence because each step utilizes the true words as input. As we build up a sequence, the search at each step is guided by the joint score of two objectives, semantic similarity and fluency.

Experiments
Baseline. We use a simple greedy baseline based on HotFlip (Ebrahimi et al., 2018). We initialize the collision text with a sequence of repeating words, e.g., "the", and iteratively replace all words. In each iteration, we look at every position t and flip the current w t to v that maximizes the first-order Taylor approximation of target similarity S: where e t , e v are the word vectors for w t and v. Following prior HotFlip-based attacks (Michel et al., 2019;Wallace et al., 2019Wallace et al., , 2020, we evaluate S using the top K words from Equation 6 and flip to the word with the lowest loss to counter the local approximation. LM for natural collisions. For generating natural collisions, we need a LM g that shares the vocabulary with the target model f . When targeting models that do not share the vocabulary with an available LM, we fine-tune another BERT with an autoregressive LM task on the Wikitext-103 dataset (Merity et al., 2017). When targeting models based on RoBERTa, we use pretrained GPT-2 (Radford et al., 2019) as the LM since the vocabulary is shared.
Unrelatedness. To ensure that collisions c are not semantically similar to inputs x, we filter out words that are relevant to x from V when generating c. First, we discard non-stop words in x; then, we discard 500 to 2,000 words in V with the highest similarity score S(x, w).
Hyperparameters. We use Adam (Kingma and Ba, 2015) for gradient ascent. Detailed hyperparameter setup can be found in table 6 in Appendix A.
Notation. In the following sections, we abbreviate HotFlip baseline as HF; aggressive collisions as Aggr.; regularized aggressive collisions as Aggr. Ω where Ω is the regularization term in equation 4; and natural collisions as Nat.

Tasks and Models
We evaluate our attacks on paraphrase identification, document retrieval, response suggestions and extractive summarization. Our models for these applications are pretrained transformers, including BERT (Devlin et al., 2019) and RoBERTa , fine-tuned on the corresponding task datasets and matching state-of-the-art performance.

Paraphrase detection.
We use the Microsoft Research Paraphrase Corpus (MRPC) (Dolan and Brockett, 2005) and Quora Question Pairs (QQP) (Iyer et al., 2017), and attack the first 1,000 paraphrase pairs from the validation set.
We target the BERT and RoBERTa base models for MRPC and QQP, respectively. The models take in concatenated inputs x a , x b and output the similarity score as S(x a , x b ) = sigmoid(f (x a ⊕ x b )). We fine-tune them with the suggested hyperparameters. BERT achieves 87.51% F1 score on MRPC and RoBERTa achieves 91.6% accuracy on QQP, consistent with prior work. Document retrieval. We use the Common Core Tracks from 2017 and 2018 (Core17/18). They have 50 topics as queries and use articles from the New York Times Annotated Corpus and TREC Washington Post Corpus, respectively.
Our target model is Birch (Yilmaz et al., 2019a,b). Birch retrieves 1,000 candidate documents using the BM25 and RM3 baseline (Abduljaleel et al., 2004) and re-ranks them using the similarity scores from a fine-tuned BERT model. Given a query x q and a document x d , the BERT model assigns similarity scores S(x q , x i ) for each sentence x i in x d . The final score used by Birch for re-reranking is: where S BM25 is the baseline BM25 score and γ, κ i are weight coefficients. We use the published models 3 and coefficient values for evaluation.
We attack similarity scores S(x q , x i ) by inserting sentences that collide with x q into irrelevant x d . We filter out query words when generating collisions c so that term frequencies of query words in c are 0, thus inserting collisions does not affect the original S BM25 . For each of the 50 query topics, we select irrelevant articles that are ranked from 900 to 1000 by Birch and insert our collisions into these articles to boost their ranks.
Response suggestion. We use the Persona-chat (Chat) dataset of dialogues (Zhang et al., 2018). The task is to pick the correct utterance in each dialogue context from 20 choices. We attack the first 1,000 contexts from the validation set.
We use transformer-based Bi-and Poly-encoders that achieved state-of-the-art results on this dataset (Humeau et al., 2020). Bi-encoders compute a similarity score for the dialogue context x a and each possible next utterance where α i is the weight from attention and f (x a ) i is the ith token's contextualized representation. We use the published models 4 for evaluation.

Extractive summarization.
We use the CNN / DailyMail (CNNDM) dataset (Hermann et al., 2015), which consists of news articles and labeled overview highlights. We attack the first 1,000 articles from the validation set.
Our target model is PreSumm (Liu and Lapata, 2019). Given a text x d , PreSumm first obtains a  vector representation φ i ∈ R h for each sentence x i using BERT, and scores each sentence x i in the text as S(x d , where u is a weight vector, f is a sentence-level transformer, and f (·) i is the ith sentence's contextualized representation. Our objective is to insert a collision c into x d such that the rank of S(x d , c) among all sentences is high. We use the published models 5 for evaluation.

Attack Results
For all attacks, we report the similarity score S between x and c; the "gold" baseline is the similarity between x and the ground truth. For MRPC, QQP, Chat, and CNNDM, the ground truth is the annotated label sentences (e.g., paraphrases or summaries); for Core17/18, we use the sentences with the highest similarity S to the query. For MRPC and QQP, we also report the percentage of successful collisions with S > 0.5. For Core17/18, we report the percentage of irrelevant articles ranking in the top-10 and top-100 after inserting collisions. For Chat, we report the percentage of collisions achieving top-1 rank. For CNNDM, we report the percentage of collisions with the top-1 and top-3 ranks (likely to be selected as summary). Table 2 shows the results. On MRPC, aggressive and natural collisions achieve around 98% success; aggressive ones have higher similarity S. With regularization Ω, success rate drops to 81%. On QQP, aggressive collisions achieve 97% vs. 90% for constrained collisions.
On Core17/18, aggressive collisions shift the rank of almost half of the irrelevant articles into the top 10. Regularized and natural collisions are less effective, but more than 60% are still ranked in the top 100. Note that query topics are compact phrases with narrow semantics, thus it might be harder to find constrained collisions for them.
On Chat, aggressive collisions achieve rank of 1 more than 99% of the time for both Bi-and Poly-  encoders. With regularization Ω, success drops slightly to above 90%. Natural collisions are less successful, with 86% ranked as 1.
On CNNDM, aggressive collisions are almost always ranked as the top summarizing sentence. HotFlip and regularized collisions are in the top 3 more than 96% of the time. Natural collisions perform worse, with 77% ranked in the top 3.
Aggressive collisions always beat HotFlip on all tasks; constrained collisions are often better, too. The similarity scores S for aggressive collisions are always higher than for the ground truth.

Evaluating Unrelatedness
We use BERTSCORE (Zhang et al., 2020) to demonstrate that our collisions are unrelated to the target inputs. Instead of exact matches in raw texts, BERTSCORE computes a semantic similarity score, ranging from -1 to 1, between a candidate and a reference by using contextualized representation for each token in the candidate and reference.
The baseline for comparisons is BERTSCORE between the target input and the ground truth. For MRPC and QQP, we use x as reference; the ground truth is paraphrases as given. For Core17/18, we use x concatenated with the top sentences except the one with the highest S as reference; the ground truth is the sentence in the corpus with the highest S. For Chat, we use the dialogue contexts as reference and the labeled response as the ground truth. For CNNDM, we use labeled summarizing sentences in articles as reference and the given abstractive summarization as the ground truth.    For MPRC, QQP and CNNDM, we report F BERT (F 1 ) score. For Core17/18 and Chat, we report P BERT (content from reference found in candidate) because the references are longer and not tokenwise equivalent to collisions or ground truth. Table 3 shows the results. The scores for collisions are all negative while the scores for target inputs are positive, indicating that our collisions are unrelated to the target inputs. Since aggressive and regularized collisions are nonsensical, their contextualized representations are less similar to the reference texts than natural collisions.

Transferability of Collisions
To evaluate whether collisions generated for one target model f are effective against a different model f , we use MRPC and Chat datasets. For MRPC, we set f to a BERT base model trained with a different random seed and a RoBERTa model. For Chat, we use Poly-encoder as f for Bi-encoder f , and vice versa. Both Poly-encoder and Bi-encoder are fine-tuned from the same pretrained transformer model. We report the percentage of successfully transferred attacks, e.g., S(x, c) > 0.5 for MRPC and r = 1 for Chat. Table 5 summarizes the results. All collisions achieve some transferability (40% to 70%) if the model architecture is the same and f, f are finetuned from the same pretrained model. Furthermore, our attacks produce more transferable collisions than the HotFlip baseline. No attacks transfer if f, f are fine-tuned from different pretrained models (BERT and RoBERTa). We leave a study of transferability of collisions across different types of pretrained models to future work.

Mitigation
Perplexity-based filtering. Because our collisions are synthetic rather than human-generated texts, it is possible that their perplexity under a language model (LM) is higher than that of real text. Therefore, one plausible mitigation is to filter out collisions by setting a threshold on LM perplexity. Figure 2 shows perplexity measured using GPT-2 (Radford et al., 2019) for real data and collisions for each of our attacks. We observe a gap between the distributions of real data and aggressive collisions, showing that it might be possible to find a threshold that discards aggressive collisions while retaining the bulk of the real data. On the other hand, constrained collisions (regularized or natural) overlap with the real data.
We quantitatively measure the effectiveness of perplexity-based filtering using thresholds that would discard 80% and 90% of collisions, respec-tively. Table 4 shows the false positive rate, i.e., fraction of the real data that would be mistakenly filtered out. Both HotFlip and aggressive collisions can be filtered out with little to no false positives since both are nonsensical. For regularized or natural collisions, a substantial fraction of the real data would be lost, while 10% or 20% of collisions evade filtering. On MRPC and Chat, perplexitybased filtering is least effective, discarding around 85% to 90% of the real data.
Learning-based filtering.
Recent works explored automatic detection of generated texts using a binary classifier trained on human-written and machine-generated data (Zellers et al., 2019;Ippolito et al., 2020). These classifiers might be able to filter out our collisions-assuming that the adversary is not aware of the defense.
As a general evaluation principle (Carlini et al., 2019), any defense mechanism should assume that the adversary has complete knowledge of how the defense works. In our case, a stronger adversary may use the detection model to craft collisions to evade the filtering. We leave a thorough evaluation of these defenses to future work.
Adversarial training. Including adversarial examples during training can be effective against inference-time attacks (Madry et al., 2018). Similarly, training with collisions might increase models' robustness against collisions. Generating collisions for each training example in each epoch can be very inefficient, however, because it requires additional search on top of gradient optimization. We leave adversarial training to future work.

Conclusion
We demonstrated a new class of vulnerabilities in NLP applications: semantic collisions, i.e., input pairs that are unrelated to each other but perceived by the application as semantically similar. We developed gradient-based search algorithms for generating collisions and showed how to incorporate constraints that help generate more "natural" collisions. We evaluated the effectiveness of our attacks on state-of-the-art models for paraphrase identification, document and sentence retrieval, and extractive summarization. We also demonstrated that simple perplexity-based filtering is not sufficient to mitigate our attacks, motivating future research on more effective defenses.

Hyper-parameters.
We report the hyperparameter values for our experiments in Table 6. The label-smoothing parameter for aggressive collisions is set to 0.1. The hyper-parameters for the baseline are the same as for aggressive collisions.
Runtime. On a single GeForce RTX 2080 GPU, our attacks generate collisions in 10 to 60 seconds depending on the length of target inputs.  Regularized aggressive (c): un / australia overthrow " -of most telegraph telegraph operations " : the state office in consensus in document lifts down us " by trial " for using ¡ the a and a to and a and a to the a to a a to to a a and a a and a a a the a to to Regularized aggressive (c): -house and later car dead with prosecutors remaining : " and cathedral gallery ' import found won british arrest prosecution a a portrait or mural ( patron at from the the to the a and a to the a and to the a to the of a and to the the and to the to the a and a 3 Natural (c): the work which left its owner by a mishandle -the royal academy's chief judge inquest 8 Regularized aggressive (c): camps wii also until neutral in later addiction and the the the the of to and the the the of to and to the the 1 Natural (c): was the same side of abject warfare that had followed then for most people in this long 1

B Additional Collision Examples
Context (