BERT-ATTACK: Adversarial Attack against BERT Using BERT

Adversarial attacks for discrete data (such as text) has been proved significantly more challenging than continuous data (such as image), since it is difficult to generate adversarial samples with gradient-based methods. Currently, the successful attack methods for text usually adopt heuristic replacement strategies on character or word level, which remains challenging to find the optimal solution in the massive space of possible combination of replacements, while preserving semantic consistency and language fluency. In this paper, we propose \textbf{BERT-Attack}, a high-quality and effective method to generate adversarial samples using pre-trained masked language models exemplified by BERT. We turn BERT against its fine-tuned models and other deep neural models for downstream tasks. Our method successfully misleads the target models to predict incorrectly, outperforming state-of-the-art attack strategies in both success rate and perturb percentage, while the generated adversarial samples are fluent and semantically preserved. Also, the cost of calculation is low, thus possible for large-scale generations.


Introduction
Despite the success of deep learning, recent works have found that these neural networks are vulnerable to adversarial samples, which are crafted with small perturbations to the original inputs (Goodfellow et al., 2014;Kurakin et al., 2016;Chakraborty et al., 2018). That is, these adversarial samples are imperceptible to human judges while they can mislead the neural networks to incorrect predictions. Therefore, it is essential to explore these adversarial attack methods since the ultimate goal is to make sure the neural networks are highly reliable and robust. While in computer vision fields, both attack strategies and their defense countermeasures * Corresponding author. are well-explored (Chakraborty et al., 2018), the adversarial attack for text is still challenging due to the discrete nature of languages. Generating of adversarial samples for text needs to possess such qualities: (1) imperceptible to human judges yet misleading to neural models; (2) fluent in grammar and semantically consistent with original inputs.
Previous methods craft adversarial samples mainly based on specific rules (Li et al., 2018;Gao et al., 2018;Jin et al., 2019). Therefore, these methods are difficult to guarantee the fluency and semantically preservation in the generated adversarial samples. These manual craft methods are rather complicated. They are designed with multiple linguistic constraint like NER tagging or POS tagging. Introducing contextualized language models to serve as an automatic perturbation generator could make these rules designing much easier.
Recent rise of pre-trained language models, such as BERT (Devlin et al., 2018), push the performances of NLP tasks to a new level. On the one hand, the powerful ability of a fine-tuned BERT on downstream tasks makes it more challenging to be adversarial attacked (Jin et al., 2019). On the other hand, BERT is a pre-trained masked language model on extremely large-scale unsupervised data and has learned general-purpose language knowledge. Therefore, BERT has the potential to generate more fluent and semantic-consistent substitutions for an input text. Naturally, both the properties of BERT motivate us to explore the possibility of attacking a fine-tuned BERT with another BERT as attacker.
In this paper, we propose an effective and high-quality adversarial sample generation method: BERT-Attack, using BERT as a language model to generate adversarial samples. The core algorithm of BERT-Attack is straightforward and consists of two stages: finding the vulnerable words in one given input sequence for the target model, then applying BERT to generate substitutes for the vulnerable words. With the powerful ability of BERT, the perturbations are generated considering the context around. Therefore, the perturbations are fluent and reasonable. We uses the masked language model as perturbation generator and find perturbations that maximize the risk of making wrong predictions (Goodfellow et al., 2014). Differently from previous attacking strategies that requires traditional single-direction language models as a constraint, we only need to inference the language model once as perturbation generator rather than repeatedly using language models to score the generated adversarial samples in a trail and error process.
Experimental results show that the proposed BERT-Attack method successfully fooled its finetuned downstream model with the highest attack success rate compared with previous methods. Meanwhile, the perturb percentage is considerably low, so does the query number, while the semantic preservation is high.
To summarize our main contributions: • We propose a simple and effective method, BERT-Attack, to generate fluent and semantically-preserved adversarial samples that can successfully mislead state-of-the-art models in NLP, such as fine-tuned BERT for various downstream tasks.
• BERT-Attack has higher attacking success rate and lower perturb percentage with less access numbers to the target model compared with previous attacking algorithms, while does not require extra scoring models therefore extremely effective.
• We can generate adversarial samples with BERT-Attack as a parallel dataset for further research on the robustness of NLP models.

Related Work
To explore the robustness of neural networks, adversarial attack has extensively studied for continuous data (such as image) (Goodfellow et al., 2014;Nguyen et al., 2015;Chakraborty et al., 2018). The key idea is to find a minimal perturbation that maximize the risk of making wrong predictions. This minimax problem can be easily achieved by applying gradient descent over the continuous space of images. However, adversarial attack for discrete data such as text remains challenging.
Adversarial Attack for Text Current successful attacks for text usually adopt heuristic rules to modify the characters of a word (Jin et al., 2019), and substituting words with synonyms (Ren et al., 2019). Li et al. (2018); Gao et al. (2018) apply perturbations based on word embeddings such as Glove (Pennington et al., 2014), which is not strictly semantically and grammatically coordinated. Alzantot et al. (2018) adopts language models to score the perturbations generated by searching for close meaning words in the Glove (Pennington et al., 2014) embeddings, using trail and error to find possible perturbations, yet the perturbations generated are still not context-aware and heavily rely on cosine similarity measurement of word embeddings. Glove embeddings do not guarantee similar vector space with cosine similarity distance, therefore the perturbations are less semantically consistent. Jin et al. (2019) apply a semantically enhanced embedding (Mrkšić et al., 2016), which is context unaware, thus less consistent with the unperturbed inputs.  use phrase-level insertion and deletion, which produces unnatural sentences inconsistent with the original inputs, lacking fluency control. To preserve semantic information, Glockner et al. (2018) replace words manually to break language inference system (Bowman et al., 2015). Jia and Liang (2017) propose manual craft methods to attack machine reading comprehension systems. Lei et al. (2019) introduce replacement strategies using embedding transition.
Although the above approaches have achieved good results, there is still much room for improvement regarding the perturbed percentage, attacking success rate, grammatical correctness and semantic consistency, etc. Moreover, the substitution strategies of these approaches are usually non-trivial, resulting in that they are limited to specific tasks.
Adversarial Attack against BERT Pre-trained language models, have become the mainstream for many NLP tasks. Works such as (Wallace et al., 2019;Jin et al., 2019;Pruthi et al., 2019) have explored these pre-trained language models in many different angles. Wallace et al. (2019) explored The possible ethical problems of learned knowledge in pre-trained models.
From our perspective, we take the idea of turning such language models against themselves. Therefore, we introduce a novel BERT-Attack algorithm to attack the fine-tuned models.
Motivated by the interesting idea of turning BERT against BERT, we propose BERT-Attack, using the original BERT model to craft adversarial samples to fool the fine-tuned BERT model.
Our method consists of two steps: (1) finding the vulnerable words for the target model and then (2) replacing them with the semantically similar and grammatically correct words until a successful attack.
The most-vulnerable words are the key words that help the target model make judgements. Perturbations over these words can be most beneficial in crafting adversarial samples. After finding which words that we are aimed to perturbate, we use masked language models to generate perturbations based on the top-K predictions from the masked language model.

Finding Vulnerable Words
Under the black-box scenario, the logit output by the target model (fine-tuned BERT or other neural models) is the only supervision we can get. We first select the words in the sequence which have a high significance influence on the final output logit.
Let S = [w 0 , · · · , w i · · · ] denote the input sentence, and o y (S) denote the logit output by the target model for correct label y, the importance score I w i is defined as where Then we rank all the words according to the ranking score I w i in descending order to create word list L. We only take percent of the most important words since we tend to keep perturbations minimum.
This process maximize the risk of making wrong predictions which is previously done by calculating gradients in image domains. The problem is then formulated as replacing these most vulnerable words with semantically consistent perturbations.

Word Replacement via BERT
After finding the vulnerable words, we iteratively replace the words in list L one by one to find perturbations that can mislead target model. Previous approaches usually use multiple human-crafted rules to ensure the generated example is semantically consistent with the original one and grammatically These strategies of finding substitutes are unaware of the context between the perturb positions, thus are insufficient in fluency control and semantic consistency. More importantly, using language models or POS checker in scoring the perturbated samples is costly since this trail and error process requires massive inference time.
To overcome the lack of fluency control and semantic preservation by using synonyms or similar words in the embedding space, we leverage BERT for word replacement. The genuine nature of the masked language model makes sure that the generated sentences are relatively fluent and grammarcorrect, also preserve most semantic information. Further, compared with previous approaches using rule-based perturbation strategies, the masked language model prediction is context-aware, thus dynamically searches for perturbations rather than simple synonyms replacing. Different from previous methods using complicated strategies to score and constrain the perturbations, contextualized perturbation generator generates minimal perturbations with only one forward pass. The time-consuming part is accessing target model only without running models to score the sentence, therefore extremely efficient.
Thus, using the masked language model as a contextualized perturbation generator can be one possible solution to craft high-quality adversarial samples efficiently.
// sort S using I w i in descending order and collect top − K words 8: procedure REPLACEMENT USING BERT 9: H = [h 0 , · · · , h n ] // sub-word tokenized sentence 10: generate top-K candidates for all sub-words using BERT and get P ∈n×K 11: if w j is a whole word then  S adv = [w 0 , · · · , w j−1 , c, · · · ] // do one perturbation 26: return None

Word Replacement Strategy
As seen in Figure 1, given a chosen word w to be replaced, we apply BERT to predict the possible words that are similar to w yet can mislead the target model. Instead of following the masked language model settings, we do not mask the chosen word w and use the original sequence as input, which can generate more semantic-consistent substitutes. For instance, given a sequence "I like the cat.", if we mask the word cat, it would be very hard for a masked language model to predict the original word cat since it could be just as fluent if the sequence is "I like the dog.". Further, if we mask out the given word w, for each iteration we would have to rerun the masked language model prediction process which is costly.
Let M denote the BERT model, we feed the tokenized sequence H into the BERT M to get output prediction P = M(H). Instead of using argmax prediction, we take the most possible K predictions at each position, where K is a hyperparameter.
We iterate words that are sorted by word importance ranking process to find perturbations. BERT model uses BPE encoding to construct vocabularies. While most words are still single words, rare words are tokenized into sub-words. We treat single words and sub-words separately to generate the substitutes.
Single words For a single word w j , we make attempts using the corresponding top-K prediction candidates P j ∈ K. We first filter out stop words collected from NLTK; for sentiment classification tasks we filter out antonyms using synonym dictionaries (Mrkšić et al., 2016) since BERT masked language model does not distinguish synonyms and antonyms. Then for given candidate c k we construct a perturbed sequence H = [h 0 , · · · , h j−1 , c k , h j+1 · · · ]. If the target model is already fooled to predict incorrectly, we break the loop to obtain the final adversarial sample H adv ; otherwise, we select from the filtered candidates to pick one best perturbation and turn to the next word in word list L.
Sub-words For word that is tokenized into subwords in BERT, we cannot obtain its substitutes directly. Thus we use the perplexity of sub-word combinations to craft word substitutes from predictions in sub-word level. Given sub-words [h 0 , h 1 , · · · , h t ] of word w, we list all possible combinations from the prediction P ∈t×K from M, which is K t sub-word combinations, we can convert them back to normal words by reversing the BERT tokenization process. Then we use the perplexity of all combinations to get top-K combinations; in this way, those combinations that are less likely to be a natural word are filtered out.
Then we replace the original word with the most likely perturbation and repeat this process by iterating the importance word ranking list to find final adversarial sample. In this way, we acquire the adversarial samples S adv effectively since we only iterate the masked language model once and do perturbations using masked language model without other checking strategies.
We summarize the two-step BERT-Attack process in Algorithm 1.

Datasets
We apply our method to attack different types of NLP tasks in the form of text classification and natural language inference. Following Jin et al.
(2019), we evaluate our method on 1k test samples randomly selected from the test set of the given task which are the same splits used by (Alzantot et al., 2018;Jin et al., 2019).

Text Classification
We use different types of text classification tasks to study the effectiveness of our method.
Following Zhang et al. (2015), we process the dataset to construct a polarity classification task. • IMDB Document-level movie review dataset, where the average sequence length is longer than Yelp dataset. We process the dataset into a polarity classification task 1 . • AG's News Sentence level news-type classification dataset, containing 4 types of news: World, Sports, Business, and Science. • FAKE Fake News Classification dataset, detecting whether a news document is fake from Kaggle Fake News Challenge 2 .

Natural Language Inference
• SNLI Stanford language inference task (Bowman et al., 2015). Given one premise and one hypothesis, and the goal is to predict if the hypothesis is entailment, neural, or contradiction of the premise. • MNLI Language inference dataset on multigenre texts, covering transcribed speech, popular fiction, and government reports (Williams et al., 2018), which is more complicated with diversified written and spoken style texts, compared with SNLI dataset, including eval data matched with training domains and eval data mismatched with training domains.

Automatic Evaluation Metrics
To measure the quality of the generated samples, we set up various automatic evaluation metrics. The success rate, which is the counter-part of afterattack accuracy, is the core metric measuring the success of the attacking method. Meanwhile, the perturbed percentage is also crucial since, generally, less perturbation results in more semantic consistency. Further, under the black-box setting, queries of the target model are the only accessible information. Constant queries for one sample is less applicable. Thus query number per sample is also a key metric. As used in TextFooler (Jin et al., 2019), we also use Universal Sentence Encoder (Cer et al., 2018) to measure the semantic consistency between the adversarial sample and the original sequence. To balance between semantic preservation and attack success rate, we set up a threshold of semantic similarity score to filter the less similar examples.

Attacking Results
As shown in Table 1, BERT-Attack method successfully fool its downstream fine-tuned model. In  both text classification and natural language inference tasks, the fine-tuned BERTs fail to classify the generated adversarial samples correctly.
The average after-attack accuracy is lower than 10%, indicating that most samples are successfully perturbated to fool the state-of-the-art classification models. Meanwhile, the perturb percentage is less than 10 %, which is significantly less than previous works.
Further, BERT-Attack successfully attacked all tasks listed, which are in diversified domains such as News classification, review classification, language inference in different domains. The results indicate that the attacking method is robust in different tasks. Compared with the strong baseline introduced by Jin et al. (2019), the BERT-Attack method is more efficient and more imperceptible. The query number and the perturbation percentage of our method are much less.
We can observe that it is generally easier to attack the review classification task since the perturb percentage is incredibly low. BERT-Attack can mislead the target model by replacing a handful of words only. Since the average sequence length is relatively long, the target model tends to make judgments by only a few words in a sequence, which is not the natural way of human prediction. Thus, the perturbation of these keywords would result in incorrect prediction from the target model, revealing the vulnerability of it.

Human Evaluations
For further evaluation of the generated adversarial samples, we set up human evaluations to measure the quality of the generated samples in fluency and grammar as well as semantic preservation.
We ask human judges to score the grammar correctness of the mixed sentences of generated adversarial samples and original sequences, scoring from 1-5 following (Jin et al., 2019). Then we ask human judges to make predictions for the generated adversarial samples mixed with original samples. We use IMDB dataset and MNLI dataset, and for each task, we select 100 samples of both original and adversarial samples for human judges.
Seen in Table 2, semantic score and label prediction of adversarial samples are close to original ones. MNLI task is a sentence pair prediction task constructed by human crafted hypotheses based on premises, therefore original pairs share a considerable amount of same words. Perturbations on these words would make it difficult for human judges to predict correctly therefore the accuracy is lower than simple sentence classification task.

BERT-Attack against Other Models
The BERT-Attack method is also applicable in attacking other target models, not limited to its finetuned model only. As seen in Table 3, the attack is successful against LSTM-based models, indicating that BERT-Attack is feasible for a wide range of models. Under BERT-Attack, ESIM model is more robust in MNLI dataset. We assume that encoding two sentences separately gets higher robustness.
In attacking BERT-large models, the performance is also excellent, indicating that BERT-Attack is successful in attacking different pre-trained models not only against its own fine-tuned downstream models.

Importance of Candidate Numbers
The candidate pool range is the major hyperparameter used in BERT-Attack algorithm. As seen in Figure 2, the attack rate is rising along with the candidate size increasing. Intuitively, a larger K would result in less semantic similarity. However, the semantic measure via Universal Sentence Encoder is maintained in a stable range, (experiments show that semantic similarities drop less than 2%), indicating that the candidates are all reasonable and semantically consistent with the original sentence.  Table 4: Transferability analysis. The column is the target model used in attack, and the row is the tested model. All attacks are using BERT-base as masked language model.

Importance of Sequence Length
BERT-Attack method is based on the contextualized masked language model. Thus the sequence length plays an important role in high-quality perturbation process. As seen, instead of the previous methods focusing on attacking the hypothesis of NLI task, we aim at premises whose average length is longer. This is because we believe that contextual replacement would be less reasonable when dealing with extremely short sequences. To avoid such a problem, we believe that many word-level synonym replacement strategies can be combined with BERT-Attack, allowing BERT-Attack method to be more applicable.

Transferability and Adversarial Training
To test the transferability of the generated adversarial samples, we take samples aimed at different target models to attack other target models. Here, we use BERT-base as masked language model for all different target models. As seen in Table 4, samples are transferable in NLI task while less transferable in text classification. Meanwhile, we further fine-tune the target model using the generated adversarial samples from train set and then test it on test set used before. As seen in Table 5, generated samples used in fine-tuning help target model become more robust while accuracy is close to the model trained with clean datasets. The attack becomes more difficult, in-

IMDB
Ori it is hard for a lover of the novel northanger abbey to sit through this bbc adaptation and to Negative keep from throwing objects at the tv screen... why are so many facts concerning the tilney family and mrs . tilney ' s death altered unnecessarily ? to make the story more ' horrible ? ' Adv it is hard for a lover of the novel northanger abbey to sit through this bbc adaptation and to Positive keep from throwing objects at the tv screen... why are so many facts concerning the tilney family and mrs . tilney ' s death altered unnecessarily ? to make the plot more ' horrible ? ' FAKE Ori the us may soon face an apocalyptic seismic event starkman today , ... earthquakes ..., as geologists Unreliable say . via usualroutine the university of washington has already presented seismological ... charts showing a gigantic geological rift that ... when scientists found a strange underground rupture ... Adv the us may soon face an apocalyptic seismic event starkman today , ... earthquakes ..., as geologists Reliable say . en usualroutine the university of washington , already presented seismological ... charts showing a gigantic geological rift that ... when scientists found a strange underground rupture ... Table 6: Some generated adversarial samples. Origin label is the correct prediction while label is adverse prediction. Only red color parts are perturbed. We only attack premises in MNLI task. Text in FAKE dataset and IMDB dataset is cut to fit in the table. Original text contains more than 200 words. dicating that the model is harder to be attacked. Therefore, the generated dataset can be used as additional data for further exploration of making neural models more robust.

Effects on Sub-Word Level Attack
BPE method is currently the most efficient way to deal with a large number of words, as used in BERT. We establish a comparative experiment where we do not use sub-word level attack. That is we skip those words that are tokenized with multiple subwords.
As seen in Table 7, using sub-word level attack can achieve higher performances, not only in higher attacking success rate but also in less perturbation percentage.

Effects on Word Importance Ranking
Word importance ranking strategy is supposed to find keys are essential to NN models, which is very much like calculating the maximum risk of wrong predictions in FGSM algorithm (Goodfellow et al., 2014). When not using word importance ranking, attacking algorithm is less successful.

Examples of Generated Adversarial Sentences
As seen in Table 6, the generated adversarial samples are semantically consistent with its original input, while the target model makes incorrect predictions. In both review classification samples and language inference samples, the perturbations do not mislead human judges.

Conclusion
In this work, we propose a high-quality and effective method BERT-Attack to generate adversarial samples using BERT masked language model. Experiment results show that the proposed method achieves a high success rate while maintaining a minimum perturbation. Nevertheless, candidates generated from the masked language model can sometimes be antonyms or irrelevant to the original words, causing a semantic loss. Thus, enhancing language models to generate more semantically related perturbations can be one possible solution to perfect BERT-Attack in the future.