BAE: BERT-based Adversarial Examples for Text Classification

Modern text classification models are susceptible to adversarial examples, perturbed versions of the original text indiscernible by humans which get misclassified by the model. Recent works in NLP use rule-based synonym replacement strategies to generate adversarial examples. These strategies can lead to out-of-context and unnaturally complex token replacements, which are easily identifiable by humans. We present BAE, a black box attack for generating adversarial examples using contextual perturbations from a BERT masked language model. BAE replaces and inserts tokens in the original text by masking a portion of the text and leveraging the BERT-MLM to generate alternatives for the masked tokens. Through automatic and human evaluations, we show that BAE performs a stronger attack, in addition to generating adversarial examples with improved grammaticality and semantic coherence as compared to prior work.


Introduction
Recent studies have exposed the vulnerability of ML models to adversarial attacks, small input perturbations which lead to misclassification by the model. Adversarial example generation in NLP (Zhang et al., 2019) is more challenging than in commonly studied computer vision tasks (Szegedy et al., 2014;Kurakin et al., 2017;Papernot et al., 2017) because of (i) the discrete nature of the input space and (ii) the need to ensure semantic coherence with the original text. A major bottleneck in applying gradient based (Goodfellow et al., 2015) or generator model (Zhao et al., 2018) based approaches to generate adversarial examples in NLP is the backward propagation of the perturbations from the continuous embedding space to the discrete token space. Initial works for attacking text models relied on introducing errors at the character level (Ebrahimi et al., 2018;Gao et al., 2018) or adding and deleting words (Li et al., 2016;Liang et al., 2017;Feng et al., 2018) for creating adversarial examples. These techniques often result in unnatural looking adversarial examples which lack grammatical correctness, thereby being easily identifiable by humans.
Rule-based synonym replacement strategies (Alzantot et al., 2018;Ren et al., 2019) have recently lead to more natural looking adversarial examples. Jin et al. (2019) combine both these works by proposing TextFooler, a strong black-box attack baseline for text classification models. However, the adversarial examples generated by TextFooler solely account for the token level similarity via word embeddings, and not the overall sentence semantics. This can lead to out-of-context and unnaturally complex replacements (see Table 3), which are easily human-identifiable. Consider a simple example: "The restaurant service was poor". Token level synonym replacement of 'poor' may lead to an inappropriate choice such as 'broke', while a context-aware choice such as 'terrible' leads to better retention of semantics and grammaticality.
Therefore, a token replacement strategy contingent on retaining sentence semantics using a pow-erful language model (Devlin et al., 2018;Radford et al., 2019) can alleviate the errors made by existing techniques for homonyms (tokens having multiple meanings). In this paper, we present BAE (BERT-based Adversarial Examples), a novel technique using the BERT masked language model (MLM) for word replacements to better fit the overall context of the English language. In addition to replacing words, we also propose inserting new tokens in the sentence to improve the attack strength of BAE. These perturbations in the input sentence are achieved by masking a part of the input and using a LM to fill in the mask (See Figure 1).
Our BAE attack beats the previous baselines by a large margin on empirical evaluation over multiple datasets and models. We show that, surprisingly, just a few replace/insert operations can reduce the accuracy of even a powerful BERT classifier by over 80% on some datasets. Moreover, our human evaluation reveals the improved grammaticality of the adversarial examples generated by BAE over the baseline TextFooler, which can be attributed to the BERT-MLM. To the best of our knowledge, we are the first to use a LM for generating adversarial examples. We summarize our contributions as: • We propose BAE, an adversarial example generation technique using the BERT-MLM.
• We introduce 4 BAE attack modes by replacing and inserting tokens, all of which are almost always stronger than previous baselines on 7 text classification datasets.
• Through human evaluation, we show that BAE yields adversarial examples with improved grammaticality and semantic coherence.

Methodology
Problem Definition. We are given a dataset (S, Y ) = {(S 1 , y 1 ), . . . (S m , y m )} and a trained classification model C : S → Y . We assume the soft-label black-box setting where the attacker can only query the classifier for output probabilities on a given input, and does not have access to the model parameters, gradients or training data. For an input pair (S=[t 1 , . . . , t n ], y), we want to generate an adversarial example S adv such that C(S adv ) =y. Additionally we would like S adv to be grammatically correct and semantically similar to S. BAE. For generating an adversarial example S adv , we introduce 2 types of token-level perturbations: (i) Replace a token t ∈ S with another and (ii) Insert a new token t in S. Some tokens in the input where L[t ] causes maximum reduction in probability of y in C(L[t ]) end if end Return: S adv ← N one contribute more towards the final prediction by C than others. Replacing these tokens or inserting a new token adjacent to them can thus have a stronger effect on altering the classifier prediction. This intuition stems from the fact that the replaced/inserted tokens changes the local context around the original token. We estimate token importance I i of each t i ∈ S, by deleting t i from S and computing the decrease in probability of predicting the correct label The Replace (R) and Insert (I) operations are performed on a token t by masking it and inserting a mask token adjacent to it respectively. The pre-trained BERT-MLM is used to predict the mask tokens (See Figure 1) in line with recent work (Shi and Huang, 2020) which uses this to analyse robustness of paraphrase identification models to modifying shared words. BERT-MLM is a powerful LM trained on a large training corpus (∼ 2 billion words), and hence the predicted mask tokens fit well into the grammar and context of the text.
The BERT-MLM, however, does not guarantee semantic coherence to the original text as demonstrated by the following simple example. Consider the sentence: 'the food was good'. For replacing the token 'good', BERT-MLM may predict the token 'bad', which fits well into the grammar and context of the sentence, but changes the original sentiment of the sentence. To achieve a high semantic similarity with the original text on introducing perturbations, we filter the set of top K tokens (K is a pre-defined constant) predicted by BERT-MLM for the masked token, using a Universal Sentence En-  coder (USE) based sentence similarity scorer (Cer et al., 2018). For the R operation, we additionally filter out predicted tokens that do not form the same part of speech (POS) as the original token. If multiple tokens can cause C to misclassify S when they replace the mask, we choose the token which makes S adv most similar to the original S based on the USE score. If no token causes misclassification, then we choose the one that decreases the prediction probability P (C(S adv )=y) the most. We apply these token perturbations iteratively in decreasing order of token importance, until either C(S adv ) =y (successful attack) or all the tokens of S have been perturbed (failed attack).
We present 4 attack modes for BAE based on the R and I operations, where for each token t in S: • BAE-R: Replace token t (See Algorithm 1) • BAE-I: Insert a token to the left or right of t • BAE-R/I: Either replace token t or insert a token to the left or right of t • BAE-R+I: First replace token t, then insert a token to the left or right of t Generating adversarial examples through masked language models has also been recently explored by Li et al. (2020) since our original submission.

Experiments
Datasets and Models. We evaluate BAE on different text classification tasks. Amazon, Yelp, IMDB are sentiment classification datasets used in recent works (Sarma et al., 2018) and MR (Pang and Lee, 2005) contains movie reviews based on sentiment polarity. MPQA (Wiebe and Wilson, 2005) is a dataset for opinion polarity detection, Subj (Pang and Lee, 2004) for classifying a sentence as subjective or objective and TREC (Li and Roth, 2002) for question type classification.
We use 3 popular text classification models: word-LSTM (Hochreiter and Schmidhuber, 1997), word-CNN (Kim, 2014) and a fine-tuned BERT (Devlin et al., 2018) base-uncased classifier. We train models on the training data and perform the adversarial attack on the test data. For complete model details, refer to Appendix A.
As a baseline, we consider TextFooler (Jin et al., 2019) which performs synonym replacement using a fixed word embedding space (Mrkšić et al., 2016). We only consider the top K=50 synonyms from the BERT-MLM predictions and set a threshold of 0.8 for the cosine similarity between USE based embeddings of the adversarial and input text.

Automatic Evaluation Results.
We perform the 4 BAE attacks and summarize the results in Tables 1 and 2. Across datasets and models, our BAE attacks are almost always more effective than the baseline attack, achieving significant drops of 40-80% in test accuracies, with higher average semantic similarities as shown in parentheses.   With just one exception, BAE-R+I is the strongest attack since it allows both replacement and insertion at the same token position. We observe a general trend that the BAE-R and BAE-I attacks often perform comparably, while the BAE-R/I and BAE-R+I attacks are much stronger. We observe that the BERT classifier is more robust to BAE and TextFooler attacks than the word-LSTM and word-CNN possibly due to its large size and pre-training on a large corpus.
The TextFooler attack is sometimes stronger than the BAE-R attack for the BERT classifier. We attribute this to the shared parameter space between the BERT-MLM and the BERT classifier before fine-tuning. The predicted tokens from BERT-MLM may not be able to drastically change the internal representations learned by the BERT classifier, hindering their ability to adversarially affect the classifier prediction.
Additionally, we make some interesting observations pertaining to the average semantic similarity of the adversarial examples with the original sentences (computed using USE). From Tables 1, 2 we observe that across different models and datasets, all BAE attacks have higher average semantic similarity than TextFooler. Notably, the BAE-I attack achieves the highest semantic similarity among all the 4 modes. This can be explained by the fact that all tokens of the original sentence are retained, in the original order, in the adversarial example generated by BAE-I. Interestingly, we observe that the average semantic similarity of the BAE-R+I attack is always higher than the BAE-R attack. This lends support to the importance of the 'Insert' operation in ameliorating the effect of the 'Replace' operation. We further investigate this through an ablation study discussed later. Effectiveness. We study the effectiveness of BAE on limiting the number of R/I operations permitted on the original text. We plot the attack performance as a function of maximum % perturbation (ratio of number of word replacements and insertions to the length of the original text) for the TREC dataset. From Figure 2, we clearly observe that the BAE attacks are consistently stronger than TextFooler. The classifier models are relatively robust to perturbations up to 20%, while the effectiveness saturates at 40-50%. Surprisingly, a 50% perturbation for the TREC dataset translates to replacing or inserting just 3-4 words, due to the short text lengths. Qualitative Examples. We present adversarial examples generated by the attacks on sentences from the IMDB and Yelp datasets in Table 3. All attack strategies successfully changed the classification to negative, however the BAE attacks produce more natural looking examples than TextFooler. The tokens predicted by the BERT-MLM fit well in the sentence context, while TextFooler tends to re-Original [Positive Sentiment]: This film offers many delights and surprises. TextFooler: This flick citations disparate revel and surprises.

BAE-R+I: This beautiful movie offers many pleasant delights and surprises .
Original [Positive Sentiment]: Our server was great and we had perfect service. TextFooler: Our server was tremendous and we assumed faultless services. BAE-R: Our server was decent and we had outstanding service. BAE-I: Our server was great enough and we had perfect service but. BAE-R/I: Our server was great enough and we needed perfect service but. BAE-R+I: Our server was decent company and we had adequate service.  place words with complex synonyms, which can be easily detected. Moreover, BAE's additional degree of freedom to insert tokens allows for a successful attack with fewer perturbations.
Human Evaluation. We perform human evaluation of our BAE attacks on the BERT classifier.
For 3 datasets, we consider 100 samples from each test set shuffled randomly with their successful adversarial examples from BAE-R, BAE-R+I and TextFooler. We calculate the sentiment accuracy by asking 3 annotators to predict the sentiment for each sentence in this shuffled set. To evaluate the naturalness of the adversarial examples, we first present the annotators with 50 other original data samples to get a sense of the data distribution. We then ask them to score each sentence (on a Likert scale of 1-5) in the shuffled set on its grammar and likelihood of being from the original data. We average the 3 scores and present them in Table 4.
Both BAE-R and BAE-R+I attacks almost always outperform TextFooler in both metrics. BAE-R outperforms BAE-R+I since the latter inserts tokens to strengthen the attack, at the expense of naturalness and sentiment accuracy. Interestingly, the BAE-R+I attacks achieve higher average semantic similarity scores than BAE-R, as discussed in Section 3. This exposes the shortcomings of using USE for evaluating the retention of semantics of adversarial examples, and reiterates the Replace vs. Insert. Our BAE attacks allow insertion operations in addition to replace. We analyze the benefits of this flexibility of R/I operations in Table 5. From Table 5, the splits A and B are the % of test points which compulsorily need I and R operations respectively for a successful attack. We can observe that the split A is larger than B thereby indicating the importance of the I operation over R. Test points in split C require both R and I operations for a successful attack. Interestingly, split C is largest for Subj, which is the most robust to attack (Table 2) and hence needs both R/I operations. Thus, this study gives positive insights towards the importance of having the flexibility to both replace and insert words.
We present complete effectiveness graphs and details of human evaluation in Appendix B and C. BAE is implemented 1 in TextAttack (Morris et al., 2020), a popular suite of NLP adversarial attacks.

Conclusion
In this paper, we have presented a new technique for generating adversarial examples (BAE) through contextual perturbations based on the BERT Masked Language Model. We propose inserting and/or replacing tokens from a sentence, in their order of importance for the text classification task, using a BERT-MLM. Automatic and human evaluation on several datasets demonstrates the strength and effectiveness of our attack. • MR: A movie reviews dataset based on subjective rating and sentiment polarity 3 .
• MPQA: An unbalanced dataset for polarity detection of opinions 4 .
• TREC: A dataset for classifying types of questions with 6 classes 5 .
• SUBJ: A dataset for classifying a sentence as objective or subjective. 2 Training Details On the sentence classification task, we target three models: word-based convolutional neural network (WordCNN), word-based LSTM, and the state-of-the-art BERT. We use 100 filters of sizes 3,4,5 for the WordCNN model with a dropout of 0.3. Similar to (Jin et al., 2019) we use a 1-layer bi-directional LSTM with 150 hidden units and a dropout of 0.3. For both models, we use the 300 dimensional pre-trained counter fitted word embeddings (Mrkšić et al., 2017). For the BERT classifier, we used the BERT base uncased model which has 12-layers, 12 attention heads and 768 hidden dimension size. Across all models and datasets, we use the standard BERT uncased vocabulary of size 30522. We first train all three models on the training data split and use early stopping on the test dataset. For BERT fine-tuning, we use the standard setting of an Adam classifier having a learning rate of 2×10 −5 and 2 fine-tuning epochs.
For our BAE attacks, we use a pre-trained BERT Base-uncased MLM to predict the masked tokens. We only consider the top K=50 synonyms from the BERT-MLM predictions and set a threshold of 0.8 for the cosine similarity between USE based embeddings of the adversarial and input text.
For R operations, we filter out predicted tokens which form a different POS than the original token in the sentence. For both R and I operations, we filter out stop words using NLTK from the set of predicted tokens. Additionally we filter out antonyms using synonym embeddings (Mrkšić et al., 2016) for sentiment analysis tasks.

B Results
Figures 3 -8 are the complete set of graphs showing the attack effectiveness for all seven datasets.

C Human Evaluation
We ask the human evaluators to judge the naturalness of texts presented to them, i.e. whether they think they are adversarial examples or not. They were instructed to do so on the basis of grammar and how likely they think it is from the original dataset, and rate each example on the following Likert scale of 1-5: 1) Sure adversarial sample 2) Likely an adversarial example 3) Neutral 4) Likely an original sample 5) Sure original sample. From the results of Table 3, it is clear that BAE-R always beats the sentiment accuracy and naturalness score of TextFooler. The latter is due to unnaturally long and complex synonym replacements on using TextFooler.