Generalization to Mitigate Synonym Substitution Attacks

Studies have shown that deep neural networks (DNNs) are vulnerable to adversarial examples – perturbed inputs that cause DNN-based models to produce incorrect results. One robust adversarial attack in the NLP domain is the synonym substitution. In attacks of this variety, the adversary substitutes words with synonyms. Since synonym substitution perturbations aim to satisfy all lexical, grammatical, and semantic constraints, they are difficult to detect with automatic syntax check as well as by humans. In this paper, we propose a structure-free defensive method that is capable of improving the performance of DNN-based models with both clean and adversarial data. Our findings show that replacing the embeddings of the important words in the input samples with the average of their synonyms’ embeddings can significantly improve the generalization of DNN-based classifiers. By doing so, we reduce model sensitivity to particular words in the input samples. Our results indicate that the proposed defense is not only capable of defending against adversarial attacks, but is also capable of improving the performance of DNN-based models when tested on benign data. On average, the proposed defense improved the classification accuracy of the CNN and Bi-LSTM models by 41.30% and 55.66%, respectively, when tested under adversarial attacks. Extended investigation shows that our defensive method can improve the robustness of nonneural models, achieving an average of 17.62% and 22.93% classification accuracy increase on the SVM and XGBoost models, respectively. The proposed defensive method has also shown an average of 26.60% classification accuracy improvement when tested with the infamous BERT model. Our algorithm is generic enough to be applied in any NLP domain and to any model trained on any natural language.

Based on the adversary's level of perturbation, three categories of adversarial attacks in NLP systems have been proposed: Character-level, tokenlevel, and sentence-level adversarial attacks (Alshemali and Kalita, 2020; Zhang et al., 2020). One robust existing token-level adversarial attack in NLP is black-box synonym substitution (Alzantot et al., 2018;Ren et al., 2019;Jin et al., 2020). In attacks of this variety, the adversary substitutes tokens with synonyms. Since synonym substitution perturbations aim to satisfy all lexical, grammatical, and semantic constraints, they are difficult to detect with automatic syntax check as well as by humans.
In this work, we propose a defensive method to mitigate synonym substitution perturbations. We propose to improve the generalization of DNNbased models by replacing the embeddings of the important tokens in the input samples with the average of their synonyms' embeddings. By doing so, we reduce model sensitivity to particular tokens in the input samples. Experimenting on two popular datasets, for two types of text classification tasks, demonstrates that the proposed defense is not only capable of defending against these adversarial attacks, but is also capable of improving the performance of DNN-based models when tested on benign data. To our knowledge, our defense is the first proposed method that can effectively (1) Improve the robustness of DNN-based models against synonym substitution adversarial attacks and (2) Improve the generalization of DNN-based models with both clean and adversarial data. Alzantot et al. (2018) developed a black-box synonym substitution attack to generate adversarial samples for sentiment analysis. They first computed the nearest neighbors of a token based on the Euclidean distance in the embedding space. Then, they picked the token that maximizes the target label prediction when replacing the original token. Their adversarial examples successfully fooled their LSTM model's output with a 100% success rate, using the IMDB dataset (Maas et al., 2011). Ren et al. (2019) proposed a black-box synonym substitution attack for text classification tasks. They employed word saliency to select the token to be replaced. For each token, they selected the synonym that causes the most significant change in the classification probability after replacement. They experimented with three datasets: IMDB, AG's News (Zhang et al., 2015), and Yahoo! Answers 1 using the word-level CNN of Kim (2014), the character-level CNN of Zhang et al. (2015), a Bi-directional LSTM, and an LSTM. Their results showed that, under their attack, the classification accuracies on the three datasets IMDB, AG's News, and Yahoo! Answers were reduced by an average of 81.05%, 33.62%, and 38.65% respectively.  adopted the Metropolis-Hastings (M-H) sampling approach (Metropolis et al., 1953;Hastings, 1970) to generate blackbox synonym substitution perturbations against text classification and textual entailment tasks. They used the M-H approach to replace targeted words with synonyms, followed by a language model to enforce the fluency of the sentence after replacing the words. Their attack successfully changed the output of their Bi-LSTM model and the Bi-DAF model (Seo et al., 2017) with 98.7% and 86.6% success rates, respectively, using the IMDB dataset, and the SNLI dataset (Bowman et al., 2015). Jin et al. (2020) also proposed a black-box synonym substitution attack to evaluate text classification systems. They first identified important tokens for the target model, then gathered the top tokens whose cosine similarity with the selected tokens are greater than a threshold. They kept the candidates that altered the prediction of the target model. Using their attack, they evaluated the word-level CNN and a word-level LSTM, using the AG's News and IMDB datasets. Their results suggested that their attack reduced the accuracy of all target models by at least 64.2%. 1 https://webscope.sandbox.yahoo.com/catalog.php?

Methodology
This paper proposes improving the generalization of DNN-based models by reducing a model's sensitivity to particular tokens in the input samples. This effectively mitigates black-box synonym substitution perturbations. We propose a method that combines word importance ranking, synonym extraction, word embedding averaging, and majority voting techniques to mitigate adversarial perturbations. Figure 1 illustrates the overall schema of the proposed approach. The proposed approach for mitigating adversarial text consists of four main steps: • Step 1: Determine the N important tokens in the input sequence. • Step 2: Build a synonym set for each important token. • Step 3: Replace the embedding of each important token by the average of its synonyms' embeddings. • Step 4: Perform a majority voting for the N replacements based on their predictions.

Scoring Function
Given a sequence of tokens, only some key tokens act as influential signals for the model's prediction. Therefore, we use a selection mechanism to choose the tokens that most significantly influence the final prediction results. We use the Replace-1 scoring function R1S() of Gao et al. (2018) to score the importance of tokens in an input sequence according to the observed results from the targeted model. By assuming the input sequence x = x 1 x 2 ...x n , where x i is the token at the i th position, we measure the effect of the x i token on the output of the targeted model (F ). The scoring function R1S() measures the effect of x i on the model by replacing x i with x i . More formally: where x i is chosen to be out-of-vocabulary (OOV) and it is obtained by inserting, deleting, or substituting a letter in x i for a random letter. R1S() measures the importance of a token by calculating the effect of replacing it with an OOV token, while observing the model's prediction. The token's importance is thus calculated as the prediction change Figure 1: Schema of the proposed defensive method. The proposed defense involves the following steps: Step 1: Extract the important tokens in the input sample (here, we extract the three most important tokens).
Step 2: Build a synonym set for each important token.
Step 3: Replace the embedding of each important token by the average of its synonyms' embeddings.
Step 4: Perform a majority voting for the replacements based on their predictions. before and after replacing it with an OOV. By calculating the effect of replacing x i with OOV, the importance of all tokens in the input sample can be measured and ranked. This step is employed to report the N most important tokens in an input sample. In our experiments, setting N to be 5 produces the best results.

Synonym Extraction
For a given token with a high importance score obtained in Step 1, we build a synonym set (Synset) for the selected token. Synonyms can be found in WordNet 2 (Miller, 1995), a large lexical resource for the English language. For each token, we use WordNet to build a synonym set that contains all possible synonyms of the token. More formally, where m is the quantity of the token's synonyms that exist in the lexical resource (WordNet). If a token does not have any synonyms in the lexical resource, the processing moves to the next important token. In this step, we use WordNet as a lexical resource, but the proposed defense can use any other lexical resource (e.g. Wiktionary 3 ).

Embedding Averaging
In the previous steps, we determine the N important tokens in an input sample (Step 1), and then extract a synonym set for each one of the important tokens ( Step 2). In the third step, for each important token, we replace its embedding by the average of its synonyms' embeddings. More formally, where E() represents the word embeddings resource, and m is the count of synonyms in the synonym set of the token.

Majority Voting
In the previous step, for each important token, we replace the embedding of the token by the average of its synonyms' embeddings. In this step, the model makes a prediction after each replacement, and assigns each replacement a vote based on its prediction. The model's final prediction will be the prediction with the majority of the votes. An example of this step is illustrated in Figure 2. In this figure, the model made three predictions and the final classification is positive, based on the votes. The proposed approach with all steps is shown in Algorithm 1. Step 4: The model makes a prediction after each replacement, and assigns each replacement a vote based on its prediction. The model's final prediction is the prediction with the majority of the votes.
Algorithm 1: The overall procedure of the proposed defensive method. input : Input sample X, classifier F (), Replace-1 scoring function to extract important tokens in an input sample R1S(), lexical resource to extract synonyms Synset(), word embeddings resource to represent tokens E(), prediction set P , majority voting method V ().
In this paper, we proposed a simple and structurefree defensive strategy which can be successful in hardening DNNs against synonym substitution based adversarial attacks. As shown in Section 5, the proposed defense yielded great performance. The advantage of our approach is that it can use any embeddings and lexical resources. It does not require any additional data to train, or modify the architecture of the models. Our implementation is generic enough to be applied in any domain and to models trained on any natural language.

Experiments
We implemented the proposed defensive method using Python, Numpy, Tensorflow, Scikit-learn, and Pandas libraries.

Corpus
To study the efficiency of our defense, we used the Internet Movie Database (Maas et al., 2011). IMDB is a sentiment classification dataset which involves binary labels annotating the sentiment of sentences in movie reviews. IMDB consists of 25,000 training samples and 25,000 test samples, labeled as positive or negative. The average length of samples in IMDB is 262 words.

Targeted Classification Models
To evaluate our proposed approach, several experiments on the word-level CNN model of Kim (2014) and the Bi-directional LSTM model of Ren et al.
(2019) were conducted. We replicated Kim's CNN architecture, which contains three convolutional layers, a max-pooling layer, and a fully-connected layer. The Bi-directional LSTM model involves a Bi-directional LSTM layer and a fully connected layer.

Adversarial Attacks
We evaluated our defensive method with two blackbox synonym substitution attacks: The attack of Alzantot et al. (2018)

Word Embeddings
We used the Global Vectors for Word Representation (GloVe) embedding space (Pennington et al., 2014) to generate word vectors of 300 dimensions.

Performance Evaluation
Classification accuracy is used as the metric to evaluate the performance of the proposed defensive model. Higher accuracy denotes a more effective approach.

Results
The CNN and Bi-LSTM models were trained on the IMDB training set, and achieved training accuracy scores similar to the original implementations.  We first present how the defensive method behaves on benign data with no adversarial attacks. In Table 1, we report the accuracy of the targeted models on the original test samples, with and without the defense applied. Table 1 shows that the defense is capable of improving the performance of the models even when they are not under attack. The classification accuracy of the CNN increases by 3.50%, and that for the Bi-LSTM is also increased by 5.46%. This indicates that the defense is beneficial not only in adversarial situations, but also in secure situations with no adversarial attacks.

Effectiveness of the Defense
To evaluate the efficiency of our defense in adversarial situations, we used the adversarial attacks of Alzantot et al. (2018) and Ren et al. (2019) to perturb the 1280 benign samples and convert them to adversarial examples. A more effective defensive method should cause a smaller drop in model clas-sification accuracy when said model is under attack. Table 2 shows the efficacy of various adversarial attacks and the defensive method.
Under the adversarial attacks of Alzantot et al. and Ren et al., the classification accuracy of the models dropped significantly. For the CNN, the accuracy degraded more than 41.50% and 51.90%, under the Alzantot et al. and Ren et al. attacks, respectively. Similarly, the accuracy of the Bi-LSTM model reduced more than 49.94% and 68.37%, under the same attacks. Our results suggest that (1) DNN-based models with higher original accuracy (with clean data) are more difficult to be attacked. For instance, as shown in Tables 1 and 2, the underattack accuracy is higher for the CNN model compared with the Bi-LSTM model under all attacks. This agrees with the observation from previous research that, in general, models with higher original accuracy have higher under-attack accuracy (Jin et al., 2020). (2) Table 3: The accuracy of the nonneural classification models under adversarial attacks, with and without the defense applied. Percent Increase is the percent increase of the classification accuracy with the defense applied.
bustness of the models significantly improved under all attacks. The effectiveness of the proposed defense is evaluated under the two attacks and the results are presented in

Nonneural Models
In this section, we evaluated the defense using two nonneural machine learning classification algorithms, that were selected due to their high performance on a variety of text classification tasks: (1) Support Vector Machine (SVM) (Cortes and Vapnik, 1995); and (2) Extreme Gradient Boosting (XGBoost) (Chen and Guestrin, 2016). We examined the performance of our defense with the SVM and XGBoost models, trained on the IMDB dataset, and using the GloVe embedding space.
To evaluate the defense with the SVM and XG-Boost models, we used the adversarial attacks of Alzantot et al. (2018) and Ren et al. (2019) to perturb the same 1280 benign samples of IMDB re-views (used in Section 5.1) and convert them to adversarial examples. Table 3 shows how the defense behaves with nonneural models on benign and adversarial data. Table 3 shows that the SVM model has more than 28.00% and 33.00% accuracy degradation under the Alzantot et al. and Ren et al. attacks, respectively Table 3 also shows that the defense improved the performance of the models with benign data. The classification accuracy of the SVM model increases by 4.06%, and that for the XGBoost is also increased by 4.77%.

News Categorization Task
In Sections 5.1 and 5.2, we evaluated the effectiveness of the discussed defense on the sentiment analysis task. Here, we evaluated it on the news categorization task, using the Bidirectional Encoder Representations from Transformers (BERT) embedding space and the BERT model (Devlin et al., 2019). This model was trained on the AG's News categorization dataset (Zhang et al., 2015). We used the 12-layer BERT model, also called the base-uncased version 4 .
AG's News is a news categorization dataset which contains news articles categorized into four classes: World, Sports, Business and Sci/Tech.

Statistical Analysis
While the defended classifiers had higher accuracy scores than the undefended classifiers across all tasks, adversarial attacks, and datasets, it is important to determine whether the difference in performance of the defended models is statistically significant. Many researchers recommend McNemar's test (McNemar, 1947) for comparing the performance of two classifiers (Salzberg, 1997;Dietterich, 1998;Japkowicz and Shah, 2011;Costa et al., 2018) as it has a lower probability of Type I error. McNemar's is a non-parametric pairwise test designed for comparing two populations, or in this case, the predictions from two different classifiers on the same test dataset. In this paper, McNemar's test was applied to compare the performance of the defended models with their undefended counterparts (studied in Sections 5.1, 5.2, and 5.3). Here, we wish to compare the performance of the defended CNN with the undefended CNN, the de-fended SVM with the undefended SVM, etc.
We performed McNemar's test to determine if there was a significant difference between the accuracy of the defended models and that of the undefended ones. We tested the null hypothesis, which states that there is no significant difference in the accuracy of the models studied, and the alternative hypothesis, which states that there is a difference in the accuracy of the models studied. Several comparisons were performed, and the significance threshold for each individual pairwise test was adjusted to 0.05. In all cases, the difference between the defended models and the undefended models (the p-value) was significant (< 0.05). Thus, we reject the null hypothesis which assumed there was no difference between the classifiers, in favor of the alternative. The results show that there was a statistically significant difference in the accuracy of all models, which indicates that the defended models had significantly better performance.

Conclusion
In this paper, we proposed a structure-free defensive method that is capable of improving the performance of DNN-based models with both clean and adversarial data. Our findings show that replacing the embeddings of the important words in the input samples with the average of their synonyms' embeddings can significantly improve the generalization of DNN-based models. Our results indicate that the proposed defense is not only capable of defending against adversarial attacks, but is also capable of improving the performance of DNNbased models when tested on benign data. On average, the proposed defense improved the classification accuracy of the CNN and Bi-LSTM models by 41.30% and 55.66%, respectively, when tested under adversarial attacks. Extended investigation shows that our defensive method can improve the robustness of nonneural models, achieving an average of 17.62% and 22.93% classification accuracy increase on the SVM and XGBoost models, respectively. The proposed defensive method has also shown an average of 26.60% classification accuracy improvement when tested with the infamous BERT model. In further work, we plan to generalize our approach to achieve robustness against other types of adversarial attacks in NLP. We also hope to evaluate the defense with a variety of NLP systems, such as textual entailment systems.