Certified Robustness to Word Substitution Attack with Differential Privacy

The robustness and security of natural language processing (NLP) models are significantly important in real-world applications. In the context of text classification tasks, adversarial examples can be designed by substituting words with synonyms under certain semantic and syntactic constraints, such that a well-trained model will give a wrong prediction. Therefore, it is crucial to develop techniques to provide a rigorous and provable robustness guarantee against such attacks. In this paper, we propose WordDP to achieve certified robustness against word substitution at- tacks in text classification via differential privacy (DP). We establish the connection between DP and adversarial robustness for the first time in the text domain and propose a conceptual exponential mechanism-based algorithm to formally achieve the robustness. We further present a practical simulated exponential mechanism that has efficient inference with certified robustness. We not only provide a rigorous analytic derivation of the certified condition but also experimentally compare the utility of WordDP with existing defense algorithms. The results show that WordDP achieves higher accuracy and more than 30X efficiency improvement over the state-of-the-art certified robustness mechanism in typical text classification tasks.


Introduction
Deep neural networks (DNNs) have achieved stateof-the-art performance in many natural language processing (NLP) tasks, such as text classification (Zhang et al., 2015), sentiment analysis (Bakshi et al., 2016), and machine translation (Bahdanau et al., 2014), making the robustness and security of NLP models significantly important. Recent studies have shown that DNNs can be easily fooled by adversarial examples, which are carefully crafted ⇤ J. Lou is the corresponding author. by adding imperceptible perturbations to input examples during inference time (Szegedy et al., 2013). In the context of text classification tasks, adversarial examples can be designed by manipulating the word or characters under certain semantic and syntactic constraints (Ren et al., 2019;Zang et al., 2020;Gao et al., 2018). Among all the attack strategies, word substitution attacks, in which attackers attempt to alter the model output by replacing input words with their synonyms, can maximally maintain the naturalness and semantic similarity of the input. Therefore, in this paper, we consider such word substitution attacks and focus on defending against such attacks. Figure 1 shows an example of the word substitution attack where the clean input text is changed into adversarial text by substituting input words from a synonym list. Various mechanisms have been developed to defend against adversarial examples in text classification models. Miyato et al. (2016) applied adversarial training to the text domain that involves adversarial examples in the training stage. Data augmentation in the training phase is another defense approach to improve model robustness. For example, Synonyms Encoding Method (SEM) proposed by Wang et al. (2019), Dirichlet Neighborhood Ensemble (DNE) proposed by Zhou et al. (2020), and Robust Encodings (RobEn) proposed by Jones et al. (2020) are different data augmentation methods on either embedding space or word space. How-ever, all the above-mentioned works are only evaluated empirically and have no theoretical analysis or guarantee on the robustness of the methods in that they may be broken by other adaptive attacks. Therefore, it is important to provide rigorous and provable certified defense.
There are several attempts to achieve certified robustness for word substitution attacks. Jia et al. (2019) and Huang et al. (2019) utilize Interval Bound Propagation (IBP) to compute an upper bound on the model's loss in the forward pass and minimize this bound via backpropagation. Although IBP gives a theoretical bound, it does not provide any certification condition. Another limitation is that it is not applicable to character-level DNNs, because IBP is limited to continuous space so that model input should be the word-level embedding. SAFER (Ye et al., 2020) achieves certified robustness with a new randomized smoothing technique. However, its computation of synonym set intersection greatly reduces the computation speed in the inference stage. Besides, SAFER only provides a theoretical certified accuracy and its empirical effectiveness on adversarial examples has not been evaluated.
In this paper, we propose a novel approach WordDP to certified robustness against word substitution attacks in text classification via differential privacy (DP) (Dwork, 2008). Figure 1 is a highlevel illustration. In the inference phase, the input go through a randomized mechanism WordDP. If a clean input satisfies the certification condition of WordDP, its adversarial counterpart is guaranteed to predict the same output label. DP is a privacy framework that protects the information of individual record in the database by randomized computations, such that the change of the computation output is bounded when small perturbation is applied on the database. This stable output guarantee is in parallel with the definition of robustness: ensuring that small changes in the input will not result in dramatic shift of its output. The idea of providing robustness certification via DP was originally introduced in PixelDP (Lecuyer et al., 2019) which is specifically designed for norm-bounded adversarial examples in the continuous domain for applications like image classification. However, it is challenging to directly apply such an idea against word substitution attack, due to the discrete nature of the text input space. Therefore, in this work, we develop WordDP to achieve the DP and robustness connec-tion in the discrete text space by exploring novel application of the exponential mechanism (McSherry and Talwar, 2007), conventionally utilized to realize DP for answering discrete queries. To achieve this, we present a conceptual certified robustness algorithm that randomly samples word-substituted sentences according to the probability distribution designated by the exponential mechanism and aggregates their inference result as the final classification for the input.
A fundamental barrier limiting the conceptual algorithm from being applied in practice is that the sampling distribution of the exponential mechanism requires an exhaustive enumeration-based sub-step, which needs to repeat the model inference for every neighboring sentences with word substitutions from the input sentence. To overcome this computational difficulty, we develop a practical simulated exponential mechanism via uniform sampling and re-weighted averaging, which not only lowers the computational overhead but also ensures uncompromising level of certified robustness. Our contribution can be summarized as follows: 1) We propose WordDP to establish the connection between DP and certified robustness for the first time in text classification domain (Sec.4.1). 2) We leverage conceptual exponential mechanism to achieve WordDP and formally prove an L-word bounded certified condition for robustness against word substitution attacks (Sec.4.2). 3) We develop a simulated exponential mechanism via uniform sampling and weighted averaging to overcome the computation bottleneck of the conceptual exponential mechanism without compromising the certified robustness guarantee (Sec.4.3). 4) Extensive experiments validate that WordDP outperforms existing defense methods and achieves over 30⇥ efficiency improvement in the inference stage than the state-of-the-art certified robustness mechanism (Sec.5).
In word substitution attacks, attackers replace words in a sentence with their synonyms according to a synonym table, including PWWS (Ren et al., 2019), TEXTFOOLER , among others (Zang et al., 2020). In particular, PWWS is the most widely used attack algorithm to evaluate defense mechanisms (Zhou et al., 2020;Jia et al., 2019;Ye et al., 2020). PWWS uses WordNet to build synonym set and only replaces named entities (NEs) with similar NEs in order to flip the prediction. It incorporates word saliency to determine the replacement order and selects the synonym that can cause the greatest prediction probability change.
Empirical Defenses to Word Substitution Attacks. Several existing empirical defenses are effective for adversarial word substitution. Miyato et al. (2016) applied adversarial training to the text domain. Wang et al. (2019) proposed Synonyms Encoding Method (SEM), which finds a mapping between the words and their synonyms before the input layer. Jones et al. (2020) proposed robust encodings (RobEn) that involves an encoding function to map sentences to a smaller, discrete space. Dirichlet Neighborhood Ensemble (DNE) (Zhou et al., 2020) creates virtual sentences by mixing the embedding of the original word with its synonyms' embedding via Dirichlet sampling, which is randomized smoothing based data augmentation. The intuition is to compute an upper bound on the model's loss through the network in a standard forward pass and minimize this upper bound via backpropagation. One major limitation of IBP certification is that it is not applicable to character-level DNNs, because IBP is limited to continuous space (word-level embedding).
SAFER (Ye et al., 2020) is a certified robust method based on randomized smoothing. The certification is based on the intersection of synonym sets between perturbed examples and clean examples. However, its computation of synonym set intersection greatly reduces the inference efficiency. Besides, it lacks thorough evaluation of empirical effectiveness on adversarial examples.

Adversarial Word Substitution and Certified Robustness
Adversarial Word Substitution. Consider a sen- Following common practice (Ye et al., 2020), we also assume the synonymous relation is symmetric, such that x i is in the synonym set of all its synonyms The synonym set S(x i ) can be built by following GLOVE (Pennington et al., 2014b). Definition 3.1. (L-Adversarial Word Substitution Attack) For an input sentence X, an Ladversarial word substitution attack perturbs the sentence by selecting at most L (L  !) words We denote an attacked sentence by X 0 and the set of all possible attacked sentences by S (L).
Certified Robustness. In general, we say a model is robust to adversarial examples when its prediction result is stable when applying small perturbations to the input.
In the following, we refer to the above robustness as L-certified robustness for short.

Differential Privacy and Exponential Mechanism
Differential Privacy. The concept of DP is to prevent the information leakage of an individual record in the database by introducing randomness into the computation. More specifically, DP guarantees the output of a function over two neighbouring databases are indistinguishable.
Definition 3.3. (Differential Privacy (Dwork et al., 2006)) A randomized mechanism A is ✏differentially private if, for all neighboring datasets D ⇠ D 0 that differ in one record or are bounded by certain distance and for all events O in the output Exponential Mechanism. The exponential mechanism is a commonly utilized DP mechanism in the discrete domain, which consists of the utility score function, sensitivity, and sampling probability distribution as its key ingredients.
The exponential mechanism M E (D, u, R) selects and outputs an element r 2 R with probability proportional to e ✏u(D,r) 2 u . The exponential mechanism is ✏-differentially private.
4 Proposed Method 4.1 WordDP for Certified Robustness WordDP. We expand the intuition that DP can be applied to provide certified robustness against textual adversarial examples like word substitution attack by regarding the sentence as a database and each word as a record. If the randomized predictive model satisfies ✏-DP during inference, then the output of a potentially adversarial input X 0 2 S (L) and the output of the original input X should be indistinguishable. Thus, our proposed approach is to transform a multiclass classification model's prediction score into a randomized ✏-WordDP score, which is formally defined below.
Definition 4.1. (Word Differential Privacy) Consider any input sentence X and its L-word substitution sentence set S (L). For a randomized function f A (X), let its prediction score vector be y 2 Y. f A (X) satisfies ✏-word differential privacy (WordDP), if it satisfies ✏-differential privacy for any pair of neighboring sentences X 1 , X 2 2 S (L) and the output space y 2 Y.
Remark 1. We stress that WordDP does not seek DP protection for the training dataset as in the conventional privacy area. Instead, it leverages the DP randomness for certified robustness during inference with respect to a testing input.
In practice, for a base model f , a DP mechanism A will be introduced to randomize it to f A . For an ✏-WordDP model f A , its expected prediction satisfies the certified robustness condition in eq.(1), based on Lemma 4.1 that shows each expected prediction score E[f y i A (X)] is stable. Lemma 4.1. For an ✏-WordDP model f A , its prediction score satisfies the relation, 8i 2 [C], From the above property, we can derive the certified robustness condition to adversarial examples. Lemma 4.2. For an ✏-WordDP model f A and an input sentence X, if there exists a label c such that: then the multiclass classification model f A based on the expected label prediction score vector E[f y A (·)] is certified robust to L-adversary word substitution attack on X.
The proofs of the above two lemmas can be adapted from the pixelDP to WordDP context based on Lemma 1 and Proposition 1 in Lecuyer et al. (2019). We relegate the proofs to Appendix A. Our focus is how to design the DP mechanism A to achieve WordDP (Subsection 4.2), and how to implement it for efficient inference that still ensures certified robustness (Subsection 4.3).

WordDP with Exponential Mechanism
In this subsection, we present the conceptual exponential mechanism-based algorithm to achieve WordDP and the certification procedure. Exponential Mechanism for WordDP. To obtain the DP classifier f A given the base model f , we introduce the exponential mechanism M E as the randomization mechanism A and define f A := f (M E ). Given an input example, the mechanism selects and outputs L-substitution sentences with a probability based on exponential mechanism. It then aggregates the inferences of these samples by an average as the estimated prediction of the input. Figure 2 illustrates the algorithm. Definition 4.2. (Exponential Mechanism for WordDP and L-Certified Robustness) Given the base model f , for any input sentence X and potential L-substitution sentence set S (L), we define the utility score function as: which associates a utility score to a candidate output X 0 2 S (L). The sensitivity of the utility score is u = 1 e 1 . Then, the exponential mechanism selects and outputs X 0 with probability P X 0 where ) is the normalization factor. Proof. To show M E is ✏-DP, we prove the sensitivity of the utility score (maximum difference between the utility scores given any two neighboring input) u is indeed 1 e 1 and the remaining follows the definition of the exponential mechanism (c.f.Definition 3.4). Since kf y (X 0 i ) f y (X)k 1 is the prediction probability change which is in [0, 1], we have u(S (L), X 0 i ) 2 [e 1 , 1], which leads to u = 1 e 1 . Next, since M E (X) is ✏-DP, by the post-processing property (i.e., any computation on the output of the DP mechanism remains DP, Proposition 2.1 in (Dwork et al., 2014).
Remark 2. 1) The design of the utility function has the intuition that we wish to assign higher probability to sentences that have minimal impact on the prediction score function. 2) The privacy budget ✏ influences whether the sampling probability distribution is flat (lower ✏) or peaky (greater ✏). Too small of an ✏ value will clearly affect the prediction accuracy. For certification purpose, according to the certified condition Lemma 4.2, too large of an ✏ value will result in none certified, so ✏ can only be searched within a limited range.
That is, we repeat the exponential mechanism-based inference to draw n samples of f y can be bounded based on Hoeffding's inequality with probability ⌘, which guar- The next proposition shows that the inference based on the estimated can still ensure certified robustness.
the prediction score vector b E[f y M E (X)]-based classification is certified robust with probability ⌘ to L-adversary word substitution attack on X.

Simulated Exponential Mechanism
Simulated Exponential Mechanism. The conceptual exponential mechanism in Definition 4.2 is computationally impractical. The bottleneck is the need to enumerate the entire S (L) in order to calculate the probability distribution of P X 0 for each X 0 2 S (L) and the normalization factor ⇢, which essentially requires us to perform inference for S (L) n times (n is the number of samples) for certifying a single input sentence X.
In the following, we show that we can significantly reduce the computation cost by sampling via a simulated exponential mechanism, which suffices to sample n candidate L substitution sentences and calculate only n times, i.e., the same repetitions as the Monte Carlo estimation. The key insight is based on the different purpose of applying the exponential mechanism between the conventional scenario for achieving DP and our certified robustness scenario. For the former, in order to ensure DP of the final output f M E (X 0 ⌧ ), the intermediate X 0 ⌧ is forced to satisfy DP, i.e., drawn from the exact probability distribution designated by the exponential mechanism. For the latter, while the derivation of the certified robustness relied on the randomness of DP and the exponential mechanism, we do not actually require the DP of the intermediate X 0 ⌧ . As a result, it allows us to sample X 0 ⌧ from other simpler distributions without calculating the probability distribution of the exponential mechanism, as long as the alternative approach can obtain the equivalent b E[f y A (X)] for robustness certification.
We develop a simulated exponential mechanism via uniform sampling and re-weighted average prediction score calculation. Figure 2 shows the simulated mechanism in contrast to the conceptual mechanism. In detail, we sample from S (L) with uniform probability, which can be efficiently implemented without generating S (L). Denoting a sample by X 0 ⌧ , we calculate its scaled exponential mechanism probability by which can be obtained via a single inference on X 0 ⌧ and the inference on X due to the omission of the normalization factor ⇢ that requires the entire S (L). The inference on X only needs to be computed once and shared by all n Monte Carlo repetitions. Such uniform sampling and scaled probability calculation is repeated for n times, which requires only n + 1 inferences. Finally, we use the following re-weighted average prediction score (weighted by the scaled exponential mechanism probability) for certified robust prediction,

For completeness, we can also show that the certified robustness of
Training procedure. To achieve a better certification result, we involve randomness in the training stage, which is also adopted by almost all certified robustness approaches. To do so, we use the data augmentation strategy that utilizes the perturbed sentences for training, i.e., X 0 2 S (L) \ X given the original training sample X. In practice, we first train the model without data augmentation for several epochs to achieve a reasonable performance, followed by training with perturbed X 0 . For each training data point, we randomly draw one neighbour sentence during training (as opposed to multiple draws during certified inference).

Experiments
We evaluate WordDP on two classification datasets: Internet Movie Database (IMDB) (Maas et al., 2011) and AG News corpus (AGNews) (Zhang et al., 2015). IMDB is a binary sentiment classification dataset containing 50000 movie reviews. AGNews includes 30,000 news articles categorized into four classes. The target model architecture we select is a single-layer LSTM model with size of 128. We use Global Vectors for Word Representation (GloVe) (Pennington et al., 2014a) for word embedding. The LSTM model achieves 88.4% and 91.8% clean accuracy on IMDB and AGNews, respectively. We use PWWS (Ren et al., 2019) to generate adversarial examples on the test dataset. PWWS is a state-of-the-art attack method which uses WordNet to build synonym set and incorporates word saliency to replace selected named entities (NEs) with their synonyms in order to flip the prediction. The details about the datasets, model training and attack algorithm are in Appendix C.

Evaluation Metrics and Baselines
We use four metrics to evaluate the effectiveness of WordDP: certified ratio, certified accuracy, conditional accuracy, and conventional ac-curacy. Certified Ratio represents the fraction of testing set that the prediction satisfies the certification criteria: , where certif iedCheck returns 1 if Theorem 4.1 is satisfied and T is the size of the test dataset. Certified accuracy (CertAcc) denotes the fraction of the clean testing set on which the predictions are both correct and satisfy the certification criteria. This is a standard metric to evaluate certified robust model (Lecuyer et al., 2019). Formally, it is defined as: , where corrClass returns 1 if the classification output is correct. When the accuracy of a model is close to 100%, certified accuracy largely reflects certified ratio. Conventional accuracy (Con-vAcc) is defined as the fraction of testing set that is correctly classified, P T t=1 corrClass(Xt,L,✏) T , which is a standard metric to evaluate any deep learning systems. Note that the input X t can be both adversarial or clean inputs. We use this metric to evaluate how WordDP empirically works on adversarial examples.
Besides the above standard metrics, we introduce a new accuracy metric called Conditional accuracy (CondAcc) to evaluate the following: when a clean input X t is certified within bound L, whether its corresponding L-word substitution adversarial example X adv t is indeed correctly classified. The CondAcc can be formulated as: in CertAcc and CondAcc. Besides SAFER, we also compare the ConvAcc on adversarial examples with two state-of-the-art defense methods, i.e., IBP (Jia et al., 2019) and DNE (Zhou et al., 2020), which do not provide certified robustness guarantee. Thus, their defense may be broken by more powerful word substitution attacks in the future.

Certified Results
Certified Accuracy. Figure 3 presents the Cer-tAcc, CondAcc and ConvAcc under different ✏ and L, respectively. Each line in the figures represents a certified bound L, which allows L number of words to be substituted. The first row is the results on IMDB, and the second row is on AGNews. Figures 3(a) and 3(d) show the certified accuracy on the two datasets. Since the conventional accuracy on the clean examples of our mechanisms is close to 100% (as shown in Figures 3(c) and 3(f)), the certified accuracy mainly reflects the certified ratio (which we skip in the results). As shown, higher ✏ can result in lower CertAcc. This is intuitive as the condition in Theorem 4.1 is more difficult to satisfy when given higher epsilon, i.e. weaker requirement of indistinguishability of the output, hence results in lower certified ratio. As illustrated in 3(a), when ✏ is around 1.5, the mechanism will approach 0 certified ratio. This indicates that ✏ can only be searched within a limited range.
Comparing each line in 3(a) and 3(d), we note that greater L results in higher CertAcc in most cases for the AGNews dataset. This can be ex-  Trader-off between Certified Ratio and Con-dAcc. We can see that ✏ has an opposite impact on certified accuracy (certified ratio) and CondAcc, we present the trade-off between the certified ratio and CondAcc of WordDP in Figure 4 in comparison with the baseline method SAFER. Ideally, we want both high certified ratio and high condAcc to contribute to overall high accuracy. The black dot represents the baseline SAFER, since the neighbouring sentence generating method of SAFER does not depend on L or ✏. As illustrated on these two datasets, with L = 20 and L = 40,WordDP can dominate SAFER and achieve a much better performance in both certified ratio and condAcc.
Relation between certified bound L and adversarial attack power L adv . Figure 5 presents the three accuracy metrics under different attack power and defense power. In Figure 5(a), we fix the attack power L adv to 40, which means allowing less than 40 word substitutions, and adjust the WordDP defense power by using different certified bound L. As discussed in Section 4, certified bound L determines the size of neighbouring set. Greater L leads to higher randomness and thus can benefit the CondAcc and ConvAcc on adversarial examples. On the other hand, greater L also makes the certified condition more difficult to be satisfied, which result in lower CertAcc.
In Figure 5(b), we fix the certified bound L to 40, which means using the same power of WordDP to defend against adversarial examples generated by varying attack power L adv . As shown in the figure, the performance increases with higher attack power. This is because the adversarial examples with more word changes (higher L adv ) are more difficult to generate but easier to defend (due to the nature of PWWS attack algorithm). Comparison with Empirical Defense. Besides certified robust method SAFER, we also compare CondAcc of WordDP with baseline empirical defense methods, IBP (Jia et al., 2019) and DNE (Zhou et al., 2020). Table 1 compares the highest CondAcc achieved by WordDP with the conventional accuracy reported by the baselines (ADV corresponds to no defense). WordDP achieves a much higher accuracy on IMDB dataset compared to IBP, DNE and SAFER. For AGNews, the accuracy of WordDP outperforms SAFER, but is lower than the two empirical defenses. We stress, however, the empirical defense methods do not provide any rigorous certified robustness guarantees and the performance can be significantly dependent on datasets and specific attacks. Efficiency Comparison. We also compare the efficiency of WordDP with SAFER by computing the average time cost for certifying one input and producing the Monte Carlo sampling-based output. It takes WordDP 6.25s and 3.21s on IMDB and AGNews, respectively. As a comparison, it costs SAFER 230.35s and 96.68s. Thus, WordDP achieves more than 30⇥ efficiency improvement.

Conclusion
We proposed WordDP, a certified robustness method to adversarial word substitution attacks with the exponential mechanism-based algorithm. Compared with previous work, WordDP achieves notable accuracy improvement and 30⇥ efficiency improvement. In the future, it would be interesting to expand WordDP to other kinds of textual adversarial examples, such as character-level attacks. It is also worthwhile to study other certified approaches such as random smoothing.