SAFER: A Structure-free Approach for Certified Robustness to Adversarial Word Substitutions

State-of-the-art NLP models can often be fooled by human-unaware transformations such as synonymous word substitution. For security reasons, it is of critical importance to develop models with certified robustness that can provably guarantee that the prediction is can not be altered by any possible synonymous word substitution. In this work, we propose a certified robust method based on a new randomized smoothing technique, which constructs a stochastic ensemble by applying random word substitutions on the input sentences, and leverage the statistical properties of the ensemble to provably certify the robustness. Our method is simple and structure-free in that it only requires the black-box queries of the model outputs, and hence can be applied to any pre-trained models (such as BERT) and any types of models (world-level or subword-level). Our method significantly outperforms recent state-of-the-art methods for certified robustness on both IMDB and Amazon text classification tasks. To the best of our knowledge, we are the first work to achieve certified robustness on large systems such as BERT with practically meaningful certified accuracy.


Introduction
Deep neural networks have achieved state-of-theart results in many NLP tasks, but also have been shown to be brittle to carefully crafted adversarial perturbations, such as replacing words with similar words (Alzantot et al., 2018), adding extra text (Wallace et al., 2019), and replacing sentences with semantically similar sentences (Ribeiro et al., 2018). These adversarial perturbations are imperceptible to humans, but can fool deep neural networks and break their performance. Efficient methods for defending these attacks are of critical im- * Equal contribution portance for deploying modern deep NLP models to practical automatic AI systems.
In this paper, we focus on defending the synonymous word substitution attacking (Alzantot et al., 2018), in which an attacker attempts to alter the output of the model by replacing words in the input sentence with their synonyms according to a synonym table, while keeping the meaning of this sentence unchanged. A model is said to be certified robust if such an attack is guaranteed to fail, no matter how the attacker manipulates the input sentences. Achieving and verifying certified robustness is highly challenging even if the synonym table used by the attacker is known during training (see Jia et al., 2019), because it requires to check every possible synonymous word substitution, whose number is exponentially large.
Various defense methods against synonymous word substitution attacks have been developed (e.g., Wallace et al., 2019;Ebrahimi et al., 2018), most of which, however, are not certified robust in that they may eventually be broken by stronger attackers. Recently Huang et al. (2019) are taken to be the word embedding vectors of the input sentences, instead of the discrete sentences. This makes it inapplicable to character-level (Zhang et al., 2015) and subword-level (Bojanowski et al., 2017) model, which are more widely used in practice (Wu et al., 2016).
In this paper, we propose a structure-free certified defense method that applies to arbitrary models that can be queried in a black-box fashion, without any requirement on the model structures. Our method is based on the idea of randomized smoothing, which smooths the model with random word substitutions build on the synonymous network, and leverage the statistical properties of the randomized ensembles to construct provably certification bounds. Similar ideas of provably certification using randomized smoothing have been developed recently in deep learning (e.g., Cohen et al., 2019;Salman et al., 2019;Zhang et al., 2020;Lee et al., 2019), but mainly for computer vision tasks whose inputs (images) are in a continuous space (Cohen et al., 2019). Our method admits a substantial extension of the randomized smoothing technique to discrete and structured input spaces for NLP.
We test our method on various types of NLP models, including text CNN (Kim, 2014), Char-CNN (Zhang et al., 2015), and BERT (Devlin et al., 2019). Our method significantly outperforms the recent IBP-based methods (Jia et al., 2019;Huang et al., 2019) on both IMDB and Amazon text classification. In particular, we achieve an 87.35% certified accuracy on IMDB by applying our method on the state-of-the-art BERT, on which previous certified robust methods are not applicable.

Adversarial Word Substitution
In a text classification task, a model f (X) maps an input sentence X ∈ X to a label c in a set Y of discrete categories, where X = x 1 , . . . , x L is a sentence consisting of L words. In this paper, we focus on adversarial word substitution in which an attacker arbitrarily replaces the words in the sentence by their synonyms according to a synonym table to alert the prediction of the model. Specifically, for any word x, we consider a pre-defined synonym set S x that contains the synonyms of x (including x itself). We assume the synonymous relation is symmetric, that is, x is in the synonym set of all its synonyms. The synonym set S x can be built based on GLOVE (Pennington et al., 2014).
With a given input sentence X = x 1 ,. . . , x L , the attacker may construct an adversarial sentence X = x 1 , . . . , x L by perturbing at most R ≤ L words x i in X to any of their synonyms x i ∈ S x i , where S X denotes the candidate set of adver-sarial sentences available to the attacker. Here X − X 0 := L i=1 I {x i = x i } is the Hamming distance, with I{·} the indicator function. It is expected that all X ∈ S X have the same semantic meaning as X for human readers, but they may have different outputs from the model. The goal of the attacker is to find X ∈ S X such that f (X) = f (X ).
Certified Robustness Formally, a model f is said to be certified robust against word substitution attacking on an input X if it is able to give consistently correct predictions for all the possible word substitution perturbations, i.e, where y denotes the true label of sentence X. Deciding if f is certified robust can be highly challenging, because, unless additional structural information is available, it requires to exam all the candidate sentences in S X , whose size grows exponentially with R. In this work, we mainly consider the case when R = L, which is the most challenging case.

Certifying Smoothed Classifiers
Our idea is to replace f with a more smoothed model that is easier to verify by averaging the outputs of a set of randomly perturbed inputs based on random word substitutions. The smoothed classifier f RS is constructed by introducing random perturbations on the input space, where Π X is a probability distribution on the input space that prescribes a random perturbation around X. For notation, we define which is the "soft score" of class c under f RS .
The perturbation distribution Π X should be chosen properly so that f RS forms a close approximation to the original model f (i.e., f RS (X) ≈ f (X)), and is also sufficiently random to ensure that f RS is smooth enough to allow certified robustness (in the sense of Theorem 1 below).
In our work, we define Π X to be the uniform distribution on a set of random word substitutions. Specifically, let P x be a perturbation set for word x in the vocabulary, which is different from the synonym set S x . In this work, we construct P x based on the top K nearest neighbors under the cosine similarity of GLOVE vectors, where K is a hyperparameter that controls the size of the perturbation set; see Section 4 for more discussion on P x .
For a sentence X = x 1 , . . . , x L , the sentencelevel perturbation distribution Π X is defined by randomly and independently perturbing each word x i to a word in its perturbation set P x i with equal probability, that is, where Z = z 1 , . . . , z L is the perturbed sentence and |P x i | denotes the size of P x i . Note that the random perturbation Z and the adversarial candidate X ∈ S X are different.

Certified Robustness
We now discuss how to certify the robustness of the where the lower bound of g RS (X , y) on X ∈ S X is larger than the upper bound of g RS (X , c) on X ∈ S X for every c = y. The key step is hence to calculate the upper and low bounds of g RS (X , c) for ∀c ∈ Y and X ∈ S X , which we address in Theorem 1 below. All proofs are in Appendix A.2.
Theorem 1. (Certified Lower/Upper Bounds) Assume the perturbation set P x is constructed such that |P x | = |P x | for every word x and its synonym x ∈ S x . Define where q x indicates the overlap between the two different perturbation sets. For a given sentence X = x 1 , . . . , x L , we sort the words according to where q X := 1 − R j=1 q x i j . Equivalently, this says The idea is that, with the randomized smoothing, the difference between g RS (X , c) and g RS (X, c) is at most q X for any adversarial candidate X ∈ S X . Therefore, we can give adversarial upper and lower bounds of g RS (X , c) by g RS (X, c) ± q X , which, importantly, avoids the difficult adversarial optimization of g RS (X , c) on X ∈ S X , and instead just needs to evaluate g RS (X, c) at the original input X.
We are ready to describe a practical criterion for checking the certified robustness.
Proposition 1. For a sentence X and its label y, we define Then under the condition of Theorem 1, we can Therefore, certifying whether the model gives consistently correct prediction reduces to checking if ∆ X is positive, which can be easily achieved with Monte Carlo estimation as we show in the sequel.
And ∆ X can be approximated accordingly. Using concentration inequality, we can quantify the non-asymptotic approximation error. This allows us to construct rigorous statistical procedures to reject the null hypothesis that f RS is not certified robust at X (i.e., ∆ X ≤ 0) with a given significance level (e.g., 1%). See Appendix A.1 for the algorithmic details of the testing procedure.
We can see that our procedure is structure-free in that it only requires the black-box assessment of the output f (Z (i) ) of the random inputs, and does not require any other structural information of f and f RS , which makes our method widely applicable to various types of complex models.
Tightness A key question is if our bounds are sufficiently tight. The next theorem shows that the lower/upper bounds in Theorem 1 are tight and can not be further improved unless further information of the model f or f RS is acquired.
Theorem 2. (Tightness) Assume the conditions of Theorem 1 hold. For a model f that satisfies f RS (X) = y and y B as defined in Proposition 1, there exists a model f * such that its related smoothed classifier g RS * satisfies g RS * (X, c) = ...

Synonym Network
An old story for young girls ...  g RS (X, c) for c = y and c = y B , and

Input Sentence
where q X is defined in Theorem 1.
In other words, if we access g RS only through the evaluation of g RS (X, y) and g RS (X, y B ), then the bounds in Theorem 1 are the tightest possible that we can achieve, because we can not distinguish between g RS and the g RS * in Theorem 2 with the information available. Figure 1 visualizes the pipeline of the proposed approach. Given the synonym sets S X , we generate the perturbation sets P X from it. When an input sentence X arrives, we draw perturbed sentences {Z (i) } from Π X and average their outputs to estimate ∆ X , which is used to decide if the model is certified robust for X.

Practical Algorithm
Training the Base Classifier f Our method needs to start with a base classifier f . Although it is possible to train f using standard learning techniques, the result can be improved by considering that the method uses the smoothed f RS , instead of f . To improve the accuracy of f RS , we introduce a data augmentation induced by the perturbation set. Specifically, at each training iteration, we first sample a mini-batch of data points (sentences) and randomly perturbing the sentences using the perturbation distribution Π X . We then apply gradient descent on the model based on the perturbed minibatch. Similar training procedures were also used for Gaussian-based random smoothing on continuous inputs (see e.g., Cohen et al., 2019).
Our method can easily leverage powerful pretrained models such as BERT. In this case, BERT is used to construct feature maps and only the top layer weights are finetuned using the data augmentation method.

Experiments
We test our method on both IMDB ( Perturbation Sets We say that two words x and x are connected synonymously if there exists a path of words x = x 1 , x 2 , . . . , x = x , such that all the successive pairs are synonymous. Let B x to be the set of words connected to x synonymously. Then we define the perturbation set P x to consist of the top K words in B x with the largest GLOVE cosine similarity if |B x | ≥ K, and set P x = B x if |B x | < K. Here K is a hyper-parameter that controls the size of P x and hence trades off the smoothness and accuracy of f RS . We use K = 100 by default and investigate its effect in Section 4.2. Evaluation Metric We evaluate the certified robustness of a model f RS on a dataset with the certified accuracy (Cohen et al., 2019), which equals the percentage of data points on which f RS is certified robust, which, for our method, holds when ∆ X > 0 can be verified.

Main Results
We first demonstrate that adversarial word substitution is able to give strong attack in our experimental setting.  (2020)). We will show later that our method is able to achieve 87.35% certified accuracy and thus the corresponding adversarial accuracy must be higher or equal to 87.35%. We compare our method with IBP (Jia et al., 2019; Huang et al., 2019). in Table 1. We can see that our method clearly outperforms the baselines. In particular, our approach significantly outperforms IBP on Amazon by improving the 14.00% baseline to 24.92%.
Thanks to its structure-free property, our algorithm can be easily applied to any pre-trained models and character-level models, which is not easily achievable with Jia et al.  Table 2 shows that our method can further improve the result using Char-CNN (a character-level model) and BERT (Devlin et al., 2019), achieving an 87.35% certified accuracy on IMDB. In comparison, the IBP baseline only achieves a 79.74% certified accuracy under the same setting.

Trade-Off between Clean Accuracy and Certified Accuracy
We investigate the trade-off between smoothness and accuracy while tuning K in Table 3. We can   Table 3: Results of the smoothed model f RS with different K on IMDB using text CNN. "Clean" represents the accuracy on the clean data without adversarial attacking and "Certified" the certified accuracy.

Conclusion
We proposed a robustness certification method, which provably guarantees that all the possible perturbations cannot break down the system. Compared with previous work such as Jia et al. (2019); Huang et al. (2019), our method is structure-free and thus can be easily applied to any pre-trained models (such as BERT) and character-level models (such as Char-CNN). The construction of the perturbation set is of critical importance to our method. In this paper, we used a heuristic way based on the synonym network to construct the perturbation set, which may not be optimal. In further work, we will explore more efficient ways for constructing the perturbation set. We also plan to generalize our approach to achieve certified robustness against other types of adversarial attacks in NLP, such as the out-of-list attack. An naïve way is to add the "OOV" token into the synonyms set of every word, but potentially better procedures can be further explored.

A.1 Bounding the Error of Monte Carlo Estimation
As shown in Proposition 1, the smoothed model f RS is certified robust at an input X in the sense of (1) if where y is the true label of X, and is an i.i.d. sample from Π X . By Monte Carlo approximation, we can estimate g RS (X, c) for all c ∈ Y jointly, viaĝ and estimate ∆ X viâ To develop a rigorous procedure for testing ∆ X > 0, we need to bound the non-asymptotic error of the Monte Carlo estimation, which can be done with a simple application of Hoeffding's concentration inequality and union bound.
For any δ ∈ (0, 1), with probability at least 1 − δ, we have We can now frame the robustness certification problem into a hypothesis test problem. Consider the null hypothesis H 0 and alternatively hypothesis H a : H 0 :∆ X ≤ 0 (f RS is not certified robust to X) H a :∆ X > 0 (f RS is certified robust to X).
Then according to Proposition 2, we can reject the null hypothesis H 0 with a significance level δ if In all the experiments, we set δ = 0.01 and n = 5000.

A.2 Proof of the Main Theorems
In this section, we give the proofs of the theorems in the main text.

A.2.1 Proof of Proposition 1
According to the definition of f RS , it is certified robust at X, that is, Obviously g RS (X, c) + q X //by Theorem 1.
Our goal is to calculate the upper and lower bounds max X ∼Π X g RS (X , c) and min X ∼Π X g RS (X , c). Our key idea is to frame the computation of the upper and lower bounds into a variational optimization. Lemma 1. Define H [0,1] to be the set of all bounded functions mapping from X to [0, 1], For any Then we have for any X and c ∈ Y, Proof of Lemma 1. The proof is straightforward. Define h 0 (X) = I{f (X) = c}. Recall that Therefore, h 0 satisfies the constraints in the optimization, which makes it obvious that Taking min X ∈S X on both sides yields the lower bound. The upper bound follows the same derivation.
Therefore, the problem reduces to deriving bounds for the optimization problems. Theorem 3. Under the assumptions of Theorem 1, for the optimization problems in Lemma 1, we have where q X is the quantity defined in Theorem 1 in the main text.
Now we proceed to prove Theorem 3.
Proof of Theorem 3. We only consider the minimization problem because the maximization follows the same proof. For notation, we denote p = g RS (X, c). Applying the Lagrange multiplier to the constraint optimization problem and exchanging the min and max, we have Here dΠ 0 X (Z) and dΠ 0 X (Z) is the counting measure and (s) + = max(s, 0). Now we calculate (λdΠ X (Z) − dΠ X (Z)) + .
Lemma 2. Given x, x , define n x = |P x |, n x = |P x | and n x,x = |P x ∩ P x |. We have the following identity As a result, under the assumption that n x = |P x | = |P x | = n x for every word x and its synonym x ∈ S x , we have We now need to solve the optimization of max X ∈S X (λdΠ X (Z) − dΠ X (Z)) + .
Lemma 3. For any word x, definex * = arg min x ∈Sx n x,x /n x . For a given sentence X = x 1 , . . . , x L , we define an ordering of the words x 1 , . . . , x L such that n x i ,x * i /n x i ≤ n x j ,x * j /n x j for any i ≤ j. For a given X and R, we define an adversarial perturbed sentence X * = x * 1 , . . . , x * L , where Then for any λ ≥ 0, we have that X * is the optimal solution of max X ∈S X (λdΠ X (Z) − dΠ X (Z)) + , that is,

Now by Lemma 3, the lower bound becomes
where q X is consistent with the definition in Theorem 1: Here equation (4) is by calculation using the assumption of Theorem 1. The optimization of max λ≥0 in (4) is an elementary step: if p ≤ q, we have λ * = 0 with solution 0; if p ≥ q, we have λ * = 1 with solution (p − q X ). This finishes the proof of the lower bound. The proof the upper bound follows similarly.
Proof of Lemma 2 Notice that we have Also notice that |S X | = L j=1 n x j ; |S X | = L j=1 n x j ; |S X ∩ S X | = L j=1 n x j ,x j and |S X − S X | = L j=1 n x j − L j=1 n x j ,x j . Plugging in the above value, we have And also, Plugging in the above value, we have Combining all the calculation, we get Proof of Lemma 3 It is sufficient to proof that, for any X = X * , we have Notice that for any λ ≥ 0, define Given any X, we can view Q(X, X ) as the function of n . And Q(X, X ) is a decreasing function of n x i ,x i /n x i for any i ∈ [L] when fixing n x j ,x j nx j for all other j = i. Supposer k is the k-th smallest quantities of n x i ,x * i /n x i , i ∈ [L] and r k is the k-th smallest quantities of n x j ,x * j /n x i , i ∈ [L]. By the construction of X * , we haver k ≤ r k for any k ∈ [L]. This implies that Q(X, X * ) ≥ Q(X, X ).

A.2.3 Proof of Theorem 2
We denote g RS (X, y) = p A , g RS (X, y B ) = p B and q = q X in this proof for simplicity. The X * below is the one defined in the proof of Lemme 3. Our proof is based on constructing a randomized smoothing classifier that satisfies the desired property we want to prove.
Case 1 p A ≥ q and p B +q ≤ 1 Note that in this case |S X ∩ S X * | / |S X | = 1−q ≥ (p A −q)+p B , where the inequality is due to p A + p B ≤ 1. Therefore, we can choose set U 1 and U 2 such that U 1 ⊆ S X ∩ S X * ; U 2 ⊆ S X ∩ S X * ; U 1 ∩ U 2 = ∅; |U 1 | / |S X | = p A − q and |U 2 | / |S X | = p B . We define the classifier: This classifier is well defined for binary classification because S X ∩ S X * − (U 1 ∪ U 2 ) = ∅.
Case 2 p A < q and p B + q ≤ 1 In this case, we can choose set U 1 and U 2 such that U 1 ⊆ S X − S X * ; U 2 ⊆ S X ∩ S X * ; |U 1 | / |S X | = p A and |U 2 | / |S X | = p B . We define the classifier: This classifier is well defined for binary classification because S X − (U 1 ∪ U 2 ) = ∅.
Case 3 p A ≥ q and p B + q > 1 This case does not exist since we would have p A + p B > 1.
Case 4 p A < q and p B + q > 1 We choose set U 1 and U 2 such that U 1 ⊆ S X − S X * ; U 2 ∈ S X − S X * ; U 1 ∩ U 2 = ∅; |U 1 | / |S X | = p A and |U 2 | / |S X | = p B − (1 − q). Notice that the intersect of U 1 and U 2 can be empty as |U 1 | / |S X | + |U 2 | / |S X | = p A + p B − (1 − q) ≤ 1 − (1 − q) = q = |S X − S X * | / |S X |. We define the classifier: if Z ∈ U 2 ∪ S X * other class (c = y or y B ) if Z ∈ (S X − S X * ) − (U 1 ∪ U 2 ) any class (c ∈ Y) otherwise This classifier is well defined for binary classification because S X − S X * − (U 1 ∪ U 2 ) = ∅. It can be easily verified that for each case, the defined classifier satisfies all the conditions in Theorem 2.