Certified Robustness to Adversarial Word Substitutions

State-of-the-art NLP models can often be fooled by adversaries that apply seemingly innocuous label-preserving transformations (e.g., paraphrasing) to input text. The number of possible transformations scales exponentially with text length, so data augmentation cannot cover all transformations of an input. This paper considers one exponentially large family of label-preserving transformations, in which every word in the input can be replaced with a similar word. We train the first models that are provably robust to all word substitutions in this family. Our training procedure uses Interval Bound Propagation (IBP) to minimize an upper bound on the worst-case loss that any combination of word substitutions can induce. To evaluate models’ robustness to these transformations, we measure accuracy on adversarially chosen word substitutions applied to test examples. Our IBP-trained models attain 75% adversarial accuracy on both sentiment analysis on IMDB and natural language inference on SNLI; in comparison, on IMDB, models trained normally and ones trained with data augmentation achieve adversarial accuracy of only 12% and 41%, respectively.


Perturbed reviewaaa
Positive CNN Negative CNÑ x Figure 1: Word substitution-based perturbations in sentiment analysis. For an input x, we consider perturbationsx, in which every word x i can be replaced with any similar word from the set S(x, i), without changing the original sentiment. Models can be easily fooled by adversarially chosen perturbations (e.g., changing "best" to "better", "made" to "delivered", "films" to "movies"), but the ideal model would be robust to all combinations of word substitutions. Goodfellow et al., 2015). Since humans are not fooled by the same perturbations, the widespread existence of adversarial examples exposes troubling gaps in models' understanding.
In this paper, we focus on the word substitution perturbations of Alzantot et al. (2018). In this setting, an attacker may replace every word in the input with a similar word (that ought not to change the label), leading to an exponentially large number of possible perturbations. Figure 1 shows an example of these word substitutions. As demonstrated by a long line of work in computer vision, it is challenging to make models that are robust to very large perturbation spaces, even when the set of perturbations is known at training time (Goodfellow et al., 2015;Athalye et al., 2018;Raghunathan et al., 2018;Wong and Kolter, 2018).
Our paper addresses two key questions. First, is it possible to guarantee that a model is robust against all adversarial perturbations of a given in-put? Existing methods that use heuristic search to attack models (Ebrahimi et al., 2017;Alzantot et al., 2018) are slow and cannot provide guarantees of robustness, since the space of possible perturbations is too large to search exhaustively. We obtain guarantees by leveraging Interval Bound Propagation (IBP), a technique that was previously applied to feedforward networks and CNNs in computer vision . IBP efficiently computes a tractable upper bound on the loss of the worst-case perturbation. When this upper bound on the worst-case loss is small, the model is guaranteed to be robust to all perturbations, providing a certificate of robustness. To apply IBP to NLP settings, we derive new interval bound formulas for multiplication and softmax layers, which enable us to compute IBP bounds for LSTMs (Hochreiter and Schmidhuber, 1997) and attention layers (Bahdanau et al., 2015). We also extend IBP to handle discrete perturbation sets, rather than the continuous ones used in vision.
Second, can we train models that are robust in this way? Data augmentation can sometimes mitigate the effect of adversarial examples (Jia and Liang, 2017;Belinkov and Bisk, 2017;Ribeiro et al., 2018;Liu et al., 2019), but it is insufficient when considering very large perturbation spaces (Alzantot et al., 2018). Adversarial training strategies from computer vision (Madry et al., 2018) rely on gradient information, and therefore do not extend to the discrete perturbations seen in NLP. We instead use certifiably robust training, in which we train models to optimize the IBP upper bound .
We evaluate certifiably robust training on two tasks-sentiment analysis on the IMDB dataset (Maas et al., 2011) and natural language inference on the SNLI dataset (Bowman et al., 2015). Across various model architectures (bagof-words, CNN, LSTM, and attention-based), certifiably robust training consistently yields models which are provably robust to all perturbations on a large fraction of test examples. A normally-trained model has only 8% and 41% accuracy on IMDB and SNLI, respectively, when evaluated on adversarially perturbed test examples. With certifiably robust training, we achieve 75% adversarial accuracy for both IMDB and SNLI. Data augmentation fares much worse than certifiably robust training, with adversarial accuracies falling to 35% and 71%, respectively.

Setup
We consider tasks where a model must predict a label y ∈ Y given textual input x ∈ X . For example, for sentiment analysis, the input x is a sequence of words x 1 , x 2 , . . . , x L , and the goal is to assign a label y ∈ {−1, 1} denoting negative or positive sentiment, respectively. We use z = (x, y) to denote an example with input x and label y, and use θ to denote parameters of a model. Let f (z, θ) ∈ R denote some loss of a model with parameters θ on example z. We evaluate models on f 0-1 (z, θ), the zero-one loss under model θ.

Perturbations by word substitutions
Our goal is to build models that are robust to labelpreserving perturbations. In this work, we focus on perturbations where words of the input are substituted with similar words. Formally, for every word x i , we consider a set of allowed substitution words S(x, i), including x i itself. We usex to denote a perturbed version of x, where each word x i is in S(x, i). For an example z = (x, y), let B perturb (z) denote the set of all allowed perturbations of z: (1) Figure 1 provides an illustration of word substitution perturbations. We choose S(x, i) so thatx is likely to be grammatical and have the same label as x (see Section 5.1).

Robustness to all perturbations
Let F(z, θ) denote the set of losses of the network on the set of perturbed examples defined in (1): We define the robust loss as max F(z, θ), the loss due to worst-case perturbation. A model is robust at z if it classifies all inputs in the perturbation set correctly, i.e., the robust zero-one loss max F 0-1 (z, θ) = 0. Unfortunately, the robust loss is often intractable to compute, as each word can be perturbed independently. For example, reviews in the IMDB dataset (Maas et al., 2011) have a median of 10 31 possible perturbations and max of 10 271 , far too many to enumerate. We instead propose a tractable upper bound by constructing a set O(z, θ) ⊇ F(z, θ). Note that Therefore, whenever max O 0-1 (z, θ) = 0, this fact is sufficient to certify robustness to all perturbed examples B perturb (z). However, since O 0-1 (z, θ) ⊇ F 0-1 (z, θ), the model could be robust even if max O 0-1 (z, θ) = 0.

Certification via Interval Bound Propagation
We now show how to use Interval Bound Propagation (IBP)  to obtain a superset O(z, θ) of the losses of perturbed inputs F(z, θ), given z, θ, and B perturb (z). For notational convenience, we drop z and θ. The key idea is to compute upper and lower bounds on the activations in each layer of the network, in terms of bounds computed for previous layers. These bounds propagate through the network, as in a standard forward pass, until we obtain bounds on the final output, i.e., the loss f . While IBP bounds may be loose in general, Section 5.2 shows that training networks to minimize the upper bound on f makes these bounds much tighter Raghunathan et al., 2018). Formally, let g i denote a scalar-valued function of z and θ (e.g., a single activation in one layer of the network) computed at node i of the computation graph for a given network. Let dep(i) be the set of nodes used to compute g i in the computation graph (e.g., activations of the previous layer). Let G i denote the set of possible values of g i across all examples in B perturb (z). We construct an interval O i = [ i , u i ] that contains all these possible values of g i , i.e., O i ⊇ G i . O i is computed from the intervals O dep(i) = {O j : j ∈ dep(i)} of the dependencies of g i . Once computed, O i can then be used to compute intervals on nodes that depend on i. In this way, bounds propagate through the entire computation graph in an efficient forward pass.
We now discuss how to compute interval bounds for NLP models and word substitution perturbations. We obtain interval bounds for model inputs given B perturb (z) (Section 3.1), then show how to compute O i from O dep(i) for elementary operations used in standard NLP models (Section 3.2). Finally, we use these bounds to certify robustness and train robust models.

Bounds for the input layer
Previous work  applied IBP to continuous image perturbations, which are naturally represented with interval bounds (Dvi- jotham et al., 2018). We instead work with discrete word substitutions, which we must convert into interval bounds O input in order to use IBP. Given input words x = x 1 , . . . , x L , we assume that the model embeds each word as by computing the smallest axis-aligned box that contains all the word vectors: (4) Figure 2 illustrates these bounds. We can view this as relaxing a set of discrete points to a convex set that contains all of the points. Section 4.2 discusses modeling choices to make this box tighter.

Interval bounds for elementary functions
Next, we describe how to compute the interval of a node i from intervals of its dependencies.  show how to efficiently compute interval bounds for affine transformations (i.e., linear layers) and monotonic elementwise nonlinearities (see Appendix 3). This suffices to compute interval bounds for feedforward networks and CNNs. However, common NLP model components like LSTMs and attention also rely on softmax (for attention), element-wise multiplication (for LSTM gates), and dot product (for computing attention scores). We show how to compute interval bounds for these new operations. These building blocks can be used to compute interval bounds not only for LSTMs and attention, but also for any model that uses these elementary functions.
For ease of notation, we drop the superscript i on g i and write that a node computes a result z res = g(z dep ) where z res ∈ R and z dep ∈ R m for m = |dep(i)|. We are given intervals O dep such that z Softmax layer. The softmax function is often used to convert activations into a probability distribution, e.g., for attention.  uses unnormalized logits and does not handle softmax operations. Formally, let z res represent the normalized score of the word at position c. We . The value of z res is largest when z dep c takes its largest value and all other words take the smallest value: We obtain a similar expression for res . Note that res and u res can each be computed in a forward pass, with some care taken to avoid numerical instability (see Appendix A.2).
Element-wise multiplication and dot product. Models like LSTMs incorporate gates which perform element-wise multiplication of two activations. Let z res = z The extreme values of the product occur at one of the four points corresponding to the products of the extreme values of the inputs. In other words, Propagating intervals through multiplication nodes therefore requires four multiplications. Dot products between activations are often used to compute attention scores. 1 The dot product is just the sum of the element-wise product z dep 1 z dep 2 . Therefore, we can bound the dot product by summing the bounds on each element of z dep 1 z dep 2 , using the formula for elementwise multiplication.

Final layer
Classification models typically output a single logit for binary classification, or k logits for k-way classification. The final loss f (z, θ) is a function of the logits s(x). For standard loss functions, we can represent this function in terms of elementwise monotonic functions (Appendix 3) and the elementary functions described in Section 3.2.
involves a max operation followed by a step function, which is monotonic.
Thus, we can compute bounds on the loss O(z, θ) = [ final , u final ] from bounds on the logits.

Certifiably Robust Training with IBP
Finally, we describe certifiably robust training, in which we encourage robustness by minimizing the upper bound on the worst-case loss . Recall that for an example z and parameters θ, u final (z, θ) is the upper bound on the loss f (z, θ). Given a dataset D, we optimize a weighted combination of the normal loss and the upper bound u final , where 0 ≤ κ ≤ 1 is a scalar hyperparameter. As described above, we compute u final in a modular fashion: each layer has an accompanying function that computes bounds on its outputs given bounds on its inputs. Therefore, we can easily apply IBP to new architectures. Bounds propagate through layers via forward passes, so the entire objective (7) can be optimized via backpropagation.  found that this objective was easier to optimize by starting with a smaller space of allowed perturbations, and make it larger during training. We accomplish this by artificially shrink- Standard training corresponds to = 0. We train for T init epochs while linearly increasing from 0 to 1, and also increasing κ from 0 up to a maximum value of κ , We then train for an additional T final epochs at κ = κ and = 1.
To summarize, we use IBP to compute an upper bound on the model's loss when given an adversarially perturbed input. This bound is computed in a modular fashion. We efficiently train models to minimize this bound via backpropagation.

Tasks and models
Now we describe the tasks and model architectures on which we run experiments. These models are all built from the primitives in Section 3.

Tasks
Following Alzantot et al. (2018), we evaluate on two standard NLP datasets: the IMDB sentiment analysis dataset (Maas et al., 2011) and the Stanford Natural Language Inference (SNLI) dataset (Bowman et al., 2015). For IMDB, the model is given a movie review and must classify it as positive or negative. For SNLI, the model is given two sentences, a premise and a hypothesis, and is asked whether the premise entails, contradicts, or is neutral with respect to the hypothesis. For SNLI, the adversary is only allowed to change the hypothesis, as in Alzantot et al. (2018), though it is possible to also allow changing the premise.

Models
IMDB. We implemented three models for IMDB. The bag-of-words model (BOW) averages the word vectors for each word in the input, then passes this through a two-layer feedforward network with 100-dimensional hidden state to obtain a final logit. The other models are similar, except they run either a CNN or bidirectional LSTM on the word vectors, then average their hidden states. All models are trained on cross entropy loss.
SNLI We implemented two models for SNLI. The bag-of-words model (BOW) encodes the premise and hypothesis separately by summing their word vectors, then feeds the concatenation of these encodings to a 3-layer feedforward network. We also reimplement the Decomposable Attention model (Parikh et al., 2016), which uses attention between the premise and hypothesis to compute richer representations of each word in both sentences. These context-aware vectors are used in the same way BOW uses the original word vectors to generate the final prediction. Both models are trained on cross entropy loss. Implementation details are provided in Appendix A.4.
Word vector layer. The choice of word vectors affects the tightness of our interval bounds. We choose to define the word vector φ(w) for word w as the output of a feedforward layer applied to a fixed pre-trained word vector φ pre (w): where g word is a learned linear transformation. Learning g word with certifiably robust training encourages it to orient the word vectors so that the convex hull of the word vectors is close to an axis-aligned box. Note that g word is applied before bounds are computed via (4). 2 Applying g word after the bound calculation would result in looser interval bounds, since the original word vectors φ pre (w) might be poorly approximated by interval bounds (e.g., Figure 2a), compared to φ(w) (e.g., Figure 2b). Section 5.7 confirms the importance of adding g word . We use 300-dimensional GloVe vectors (Pennington et al., 2014) as our φ pre (w).

Setup
Word substitution perturbations. We base our sets of allowed word substitutions S(x, i) on the substitutions allowed by Alzantot et al. (2018). They demonstrated that their substitutions lead to adversarial examples that are qualitatively similar to the original input and retain the original label, as judged by humans. Alzantot et al. (2018) define the neighbors N (w) of a word w as the n = 8 nearest neighbors of w in a "counter-fitted" word vector space where antonyms are far apart (Mrkšić et al., 2016). 3 The neighbors must also lie within some Euclidean distance threshold. They also use a language model constraint to avoid nonsensical perturbations: they allow substituting x i with x i ∈ N (x i ) if and only if it does not decrease the log-likelihood of the text under a pre-trained language model by more than some threshold. We make three modifications to this approach. First, in Alzantot et al. (2018), the adversary applies substitutions one at a time, and the neighborhoods and language model scores are computed relative to the current altered version of the input. This results in a hard-to-define attack surface, as changing one word can allow or disallow changes to other words. It also requires recomputing language model scores at each iteration of the genetic attack, which is inefficient. Moreover, the same word can be substituted multiple times, leading to semantic drift. We define allowed substitutions relative to the original sentence x, and disallow repeated substitutions. Second, we use a faster language model that allows us to query longer contexts; Alzantot et al. (2018) use a slower language model and could only query it with short contexts. Finally, we use the language model constraint only at test time; the model is trained against all perturbations in N (w). This encourages the model to be robust to a larger space of perturbations, instead of specializing for the particular choice of language model. See Appendix A.3 for further details.
Analysis of word neighbors. One natural question is whether we could guarantee robustness by having the model treat all neighboring words the same. We could construct equivalence classes of words from the transitive closure of N (w), and represent each equivalence class with one embedding. We found that this would lose a significant amount of information. Out of the 50,000 word vocabulary, 19,122 words would be in the same equivalence class, including the words "good", "bad", "excellent", and "terrible." Of the remaining words, 24,389 (79%) have no neighbors.
Baseline training methods. We compare certifiably robust training (Section 3) with both standard training and data augmentation, which has been used in NLP to encourage robustness to various types of perturbations (Jia and Liang, 2017;Belinkov and Bisk, 2017;Iyyer et al., 2018;Ribeiro et al., 2018). In data augmentation, for each training example z, we augment the dataset with K new examplesz by samplingz uniformly from B perturb (z), then train on the normal cross entropy loss. For our main experiments, we use K = 4. We do not use adversarial training (Goodfellow et al., 2015) because it would require running an adversarial search procedure at each training step, which would be prohibitively slow.
Evaluation of robustness. We wish to evaluate robustness of models to all word substitution perturbations. Ideally, we would directly measure robust accuracy, the fraction of test examples z for which the model is correct on allz ∈ B perturb (z). However, evaluating this exactly involves enumerating the exponentially large set of perturbations, which is intractable. Instead, we compute tractable upper and lower bounds: 1. Genetic attack accuracy: Alzantot et al. (2018) demonstrate the effectiveness of a genetic algorithm that searches for perturbationsz that cause model misclassification. The algorithm maintains a "population" of candidatez's and repeatedly perturbs and combines them. We used a population size of 60 and ran 40 search iterations on each example. Since the algorithm does not exhaustively search over B perturb (z), accuracy on the perturbations it finds is an upper bound on the true robust accuracy.    Figure 3: Trade-off between clean accuracy and genetic attack accuracy for CNN models on IMDB. Data augmentation cannot achieve high robustness. Certifiably robust training yields much more robust models, though at the cost of some clean accuracy. Lines connect Pareto optimal points for each training strategy.

Main results
we show on synthetic data that robustly trained LSTMs can learn long-range dependencies.

Clean versus robust accuracy
Robust training does cause a moderate drop in clean accuracy (accuracy on unperturbed test examples) compared with normal training. On IMDB, our normally trained CNN model gets 89% clean accuracy, compared to 81% for the robustly trained model. We also see a drop on SNLI: the normally trained BOW model gets 83% clean accuracy, compared to 79% for the robustly trained model. Similar drops in clean accuracy are also seen for robust models in vision (Madry et al., 2017). For example, the state-of-the-art robust model on CIFAR10 (Zhang et al., 2019) only has 85% clean accuracy, but comparable normallytrained models get > 96% accuracy.
We found that the robustly trained models tend to underfit the training data-on IMDB, the CNN model gets only 86% clean training accuracy, lower than the test accuracy of the normally trained model. The model continued to underfit when we increased either the depth or width of the network. One possible explanation is that the attack surface adds a lot of noise, though a large enough model should still be able to overfit the training set. Better optimization or a tighter way to compute bounds could also improve training accuracy. We leave further exploration to future work.
Next, we analyzed the trade-off between clean and robust accuracy by varying the importance We use accuracy against the genetic attack as our proxy for robust accuracy, rather than IBPcertified accuracy, as IBP bounds may be loose for models that were not trained with IBP. For data augmentation, we vary K, the number of augmented examples per real example, from 1 to 64. For certifiably robust training, we vary κ , the weight of the certified robustness training objective, between 0.01 and 1.0. Figure 3 shows tradeoff curves for the CNN model on 1000 random IMDB development set examples. Data augmentation can increase robustness somewhat, but cannot reach very high adversarial accuracy. With certifiably robust training, we can trade off some clean accuracy for much higher robust accuracy.

Runtime considerations
IBP enables efficient computation of u final (z, θ), but it still incurs some overhead. Across model architectures, we found that one epoch of certifiably robust training takes between 2× and 4× longer than one epoch of standard training. On the other hand, IBP certificates are much faster to compute at test time than genetic attack accuracy. For the robustly trained CNN IMDB model, computing certificates on 1000 test examples took 5 seconds, while running the genetic attack on those same examples took over 3 hours.

Error analysis
We examined development set examples on which models were correct on the original input but in-correct on the perturbation found by the genetic attack. We refer to such cases as robustness errors. We focused on the CNN IMDB models trained normally, robustly, and with data augmentation. We found that robustness errors of the robustly trained model mostly occurred when it was not confident in its original prediction. The model had > 70% confidence in the correct class for the original input in only 14% of robustness errors. In contrast, the normally trained and data augmentation models were more confident on their robustness errors; they had > 70% confidence on the original example in 92% and 87% of cases, respectively.
We next investigated how many words the genetic attack needed to change to cause misclassification, as shown in Figure 4. For the normally trained model, some robustness errors involved only a couple changed words (e.g., "I've finally found a movie worse than . . . " was classified negative, but the same review with "I've finally discovered a movie worse than. . . " was classified positive), but more changes were also common (e.g., part of a review was changed from "The creature looked very cheesy" to "The creature seemed supremely dorky", with 15 words changed in total). Surprisingly, certifiably robust training nearly eliminated robustness errors in which the genetic attack had to change many words: the genetic attack either caused an error by changing a couple words, or was unable to trigger an error at all. In contrast, data augmentation is unable to cover the exponentially large space of perturbations that involve many words, so it does not prevent errors caused by changing many words.

Training schedule
We investigated the importance of slowly increasing during training, as suggested by . Fixing = 1 during training led to a 5 point reduction in certified accuracy for the CNN. On the other hand, we found that holding κ fixed did not hurt accuracy, and in fact may be preferable. More details are shown in Appendix A.5.

Word vector analysis
We determined the importance of the extra feedforward layer g word that we apply to pre-trained word vectors, as described in Section 4.2. We compared with directly using pre-trained word vectors, i.e. φ(w) = φ pre (w). We also tried using g word but applying interval bounds on φ pre (w), then computing bounds on φ(w) with the IBP for-mula for affine layers. In both cases, we could not train a CNN to achieve more than 52.2% certified accuracy on the development set. Thus, transforming pre-trained word vectors and applying interval bounds after is crucial for robust training. In Appendix A.6, we show that robust training makes the intervals around transformed word vectors smaller, compared to the pre-trained vectors.

Related Work and Discussion
Recent work on adversarial examples in NLP has proposed various classes of perturbations, such as insertion of extraneous text (Jia and Liang, 2017), word substitutions (Alzantot et al., 2018), paraphrasing (Iyyer et al., 2018;Ribeiro et al., 2018), and character-level noise (Belinkov and Bisk, 2017;Ebrahimi et al., 2017). These works focus mainly on demonstrating models' lack of robustness, and mostly do not explore ways to increase robustness beyond data augmentation. Data augmentation is effective for narrow perturbation spaces (Jia and Liang, 2017;Ribeiro et al., 2018), but only confers partial robustness in other cases (Iyyer et al., 2018;Alzantot et al., 2018). Ebrahimi et al. (2017) tried adversarial training (Goodfellow et al., 2015) for character-level perturbations, but could only use a fast heuristic attack at training time, due to runtime considerations. As a result, their models were still be fooled by running a more expensive search procedure at test time.
Provable defenses have been studied for simpler NLP models and attacks, particularly for tasks like spam detection where real-life adversaries try to evade detection. Globerson and Roweis (2006) train linear classifiers that are robust to adversarial feature deletion. Dalvi et al. (2004) analyzed optimal strategies for a Naive Bayes classifier and attacker, but their classifier only defends against a fixed attacker that does not adapt to the model.
Recent work in computer vision (Szegedy et al., 2014;Goodfellow et al., 2015) has sparked renewed interest in adversarial examples. Most work in this area focuses on L ∞ -bounded perturbations, in which each input pixel can be changed by a small amount. The word substitution attack model we consider is similar to L ∞ perturbations, as the adversary can change each input word by a small amount. Our work is inspired by work based on convex optimization (Raghunathan et al., 2018;Wong and Kolter, 2018) and builds directly on interval bound propagation , which has certified robustness of computer vision models to L ∞ attacks. Adversarial training via projected gradient descent (Madry et al., 2018) has also been shown to improve robustness, but assumes that inputs are continuous. It could be applied in NLP by relaxing sets of word vectors to continuous regions.
This work provides certificates against word substitution perturbations for particular models. Since IBP is modular, it can be extended to other model architectures on other tasks. It is an open question whether IBP can give nontrivial bounds for sequence-to-sequence tasks like machine translation (Belinkov and Bisk, 2017;Michel et al., 2019). In principle, IBP can handle character-level typos (Ebrahimi et al., 2017;Pruthi et al., 2019), though typos yield more perturbations per word than we consider in this work. We are also interested in handling word insertions and deletions, rather than just substitutions. Finally, we would like to train models that get state-ofthe-art clean accuracy while also being provably robust; achieving this remains an open problem.
In conclusion, state-of-the-art NLP models are accurate on average, but they still have significant blind spots. Certifiably robust training provides a general, principled mechanism to avoid such blind spots by encouraging models to make correct predictions on all inputs within some known perturbation neighborhood. This type of robustness is a necessary (but not sufficient) property of models that truly understand language. We hope that our work is a stepping stone towards models that are robust against an even wider, harder-tocharacterize space of possible attacks. M. T. Ribeiro, S. Singh, and C. Guestrin. 2018 where µ = 0.5a ( dep + u dep ) + b and r = 0.5|a| (u − l). A similar computation yields that res = µ − r. Therefore, the interval O res can be computed using two inner product evaluations: one with a and one with |a|.
Monotonic scalar functions. Activation functions such as ReLU, sigmoid and tanh are monotonic. Suppose z res = σ(z dep ) where z res , z dep ∈ R, i.e. the node applies an element-wise function to its input. The intervals can be computed trivially since z res is minimized at dep and maximized at u dep .
A.2 Numerical stability of softmax In this section, we show how to compute interval bounds for softmax layers in a numerically stable way. We will do this by showing how to handle log-softmax layers. Note that since softmax is just exponentiated log-softmax, and exponentiation is monotonic, bounds on log-softmax directly yield bounds on softmax. Let z dep denote a vector of length m, let c be an integer ∈ {1, . . . , m}, and let z res represent the log-softmax score of index c, i.e.
Given interval bounds j ≤ z dep j ≤ u j for each j, we show how to compute upper and lower bounds on z res . For any vector v, we assume access to a subroutine that computes The standard way to compute this is to normalize v by subtracting max i (v i ) before taking exponentials, then add it back at the end. logsumexp is a standard function in libraries like PyTorch. We will also rely on the fact that if v is the concatenation of vectors u and w, then logsumexp(v) = logsumexp ([logsumexp(u), logsumexp(w)]).
Upper bound. The upper bound u res is achieved by having the maximum value of z dep c , and minimum value of all others. This can be written as: While we could directly compute this expression, it is difficult to vectorize. Instead, with some rearranging, we get The second term is the logsumexp of and logsumexp( dep ).
Since we know how to compute logsumexp, this reduces to computing (15). Note that (15) can be rewritten as by adding and subtracting u dep c . To compute this quantity, we consider two cases: Here we use the fact that stable methods exist to compute log1p(x) = log(1 + x) for x close to 0. We compute the desired value as Here we use the fact that stable methods exist to compute expm1(x) = exp(x) − 1 for x close to 0. We compute the desired value as The second term is just a normal logsumexp, which is easy to compute. To vectorize the implementation, it helps to first compute the logsumexp of everything except dep c , and then logsumexp that with dep c .

A.3 Attack surface differences
In Alzantot et al. (2018), the adversary applies replacements one at a time, and the neighborhoods and language model scores are computed relative to the current altered version of the input. This results in a hard-to-define attack surface, as the same word can be replaced many times, leading to semantic drift. We instead pre-compute the allowed substitutions S(x, i) at index i based on the original x. We define S(x, i) as the set ofx i ∈ N (x i ) such that where probabilities are assigned by a pre-trained language model, and the window radius W and threshold δ are hyperparameters. We use W = 6 and δ = 5. We also use a different language model 6 from Alzantot et al. (2018) that achieves perplexity of 50.79 on the One Billion Word dataset (Chelba et al., 2013). Alzantot et al. (2018) use a different, slower language model, which compels them to use a smaller window radius of W = 1.

A.4 Experimental details
We do not run training for a set number of epochs but do early stopping on the development set instead. For normal training, we early stop on normal development set accuracy. For training with data augmentation, we early stop on the accuracy on the augmented development set. For certifiably robust training, we early stop on the certifiably robust accuracy on the development set. We use the Adam optimizer (Kingma and Ba, 2014) to train all models. On IMDB, we restrict the model to only use the 50, 000 words that are in the vocabulary of the counter-fitted word vector space of Mrkšić et al. (2016). This is because perturbations are not allowed for any words not in this vocabulary, i.e. N (w) = {w} for w / ∈ V . Therefore, the model is strongly incentivized to predict based on words outside of this set. While this is a valid way to achieve high certified accuracy, it is not a valid robustness strategy in general. We simply delete all words that are not in the vocabulary before feeding the input to the model. For SNLI, we use 100-dimensional hidden state for the BOW model and a 3-layer feedforward network. These values were chosen by a hyperparameter search on the dev set. For DECOMPATTN, we use a 300-dimensional hidden state and a 2layer feedforward network on top of the contextaware vectors. These values were chosen to match Parikh et al. (2016).  Our implementation of the Decomposable Attention follows the original described in (Parikh et al., 2016) except for a few differences listed below; • We do not normalize GloVe vectors to have norm 1.
• We do not hash out-of-vocabulary words to randomly generated vectors that we train, instead we omit them.
• We do randomly generate a null token vector that we then train. (Whether the null vector is trained is unspecified in the original paper).
• We use the Adam optimizer (with a learning rate of 1 × 10 −4 ) instead of AdaGrad.
• We use a batch size of 256 instead of 4.
• We use a dropout probability of 0.1 instead of 0.2 • We do not use the intra-sentence attention module.

A.5 Training schedule
In Table 4, we show the effect of holding or κ fixed during training, as described in Section 5.6. All numbers are on 1000 randomly chosen examples from the IMDB development set. Slowly increasing is important for good performance. Slowly increasing κ is actually slightly worse than holding κ = κ * fixed during training, despite earlier experiments we ran suggesting the opposite.
Here we only report certified accuracy, as all models are trained with certifiably robust training, and certified accuracy is much faster to compute for development purposes.

A.6 Word vector bound sizes
To better understand the effect of g word , we checked whether g word made interval bound boxes around neighborhoods N (w) smaller.  word w with |N (w)| > 1, and for both the pretrained vectors φ pre (·) and transformed vectors φ(·), we compute where word w and u word w are the interval bounds around either {φ pre (w) :w ∈ N (w)} or {φ(w) : w ∈ N (w)}, and σ i is the standard deviation across the vocabulary of the i-th coordinate of the embeddings. This quantity measures the average width of the IBP bounds for the word vectors of w and its neighbors, normalized by the standard deviation in each coordinate. On 78.2% of words with |N (w)| > 1, this value was smaller for the transformed vectors learned by the CNN on IMDB with robust training, compared to the GloVe vectors. For same model with normal training, the value was smaller only 54.5% of the time, implying that robust training makes the transformation produce tighter bounds. We observed the same pattern for other model architectures as well.