Assessing Robustness of Text Classification through Maximal Safe Radius Computation

Neural network NLP models are vulnerable to small modifications of the input that maintain the original meaning but result in a different prediction. In this paper, we focus on robustness of text classification against word substitutions, aiming to provide guarantees that the model prediction does not change if a word is replaced with a plausible alternative, such as a synonym. As a measure of robustness, we adopt the notion of the maximal safe radius for a given input text, which is the minimum distance in the embedding space to the decision boundary. Since computing the exact maximal safe radius is not feasible in practice, we instead approximate it by computing a lower and upper bound. For the upper bound computation, we employ Monte Carlo Tree Search in conjunction with syntactic filtering to analyse the effect of single and multiple word substitutions. The lower bound computation is achieved through an adaptation of the linear bounding techniques implemented in tools CNN-Cert and POPQORN, respectively for convolutional and recurrent network models. We evaluate the methods on sentiment analysis and news classification models for four datasets (IMDB, SST, AG News and NEWS) and a range of embeddings, and provide an analysis of robustness trends. We also apply our framework to interpretability analysis and compare it with LIME.


Introduction
Deep neural networks (DNNs) have shown great promise in Natural Language Processing (NLP), outperforming other machine learning techniques in sentiment analysis (Devlin et al., 2018), language translation (Chorowski et al., 2015), speech recognition (Jia et al., 2018) and many other tasks 1 . 1 See https://paperswithcode.com/area/ natural-language-processing Despite these successes, concerns have been raised about robustness and interpretability of NLP models (Arras et al., 2016). It is known that DNNs are vulnerable to adversarial examples, that is, imperceptible perturbations of a test point that cause a prediction error (Goodfellow et al., 2014). In NLP this issue manifests itself as a sensitivity of the prediction to small modifications of the input text (e.g., replacing a word with a synonym). In this paper we work with DNNs for text analysis and, given a text and a word embedding, consider the problem of quantifying the robustness of the DNN with respect to word substitutions. In particular, we define the maximal safe radius (MSR) of a text as the minimum distance (in the embedding space) of the text from the decision boundary, i.e., from the nearest perturbed text that is classified differently from the original. Unfortunately, computation of the MSR for a neural network is an NP-hard problem and becomes impractical for real-world networks (Katz et al., 2017). As a consequence, we adapt constraint relaxation techniques (Weng et al., 2018a;Zhang et al., 2018;Wong and Kolter, 2018) developed to compute a guaranteed lower bound of the MSR for both convolutional (CNNs) and recurrent neural networks (RNNs). Furthermore, in order to compute an upper bound for the MSR we adapt the Monte Carlo Tree Search (MCTS) algorithm (Coulom, 2007) to word embeddings to search for (syntactically and semantically) plausible word substitutions that result in a classification different from the original; the distance to any such perturbed text is an upper bound, albeit possibly loose. We employ our framework to perform an empirical analysis of the robustness trends of sentiment analysis and news classification tasks for a range of embeddings on vanilla CNN and LTSM models. In particular, we consider the IMDB dataset (Maas et al., 2011), the Stanford Sentiment Treebank (SST) dataset (Socher et al., 2013), the AG News Corpus Dataset (Zhang et al., 2015) and the NEWS Dataset (Vitale et al., 2012). We empirically observe that, although generally NLP models are vulnerable to minor perturbations and their robustness degrades with the dimensionality of the embedding, in some cases we are able to certify the text's classification against any word substitution. Furthermore, we show that our framework can be employed for interpretability analysis by computing a saliency measure for each word, which has the advantage of being able to take into account non-linearties of the decision boundary that local approaches such as LIME (Ribeiro et al., 2016) cannot handle.
In summary this paper makes the following main contributions: • We develop a framework for quantifying the robustness of NLP models against (single and multiple) word substitutions based on MSR computation.
• We adapt existing techniques for approximating the MSR (notably CNN-Cert, POPQORN and MCTS) to word embeddings and semantically and syntactically plausible word substitutions.
• We evaluate vanilla CNN and LSTM sentiment and news classification models on a range of embeddings and datasets, and provide a systematic analysis of the robustness trends and comparison with LIME on interpretability analysis.
Related Work. Deep neural networks are known to be vulnerable to adversarial attacks (small perturbations of the network input that result in a misclassification) (Szegedy et al., 2014;Biggio et al., 2013;Biggio and Roli, 2018). The NLP domain has also been shown to suffer from this issue (Belinkov and Bisk, 2018;Ettinger et al., 2017;Gao et al., 2018;Jia and Liang, 2017;Liang et al., 2017;Zhang et al., 2020). The vulnerabilities of NLP models have been exposed via, for example, small character perturbations (Ebrahimi et al., 2018), syntactically controlled paraphrasing (Iyyer et al., 2018), targeted keywords attacks (Alzantot et al., 2018;Cheng et al., 2018), and exploitation of back-translation systems (Ribeiro et al., 2018). Formal verification can guarantee that the classification of an input of a neural network is invariant to perturbations of a certain magnitude, which can be established through the concept of the maximal safe radius (Wu et al., 2020) or, dually, minimum adversarial distortion (Weng et al., 2018b). While verification methods based on constraint solving (Katz et al., 2017(Katz et al., , 2019 and mixed integer programming (Dutta et al., 2018;Cheng et al., 2017) can provide complete robustness guarantees, in the sense of computing exact bounds, they are expensive and do not scale to real-world networks because the problem itself is NP-hard (Katz et al., 2017). To work around this, incomplete approaches, such as search-based methods (Huang et al., 2017;Wu and Kwiatkowska, 2020)

Robustness Quantification of Text Classification against Word Substitutions
In text classification an algorithm processes a text and associates it to a category. Raw text, i.e., a sequence of words (or similarly sentences or phrases), is converted to a sequence of real-valued vectors through an embedding E : W → X ⊆ R d , which maps each element of a finite set W (e.g., a vocabulary) into a vector of real numbers. There are many different ways to build embeddings (Goldberg and Levy, 2014;Pennington et al., 2014;Wallach, 2006), nonetheless their common objective is to capture relations among words. Furthermore, it is also possible to enforce into the embedding syntactic/semantic constraints, a technique commonly known as counter-fitting (Mrkšić et al., 2016), which we assess from a robustness perspective in Section 3. Each text is represented univocally by a sequence of vectors x = (x 1 , . . . , x m ), where m ∈ N, x i ∈ X , padding if necessary. In this work we consider text classification with neural networks, hence, a text embedding x is classified into a category c ∈ C, through a trained network N : R d·m [0,1] → R |C| , i.e., c = arg max i∈C N i (x), where without any loss of generality we assume that each dimension of the input space of N is normalized between 0 and 1. We note that pre-trained embeddings are scaled before training, thus resulting in a L ∞ diameter whose maximum value is 1. Thus, the lower and upper bound measurements are affected by normalization only when one compares embeddings with different dimensions with norms different from L ∞ . In this paper robustness is measured for both convolutional and recurrent neural networks with the distance between words in the embedding space that is calculated with either L 2 or L ∞ -norm: while the former is a proxy for semantic similarity between words in polarized embeddings (this is discussed more in details in the Experimental Section), the latter, by taking into account the maximum variation along all the embedding dimensions, is used to compare different robustness profiles.

Robustness Measure against Word Substitutions
Given a text embedding x, a metric L p , a subset of word indices I ⊆ {1, . . . , m}, and a distance where x I is the sub-vector of x that contains only embedding vectors corresponding to words in I. That is, Ball(x, ) is the set of embedded texts obtained by replacing words in I within x and whose distance to x is no greater than . We elide the index set I to simplify the notation. Below we define the notion of the maximal safe radius (MSR), which is the minimum distance of an embedding text from the decision boundary of the network.
Definition 1 (Maximal Safe Radius). Given a neural network N, a subset of word indices I ⊆ {1, . . . , m}, and a text embedding x, the maximal safe radius MSR(N, x) is the minimum distance from input x to the decision boundary, i.e.,  Figure 1: Illustration of the Maximal Safe Radius (MSR) and its upper and lower bounds. An upper bound of MSR is obtained by computing the distance of any perturbation resulting in a class change (blue ellipse) to the input text. A lower bound certifies that perturbations of the words contained within that radius are guaranteed to not change the classification decision (green ellipse). Both upper and lower bounds approximate the MSR (black ellipse). In this example the word strange can be safely substituted with odd. The word timeless is within upper and lower bound of the MSR, so our approach cannot guarantee it would not change the neural network prediction. particular, if the normalised MSR is greater than 1 then x is robust to any perturbation of the words in I). Conversely, low values of the normalised MSR indicate that the network's decision is vulnerable at x because of the ease with which the classification outcomes can be manipulated. Further, averaging MSR over a set of inputs yields a robustness measure of the network, as opposed to being specific to a given text. Under standard assumptions of bounded variation of the underlying learning function, the MSR is also generally employed to quantify the robustness of the NN to adversarial examples (Wu et al., 2020;Weng et al., 2018a), that is, small perturbations that yield a prediction that differs from ground truth. Since computing the MSR is NP-hard (Katz et al., 2017), we instead approximate it by computing a lower and an upper bound for this quantity (see Figure 1). The strategy for obtaining an upper bound is detailed in Section 2.2, whereas for the lower bound (Section 2.3) we adapt constraint relaxation techniques developed for the verification of deep neural networks.

Upper Bound: Monte Carlo Tree Search
An upper bound for MSR is a perturbation of the text that is classified by the NN differently than the original text. In order to only consider perturba-tions that are syntactically coherent with the input text, we use filtering in conjunction with an adaptation of the Monte Carlo Tree Search (MCTS) algorithm (Coulom, 2007) to the NLP scenario ( Figure 2). The algorithm takes as input a text, embeds it as a sequence of vectors x, and builds a tree where at each iteration a set of indices I identifies the words that have been modified so far: at the first level of the tree a single word is changed to manipulate the classification outcome, at the second two words are perturbed, with the former being the same word as for the parent vertex, and so on (i.e., for each vertex, I contains the indices of the words that have been perturbed plus that of the current vertex). We allow only word for word substitutions. At each stage the procedure outputs all the successful attacks (i.e., perturbed texts that are classified by the neural network differently from the original text) that have been found until the terminating condition is satisfied (e.g., a fixed fraction out of the total number of vertices has been explored). Successful perturbations can be used as diagnostic information in cases where ground truth information is available. The algorithm explores the tree according to the UCT heuristic (Browne et al., 2012), where urgent vertices are identified by the perturbations that induce the largest drop in the neural network's confidence. A detailed description of the resulting algorithm, which follows the classical algorithm (Coulom, 2007) while working directly with word embeddings, can be found in Appendix A.1. Perturbations are sampled by considering the n-closest replacements in the word's neighbourhood: the distance between words is measured in the L 2 norm, while the number of substitutions per word is limited to a fixed constant (e.g., in our experiments this is either 1000 or 10000). In order to enforce the syntactic consistency of the replacements we consider part-of-speech tagging of each word based on its context. Then, we filter all the replacements found by MCTS to exclude those that are not of the same type, or from a type that will maintain the syntactic consistency of the perturbed text (e.g., a noun sometimes can be replaced by an adjective). To accomplish this task we use the Natural Language Toolkit (Bird et al., 2009). More details are provided in Appendix A.1.  Figure 2: Structure of the tree after two iterations of the MCTS algorithm. Simulations of 1-word substitutions are executed at each vertex on the first level to update the UCT statistics. The most urgent vertex is then expanded (e.g., word the) and several 2-words substitutions are executed combining the word identified by the current vertex (e.g., word movie at the second level of the tree) and that of its parent, i.e., the. Redundant substitutions may be avoided (greyed out branch).

Lower Bound: Constraint Relaxation
in the same class by N. Note that, as MSR(N, x) is defined in the embedding space, which is continuous, the perturbation space, Ball(x, ), contains meaningful texts as well as texts that are not syntactically or semantically meaningful. In order to compute l we leverage constraint relaxation techniques developed for CNNs (Boopathy et al., 2019) and LSTMs (Ko et al., 2019), namely CNN-Cert and POPQORN. For an input text x and a hyperbox around Ball(x, ), these techniques find linear lower and upper bounds for the activation functions of each layer of the neural network and use these to propagate an over-approximation of the hyperbox through the network. l is then computed as the largest real such that all the texts in Ball(x, l ) are in the same class, i.e., for all x ∈ Ball(x, l ), arg max i∈C N i (x) = arg max i∈C N i (x ). Note that, as Ball(x, l ) contains only texts obtained by perturbing a subset of the words (those whose index is in I), to adapt CNN-Cert and POPQORN to our setting, we have to fix the dimensions of x corresponding to words not in I and only propagate through the network intervals corresponding to words in I.

Experimental Results
We use our framework to empirically evaluate the robustness of neural networks for sentiment analysis and news classification on typical CNN and LSTM architectures. While we quantify lower  Table 1: Datasets used for the experimental evaluation. We report the number of samples (training/test ratio as provided in the original works) and output classes, the average and maximum length of each input text before pre-processing and the maximum length considered in our experiments.
bounds of MSR for CNNs and LSTMs, respectively, with CNN-Cert and POPQORN tools, we implement the MCTS algorithm introduced in Section 2.2 to search for meaningful perturbations (i.e., upper bounds), regardless of the NN architecture employed. In particular, in Section 3.1 we consider robustness against single and multiple word substitutions and investigate implicit biases of LSTM architectures. In Section 3.2 we study the effect of embedding on robustness, while in Section 3.3 we employ our framework to perform saliency analysis of the most relevant words in a text.

Experimental Setup and Implementation
We have trained several vanilla CNN and LSTM models on datasets that differ in length of each input, number of target classes and difficulty of the learning task. All our experiments were conducted on a server equipped with two 24 core Intel Xenon 6252 processors and 256GB of RAM 2,3 . We consider the IMDB dataset ( Table 1. In our experiments we consider different embeddings, and specifically both complex, probabilistically-constrained representations (GloVe and GloVeTwitter) trained on global word-word co-occurrence statistics from a corpus, as well as the simplified embedding provided by the Keras Python Deep Learning Library (referred to as Keras Custom) (Chollet et al., 2015), which allows one to fine tune the exact dimension of the vector space and only aims at minimizing the loss on the classification task. The resulting learned Keras Custom embedding does not capture com- 2 We emphasise that, although the experiments reported here have been performed on a cluster, all the algorithms are reproducible on a mid-end laptop; we used a machine with 16GB of RAM and an Intel-5 8th-gen. processor.
3 Code for reproducing the MCTS experiments is available at: https://github.com/EmanueleLM/MCTS plete word semantics, just their emotional polarity. More details are reported in Appendix A.3 and Table 4. For our experiments, we consider a 3-layer CNN, where the first layer consists of bidimensional convolution with 150 filters, each of size 3×3, and a LSTM model with 256 hidden neurons on each gate. We have trained more than 20 architectures on the embeddings and datasets mentioned above. We note that, though other architectures might offer higher accuracy for sentence classification (Kim, 2014), this vanilla setup has been chosen intentionally not to be optimized for a specific task, thus allowing us to measure robustness of baseline models. Both CNNs and LSTMs predict the output with a softmax output layer, while the categorical cross-entropy loss function is used during the optimization phase, which is performed with Adam (Kingma and Ba, 2014) algorithm (without early-stopping); further details are reported in Appendix A.3.

Robustness to Word Substitutions
For each combination of a neural network and embedding, we quantify the MSR against single and multiple word substitutions, meaning that the set of word indices I (see Definition 1) consists of 1 or more indices. Interestingly, our framework is able to prove that certain input texts and architectures are robust for any single-word substitution, that is, replacing a single word of the text (any word) with any other possible other word, and not necessarily with a synonym or a grammatically correct word, will not affect the classification outcome. Figure 3 shows that for CNN models equipped with Keras Custom embedding the (lower bound of the) MSR on some texts from the IMDB dataset is greater than the diameter of the embedding space. To consider only perturbations that are semantically close and syntactically coherent with the input text, we employ the MCTS algorithm with filtering described in Section 2.2. An example of a successful  perturbation is shown in Figure 4, where we illustrate the effectiveness of single-word substitutions on inputs that differ in the confidence of the neural network prediction. We note that even with simple tagging it is possible to identify perturbations where replacements are meaningful. For the first example in Figure 4 (top), the network changes the output class to World when the word China is substituted for U.S.. Although this substitution may be relevant to that particular class, nonetheless we note that the perturbed text is coherent and the main topic remains sci-tech. Furthermore, the classification changes also when the word exists is replaced with a plausible alternative misses, a perturbation that is neutral, i.e. not informative for any of the possible output classes. In the third sentence in Figure 4 (bottom), we note that replacing championship with wrestling makes the model output class World, where originally it was Sport, indicating that the model relies on a small number of key words to make its decision. We report a few additional examples of word replacements for a CNN model equipped with GloVe-50d embedding. Given as input the review 'this is art paying homage to art' (from the SST dataset), when art is replaced by graffiti the network misclassifies the review (from positive to negative). Further, as mentioned earlier, the MCTS framework is capable of finding multiple word perturbations: considering the same setting as in the previous example, when in the review 'it's not horrible just horribly mediocre' the words horrible and horribly are replaced, respectively, with gratifying and decently, the review is classified as positive, while for the original sentence it was negative. Robustness results for highdimensional embeddings are included in Table 3, where we report the trends of the average lower and upper bounds of MSR and the percentage of successful perturbations computed over 100 texts (per dataset) for different architectures and embeddings. Further results are in Appendix A.3, including statistics on lower bounds (Tables 5, 6) and single and multiple word substitutions (Tables 7, 8).
CNNs vs. LTSMs By comparing the average robustness assigned to each word, respectively, by CNN-Cert and POPQORN over all the experiments on a fixed dataset, it clearly emerges that recurrent models are less robust to perturbations that occur in very first words of a sentence; interestingly, CNNs do not suffer from this problem. A visual comparison is shown in Figure 6. The key difference is the structure of LSTMs compared to CNNs: while in LSTMs the first input word influences the successive layers, thus amplifying the  Table 3: Statistics on single-word substitutions averaged on 100 input texts of each dataset. We report: the average lower bound of the MSR as measured with either CNN-Cert or POPQORN; the approximate ratio that given a word from a text we find a single-word substitution and the average number of words that substituted for a given word change the classification; the average upper bound computed as the distance between the original word and the closest substitution found by MCTS (when no successful perturbation is found we over-approximate the upper bound for that word with the diameter of the embedding). Values reported for lower bounds have been normalized by each embedding diameter (measurements in the L 2 -norm).  manipulations, the output of a convolutional region is independent from any other of the same layer. On the other hand, both CNNs and LSTMs have in common an increased resilience to perturbations on texts that contain multiple polarized words, a trend that suggests that, independently of the architecture employed, robustness relies on a distributed representation of the content in a text ( Figure 5).

Influence of the Embedding on Robustness
As illustrated in Table 2 and in Figure 3, models that employ small embeddings are more robust to perturbations. On the contrary, robustness de-  creases, from one to two orders of magnitude, when words are mapped to high-dimensional spaces, a trend that is confirmed also by MCTS (see Appendix Table 8). This may be explained by the fact that adversarial perturbations are inherently related to the dimensionality of the input space (Carbone et al., 2020;Goodfellow et al., 2014). We also discover that models trained on longer inputs (e.g., IMDB) are more robust compared to those trained on shorter ones (e.g., SST): in long texts the decision made by the algorithm depends on multiple words that are evenly distributed across Figure 6: Robustness lower bound trends for successive input words for LSTMs (red dots) and CNNs (blue dots) on NEWS and AG News datasets.
the input, while for shorter sequences the decision may depend on very few, polarized terms. From Table 3 we note that polarity-constrained embeddings (Keras) are more robust than those that are probabilistically-constrained (GloVe) on relatively large datasets (IMDB), whereas the opposite is true on smaller input dimensions: experiments suggest that models with embeddings that group together words closely related to a specific output class (e.g., positive words) are more robust, as opposed to models whose embeddings gather words together on a different principle (e.g., words that appear in the same context): intuitively, in the former case, words like good will be close to synonyms like better and nice, while in the latter words like good and bad, which often appear in the same context (think of the phrase 'the movie was good/bad'), will be closer in the embedding space. In the spirit of the analysis in (Baroni et al., 2014), we empirically measured whether robustness is affected by the nature of the embedding employed, that is, either prediction-based (i.e., embeddings that are trained alongside the classification task) or hybrid/count-based (e.g., GloVe, GloVeTwitter). By comparing the robustness of different embeddings and the distance between words that share the same polarity profile (e.g., positive vs. negative), we note that MSR is a particularly well suited robustness metric for prediction-based embeddings, with the distance between words serving as a reasonable estimator of word-to-word semantic similarity w.r.t. the classification task. On the other hand, for hybrid and count-based embeddings (e.g., GloVe), especially when words are represented as high-dimensional vectors, the distance between two words in the embedding space, when compressed into a single scalar, does not retain enough information to estimate the relevance of input variations. Therefore, in this scenario, an approach based solely on the MSR is limited by the choice of the distance function between words, and may lose its effectiveness unless additional factors such as context are considered. Further details of our evaluation are provided in Appendix A.3, Table 5 and Figure 11.
Counter-fitting To mitigate the issue of robustness in multi-class datasets characterized by short sequences, we have repeated the robustness measurements with counter-fitted (Mrkšić et al., 2016) embeddings, i.e., a method of injecting additional constraints for antonyms and synonyms into vector space representations in order to improve the vectors' capability to encode semantic similarity. We observe that the estimated lower bound of MSR is in general increased for low-dimensional embeddings, up to twice the lower bound for non counter-fitted embeddings. This phenomenon is particularly relevant when Keras Custom 5d and 10d are employed, see Appendix A.3, Table 6. On the other hand, the benefits of counter-fitting are less pronounced for high-dimensional embeddings. The same pattern can be observed in Figure 7, where multiple-word substitutions per text are allowed. Further details can be found in Appendix A.3, Tables 6 and 8.

Interpretability of Sentiment Analysis via Saliency Maps
We employ our framework to perform interpretablity analysis on a given text. For each word of a given text we compute the (lower bound of the) MSR and use this as a measure of its saliency, where small values of MSR indicate that minor perturbations of that word can have a significant influence on the classification outcome. We use the above measure to compute saliency maps for both CNNs and LSTMs, and compare our results with those obtained by LIME (Ribeiro et al., 2016), which assigns saliency to input features according to the best linear model that locally explains the decision boundary. Our method has the advantage of being able to account for non-linearities in the decision boundary that a local approach such as LIME cannot handle, albeit at a cost of higher computational complexity (a similar point was made in (Blaas et al., 2020) for Gaussian processes). As a result, we are able to discover words that our framework views as important, but LIME does not, and vice versa. In Figure 8 we report two examples, one for an IMDB positive review (Figure 8 (a)) and another from the NEWS dataset classified using a LTSM (Figure 8 (b)). In Figure 8 (a) our approach finds that the word many is salient and perturbing it slightly can make the NN change the class of the review to negative. In contrast, LIME does not identify many as significant. In order to verify this result empirically, we run our MCTS algorithm (Section 2.2) and find that simply substituting many with worst changes the classification to 'negative'. Similarly, for Figure 8 (b), where the input is assigned to class 5 (health), perturbing the punctuation mark (:) may alter the classification, whereas LIME does not recognise its saliency.

Conclusions
We introduced a framework for evaluating robustness of NLP models against word substitutions.
Through extensive experimental evaluation we demonstrated that our framework allows one to certify certain architectures against single word perturbations and illustrated how it can be employed for interpretability analysis. While we focus on perturbations that are syntactically coherent, we acknowledge that semantic similarity between phrases is a crucial aspect that nonetheless requires an approach which takes into account the context where substitutions happen: we will tackle this limitation in future. Furthermore, we will address robustness of more complex architectures, e.g., networks that exploit attention-based mechanisms (Vaswani et al., 2017).

A.1 Monte Carlo Tree Search (MCTS)
We adapt the MCTS algorithm (Browne et al., 2012) to the NLP classification setting with word embedding, which we report here for completeness as Algorithm 1. The algorithm explores modifications to the original text by substituting one word at the time with nearest neighbour alternatives. It takes as input: text, expressed as a list of T words; N, the neural network as introduced in Section 2; E, an embedding; sims, an integer specifying the number of Monte Carlo samplings at each step; and α, a real-valued meta-parameter specifying the exploration/exploitation trade-off for vertices that can be further expanded. The salient steps of the MCTS procedure are: • Select: the most promising vertex to explore is chosen to be expanded (Line 14) according to the standard UCT heuristic: , where v and v are respectively the selected vertex and its parent; α is a meta-parameter that balances exploration-exploitation trade-off; N () represents the number of times a vertex has been visited; and Q() measures the neural network confidence drop, averaged over the Monte Carlo simulations for that specific vertex.
• Expand: the tree is expanded with T new vertices, one for each word in the input text (avoiding repetitions). A vertex at index t ∈ {1, ...T } and depth n > 0 represents the strategy of perturbing the t-th input word, plus all the words whose indices have been stored in the parents of the vertex itself, up to the root.
• Simulate: simulations are run from the current position in the tree to estimate how the neural network behaves against the perturbations sampled at that stage (Line 23). If one of the word substitutions induced by the simulation makes the network change the classification, a successful substitution is found and added to the results, while the value Q of the current vertex is updated. Many heuristics can be considered at this stage, for example the average drop in the confidence of the network over all the simulations. We have found that the average drop is not a good measure of how the robustness of the network drops when some specific words are replaced, since for a high number of simulations a perturbation that is effective might pass unnoticed. We thus work with the maximum drop over all the simulations, which works slightly better in this scenario (Line 27).
• Backpropagate: the reward received is backpropagated to the vertices visited during selection and expansion to update their UCT statistics. It is known that, when UCT is employed (Browne et al., 2012;Kocsis and Szepesvári, 2006), MCTS guarantees that the probability of selecting a sub-optimal perturbation tends to zero at a polynomial rate when the number of games grows to infinity (i.e., it is guaranteed to find a discrete perturbation, if it exists).
For our implementation we adopted sims = 1000 and α = 0.5. Tables 8 and 7 give details of MCTS experiments with single and multiple word substitutions.

MCTS Word Substitution Strategies
We consider two refinements of MCTS: weighting the replacement words by importance and filtering to ensure syntactic/semantic coherence of the input text. The importance score of a word substitution is inversely proportional to its distance from the original word, e.g., pickup(w ← w ) = 1 ), where w, w are respectively the original and perturbed words, d() is an L p norm of choice and U a neighbourhood of w, whose cardinality, which must be greater than 1, is denoted with |U | (as shown in Figure 9). We can further filter words in the neighborhood such that only synonyms/antonyms are selected, thus guaranteeing that a word is replaced by a meaningful substitution; more details are provided in Section 2.2. While in this work we use a relatively simple method to find replacements that are syntactically coherent with the input text, more complex methods are available that try also to enforce semantic consistency (Navigli, 2009;Ling et al., 2015;Trask et al., 2015), despite this problem is known to be much harder and we reserve this for future works.  text ← sampleP erturbation(text, c) Ref. Figure 9 24: x ← E(text);   Figure 9: Substitutions are selected either randomly or according to a score calculated as a function of the distance from the original word. The sampling region (red circle) is a finite fraction of the embedding space (blue circle). Selected candidates can be filtered to enforce semantic and syntactic constraints. Word the has been filtered out because it is not grammatically consistent with the original word strange, while words good, better and a are filtered out as they lie outside the neighborhood of the original word.

A.2 Experimental Setup
The network architectures that have been employed in this work are shown in Figure 10, while the embeddings are summarised in Table 4. More details of both the embeddings and the architectures employed are provided in the main paper, Section 3.

A.3 Additional Robustness Results
In the remainder of this section we present additional experimental results of our robustness evaluation. More specifically, we show the trends of upper and lower bounds for different datasets (Tables 5, 6, 7 and 8); include robustness results against multiple substitutions; and perform robustness comparison with counter-fitted models (Figure 11).   Table 6: Lower bound results for single (top) and multiple word (middle and bottom) substitutions, comparing vanilla and counter-fitted models. Robustness of counter-fitted models is superior to the vanilla counterpart, except for high-dimensional embeddings such as GloVe 100d, where it has not been possible to obtain a bound for the counter-fitted embedding due to computational constraints (nonetheless the counterpart lower bound is close to zero). Values reported refer to measurements in the L ∞ -norm.  Table 7: Upper bound results for single-word substitutions as found by MCTS. We report: the average execution time for each experiment; the percentage of texts for which we have found at least one successful single-word substitution (which results in a class change) and the approximate ratio that selecting randomly 1 word from a text we find a replacement that is successful; the distance to the closest meaningful perturbation to the original word found, namely an upper bound (differently from  Table 8: Upper bound results for multiple-word substitutions as found by MCTS. We report the percentage of texts for which we have found at least a single-word substitution and the approximate ratio that selecting randomly k words from a text (where k is the number of substitutions allowed) we find a replacement that is successful. We do not report the average execution times as they are (roughly) the same as in Table 7. Values reported refer to measurements in the L 2 -norm. For more than 1 substitution, values reported are an estimate on several random replacements, as it quickly becomes prohibitive to cover all the possible multiple-word combinations. 25 50 10 Figure 11: Comparison of robustness of vanilla vs. counter-fitted embeddings for an increasing number of dimensions and word substitutions on the AG News dataset. (a) Simple Keras Custom embeddings optimised for emotional polarity. (b) GloVeTwitter embeddings that encode more complex representations. Counter-fitted embeddings exhibit greater robustness on low-dimensional or simple embeddings. A reversed trend is observed on high-dimensional embeddings or more complex word representations. Values reported refer to measurements in the L ∞ -norm.