Adversarial Augmentation Policy Search for Domain and Cross-Lingual Generalization in Reading Comprehension

Reading comprehension models often overfit to nuances of training datasets and fail at adversarial evaluation. Training with adversarially augmented dataset improves robustness against those adversarial attacks but hurts generalization of the models. In this work, we present several effective adversaries and automated data augmentation policy search methods with the goal of making reading comprehension models more robust to adversarial evaluation, but also improving generalization to the source domain as well as new domains and languages. We first propose three new methods for generating QA adversaries, that introduce multiple points of confusion within the context, show dependence on insertion location of the distractor, and reveal the compounding effect of mixing adversarial strategies with syntactic and semantic paraphrasing methods. Next, we find that augmenting the training datasets with uniformly sampled adversaries improves robustness to the adversarial attacks but leads to decline in performance on the original unaugmented dataset. We address this issue via RL and more efficient Bayesian policy search methods for automatically learning the best augmentation policy combinations of the transformation probability for each adversary in a large search space. Using these learned policies, we show that adversarial training can lead to significant improvements in in-domain, out-of-domain, and cross-lingual (German, Russian, Turkish) generalization.


Introduction
There has been growing interest in understanding NLP systems and exposing their vulnerabilities through maliciously designed inputs (Iyyer et al., 2018;Belinkov and Bisk, 2018;Nie et al., 2019;Gurevych and Miyao, 2018). Adversarial examples are generated using search (Alzantot et al., 2018), heuristics (Jia and Liang, 2017) or gradient (Ebrahimi et al., 2018) based techniques to fool the model into giving the wrong outputs. Often, the model is further trained on those adversarial examples to make it robust to similar attacks. In the domain of reading comprehension (RC), adversaries are QA samples with distractor sentences that have significant overlap with the question and are randomly inserted into the context. By having a fixed template for creating the distractors and training on them, the model identifies learnable biases and overfits to the template instead of being robust to the attack itself (Jia and Liang, 2017). Hence, we first build on Wang and Bansal (2018)'s work of adding randomness to the template and significantly expand the pool of distractor candidates by introducing multiple points of confusion within the context, adding dependence on insertion location of the distractor, and further combining distractors with syntactic and semantic paraphrases to create combinatorially adversarial examples that stress-test the model's language understanding capabilities. These adversaries inflict up to 45% drop in performance of RC models built on top of large pretrained models like RoBERTa (Liu et al., 2019).
Next, to improve robustness to the aforementioned adversaries, we finetune the RC model with a combined augmented dataset containing an equal number of samples from all of the adversarial transformations. While it improves robustness by a significant margin, it leads to decline in performance on the original unaugmented dataset. Hence, instead of uniformly sampling from the various adversarial transformations, we propose to perform a search for the best adversarial policy combinations that improve robustness against the adversarial attacks and also preserve/improve accuracy on the original dataset via data augmentation. However, it is slow, expensive and inductive-biased to manually tune the transformation probability for each adversary and repeat the process for each target dataset, and so we present RL and Bayesian search methods to learn this policy combination automatically.
For this, we create a large augmentation search space of up to 10 6 , with four adversarial methods, two paraphrasing methods and a discrete binning of probability space for each method (see Figure  1). Cubuk et al. (2019) showed via AutoAugment that a RNN controller can be trained using reinforcement learning to find the best policy in a large search space. However, AutoAugment is computationally expensive and relies on the assumption that the policy searched using rewards from a smaller model and reduced dataset will generalize to bigger models. Alternatively, the augmentation methods can be modelled with a surrogate function, such as Gaussian processes (Rasmussen, 2003), and subjected to Bayesian optimization (Snoek et al., 2012), drastically reducing the number of training iterations required for achieving similar results (available as a software package for computer vision). 2 Hence, we extend these ideas to NLP and perform a systematic comparison between Au-toAugment and our more efficient BayesAugment.
Finally, there has been limited previous work exploring the role of adversarial data augmentation to improve generalization of RC models to out-of-domain and cross-lingual data. Hence, we also perform automated policy search of adversarial transformation combinations for enhancing generalization from English Wikipedia to datasets in other domains (news, web) and languages (Russian, German, Turkish). Policy search methods like BayesAugment can be readily adapted for lowresource scenarios where one only has access to a small development set that the model can use as a black-box evaluation function (for rewards, but full training or gradient access on that data is unavailable). We show that augmentation policies for the source domain learned using target domain performance as reward, improves the model's generalization to the target domain with only the use of a small development set from that domain. Similarly, we use adversarial examples in a pivot language (in our case, English) to improve performance on other languages' RC datasets using rewards from a small development set from that language.
Our contributions can be summarized as follows: 2 https://pypi.org/project/deepaugment/ • We first propose novel adversaries for reading comprehension that cause up to 45% drop in large pretrained models' performance. Augmenting the training datasets with uniformly sampled adversaries improves robustness to the adversarial attacks but leads to decline in performance on the original unaugmented dataset. • We next demonstrate that optimal adversarial policy combinations of transformation probabilities (for augmentation and generalization) can be automatically learned using policy search methods. Our experiments show that efficient Bayesian optimization achieves similar results as AutoAugment with a fraction of the resources. • By training on the augmented data generated via the learned policies, we not only improve adversarial robustness of the models but also show significant gains i.e., up to 2.07%, 5.0%, and 2.21% improvement for in-domain, out-of-domain, and cross-lingual evaluation respectively. Overall, the goal of our paper is to make reading comprehension models robust to adversarial attacks as well as out-of-distribution data in cross-domain and cross-lingual scenarios.

Adversary Policy Design
As shown by Jia and Liang (2017), QA models are susceptible to random, semantically meaningless and minor changes in the data distribution. We extend this work and propose adversaries that exploit the model's sensitivity to insert location of distractor, number of distractors, combinatorial adversaries etc. After exposing the model's weaknesses, we strengthen them by training on these adversaries and show that the model's robustness to adversarial attacks significantly increases due to it. Finally, in Sec. 4, we automatically learn the right combination of transformation probability for each adversary in response to a target improvement using policy search methods.

Adversary Transformations
We present two types of adversaries, namely positive perturbations and negative perturbations (or attacks) (Figure 1). Positive perturbations are adversaries generated using methods that have been traditionally used for data augmentation in NLP i.e., semantic and syntactic transformations. Negative perturbations are distractor sentences based on the classic AddSent model (Jia and Liang, 2017) that exploits the RC model's shallow language understanding to mislead it to incorrect answers. We use the method outlined by Wang and Bansal (2018) for AddSentDiverse to generate a distractor sentence (see Table 1) and insert it randomly within the context of a QA sample.
We introduce more variance to adversaries with AddKSentDiverse, wherein multiple distractor sentences are generated using AddSentDiverse and are inserted at independently sampled random positions within the context. For AddAnswerPosition, the original answer span is retained within the distractor sentence and the model is penalized for incorrect answer span location. We remove the sentence containing the answer span from the context and introduce a distractor sentence to create In-validateAnswer adversarial samples which are no longer answerable. PerturbAnswer adversaries are created by following the Perturb subroutine (Alzantot et al., 2018) and generating semantic paraphrases of the sentence containing the answer span. We use the syntactic paraphrase network (Iyyer et al., 2018) to create PerturbQuestion adversarial samples by replacing the original question with its paraphrase.
Finally, we combine negative and positive perturbations to create adversaries which double-down on the model's language understanding. It always leads to a larger drop in performance when tested on the RC models trained on original unaugmented datasets. See Appendix for more details.

Adversarial Policy & Search Space
Reading comprehension models are often trained with adversarial samples in order to improve robustness to the corresponding adversarial attack. We seek to find the best combination of adversaries for data augmentation that also preserves/improves accuracy on source domain and improves generalization to a different domain or language.
AutoAugment: Following previous work in Au-toAugment policy search (Cubuk et al., 2019; Niu and Bansal, 2019), we define a sub-policy to be a set of adversarial transformations which are applied to a QA sample to generate an adversarial sample. We show that adversaries are most effective when positive and negative perturbations are applied together (Table 2). Hence, to prepare one sub-policy, we select one of the four negative perturbations (or none), combine it with one of the two positive perturbations (or none) and assign the combination a transformation probability (see Figure  1). The probability space [0, 1] is discretized into 6 equally spaced bins. This leads to a search space of 5 * 3 * 6 = 90 for a single sub-policy. Next, we define a complete adversarial policy as a set of n sub-policies with a search space of 90 n . For each input QA sample, one of the sub-policies is randomly sampled and applied (with a probability equal to the transformation probability) to generate the adversarial sample. Thus, each original QA sample ends up with one corresponding adversarial sample or none.
BayesAugment: We adopt a simplified formulation of the policy for our BayesAugment method, following Ho et al. (2019) and RandAugment (Cubuk et al., 2020). Sampling of positive and negative adversaries is eliminated and transformation probabilities of all possible combinations of adversaries are optimized over a continuous range [0, 1]. 3 Consequently, one of these combinations is randomly sampled for each input QA sample to generate adversaries. Empirically, the dominant adversary in a policy is the attack with highest transformation probability (see policies in Table 8 in Appendix). Due to the probabilistic nature of the policy, it is possible for the model to not add any adversarial sample at all, but the probability of this happening is relatively low.

Automatic Policy Search
Next, we need to perform search over the large space of augmentation policies in order to find the best policy for a desired outcome. Performing naive search (random or grid) or manually tuning the transformation probabilities is slow, expensive and largely impractical due to resource constraints. Hence, we compare two different approaches for learning the best augmentation policy in fewer searches: AutoAugment and BayesAugment. We follow the optimization procedure as demonstrated in Figure 1. For t = 1, 2, ..., do: • Sample the next policy p t (sample) • Transform training data with p t and generate augmented data (apply, transform) • Train the downstream task model with augmented data (train) • Obtain score on validation dataset as reward r t • Update Gaussian Process or RNN Controller with r t (update)

AutoAugment
Our AutoAugment model (see Figure 1) consists of a recurrent neural network-based controller and a downstream task model. The controller has n output blocks for n sub-policies; each output block generates distributions for the three components of sub-policies i.e., neg, pos and probability. The adversarial policy is generated by sampling from these distributions and applied on input dataset to create adversarial samples, which are added to the original dataset to create an augmented dataset. The downstream model is trained on the augmented dataset till convergence and evaluated on a given metric, which is then fed back to the controller as a reward (see the update flow in figure). We use REINFORCE (Sutton et al., 1999;Williams, 1992) to train the controller.

BayesAugment
Typically, it takes thousands of steps to train an Au-toAugment controller using reinforcement learning which prohibits the use of large pretrained models as task model in the training loop. For example, the controllers in Cubuk et al. (2019) were trained for 15,000 samples or more. To circumvent this computational issue, we frame our adversarial policy search as a hyperparameter optimization problem and use Bayesian methods to perform the search. Bayesian optimization techniques use a surrogate model to approximate the objective function f and an acquisition function to sample points from areas where improvement over current result is most likely. The prior belief about f is updated with samples drawn from f in order to get a better estimate of the posterior that approximates f . Bayesian methods attempt to find global maximum in the minimum number of steps.

Rewards
The F1 score of downstream task model on development set is used as reward during policy search. To discover augmentation policies which are geared towards improving generalization of RC model, we calculate the F1 score of task model (trained on source domain) on out-of-domain or cross-lingual development datasets, and feed it as the reward to the optimizer.

Datasets
We

Reading Comprehension Models
We use RoBERTa BASE as the primary RC model for all our experiments. For fair baseline evaluation on out-of-domain and cross-lingual datasets, we also use the development set of the target task to select the best checkpoint.   Table 3 to observe difference in performance after adversarial training. Results (F1 score) are shown on dev set.

Evaluation Metrics
We use the official SQuAD evaluation script for evaluation of robustness to adversarial attacks and performance on in-domain and out-of-domain datasets. 5 For cross-lingual evaluation, we use the modified Translate-Test method as outlined in Lewis et al. (2020);Asai et al. (2018). QA samples in languages other than English are first translated to English and sent as input to RoBERTa BASE finetuned on SQuAD v1.1. The predicted answer spans within English context are then mapped back to the context in original language using alignment scores from the translation model. We use the top-ranked German→English and Russian→English models in WMT19 shared news translation task, and train a Turkish→English model using a similar architecture, to generate translations and alignment scores (Ng et al., 2019). 6

Results
First, in Sec. 5.1, we perform adversarial evaluation of baseline RC models for various categories of adversaries. Next, in Sec. 5.2, we train the RC 5 Statistical significance is computed with 100K samples using bootstrap (Noreen, 1989;Tibshirani and Efron, 1993 Table 3: Adversarial evaluation after training RoBERTa BASE with the original dataset augmented with equally sampled adversarial data. Compare to corresponding rows in Table 2 to observe difference in performance after adversarial training. Results (F1 score) are shown on dev set. models with an augmented dataset that contains equal ratios of adversarial samples and show that it improves robustness to adversarial attacks but hurts performance of the model on original unaugmented dataset. Finally, in Sec. 5.3, we present results from AutoAugment and BayesAugment policy search and the in-domain, out-of-domain and cross-lingual performance of RC models trained using augmentation data generated from the learned policies with corresponding target rewards. Table 2 shows results from adversarial evaluation of RoBERTa BASE finetuned with SQuAD v2.0 and NewsQA respectively. All adversarial methods lead to a significant drop in performance for the finetuned models i.e., between 4-45% for both datasets. The decrease in performance is maximum when there are multiple distractors in the context (Add3SentDiverse) or perturbations are combined with one another (AddSentDiverse + PerturbAnswer). These results show that, in spite of being equipped with a broader understanding of language from pretraining, the finetuned RC models are shallow and over-stabilized to textual patterns like ngram overlap. Further, the models aren't robust to semantic and syntactic variations in text.

Adversarial Evaluation
Additionally, we performed manual evaluation of 96 randomly selected adversarial samples (16 each from attacks listed in Table 1) and found that a human annotator picked the right answer for 85.6% of the questions.

Manual Adversarial Training
Next, in order to remediate the drop in performance observed in Table 2 and improve robustness to adversaries, the RC models are further finetuned for 2 epochs with an adversarially augmented training set. The augmented training set contains each  QA sample from the original training set and a corresponding adversarial QA sample by randomly sampling from one of the adversary methods. Table 3 shows results from adversarial evaluation after adversarial training. Adding perturbed data during training considerably improves robustness of the models to adversarial attacks. For instance, RoBERTa BASE performs with 79.44 F1 score on SQuAD AddKSentDiverse samples (second row, Table 3), as compared to 45.31 F1 score without adversarial training (third row, Table 2). Similarly, RoBERTa BASE performs with 44.99 F1 score on NewsQA PerturbQuestion samples (fifth row, Table 3), as compared to a baseline score of 36.76 F1 score (sixth row, Table 2). However, this manner of adversarial training also leads to drop in performance on the original unaugmented development set, e.g., RoBERTa BASE performs with 78.83 and 58.08 F1 scores on the SQuAD and NewsQA development sets respectively, which is 2.34 and 0.32 points lesser than the baseline (first row, Table 2).

Augmentation Policy Search for Domain and Language Generalization
Following the conclusion from Sec. 5.2 that uniform sampling of adversaries is not the optimal approach for model performance on original unaugmented dataset, we perform automated policy search over a large search space using BayesAugment and AutoAugment for in-domain as well as cross-domain/lingual improvements (as discussed in Sec. 4). For AutoAugment, we choose the number of sub-policies in a policy to be n = 3 as a trade-off between search space dimension and  optimum results. We search for the best transformation policies for the source domain that lead to improvement of the model in 3 areas: 1. in-domain performance 2. generalization to other domains and 3. generalization to other languages. These results are presented in Tables 4 and 5, adversarial evaluation of the best BayesAugment models is presented in Table 6, and the learned policies are shown in the Appendix.
In-domain evaluation: The best AutoAugment augmentation policies for improving in-domain performance of RoBERTa BASE on the development sets result in 0.46% and 3.77% improvement in F1 score over baseline for SQuAD v2.0 and NewsQA respectively (see Table 4). Similarly, we observe 0.54% (p=0.021) and 0.22% (p=0.013) absolute improvement in F1 Score for SQuAD and NewsQA respectively by using BayesAugment policies. This trend is reflected in results on the test set as well. AutoAugment policies result in most improvement i.e., 0.42% (p=0.014) and 2.07% (p=0.007) for SQuAD and NewsQA respectively. Additionally, both policy search methods outperform finetuning with a dataset of uniformly sampled adversaries (see row 2 in Table 4).
Out-of-domain evaluation: To evaluate generalization of the RC model from Wikipedia to news articles and web, we train RoBERTa BASE on SQuAD and evaluate on NewsQA and TriviaQA respectively. The baseline row in Table 4 presents results of RoBERTa BASE trained on original unaugmented SQuAD and evaluated on NewsQA and TriviaQA. Next, we reiterate results from Table 3 and show that finetuning with uniformly sampled dataset (see UniS in Table 4) of adversaries results in drop in performance on the validation sets of SQuAD and NewsQA. By training on adversarially augmented SQuAD with AutoAugment policy, we see 2.21% and 0.81% improvements on the development sets of NewsQA (SQuAD→NewsQA) and TriviaQA (SQuAD→TriviaQA) respectively. Similarly, BayesAugment provides 1.37% and 2.36% improvements over baseline for development sets of TriviaQA and NewsQA, proving as a competitive and less computationally intensive substitute to AutoAugment. BayesAugment outperforms Au-toAugment at out-of-domain generalization by providing 4.0%(p<0.001) and 4.98% jump on test sets for NewsQA and TriviaQA respectively, as compared to 1.87% improvements with AutoAugment.
Our experiments suggest that AutoAugment finds better policies than BayesAugment for indomain evaluation. We hypothesize that this might be attributed to a difference in search space between the two policy search methods. AutoAugment is restricted to sampling at most 3 sub-policies while BayesAugment has to simultaneously optimize the transformation probability for ten or more different augmentation methods. A diverse mix of adversaries from the latter is shown to be more beneficial for out-of-domain generalization but results in minor improvements for in-domain performance. Moving ahead, due to better performance for outof-domain evaluation and more efficient trade-off with computation, we only use BayesAugment for our cross-lingual experiments.
Cross-lingual evaluation: Table 5 shows results of RoBERTa BASE finetuned with adversarially augmented SQuAD v1.1 7 and evaluated on RC datasets in non-English languages. The baseline row presents results from RoBERTa BASE trained on original unaugmented SQuAD and evaluated on German MLQA(de), Russian XQuAD(ru) and Turkish XQuAD(tr) datasets; F1 scores on the development sets are 58.58, 67.89 and 42.95 respectively. These scores depend on quality of the translation model as well as the RC model. We observe significant improvements on the development as well as test sets by finetuning baseline RC model with adversarial data from English SQuAD. Uniformly sampled adversarial dataset results in 0.71% (p=0.063), 1.06% (p=0.037), and 0.55% (p=0.18) improvement for test sets of MLQA(de), XQuAD(ru) and XQuAD(tr), respectively. BayesAugment policies outperform uniform sampling and result in 1.47% (p=0.004), 2.21% (p=0.007) and 1.46% (p=0.021) improvement for test sets of MLQA(de), XQuAD(ru) and XQuAD(tr), respectively.
Adversarial evaluation: We show results from the adversarial evaluation of RoBERTa BASE models finetuned with adversarially augmented SQuAD using policies learned from BayesAugment in Table 6. We use the best models for out-of-domain and cross-lingual generalization as shown in Tables 4 and 5, and evaluate their performance on the adversaries discussed in Section 5.1. Results show that the policies learnt from BayesAugment significantly improve resilience to the proposed adversarial attacks in addition to improving performance on the target datasets. The performance on adversaries varies with the transformation probability of the respective adversaries in the learned policies. For example, the transformation probability of PerturbQuestion adversaries is 0.83 and 0.0 for SQuAD→TriviaQA and SQuaD→NewsQA models respectively (see Table 8). Consequently, the former has a higher performance on Pertur-bQuestion adversaries.

Analysis and Discussion
Having established the efficacy of automated policy search for adversarial training, we further probe the robustness of adversarially trained models to unseen adversaries. We also analyze the convergence of BayesAugment for augmentation policy search and contrast its requirement of computational resources with that of AutoAugment. See Appendix for more analysis on domain independence of adversarial robustness and augmentation data size.
Robustness to Unseen Adversaries: We train RoBERTa BASE on SQuAD v2.0 augmented with the AddSentDiverse counterpart of each QA sample and evaluate it on other adversarial attacks, to analyze robustness of the model to unseen adversaries. As seen from the results in Table 7 Table 6: Adversarial evaluation after finetuning the baseline with adversarial policies derived from BayesAugment for generalization from SQuAD2.0 to TriviaQA, NewsQA, and SQuAD1.1 to German (de), Russian (ru) and Turkish (tr) RC datasets. Results (F1 / Exact Match) are shown on validation sets. Compare to corresponding rows in Table 3 to observe difference in performance between models finetuned with uniformly sampled dataset vs. dataset derived from learned policies.   tracted by adversaries when the original answer is removed from context.
Bayesian Convergence: In comparison to the thousands of training loops or more for AutoAugment, we run BayesAugment for only 100 training loops with 20 restarts. To show that BayesAugment converges within the given period, we plot the distance between transformation probabilities chosen by the Bayesian optimizer for the AddSentDiverse-PerturbQuestion augmentation method. As shown in Figure 2, the distance between the samples decreases with progression in training, showing that the optimizer becomes more confident about the narrow range of probability which should be sampled for maximum performance on validation set.
Analysis of Resources for AutoAugment vs BayesAugment: With lesser number of training loops, BayesAugment uses only 10% of the GPU resources required for AutoAugment. Our Au-toAugment experiments have taken more than 1000 iterations and upto 5-6 days for convergence, requiring many additional days for hyperparameter tuning. In contrast, our BayesAugment experiment ran for 36-48 hours on 2 1080Ti GPUs and achieved comparable performance with 100 iterations or less. If large pretrained models are replaced with smaller distilled models in future work, BayesAugment will provide even more gains in time/computation.

Conclusion
We show that adversarial training can be leveraged to improve robustness of reading comprehension models to adversarial attacks and also to improve performance on source domain and generalization to out-of-domain and cross-lingual data. We present BayesAugment for policy search, which achieves results similar to the computationallyintensive AutoAugment method but with a fraction of computational resources. By combining policy search with rewards from the corresponding target development sets' performance, we show that models trained on SQuAD can be generalized to NewsQA and German, Russian, Turkish crosslingual datasets without any training data from the target domain or language.  (2018) for AddSentDiverse to generate a distractor sentence and insert it randomly within the context of a QA sample. In addition to WordNet, we use ConceptNet (Speer et al., 2017) for a wider choice of antonyms during generation of adversary. QA pairs that do not have an answer within the given context are also augmented with AddSentDiverse adversaries. AddKSentDiverse: The AddSentDiverse method is used to generate multiple distractor sentences for a given context. Each of the distractor sentences is then inserted at independently sampled random positions within the context. The distractors may or may not be similar to each other. Introducing multiple points of confusion is a more effective technique for misleading the model and reduces the scope of learnable biases during adversarial training by adding variance.
AddAnswerPosition: The original answer span is retained and placed within a distractor sentence generated using a combination of AddSentDiverse and random perturbations to maximize semantic mismatch. We modify the evaluation script to compare exact answer span locations in addition to the answer phrase and fully penalize incorrect locations. For practical purposes, if the model predicts the answer span within adversarial sentence as output, it does not make a difference. However, it brings into question the interpretability of such models. This distractor is most effective when placed right before the original answer sentence, showing dependence on insert location of distractor.
InvalidateAnswer: The sentence containing the original answer is removed from the context. Instead, a distractor sentence generated using AddSentDiverse is introduced to the context. This method is used to augment the adversarial NoAnswer-style samples in SQuAD v2.0.
PerturbAnswer (Semantic Paraphrasing): Following Alzantot et al. (2018), we perform semantic paraphrasing of the sentence containing the answer span. Instead of using genetic algorithm, we adapt their Perturb subroutine to generate paraphrases in the following steps: 1. Select word locations for perturbations, which includes locations within any content phrase that does not appear within the answer span.
Here, content phrases are verbs, adverbs and adjectives. 2. For location k i in the set of word locations {k}, compute 20 nearest neighbors of the word at given location using GloVe embeddings, create a candidate sentence by perturbing the word location with each of the substitute words and rank perturbed sentences using a language model. 3. Select the perturbed sentence with highest rank and perform Step 2 for the next location k i+1 using the perturbed sentence. We use the OpenAI-GPT model (Radford et al., 2018) to evaluate paraphrases.
PerturbQuestion (Syntactic Paraphrasing): We use the syntactic paraphrase network introduced by Iyyer et al. (2018) to generate syntactic adversaries. Sentences from the context of QA samples tend to be long and have complicated syntax. The corresponding syntactic paraphrases generated by the paraphrasing network usually miss out on half of the source sentence. Therefore, we choose to perform paraphrasing on the questions. We generate 10 paraphrases for each question and rank them based on cosine similarity, computed between the mean of word embeddings (Pennington et al., 2014) of source sentence and generated paraphrases (Niu and Bansal, 2018;Liu et al., 2016).
Finally, we combine negative perturbations with positive perturbations to create adversaries which double-down on the model's language understanding capabilities. It always leads to a larger drop in performance when tested on the reading comprehension models trained on original unaugmented datasets.
Semantic Difference Check: To make sure that the distractor sentences are sufficiently different from the original sentence, we perform a semantic difference check in two steps: 1. Extract content phrases from original sentence.
Content phrase is any common NER phrase or one of the four: noun, verb, adverb, adjective. 2. There should be at least 2 content phrases in the original text that aren't found in the distractor. We examined 100 randomly sampled originaldistractor sentence pairs and found that our semantic difference check works for 96% of the cases.

B BayesAugment
We use Gaussian Process (GP) (Rasmussen, 2003) as surrogate function and Upper Confidence Bound (UCB) (Srinivas et al., 2010) as the acquisition function. GP is a non-parametric model that is fully characterized by a mean function µ 0 : χ → IR and a positive-definite kernel or covariance function k : χ × χ → IR. Let x 1 , x 2 , ...x n denote any finite collections of n points, where each x i represents a choice of sampling probabilities for each of the augmentation methods and f i = f (x i ) is the (unknown) function value evaluated at x i . Let y 1 , y 2 , ...y n be the corresponding noisy observations (the validation performance at the end of training). In the context of GP Regression (GPR), f = f 1 , .....f n are assumed to be jointly Gaussian. Then, the noisy observations y = y 1 , ....y n are normally distributed around f as y|f ∼ N (f, σ 2 I).
The Gaussian Process upper confidence bound (GP-UCB) algorithm measures the optimistic performance upper bound of the sampling probabilities.

C Datasets
SQuAD v2.0 (Rajpurkar et al., 2018) is a crowdsourced dataset consisting of 100,000 questions from SQuAD v1.1 (Rajpurkar et al., 2016) and an additional 50,000 questions that do not have answers within the given context. We split the official development set into 2 randomly sampled sets of validation and test for our experiments.
NewsQA is also a crowd-sourced extractive RC dataset based on 10,000 news articles from CNN, containing both answerable and unanswerable ques- TriviaQA (Joshi et al., 2017) questions were crawled from the web and have two variants. One variant includes Wikipedia articles as contexts; we use the other variant which involves web snippets and documents from Bing search engine as contexts. The development and test sets are large    RoBERTa BASE for 5 epochs on SQuAD and NewsQA respectively and selected the bestperforming checkpoint as baseline. We perform a hyperparameter search for both datasets using Bayesian optimization search (Snoek et al., 2012). The RNN controller in AutoAugment training loop consists of a single LSTM cell with a single hidden layer and hidden layer dimension of 100. The generated policy consists of 3 sub-policies; each sub-policy is structured as discussed in main text. BayesAugment is trained for 100 iterations with 20 restarts. During AutoAugment and BayesAugment training loops, RoBERTa BASE or distilRoBERTa BASE (which has already been trained on unaugmented SQuAD) is further finetuned on the adversarially augmented dataset for 2 epochs with a warmup ratio of 0.2 and learning rate decay (lr=1e-5) thereafter. After the policy search, further hyperparameter optimization is performed for best results from fine-tuning. We do not perform this last step of hyperparameter tuning on cross-lingual data to avoid the risk of overfitting the small datasets. For generalization from SQuAD v1.1 to cross-lingual datasets, we do not consider the adversary InvalidateAnswer because NoAnswer samples do not exist for these datasets.

E Analysis
In this section, we show the impact of adversarial augmentation ratio in training dataset and the size of training dataset on the generalization of RC model to out-of-domain data. Next, we show more experiments on robustness to unseen adversaries. Finally, we analyze the domain-independence of adversarial robustness by training on adversarially augmented SQuAD and testing on adversarial NewsQA samples.
Effect of Augmentation Ratio: To assess the importance of adversarial augmentation in the dataset, we experimented with different ra-   tios i.e., 1x, 2x and 3x, of augmented samples to the original dataset, for generalization from SQuAD to NewsQA using the augmentation policy learnt by BayesAugment. The performance of SQuAD→NewsQA models on NewsQA validation set were 49.73, 49.84 and 49.62 for 1x, 2x and 3x augmentations respectively, showing slight improvement for twice the number of augmentations. However, the performance starts decreasing at 3x augmentations, showing that too many adversaries in the training data starts hurting generalization.
Effect of Augmented Dataset Size: We experimented with 20%, 40%, 60%, 80% and 100% of the original dataset to generate augmented dataset using the BayesAugment policy for generalization of RoBERTa BASE trained on SQuAD to NewsQA and observed little variance in performance with increasing data, as seen from Figure 3. The augmentation ratio in these datasets is 1:1. We hypothesize that the model is saturated early on during training, within the first tens of thousands of adversarially augmented samples. Exposing the model to more SQuAD samples gives little boost to performance on NewsQA thereafter.
Robustness to Unseen Adversaries: We train RoBERTa BASE on SQuAD which has been augmented with an adversarial dataset of the same size as SQuAD and contains equal number of samples   from AddSentDiverse, PerturbQuestion and Pertur-bAnswer. In Table 13, We see that the model is significantly more robust to combinatorial adversaries like AddSentDiverse+PerturbAnswer when trained on the adversaries AddSentDiverse and Per-turbAnswer individually. We also see a decline in performance on InvalidateAnswer. Domain-Independence of Robustness to Adversarial Attacks: We have shown that a reading comprehension model trained on SQuAD can be generalized to NewsQA by finetuning the model with adversarially transformed samples from SQuAD dataset. It is expected that this model will be robust to similar attacks on SQuAD. To assess if this robustness generalizes to NewsQA as well, we evaluate our best SQuAD→NewsQA model on adversarially transformed NewsQA samples from the development set. The SQuAD column in Table 11 shows results from evaluation of RoBERTa BASE finetuned with original unaugmented SQuAD, on adversarially transformed NewsQA samples. Interestingly, the generalized model (rightmost column) is 5-8% more robust to adversarial NewsQA without being trained on any NewsQA samples, showing that robustness to adversarial attacks in source domain easily generalizes to a different domain.