SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving Out-of-Domain Robustness

Models that perform well on a training domain often fail to generalize to out-of-domain (OOD) examples. Data augmentation is a common method used to prevent overfitting and improve OOD generalization. However, in natural language, it is difficult to generate new examples that stay on the underlying data manifold. We introduce SSMBA, a data augmentation method for generating synthetic training examples by using a pair of corruption and reconstruction functions to move randomly on a data manifold. We investigate the use of SSMBA in the natural language domain, leveraging the manifold assumption to reconstruct corrupted text with masked language models. In experiments on robustness benchmarks across 3 tasks and 9 datasets, SSMBA consistently outperforms existing data augmentation methods and baseline models on both in-domain and OOD data, achieving gains of 0.8% accuracy on OOD Amazon reviews, 1.8% accuracy on OOD MNLI, and 1.4 BLEU on in-domain IWSLT14 German-English.


Introduction
Training distributions often do not cover all of the test distributions we would like a supervised classifier or model to perform well on. Often, this is caused by biased dataset collection (Torralba and Efros, 2011) or test distribution drift over time (Quionero-Candela et al., 2009). Therefore, a key challenge in training machine learning models in these settings is ensuring they are robust to unseen examples. Since it is impossible to generalize to the entire distribution, methods often focus on the adjacent goal of out-of-domain robustness.
Data augmentation is a common technique used to improve out-of-domain (OOD) robustness by synthetically generating new training examples 1 Code is availble at https://github.com/ nng555/ssmba M x x x Figure 1: SSMBA moves along the data manifold M by using a corruption function to perturb an example x off the data manifold, then using a reconstruction function to project it back on. (Simard et al., 1998), often by perturbing existing examples in the input space (Perez and Wang, 2017). If data concentrates on a low-dimensional manifold (Chapelle et al., 2006), then these synthetic examples should lie in a manifold neighborhood of the original examples (Olivier Chapelle and Vapnik, 2000). Training models to be robust to such local perturbations has been shown to be effective in improving performance and generalization in semi-supervised and self-supervised settings (Bachman et al., 2014;Szegedy et al., 2014;Sajjadi et al., 2016). When the underlying data manifold exhibits easy-to-characterize properties, as in natural images, simple transformations such as translation and rotation can quickly generate local training examples. However, in domains such as natural language, it is much more difficult to find a set of invariances that preserves meaning or semantics.
In this paper we propose Self-Supervised Manifold Based Data Augmentation (SSMBA): a data augmentation method for generating synthetic examples in domains where the data manifold is difficult to heuristically characterize. Motivated by the use of denoising auto-encoders as generative models (Bengio et al., 2013), we use a corruption function to stochastically perturb examples off the data manifold, then use a reconstruction function to project them back on (Figure 1). This ensures new examples lie within the manifold neighborhood of the original example. SSMBA is applicable to any supervised task, requires no task-specific knowledge, and does not rely on class-or dataset-specific fine-tuning.
We investigate the use of SSMBA in the natural language domain on 3 diverse tasks spanning both classification and sequence modelling: sentiment analysis, natural language inference, and machine translation. In experiments across 9 datasets and 4 model types, we show SSMBA consistently outperforms baseline models and other data augmentation methods on both in-domain and OOD data.
2 Background and Related Work

Data Augmentation in NLP
The problem of domain adaptation and OOD robustness is well established in NLP (Blitzer et al., 2007;Daumé III, 2007;Hendrycks et al., 2020). Existing work on improving generalization has focused on data augmentation, where synthetically generated training examples are used to augment an existing dataset. It is hypothesized that these examples induce robustness to local perturbations, which has been shown to be effective in semi-supervised and self-supervised settings (Bachman et al., 2014;Szegedy et al., 2014;Sajjadi et al., 2016).
Existing task-specific methods (Kafle et al., 2017) and word-level methods (Zhang et al., 2015;Xie et al., 2017;Wei and Zou, 2019) are based on human-designed heuristics. Back-translation from or through another language has been applied in the context of machine translation , question answering (Yu et al., 2018), and consistency training (Xie et al., 2019). More recent work has used word embeddings (Wang and Yang, 2015) and LSTM language models (Fadaee et al., 2017) to perform word replacement. Other methods focus on fine-tuning contextual language models (Kobayashi, 2018;Wu et al., 2019b;Kumar et al., 2020) or large generative models (Anaby-Tavor et al., 2020;Yang et al., 2020;Kumar et al., 2020) to generate synthetic examples.

Vicinal
Risk Minimization (VRM) (Olivier Chapelle and Vapnik, 2000) formalizes data augmentation as enlarging the training set support by drawing samples from a vicinity of existing training examples. Typically the Figure 2: To sample from an MLM DAE, we apply the MLM corruption q to the original sentence then reconstruct the corrupted sentence using our DAE r. vicinity of a training example is defined using dataset-dependent heuristics. For example, in computer vision, examples are generated using scale augmentation (Simonyan and Zisserman, 2015), color augmentation (Krizhevsky et al., 2012), and translation and rotation (Simard et al., 1998).
The manifold assumption states that high dimensional data concentrates around a low-dimensional manifold (Chapelle et al., 2006). This assumption allows us to define the vicinity of a training example as its manifold neighborhood, the portion of the neighborhood that lies on the data manifold. Recent methods have used the manifold assumption to improve robustness by moving examples towards a decision boundary (Kanbak et al., 2018), generating adversarial examples (Szegedy et al., 2014;Miyato et al., 2017), interpolating between pairs of examples (Zhang et al., 2018), or finding affine transforms (Paschali et al., 2019).

Sampling from Denoising Autoencoders
A denoising autoencoder (DAE) is an autoencoder trained to reconstruct a clean input x from a stochastically corrupted one x ∼ q(x |x) by learning a conditional distribution P θ (x|x ) (Vincent et al., 2008). We can sample from a DAE by successively corrupting and reconstructing an input using the following pseudo-Gibbs Markov chain: As the number of training examples increases, the asymptotic distribution π n (x) of the generated samples approximate the true data-generating distribution P (x) (Bengio et al., 2013). This corruptionreconstruction process allows for sampling directly along the manifold that P (x) concentrates on.

Masked Language Models
Recent advances in unsupervised representation learning for natural language have relied on pretraining models on a masked language modeling (MLM) objective (Devlin et al., 2018;Liu et al.,  for (x i , y i ) ∈ D do 6:  We now describe Self-Supervised Manifold Based Data Augmentation. Let our original dataset D consist of pairs of input and output vectors D = {(x 1 , y 1 ) . . . (x n , y n )}. We assume the in-put points concentrate around an underlying lower dimensional data manifold M. Let q be a corruption function from which we can draw a sample x ∼ q(x |x) such that x no longer lies on M. Let r be a reconstruction function from which we can draw a samplex ∼ r(x|x ) such thatx lies on M.
To generate an augmented dataset, we take each pair (x i , y i ) ∈ D and sample a perturbed x i ∼ q(x |x i ). We then sample a reconstructed x ij ∼ r(x|x i ). A corresponding vectorŷ ij can be generated by preserving y i , or, since examples in the manifold neighborhood may cross decision boundaries on more sensitive tasks, by using a teacher model trained on the original data. This operation can be repeated to generate multiple augmented examples for each input example. These new examples form a dataset that we can augment the original training set with. We can then train an augmented model on the new augmented dataset.
In this paper we investigate SSMBA's use on natural language tasks, using the MLM training corruption function as our corruption function q and a pre-trained BERT model as our reconstruction model r. Different from other data augmentation methods, SSMBA does not rely on task-specific knowledge, requires no dataset-specific fine-tuning, and is applicable to any supervised natural language task. SSMBA requires only a pair of functions q and r used to generate data.

Datasets
To empirically evaluate our proposed algorithm, we select 9 datasets -4 sentiment analysis datasets, 2 natural language inference (NLI) datasets, and 3 machine translation (MT) datasets. Table 1 and Appendix A provide dataset summary statistics. All datasets either contain metadata that can be used to split the samples into separate domains or similar datasets that are treated as separate domains.

Sentiment Analysis
The Amazon Review Dataset (Jianmo Ni, 2019) contains product reviews from Amazon. Following Hendrycks et al. 2020, we form two datasets: AR-Full contains reviews from the 10 largest categories, and AR-Clothing contains reviews in the clothing category separated into subcategories by metadata. Since the reviews in AR-Clothing come from the same top-level category, the amount of domain shift is much less than that of AR-Full. Models predict a review's 1 to 5 star rating.  SST2 (Socher et al., 2013) contains movie review excerpts. Following Hendrycks et al. 2020 we pair this dataset with the IMDb dataset (Maas et al., 2011), which contains full length movie reviews. We call this pair the Movies dataset. Models predict a movie review's binary sentiment.
The Yelp Review Dataset contains restaurant reviews with associated business metadata which we preprocess following Hendrycks et al. 2020. Models predict a review's 1 to 5 star rating.

Natural Language Inference
MNLI (Williams et al., 2018) is a corpus of NLI data from 10 distinct genres of written and spoken English. We train on the 5 genres with training data and test on all 10 genres. Since the dataset does not include labeled test data, we use the validation set as our test set and sample 2000 examples from each training set for validation.
ANLI (Nie et al., 2019) is a corpus of NLI data designed adversarially by humans such that stateof-the-art models fail to classify examples correctly. The dataset consists of three different levels of difficulty which we treat as separate textual domains.

Machine Translation
Following Mller et al. 2019, we consider two translation directions, German→English (de→en) and German→Romansh (de→rm). Romansh is a lowresource language with an estimated 40,000 native speakers where OOD robustness is of practical relevance (Mller et al., 2019).
In the de→en direction, we use IWSLT14 de→en (Cettolo et al., 2014) as a widely-used benchmark to test in-domain performance. We also use the OPUS (Tiedemann, 2012) dataset to test OOD generalization. We train on highly specific in-domain data (medical texts) and disparate out-of-domain data (Koran text, Ubuntu localization files, movie subtitles, and legal text). Since domains share very little similarities in language, generalization to out-of-domain text is extremely difficult. In the de→rm direction, we use a training set consisting of the Allegra corpus (Scherrer and Cartoni, 2012) and Swiss press releases. We use blog posts from Convivenza as a test domain.

Model Types
For sentiment analysis tasks, we investigate LSTMs (Hochreiter and Schmidhuber, 1997) and convolutional neural networks (CNNs). For NLI tasks, we investigate fine-tuned RoBERTa BASE models (Liu et al., 2019), which are pretrained bidirectional transformers (Vaswani et al., 2017). On both tasks, representations from the encoder are fed into an feed-forward neural network for classification. For MT tasks, we train transformers (Vaswani et al., 2017). For all models, word embeddings are initialized randomly and trained end-to-end with the model. We do not initialize with pre-trained word embeddings to maintain consistency across all models and tasks. Model hyperparameters are tuned to maximize performance on in-domain validation data. Training details and hyperparameters for all models are provided in Appendix C.

SSMBA Settings
For all experiments we use the MLM corruption function as our corruption function q. We tune tune the total percentage of tokens corrupted, leaving the percentages of specific corruption operations (80% masked, 10% random, 10% unmasked) the same. For sentiment analysis and NLI experiments we use a pre-trained RoBERTa BASE model as our reconstruction function r, and for translation experiments we use a pre-trained German BERT model (Chan et al., 2020). For each input example, we generate 5 augmented examples using unrestricted sampling. For translation experiments, target side translations are generated with beam search with width 5. SSMBA hyperparameters, including augmented example labelling method and corruption percentage, are chosen based on in-domain validation performance. Hyperparameters for each dataset are provided in Appendix D.

Baselines
On sentiment analysis and NLI tasks, we compare against 3 data augmentation methods. Easy Data Augmentation (EDA) (Wei and Zou, 2019) is a heuristic method that randomly replaces synonyms and inserts, swaps, and deletes words. Conditional Bert Contextual Augmentation (CBERT) (Wu et al., 2019b) finetunes a class-conditional BERT model and uses it to generate sentences in a process similar to our own. Unsupervised Data Augmentation (UDA) (Xie et al., 2020) translates data to and from a pivot language to generate paraphrases. We adapt UDA for supervised classification tasks by training directly on the backtranslated data.
On translation tasks, we compare only against methods which do not require additional target side monolingual data. Word dropout  randomly chooses words in the source sentence to set to zero embeddings. Reward Augmented Maximum Likelihood (RAML) (Norouzi et al., 2016) samples noisy target sentences based on an exponential of their Hamming distance from the original sentence. SwitchOut (Wang et al., 2018) applies a noise function similar to RAML to both the source and target side. We use publicly available implementations for all methods.

Evaluation Method
We train LSTM and CNN models with 10 random seeds, RoBERTa models with 5 random seeds, and transformer models with 3 random seeds. Models are trained separately on each domain then evaluated on all domains, and performance is averaged across seeds and test domains. We report the average in-domain (ID) and OOD performance across all train domains. On sentiment analysis and NLI tasks we report accuracy, and on translation we report uncased tokenized BLEU (Papineni et al., 2002) for IWSLT and cased, detokenized BLEU with SacreBLEU 2 (Post, 2018) for all others. Statistical testing details are in Appendix E. Table 2 present results on sentiment analysis. Across all datasets, models trained with SSMBA outperform baseline models and all other data augmentation methods on OOD data. On ID data, SSMBA outperforms baseline models and other data augmentation methods on all datasets for CNN models, and 3/4 datasets for RNN models. On average, SSMBA improves OOD performance by 1.1% for RNN models and 0.7% for CNN models, and ID performance by 0.8% for RNN models and 0.4% for CNN model. Other methods achieve much smaller OOD generalization gains and perform worse than baseline models on multiple datasets.

Sentiment Analysis
On the AR-Full dataset, RNNs trained with SSMBA demonstrate improvements in OOD accuracy of 1.1% over baseline models. On the AR-Clothing dataset, which exhibits less domain shift than AR-Full, RNNs trained with SSMBA exhibit slightly lower OOD improvement. CNN models exhibit about the same boost in OOD accuracy across both Amazon review datasets.
On the Movies dataset where we observe a large difference in average sentence length between the two domains, SSMBA still manages to present considerable gains in OOD performance. Although RNNs trained with SSMBA fail to improve ID performance, their OOD performance in this setting still beats other data augmentation methods.
On the Yelp dataset, we observe large performance gains on both ID and OOD data for RNN models. The improvements on CNN models are more modest, but notably our method is the only one that improves OOD generalization. Table 3 presents results on NLI tasks. Models trained with SSMBA outperform or match baseline models and data augmentation methods on both ID and OOD data. Even with a more difficult task and stronger baseline model, SSMBA still confers large accuracy gains. On MNLI, SSMBA improves OOD accuracy by 1.8%, while the best performing baseline achieves only 0.3% improvement. Our method also improves ID accuracy by 1.4%. All other baseline methods hurt both ID and OOD accuracy, or confer negligible improvements.

Natural Language Inference
On the intentionally difficult ANLI, SSMBA maintains baseline OOD accuracy while conferring a large 6% improvement on ID data. Other     Scores marked with a * and † are statistically significantly higher than baseline transformers and the next best model, both with p < 0.01. Table 4 presents results on IWSLT14 de→en. We compare our results with convolutional models (Edunov et al., 2018) and strong baseline transformer and dynamic convolution models (Wu et al., 2019a). SSMBA improves BLEU by almost 1.5 points, outperforming all other baseline and comparison models. Compared to SSMBA, other augmentation methods offer much smaller improvements or even degrade performance. Table 5 presents results on OPUS and de→rm. On OPUS, where the training domain contains highly specialized language and differs significantly both from other domains and the learned MLM manifold, SSMBA offers a small boost in OOD BLEU but degrades ID performance. All other augmentation methods degrade both ID and OOD performance. On de→rm, SSMBA improves OOD BLEU by a large margin of 2.4 points, and ID BLEU by 0.4 points. Other augmentation meth-ods offer much smaller OOD improvements while degrading ID performance.

Analysis and Discussion
In this section, we analyze the factors that influence SSMBA's performance. Due to its relatively small size (25k sentences), number of OOD domains (3), and amount of domain shift, we focus our analysis on the Baby domain within the AR-Clothing dataset. Ablations are performed on a single domain rather than all domains, so error bars correspond to variance in models trained with different seeds and results are not comparable with those in Table 2. Unless otherwise stated, we train CNN models and augment with SSMBA, corrupting 45% of tokens, performing unrestricted sampling when reconstructing, and using self-supervised soft labelling, generating 5 synthetic examples for each training example.

Training Set Size
We first investigate how the size of the initial dataset affects SSMBA's effectiveness. Since a smaller dataset covers less of the training distribution, we might expect the data generated by SSMBA to explore less of the data manifold and reduce its effectiveness. We subsample 25% of the original dataset to form a new training set, then repeat this process successively to form exponentially smaller and smaller datasets. The smallest dataset contains only 24 examples. For each dataset fraction, we train 10 models and average performance, tuning a set of SSMBA hyperparameters on the same ID validation data. Figure 4 shows that SSMBA offers OOD performance gains across almost all dataset sizes, even in low resource settings with less than 100 training examples.

Reconstruction Model Capacity
Since SSMBA relies on a reconstruction function that approximates the underlying data manifold, we might expect a larger and more expressive model to generate higher quality examples. We investigate three models of varying size: Distil-RoBERTa (Sanh et al., 2019) with 82M parameters, RoBERTa BASE with 125M parameters, and RoBERTa LARGE with 355M parameters. For each reconstruction model, we generate a set of 10 augmented datasets and train a set of 10 models on each augmented dataset. We average performance across models and datasests. Table 6 shows that  SSMBA displays robustness to the choice of reconstruction model, with all models conferring similar improvements to OOD accuracy. Using the smaller DistilRoBERTa model only degrades performance by a small margin.

Corruption Amount
How sensitive is SSMBA to the particular amount of corruption applied? Empirically, tasks that were more sensitive to input noise, like sentiment analysis, required less corruption than those that were more robust, like NLI. To analyze the effect of tuning the corruption amount, we generate 10 sets of augmented data with varying percentages of corruption, then train 10 models on each dataset, averaging performance across all 100 models. Figure 5 shows that for corruption percentages below 50%, our algorithm is relatively robust to the specific amount of corruption applied. OOD performance peaks at 45% corruption, decreasing thereafter as corruption increases. Very large amounts of corruption tend to degrade performance, although surprisingly all augmented models still outperform unaugmented models, even when 95% of tokens are corrupted. In experiments on the more input sensitive NLI task, large amounts of noise degraded performance below baselines.

Sample Generation Methods
Next we investigate methods for generating the reconstructed examplesx ∼ r(x|x ). Top-k sampling draws samples from the MLM distribution on the top-k most probable tokens, leading to augmented data that explores higher probability regions of the manifold. We investigate top1, top5, top10, top20, and top50 sampling. Unrestricted sampling draws samples from the full probability distribution of tokens. This method explores a larger area of the underlying data distribution but can often lead to augmented data in low probability regions.
For each sample generation method, we generate 5 sets of augmented data and train 10 models on each dataset. OOD accuracy is averaged across all models for a given sampling method. Figure 6 shows that unrestricted sampling provides the greatest increase in OOD accuracy, with top-k sampling methods all performing similarly. This suggests that SSMBA works best when it is able to explore the manifold without any restrictions.

Amount of Augmentation
How does OOD accuracy change as we generate more sentences and explore more of the manifold neighborhood? To investigate we select various augmentation amounts and generate 5 datasets for each amount, training 10 models on each dataset and averaging OOD accuracy across all 50 models. Figure 7 shows that increasing the amount of augmentation increases the amount by which SSMBA improves OOD accuracy, as well as decreasing the variance in the OOD accuracy of trained models.

Label Generation
We investigate 3 methods to generate a labelŷ ij for a synthetic examplex ij . Label preservation preserves the original label y i . Since the manifold neighborhood of an example may cross a decision boundary, we also investigate using a supervised model f trained on the original set of unaugmented data for hard labelling of a one-hot class labelŷ ij and soft labelling of a class distributionŷ ij .
We train a CNN model to varying levels of convergence and validation accuracy, then label a set of 5 augmented datasets with each labelling method. When training with soft labels, we optimize the KL-divergence between the output distribution and soft label distribution. For each dataset we train 10 models and average performance across all models and datasets. Results are shown in Figure 8.
Unsurprisingly, soft and hard labelling with a low accuracy model degrades performance. As our supervision classifier improves, so does the performance of models trained with soft and hard labelled data. Once we pass a certain accuracy threshold, models trained with soft labels begin outperforming all other models. This threshold varies depending on the difficulty of the dataset and task.
In ANLI experiments, labelling augmented examples even with a poor performing model still improved downstream accuracy.

Conclusion
In this paper, we introduce SSMBA, a method for generating synthetic data in settings where the underlying data manifold is difficult to characterize. In contrast to other data augmentation methods, SSMBA is applicable to any supervised task, requires no task-specific knowledge, and does not rely on dataset-specific fine-tuning. We demonstrate SSMBA's effectiveness on three NLP tasks spanning classification and sequence modeling: sentiment analysis, natural language inference, and machine translation. We achieve gains of 0.8% accuracy on OOD Amazon reviews, 1.8% accuracy on OOD MNLI, and 1.4 BLEU on indomain IWSLT14 de→en. Our analysis shows that SSMBA is robust to the initial dataset size, reconstruction model choice, and corruption amount, offering OOD robustness improvements in most settings. Future work will explore applying SSMBA to the target side manifold in structured prediction tasks, as well as other natural language tasks and settings where data augmentation is difficult.

B Data Preprocessing
We use the same preprocessing steps across all sentiment analysis and NLI experiments. All data is first tokenized using a GPT-2 style tokenizer and BPE vocabulary provided by fairseq (Ott et al., 2019). This BPE vocabulary consists of 50263 types. Corresponding labels are encoded using a label dictionary consisting of as many types as there are classes. Input text and labels are then binarized for model training. Although all models share the same vocabulary, we randomly initialize each model's embeddings and train the entire model endto-end. For machine translation experiments, we follow Mller et al. 2019 and learn a 16k BPE on OPUS and a 32k BPE on de→rm. On IWSLT14 we learn a 10k BPE. We use a separate vocabulary for the source and target side.

C Model Architecture and Training Hyperparameters
All models are written and trained within the fairseq framework (Ott et al., 2019) with T4 GPUs. LSTM and CNN models were trained on a single GPU, RoBERTa models were trained with 4 GPUs, and tranfsormer models were trained with 2 GPUs. On average, when trained on augmented data, LSTM and CNN models took an hour to train to convergence, RoBERTa models took 12 hours to train to convergence, and transformer models took 24 hours to train to convergence. Models trained on unaugmented data took roughly 20% of the time of models trained on augmented data to reach convergence. For each model we investigate, we present first the model architecture and then the training hyperparameters.

C.1 LSTM
Our LSTM models are a single layer of 512 nodes. Input embeddings are 512 dimensions. The output embedding from the last time step is fed into a MLP classifier with a single hidden layer of 512 dimensions. Models contain 28M parameters. Dropout of 0.3 is applied to the input and output of our encoder, and dropout of 0.1 is applied to the MLP classifier.
We train with Adam optimizer (Kingma and Ba, 2014) with β = (0.9, 0.98) and = 1e−6. Our learning rate is set to 1e−4 and is first warmed up for 2 epochs before it is decayed using an inverse square root scheduler.

C.2 CNN
Our CNN models are based on the architecture in (Kim, 2014). As in our LSTM models, our input embeddings are 512 dimensional, which we treat as our channel dimension. We apply three convolutions of kernel size 3, 4, and 5, with 256 output channels. Models contain 27M parameters. Convolutional outputs are max-pooled over time then concatenated to a 768-dimensional encoded representation. Again, we feed this representation into a MLP classifier with a single hidden layer of 512 dimensions. We apply dropout of 0.2 to our inputs and MLP classifier.
We train with Adam optimizer (Kingma and Ba, 2014) with β = (0.9, 0.98) and = 1e−6. Our learning rate is set to 1e−3 and is first warmed up for 2 epochs before it is decayed using an inverse square root scheduler.

C.3 RoBERTa
Our RoBERTa models use a pre-trained RoBERTa BASE model provided by fairseq. As in other models, classification token embeddings are fed into an MLP classifier with a single hidden layer of 512 dimensions. Models contain 125M parameters. We follow the MNLI fine-tuning procedures in fairseq, training with learning rate 1e−5 with Adam optimizer (Kingma and Ba, 2014) with β = (0.9, 0.98) and = 1e−6. We warmup the learning rate for 2 epochs before decaying with an inverse square root scheduler.

C.4 Transformer
Transformer models are trained with labelsmoothed cross-entropy and label smoothing 0.1. Due to the dataset sizes, we use a slightly smaller transformer architecture with embedding dimension 512, feed forward embedding dimension 1024, 4 encoder heads, and 6 encoder and decoder layers. Models contain 52M parameters. We also apply dropout of 0.3 and weight decay of 0.0001. All other hyperparameters follow the base architecture in Vaswani et al. 2017.
As in other models, we train with Adam optimizer (Kingma and Ba, 2014) with β = (0.9, 0.98) and = 1e−6. Our learning rate is set to 5e−4 and is first warmed up for 4000 updates before it is decayed using an inverse square root scheduler.

D SSMBA Hyperparameters
SSMBA hyperparameters for each dataset and domain are provided in table 8. Hyperparameters are chosen based on in-domain validation performance. A detailed analysis of hyperparameter tuning is provided in section 7.

E Statistical Testing
For the statistical tests on sentiment analysis and NLI tasks, we use a Wilcoxon ranked-sum test. Specifically, we compare averages of model performances on pairs of training and test domains. For example, in a dataset with 3 domains, D1, D2, and D3, we have 3 in-domain train-test pairs (D1-D1, D2-D2, D3-D3), and 6 out-of-domain traintest pairs (D1-D2, D1-D3, D2-D1, D2-D3, D3-D1, D3-D2). We calculate the average performance for each model on each pair, then compare the matched in-domain and out-of-domain pairs. Since the number of samples we can compare depends on the total number of domains in the dataset, a larger number of datasets gives us a better sense of our statistical significance.
For the statistical tests on machine translation tasks, we use a paired bootstrap resampling approach (Koehn, 2004). Since the test works only on a single system's output, we run the test on every pairing of seeds and test domains for the two comparison models. We report the significance level only if all tests result in a small enough probability.