Pretrained Transformers Improve Out-of-Distribution Robustness

Although pretrained Transformers such as BERT achieve high accuracy on in-distribution examples, do they generalize to new distributions? We systematically measure out-of-distribution (OOD) generalization for seven NLP datasets by constructing a new robustness benchmark with realistic distribution shifts. We measure the generalization of previous models including bag-of-words models, ConvNets, and LSTMs, and we show that pretrained Transformers’ performance declines are substantially smaller. Pretrained transformers are also more effective at detecting anomalous or OOD examples, while many previous models are frequently worse than chance. We examine which factors affect robustness, finding that larger models are not necessarily more robust, distillation can be harmful, and more diverse pretraining data can enhance robustness. Finally, we show where future work can improve OOD robustness.


Introduction
The train and test distributions are often not identically distributed. Such train-test mismatches occur because evaluation datasets rarely characterize the entire distribution (Torralba and Efros, 2011), and the test distribution typically drifts over time (Quionero-Candela et al., 2009). Chasing an evolving data distribution is costly, and even if the training data does not become stale, models will still encounter unexpected situations at test time. Accordingly, models must generalize to OOD examples whenever possible, and when OOD examples do not belong to any known class, models must detect them in order to abstain or trigger a conservative fallback policy (Emmott et al., 2015).
Most evaluation in natural language processing (NLP) assumes the train and test examples are in-dependent and identically distributed (IID). In the IID setting, large pretrained Transformer models can attain near human-level performance on numerous tasks . However, high IID accuracy does not necessarily translate to OOD robustness for image classifiers (Hendrycks and Dietterich, 2019), and pretrained Transformers may embody this same fragility. Moreover, pretrained Transformers can rely heavily on spurious cues and annotation artifacts (Cai et al., 2017;Gururangan et al., 2018) which out-of-distribution examples are less likely to include, so their OOD robustness remains uncertain.
In this work, we systematically study the OOD robustness of various NLP models, such as word embeddings averages, LSTMs, pretrained Transformers, and more. We decompose OOD robustness into a model's ability to (1) generalize and to (2) detect OOD examples (Card et al., 2018).
To measure OOD generalization, we create a new evaluation benchmark that tests robustness to shifts in writing style, topic, and vocabulary, and spans the tasks of sentiment analysis, textual entailment, question answering, and semantic similarity. We create OOD test sets by splitting datasets with their metadata or by pairing similar datasets together (Section 2). Using our OOD generalization benchmark, we show that pretrained Transformers are considerably more robust to OOD examples than traditional NLP models (Section 3). We show that the performance of an LSTM semantic similarity model declines by over 35% on OOD examples, while a RoBERTa model's performance slightly increases. Moreover, we demonstrate that while pretraining larger models does not seem to improve OOD generalization, pretraining models on diverse data does improve OOD generalization.
To measure OOD detection performance, we turn classifiers into anomaly detectors by using their prediction confidences as anomaly scores (Hendrycks and Gimpel, 2017). We show that many non-pretrained NLP models are often near or worse than random chance at OOD detection. In contrast, pretrained Transformers are far more capable at OOD detection. Overall, our results highlight that while there is room for future robustness improvements, pretrained Transformers are already moderately robust.
2 How We Test Robustness

Train and Test Datasets
We evaluate OOD generalization with seven carefully selected datasets. Each dataset either (1) contains metadata which allows us to naturally split the samples or (2) can be paired with a similar dataset from a distinct data generating process. By splitting or grouping our chosen datasets, we can induce a distribution shift and measure OOD generalization. We utilize four sentiment analysis datasets: • We use SST-2, which contains pithy expert movie reviews (Socher et al., 2013), and IMDb (Maas et al., 2011), which contains fulllength lay movie reviews. We train on one dataset and evaluate on the other dataset, and vice versa. Models predict a movie review's binary sentiment, and we report accuracy. • The Yelp Review Dataset contains restaurant reviews with detailed metadata (e.g., user ID, restaurant name). We carve out four groups from the dataset based on food type: American, Chinese, Italian, and Japanese. Models predict a restaurant review's binary sentiment, and we report accuracy. • The Amazon Review Dataset contains product reviews from Amazon (McAuley et al., 2015;He and McAuley, 2016). We split the data into five categories of clothing (Clothes, Women Clothing, Men Clothing, Baby Clothing, Shoes) and two categories of entertainment products (Music, Movies). We sample 50,000 reviews for each category. Models predict a review's 1 to 5 star rating, and we report accuracy. We also utilize these datasets for semantic similarity, reading comprehension, and textual entailment: • STS-B requires predicting the semantic similarity between pairs of sentences (Cer et al., 2017). The dataset contains text of different genres and sources; we use four sources from two genres: MSRpar (news), Headlines (news); MSRvid (captions), Images (captions). The evaluation metric is Pearson's correlation coefficient.
• ReCoRD is a reading comprehension dataset using paragraphs from CNN and Daily Mail news articles and automatically generated questions . We bifurcate the dataset into CNN and Daily Mail splits and evaluate using exact match. • MNLI is a textual entailment dataset using sentence pairs drawn from different genres of text (Williams et al., 2018). We select examples from two genres of transcribed text (Telephone and Face-to-Face) and one genre of written text (Letters), and we report classification accuracy.

Embedding and Model Types
We evaluate NLP models with different input representations and encoders. We investigate three model categories with a total of thirteen models.
Bag-of-words (BoW) Model. We use a bag-ofwords model (Harris, 1954), which is high-bias but low-variance, so it may exhibit performance stability. The BoW model is only used for sentiment analysis and STS-B due to its low performance on the other tasks. For STS-B, we use the cosine similarity of the BoW representations from the two input sentences.

Word
Embedding Models. We use word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) word embeddings. These embeddings are encoded with one of three models: word averages (Wieting et al., 2016), LSTMs (Hochreiter and Schmidhuber, 1997), and Convolutional Neural Networks (ConvNets). For classification tasks, the representation from the encoder is fed into an MLP. For STS-B and MNLI, we use the cosine similarity of the encoded representations from the two input sentences. For reading comprehension, we use the DocQA model (Clark and Gardner, 2018) with GloVe embeddings. We implement our models in AllenNLP  and tune the hyperparameters to maximize validation performance on the IID task.
Pretrained Transformers. We investigate BERT-based models (Devlin et al., 2019) which are pretrained bidirectional Transformers (Vaswani et al., 2017) with GELU (Hendrycks and Gimpel, 2016) activations. In addition to using BERT Base and BERT Large, we also use the large version of RoBERTa (Liu et al., 2019b), which is pretrained on a larger dataset than BERT. We use ALBERT (Lan et al., 2020) and also a distilled version of BERT, DistilBERT . We follow the standard BERT fine-tuning procedure (Devlin et al., 2019) and lightly tune the hyperparameters for our tasks. We perform our experiments using the HuggingFace Transformers library .

Out-of-Distribution Generalization
In this section, we evaluate OOD generalization of numerous NLP models on seven datasets and provide some upshots. A subset of results are in Figures 1 and 2. Full results are in the Appendix.
Pretrained Transformers are More Robust.
In our experiments, pretrained Transformers often have smaller generalization gaps from IID data to OOD data than traditional NLP models. For instance, Figure 1 shows that the LSTM model declined by over 35%, Bigger Models Are Not Always Better. While larger models reduce the IID/OOD generalization gap in computer vision (Hendrycks and Dietterich, 2019;Xie and Yuille, 2020;Hendrycks et al., 2019d), we find the same does not hold in NLP. Figure 3 shows that larger BERT and AL-  BERT models do not reduce the generalization gap. However, in keeping with results from vision (Hendrycks and Dietterich, 2019), we find that model distillation can reduce robustness, as evident in our DistilBERT results in Figure 2. This highlights that testing model compression methods for BERT Ganesh et al., 2020;Li et al., 2020)  2020; Hendrycks et al., 2019a), pretraining on larger and more diverse datasets can improve robustness. RoBERTa exhibits greater robustness than BERT Large, where one of the largest differences between these two models is that RoBERTa pretrains on more data. See Figure 2's results.

Out-of-Distribution Detection
Since OOD robustness requires evaluating both OOD generalization and OOD detection, we now turn to the latter. Without access to an outlier dataset (Hendrycks et al., 2019b), the state-ofthe-art OOD detection technique is to use the model's prediction confidence to separate in-and out-of-distribution examples (Hendrycks and Gimpel, 2017). Specifically, we assign an example x the anomaly score − max y p(y | x), the negative prediction confidence, to perform OOD detection.  (Lang, 1995), the English source side of English-German WMT16 and English-German Multi30K (Elliott et al., 2016), and concatenations of the premise and hypothesis for RTE (Dagan et al., 2005) and SNLI (Bowman et al., 2015). These examples are only used during OOD evaluation not training.
For evaluation, we follow past work (Hendrycks et al., 2019b) and report the False Alarm Rate at 95% Recall (FAR95). The FAR95 is the probability that an in-distribution example raises a false alarm, assuming that 95% of all out-of-distribution exam-ples are detected. Hence a lower FAR95 is better. Partial results are in Figure 4, and full results are in the Appendix.
Previous Models Struggle at OOD Detection. Models without pretraining (e.g., BoW, LSTM word2vec) are often unable to reliably detect OOD examples. In particular, these models' FAR95 scores are sometimes worse than chance because the models often assign a higher probability to out-of-distribution examples than in-distribution examples. The models particularly struggle on 20 Newsgroups (which contains text on diverse topics including computer hardware, motorcycles, space), as their false alarm rates are approximately 100%.

Pretrained Transformers Are Better Detectors.
In contrast, pretrained Transformer models are better OOD detectors. Their FAR95 scores are always better than chance. Their superior detection performance is not solely because the underlying model is a language model, as prior work (Hendrycks et al., 2019b) shows that language models are not necessarily adept at OOD detection. Also note that in OOD detection for computer vision, higher accuracy does not reliably improve OOD detection (Lee et al., 2018), so pretrained Transformers' OOD detection performance is not anticipated. Despite their relatively low FAR95 scores, pretrained Transformers still do not cleanly separate in-and out-of-distribution examples ( Figure 5). OOD detection using pretrained Transformers is still far from perfect, and future work can aim towards creating better methods for OOD detection. Although RoBERTa is better than previous models at OOD detection, there is clearly room for future work.

Discussion and Related Work
Why Are Pretrained Models More Robust? An interesting area for future work is to analyze why pretrained Transformers are more robust. A flawed explanation is that pretrained models are simply more accurate. However, this work and past work show that increases in accuracy do not directly translate to reduced IID/OOD generalization gaps (Hendrycks and Dietterich, 2019;Fried et al., 2019). One partial explanation is that Transformer models are pretrained on diverse data, and in computer vision, dataset diversity can improve OOD generalization (Hendrycks et al., 2020) and OOD detection (Hendrycks et al., 2019b). Similarly, Transformer models are pretrained with large amounts of data, which may also aid robustness (Orhan, 2019;Hendrycks et al., 2019a). However, this is not a complete explanation as BERT is pretrained on roughly 3 billion tokens, while GloVe is trained on roughly 840 billion tokens. Another partial explanation may lie in self-supervised training itself. Hendrycks et al. (2019c) show that computer vision models trained with self-supervised objectives exhibit better OOD generalization and far better OOD detection performance. Future work could propose new self-supervised objectives that enhance model robustness.
Domain Adaptation. Other research on robustness considers the separate problem of domain adaptation (Blitzer et al., 2007;Daumé III, 2007), where models must learn representations of a source and target distribution. We focus on testing generalization without adaptation in order to benchmark robustness to unforeseen distribution shifts. Unlike Fisch et al. (2019); Yogatama et al. (2019), we measure OOD generalization by considering simple and natural distribution shifts, and we also evaluate more than question answering.
Adversarial Examples. Adversarial examples can be created for NLP models by inserting phrases (Jia and Liang, 2017;), paraphrasing questions (Ribeiro et al., 2018, and reducing inputs (Feng et al., 2018). However, adversarial examples are often disconnected from real-world performance concerns (Gilmer et al., 2018). Thus, we focus on an experimental setting that is more realistic. While previous works show that, for all NLP models, there exist adversarial examples, we show that all models are not equally fragile. Rather, pretrained Transformers are overall far more robust than previous models.
Counteracting Annotation Artifacts. Annotators can accidentally leave unintended shortcuts in datasets that allow models to achieve high accuracy by effectively "cheating" (Cai et al., 2017;Gururangan et al., 2018;Min et al., 2019). These annotation artifacts are one reason for OOD brittleness: OOD examples are unlikely to contain the same spurious patterns as in-distribution examples. OOD robustness benchmarks like ours can stress test a model's dependence on artifacts (Liu et al., 2019a;Naik et al., 2018).

Conclusion
We created an expansive benchmark across several NLP tasks to evaluate out-of-distribution robustness. To accomplish this, we carefully restructured and matched previous datasets to induce numerous realistic distribution shifts. We first showed that pretrained Transformers generalize to OOD examples far better than previous models, so that the IID/OOD generalization gap is often markedly reduced. We then showed that pretrained Transformers detect OOD examples surprisingly well. Overall, our extensive evaluation shows that while pretrained Transformers are moderately robust, there remains room for future research on robustness.