Neural Unsupervised Domain Adaptation in NLP—A Survey

Deep neural networks excel at learning from labeled data and achieve state-of-the-art results on a wide array of Natural Language Processing tasks. In contrast, learning from unlabeled data, especially under domain shift, remains a challenge. Motivated by the latest advances, in this survey we review neural unsupervised domain adaptation techniques which do not require labeled target domain data. This is a more challenging yet a more widely applicable setup. We outline methods, from early traditional non-neural methods to pre-trained model transfer. We also revisit the notion of domain, and we uncover a bias in the type of Natural Language Processing tasks which received most attention. Lastly, we outline future directions, particularly the broader need for out-of-distribution generalization of future NLP.


Introduction
Deep learning has undoubtedly pushed the frontier in Natural Language Processing (NLP). Particularly large pre-trained language models have improved results for a wide range of NLP applications. However, the lack of portability of NLP models to new conditions remains a central issue in NLP. For many target applications, labeled data is lacking (Y scarcity), and even for pre-training general models data might be scarce (X scarcity). This makes it even more pressing to revisit a particular type of transfer learning, namely domain adaptation (DA). A default assumption in many machine learning algorithms is that the training and test sets follow the same underlying distribution. When these distributions do not match, we face a dataset shift (Gretton et al., 2007) -in NLP typically referred to as a domain shift. In this setup, the target domain and the source training data differ, they are not sampled from the same underlying distribution. Consequently, performance drops on the target, which undermines the ability of models to truly generalize into the wild. Domain adaptation is closely tied to a fundamental bigger open issue in machine learning: generalization beyond the training distribution. Ultimately, intelligent systems should be able to adapt and robustly handle any test distribution, without having seen any data from it. This is the broader need for out-of-distribution generalization (Bengio, 2019), and a more challenging setup targeted at handling unknown domains (Volpi et al., 2018;Krueger et al., 2020).
Work on domain adaptation focused largely on supervised domain adaptation (Daumé III, 2007;Plank, 2011). In such a classic supervised DA setup, a small amount of labeled target domain data is available, along with some larger amount of labeled source domain data. The task is to adapt from the source to the specific target domain in light of limited target domain data. However, annotation is a substantial timerequiring and costly manual effort. While annotation directly mitigates the lack of labeled data, it does not easily scale to new application targets. In contrast, DA methods aim to shift the ability of models from the traditional interpolation of similar examples to models that extrapolate to examples outside the original training distribution (Ruder, 2019). Unsupervised domain adaptation (UDA) mitigates the domain shift issue by learning only from unlabeled target data, which is typically available for both source and target domain(s). UDA fits the classical real-world scenario better, in which labeled data in Figure 1: Taxonomy of DA as special case of transductive transfer learning (left). Related problems (e.g., domain and out-of-distribution generalization) and DA setups (1:1 and multi-source adaptation) (right). the target domain is absent, but unlabeled data might be abundant. UDA thus provides an elegant and scalable solution. We believe these advances in UDA will help for out-of-distribution generalization.
A categorization of domain adaptation in NLP We categorize research into model-centric, datacentric and hybrid approaches, as shown in Figure 1. Model-centric methods target approaches to augment the feature space, alter the loss function, the architecture or model parameters (Blitzer et al., 2006;Pan et al., 2010;Ganin et al., 2016). Data-centric methods focus on the data aspect and either involve pseudo-labeling (or bootstrapping) to bridge the domain gap (Abney, 2007;Zhu and Goldberg, 2009;Ruder and Plank, 2018;Cui and Bollegala, 2019), data selection (Axelrod et al., 2011;Plank and van Noord, 2011;Ruder and Plank, 2017) and pre-training methods (Han and Eisenstein, 2019;Guo et al., 2020). As some approaches take elements of both, we include a hybrid category. 2 A comprehensive overview of UDA methods and the tasks each method is applied to is provided in Table 1.
Contributions In this survey, we (i) comprehensively review neural approaches to unsupervised domain adaptation in NLP, 3 (ii) we analyze and compare the strengths and weaknesses of the described approaches, and (iii) we outline potential challenges and future directions in this field.

Background
First we introduce the classic learning paradigm with its core assumption, then we outline DA setups. Given {x 1 , ..., x n } = X the training instances and {y 1 , ..., y n } = Y the corresponding class labels, the goal of machine learning is to learn a function f that generalizes well to unseen instances. In supervised learning, training data consists of tuples where n is the number of instances, while in unsupervised learning we only have {(x i )} n i=1 . A general assumption in supervised machine learning is that the test data follows the same distribution as the training data. Formally, training and test data are assumed to be independently and identically (i.i.d.) sampled from the same underlying distribution. In practice, this assumption does not hold, which translates into a drop in performance when the model f trained on a source domain S is tested on a different but related target domain T .

Domain adaptation and transfer learning: notation
Formally, a domain is defined as D = {X , P (X)} where X is the feature space (e.g., the text representations), and P (X) is the marginal probability distribution over that feature space. A task (e.g., sentiment classification) is defined as T = {Y, P (Y |X)}, where Y is the label space. Estimates for the prior distribution P (Y ) and the likelihood P (Y |X) are learned from the training data Domain adaptation aims to learn a function f from a source domain D S that generalizes well to a target domain D T , where P S (X) = P T (X). DA is a particular case of transfer learning, namely transductive transfer learning (Pan and Yang, 2009;Ruder, 2019). In inductive learning, the source and target tasks differ (Pan and Yang, 2009). In transductive DA, the source and target tasks T S and T T remain the same, but the source and target domains D S and D T differ in their underlying probability distributions. Given two distributions P S (X, Y ) and P T (X, Y ), DA typically addresses the shift in marginal distribution P S (X) = P T (X), also known as covariate shift. A related problem is the problem of label shift, P S (Y ) = P T (Y ). Since we do not assume any labeled target data, we focus on the former. 4 3 What is a domain? From the notion of domain to variety space and related problems Despite the formal definition of domain above, the term is quiet loosely used in NLP and there is no common ground on what constitutes a domain (Plank, 2016). Typically in NLP, domain is meant to refer to some coherent type of corpus, i.e., predetermined by the given dataset (Plank, 2011). This may relate to topic, style, genre, or linguistic register. The notion of domain and what plays into it has though significantly changed over the last years, leading to relevant research lines.
First, the Penn Treebank WSJ corpus (Marcus et al., 1993) and the Brown corpus (Francis and Kucera, 1979) are prototypical examples, with the WSJ being considered widely as the canonical newswire domain. In the recent decade, there has been considerable work on what is considered non-canonical data. The dichotomy between canonical (typically considered well-edited English newswire) and noncanonical data arose with the increasing interest of working with social media with all its challenges related to the 'noisiness' of the domain (Eisenstein, 2013;Baldwin et al., 2013). Models trained on canonical data failed in light of the challenges on, e.g., Twitter (Gimpel et al., 2011;Foster et al., 2011).
The general quest to understand the implications of variations of language on model performance led to lines of work on how human factors impact data in a covert or overt way, e.g., on how latent socio-demographic factors impact NLP performance (Hovy, 2015;Nguyen et al., 2016), or how direct data collection strategies like crowdsourcing impact corpus composition (Geva et al., 2019) or frequency effects impact NLP performance . However, what is a domain? Is, say, Twitter, its own domain? Or is it a set of subdomains? Similarly, do language samples of social groups (e.g., sociolects) form a domain or a set of subdomains?
Variety space We believe it is time to reconsider the notion of domain, the use of the term itself, and raise even more awareness of the underlying variation in the data samples NLP works with. NLP is pervasively facing heterogeneity in data along many underlying (often unknown) dimensions. A theoretical notion put forward by Plank (2016) is the variety space. In the variety space a corpus is seen as a subspace (subregion), a sample of the variety space. A corpus is a set of instances drawn from the underlying unknown high-dimensional variety space, whose dimensions (or latent factors) are fuzzy language and annotation aspects. These latent factors can be related to the notions discussed above, such as genre (e.g., scientific, newswire, informal), sub-domain (e.g., finance, immunology, politics, environmental law, molecular biology) and socio-demographic aspects (e.g., gender), among other unknown factors, as well as stylistic or data sampling impacts (e.g., sentence length, annotator bias).
In spirit of the variety space (Plank, 2016), we suggest to use the more general term variety, rather than domain, which pinpoints better to the underlying linguistic differences and their implications rather than the technical assumptions. Each corpus is inevitably biased towards a specialized language and some latent aspects. Understanding bias sources and effects, besides effects only (Shah et al., 2020), and documenting the known are the first important steps (Bender and Friedman, 2018), as is building broader, more varied corpora (Ide and Suderman, 2004). What we need more work on is to link the known to the unknown, and studying its impact. Doing so will ultimately help to not only overcome overfitting to overrepresented domains (e.g., the newswire bias (Plank, 2016)), but also work on robustness and ultimately out-of-distribution generalization, as described later on.
Treating data as 'just another input' to machine learning is very problematic. For example, it is less known that the well-known Penn Treebank consists of multiple genres (Webber, 2009;Plank and van Noord, 2011), including reviews and some prose. It has almost universally been treated as prototypical news domain. Similarly, social media is typically considered only non-canonical data, but an analysis revealed the data to lie on a "continuum of similarity" (Baldwin et al., 2013). This has implications on NLP performance. As we have seen, there are a multitude of dimensions to consider in corpus composition and annotations, which are tied to the theoretical notion of a variety space. They challenge the true generalization capabilities of current models. What remains is to study what variety comprises, how covert and overt factors impact results, and take them into consideration in modeling and evaluation.
Related problems Following the idea of the variety space, we discuss three related notions: crosslingual learning, domain generalization/robustness, and out-of-distribution generalization.
In cross-lingual learning the feature space drastically changes, as alphabets, vocabularies and word order can be different. It can be seen as extreme adaptation scenario, for which parallel data may exist and can be used to build multilingual representations Artetxe et al., 2020). Second, instead of adapting to a particular target, there is some work on domain generalization aimed at building a single system which is robust on several known target domains. One example is the SANCL shared task (Petrov and McDonald, 2012), where participants were asked to build a single system that can robustly parse reviews, weblogs, answers, emails, newsgroups. In this setup, the DA problem boils down to finding a more robust system for given targets. It can be seen as optimizing for both in-domain and out-of-domain(s) accuracy.
If domains are unknown a priori, robustness can be taken a step further towards out-of-domain generalization, to unknown targets, the most challenging setup. A recent solution is distributionally robust optimization (Oren et al., 2019), i.e., optimizing for worst-case performance without the knowledge of the test distribution. To do so, it assumes a subpopulation shift, where the test population is a subpopulation mix of the training distribution. A model is then trained to do well over a wide range of potential test distributions. Some early work in dialogue (Bod, 1999) and parsing (Plank and Sima'an, 2008) adopted a similar idea of subdomains, however, with manually identified subpopulations. This bears some similarity to early work on leveraging general background knowledge (embeddings trained on general data) for domain adaptation (Plank and Moschitti, 2013;Nguyen and Grishman, 2015;Li et al., 2018a), and also relates to recent work on pre-training (Section 5.3). An alternative and complementary interesting line of research is to predict test set performance for new data varieties (Ravi et al., 2008;Van Asch and Daelemans, 2010;Elsahar and Gallé, 2019;Xia et al., 2020).

Model-centric approaches
Model-centric approaches redesign parts of the model: the feature space, the loss function or regularization and the structure of the model. We categorize them into feature-centric and loss-centric methods.

Feature-centric methods
Two lines of work can be found within feature-centric methods: feature augmentation and feature generalization methods. The former use pivots (common shared features) to construct an aligned feature space. The latter use autoencoders to find latent representations that transfer better across domains.
In particular, Ziser and Reichart (2017) propose to combine the strengths of pivot-based methods with autoencoder neural networks in an autoencoder structural correspondence learning (AE-SCL) model. Autoencoders are used to learn latent representations to map non-pivots to pivots, and these encodings are then used to augment the training data. The main drawback of this approach is that the output vector representations of the text are unique and not context-dependent. To solve this problem, a pivot-based language modeling (PBLM) method has been proposed (Ziser and Reichart, 2018a;Ziser and Reichart, 2018b). PBLM effectively combines SCL with a neural language model based on long short-term mem-ory (LSTM) networks which predicts the presence of pivots and non-pivots, thus making representations structure-aware. A weakness of the PBLM approach relies in the large number of pivots needed. To remedy this issue, Ziser and Reichart (2019) adopted a task refinement learning approach using PBLM (called TRL-PBLM), showing gains in both accuracy and stability over different hyperparameters selection choices. The approach is an iterative training process where the network is trained using an increasingly larger amount of pivots. Recent hybrid UDA work extends pivots with contextual embeddings (Ben-David et al., 2020), as we discuss in Section 6.
A common issue with the aforementioned methods is that they involve two independent steps: one for representation learning and one for task learning. To tackle this issue, recent studies propose training the two tasks jointly (i.e., pivot prediction and sentiment) (Miller, 2019) and learn pivots automatically via attention (Li et al., 2017), similar to work on automatic non-pivot identification (Li et al., 2018b).
To the best of our knowledge, neural pivot-based UDA approaches have been solely applied to sentiment classification, cf. Table 1. Notably, Ziser and Reichart (2018a) went a step further, and applied neural SCL cross-lingually; the NLP task is still sentiment classification. The effectiveness of pivotbased methods in neural models remains to be tested. Early non-neural work applied SCL to structure prediction problems with mixed results, i.e., POS (Blitzer et al., 2006) and parsing (Plank, 2011).
Autoencoder-based DA Early neural approaches for UDA have been based on autoencoders. Autoencoders are neural networks that are employed to learn latent representations from raw data in an unsupervised fashion by learning with an input reconstruction loss. Motivated by the denoising autoencoders (Vincent et al., 2008), the first work in this line is by Glorot et al. (2011), who introduced the stacked denoising autoencoder (SDA) for domain adaptation. Basically, a SDA automatically learns a robust and unified feature representation for all domains by stacking multiple layers, and artificially corrupts the inputs with a Gaussian noise that the decoder needs to reconstruct. However, SDAs showed issues in speed and scalability to high-dimensional data. To mitigate these limitations, a more efficient marginalized stacked denoising autoencoder (MSDA) that marginalizes the noise was proposed (Chen et al., 2012). MSDAs have been further extended by Yang and Eisenstein (2014) with marginalized structured dropout, and by Clinchant et al. (2016), which improved the regularization of MSDAs following the insights from the domain adversarial training of neural networks (Ganin and Lempitsky, 2015;Ganin et al., 2016) (described in Section 4.2). The main drawback of autoencoder approaches is that the induced representations do not make use of any linguistic information.

Loss-centric methods
Loss-centric approaches can be divided into methods which employ domain adversaries, and instancelevel reweighting methods. We outline these two strands of work in the following.
Domain adversaries The most widespread methods for neural UDA are based on the use of domain adversaries (Ganin and Lempitsky, 2015;Ganin et al., 2016). Inspired by the way generative adversarial networks (GANs) (Goodfellow et al., 2014) minimize the discrepancies between training and synthetic data distributions, domain adversarial training aims at learning latent feature representations that serve at reducing the discrepancy between the source and target distributions. The intuition behind these methods puts its ground on the theory on domain adaptation (Ben-David et al., 2010), which argues that crossdomain generalization can be achieved by means of feature representations for which the origin (domain) of the input example cannot be identified.
The seminal approach in this category are DANNs: domain-adversarial neural networks (Ganin and Lempitsky, 2015;Ganin et al., 2016). The aim is to estimate an accurate predictor for the task while maximizing the confusion of an auxiliary domain classifier in distinguishing features from the source or the target domain. To learn domain-invariant feature representations, DANNs employ a loss function via a gradient reversal layer which ensures that feature distributions in the source and target domains are made similar. The strength of this approach is in its scalability and generality; however, DANNs only model feature representations that are shared across both domains, and suffer from a vanishing gradient problem when the domain classifier accurately discriminates source and target representations (Shen et al., 2018). Wasserstein methods (Martin ) are more stable training methods than gradient reversal layers. Instead of learning a classifier to distinguish domains, they attempt to reduce the approximated Wasserstein distance (also known as Earth Mover's Distance). A recent study on question pair classification shows that the two adversarial methods reach similar performance, but Wasserstein enables more stable training (Shah et al., 2018).
To model features that also belong to either the source or target domain, domain separation networks (DSNs) (Bousmalis et al., 2016) have been proposed. DSNs separate latent representations in i) separate private encoders (i.e., one for each domain) and ii) a shared encoder (in charge to reconstruct the input instance using these representations). This bears similarities to a traditional supervised method (Daumé III, 2007). The main drawback of DSNs is that domain-specific representations are solely used in the decoder, leaving the classifier to be trained on the domain-invariant representations only.
DSNs have seen a notable success in Computer Vision (CV) (Bousmalis et al., 2016). In NLP, Shi et al. (2018) propose the genre separation networks (GSNs) as a variant of the DSNs, introducing a novel reconstruction component that leverages both shared and private feature representations in the learning process. As noted also by Han and Eisenstein (2019), a downside of adversarial methods is that they require careful balancing between objectives (Kim et al., 2017;Alam et al., 2018a) to avoid instability during learning .
Reweighting This family of methods is an instance-level adaptation method. The core idea of instance weighting (also known as importance weighting) is to assign a weight to each training instance proportional to its similarity to the target domain (Jiang and Zhai, 2007). We can see instance weighting as an alternative to domain adversaries. While domain adversaries distinguish the domains to learn domain invariant representations in a joint model, instance weighting decouples domain detection for a-priori weight estimation of an instance.
Methods that explicitly reweight the loss based on domain discrepancy information include maximum mean discrepancy (MMD) (Gretton et al., 2007) and its more efficient version called kernel mean matching (KMM) (Gretton et al., 2009). KMM reweights the training instances such that the means of the training and test points in reproducing a kernel Hilbert space are close to each other. Jiang and Zhai (2007) introduced instance weighting in NLP and proposed to learn weights by first training domain classifiers. The effectiveness of the method in neural setups remains to be seen. An early study reports non-significant improvements for POS tagging (Plank et al., 2014b).

Data-centric methods
Recently, data-centric approaches are on a rise, due to rapid growth of data and the gain in popularity of pre-training methods. We summarize data-centric strands next, which differ whether they use pseudolabeling, select relevant data or use large unlabeled data or auxiliary tasks for model pre-training.

Pseudo-labeling
The main idea of pseudo-labeling is to apply a trained classifier to predict labels on unlabeled instances, which are then treated as 'pseudo' gold labels. Pseudo-labeling applies semi-supervised methods (Abney, 2007;Zhu and Goldberg, 2009) such as bootstrapping methods like self-training, co-training and tri-training or methods such as temporal ensembling (Charniak, 1997;McClosky et al., 2006;Blum and Mitchell, 1998;Steedman et al., 2003;Zhou and Li, 2005;Søgaard and Rishøj, 2010;Saito et al., 2017;Laine and Aila, 2016) by using either the same model, a teacher model, or multiple bootstrap models which may include slower but more accurate hand-crafted models (Petrov et al., 2010) to guide pseudo-labeling. Most pseudo-labeling works date back to traditional non-neural learning methods. Bootstrapping methods for domain adaptation are well-studied in parsing (McClosky et al., 2006;Reichart and Rappoport, 2007;Yu et al., 2015). They include models trained on other grammar formalisms to improve dependency parsing on Twitter (Foster et al., 2011). Recently, this line of classics has been revisited (Ruder and Plank, 2018;Rotman and Reichart, 2019;Lim et al., 2020). For example, classic methods such as tri-training constitute a strong baseline for domain shift in neural times (Ruder and Plank, 2018). Pseudo-labeling has recently been studied for parsing with contextualized word representations (Rotman and Reichart, 2019;Lim et al., 2020) and a recent work proposes adaptive ensembling (Desai et al., 2019) as extension of temporal ensembling (see hybrid methods in Section 6).

Data selection
A relatively unexplored area is data selection for adaptation, which is gaining traction again in light of large pre-trained models (which data should they be trained on?) and the related problem of cross-lingual learning (what is/are the best source language(s) to transfer from?). Data selection aims to select the best matching data for a new domain, typically by using perplexity (Moore and Lewis, 2010) or using domain similarity measures such as Jensen-Shannon divergence over term or topic distributions (Plank and van Noord, 2011). This has mostly been studied for MT (Moore and Lewis, 2010; Axelrod et al., 2011;van der Wees et al., 2017;Aharoni and Goldberg, 2020), but also for parsing (Plank and van Noord, 2011;Ruder and Plank, 2017) and sentiment analysis (Remus, 2012) though for supervised domain adaptation setups only. For parsing and sentiment analysis, the simple Jensen-Shannon divergence on term distribution constitutes a strong baseline (Plank, 2011;Ruder and Plank, 2017). Within MT, van der Wees et al. (2017) propose a dynamic data selection approach which changes the subset of data in each epoch for MT. Data selection is gaining attention, in light of the abundance of data. Recent work investigates data representation and cosine similarity for MT data selection (Aharoni and Goldberg, 2020). Similarly, distance metrics have been been recently used for multi-source domain adaptation of sentiment classification models using a bandit-based approach (Guo et al., 2020). For morphosyntactic cross-lingual work, simple overlap metrics are indicative (Üstün et al., 2019;Lin et al., 2019). Another line explores whether tailoring large pre-trained models to the domain of a target task is still beneficial, and use of data selection to overcome costly expert selection. They propose two multi-phase pre-training methods (Gururangan et al., 2020) (as discussed further below) with promising results on text classification tasks.

5.3
Pre-training-And:-Is bigger better? Are domains (or: varieties) still relevant?
Large pre-trained models have become ubiquitous in NLP (Howard and Ruder, 2018;Peters et al., 2018;Devlin et al., 2019). Fine-tuning a transformer-based model with a small amount of labeled data often reaches high performance across NLP tasks and has become a de-facto standard. It means starting from the pre-trained model weights and training a new task-specific layer on supervised data. A natural question which arises is how universal such large models are. Is bigger better? And are domains (or varieties) still relevant? We return to these questions after depicting pre-training strategies. We delineate: 1. Pre-training: pre-training alone (e.g., multilingual BERT; language-specfic BERTs from scratch); 2. Adaptive pre-training: This encompasses pre-training, followed by secondary stages of pretraining on unlabeled data or on labeled data from intermediate higher-resource auxiliary tasks: (a) Multi-phase pre-training: two or more phases of secondary pre-training, from broadcoverage to domain-/task-adaptive pre-training (i.e., BioBERT, AdaptaBERT, DAPT, TAPT). They differ by the source of unlabeled data: broad-domain domain-specific task-specific; (b) Auxiliary-task pre-training: pre-training, followed by (possibly multiple stages of) auxiliarytask pre-training (e.g., supplementary training on intermediate labeled-data tasks, STILTs).
Pre-training (option 1) can be seen as straightforward adaptation, analogous to zero-shot in crosslingual learning. The key idea is to train encoders with self-supervised objectives like (masked) language model and related unsupervised objectives (Peters et al., 2018;Devlin et al., 2019;Beltagy et al., 2019).
In light of a domain shift, adaptive pre-training is beneficial, in which in one instantiation contextualized embeddings are adapted to text from the target domain by masked language modeling, as introduced by Han and Eisenstein (2019). More broadly, we distinguish two variants of adaptive pre-training. They differ whether unlabeled data or some form of auxiliary labeled data (or intermediate tasks data) is used. These variants can also be combined, and fine-tuning applies to all setups, if data is available. The key idea of multi-phase pre-training (option 2a) is to use secondary-stage unsupervised pre-training, such as broad-coverage domain-specific BERT variants (e.g., BioBERT). Gururangan et al. (2020) propose domain-adaptive pre-training (DAPT) from a broader corpus, compared to (Han and Eisenstein, 2019), and task-specific pre-training (TAPT) which uses unlabeled data closer-and-closer to the task distribution. As these studies show, domain-relevant data is important for pre-training (Han and Eisenstein, 2019;Gururangan et al., 2020) in both high and low resource setups. Similar adaptive pre-training work has been shown to be effective for dependency parsing . This suggests that there exists a spectrum of domains of varying granularity, confirming ideas around domain similarity (Plank, 2011;Baldwin et al., 2013). Domains (varieties) do still matter in today's models.
An alternative line of work (option 2b) is auxiliary-task pre-training and use labeled auxiliary tasks either via multi-task learning (MTL) (Peng and Dredze, 2017) or intermediate-task transfer (Phang et al., 2018;Phang et al., 2020). The latter proposed supplementary training on intermediate labeled-data tasks for transfer (STILT) (Phang et al., 2018), and recently adopted this idea to cross-lingual learning, where English is used as intermediate-task for zero-shot transfer (Phang et al., 2020).
The choice of data used for pre-training (or the auxiliary tasks) do matter. Current transformer models are trained on either large general data like BookCorpus and Wikipedia in BERT (Devlin et al., 2019) or target-specific samples, like papers from Semantic Scholar in SciBERT (Beltagy et al., 2019), and PubMed abstracts and PMC full-text articles in BioBERT . What denotes relevant data is an open question. Today, it is either general background knowledge, domain-specific target data, or a combination thereof, possibly via auxiliary tasks or intermediate training stages. Most of these have been carefully selected manually, raising interesting connections to data selection (Section 5.2) and finding better curricula (Tsvetkov et al., 2016) to learn under domain shift (Ruder and Plank, 2017).
While large pre-trained models have shown to work well, many questions and challenges remain. Recent work has shown that these models degrade on out-of-domain data, maximum likelihood training makes them too over-confident (Oren et al., 2019) and particularly calibration is important for out-ofdomain generalization (Hendrycks et al., 2020). An acknowledged issue with fine-tuning is the brittleness of the process (Phang et al., 2018;Dodge et al., 2020). Even with the same hyperparameters, distinct runs can lead to drastically different results and training data order and seed choice have a considerably impact (Dodge et al., 2020). Deeper investigations into what such models capture, how they can be robustly trained in light of known test distributions or out-of-domain conditions are interesting issues.

Hybrid approaches
Work on the intersection of data-centric and model-centric methods can be plentiful. It currently includes combining semi-supervised objectives with an adversarial loss (Lim et al., 2020;Alam et al., 2018b), combining pivot-based approaches with pseudo-labeling (Cui and Bollegala, 2019) and very recently with contextualized word embeddings (Ben-David et al., 2020), and combining multi-task approaches with domain shift (Jia et al., 2019), multi-task learning with pseudo-labeling (multi-task tritraining) (Ruder and Plank, 2018), and adaptive ensembling (Desai et al., 2019), which uses a studentteacher network with a consistency-based self-ensembling loss and a temporal curriculum. They apply adaptive ensembling to study temporal and topic drift in political data classification (Desai et al., 2019).

Challenges and future directions
While recent work has made important progress in neural UDA, our survey reveals i) an overrepresentation and bias of work on sentiment analysis (cf. column bias in Table 1) and ii) a general lack of testing across tasks (row sparsity in Table 1) and multiple adaptation methods.
Comprehensive UDA benchmarks Concretely, we recommend a) to create new benchmarks for UDA with multiple tasks and of increasing complexity, setups beyond 1:1 adaptation, and datasets which document known variety facets of the data (Bender and Friedman, 2018). This will help to learn about the known and unknown (Section 3) as 'variety' (domain) matters; b) to release unlabeled data from the broader distribution from which annotated data was sampled, in line with Gururangan et al. (2020); this allows studying diachronic effects, as labeled evaluation data lacks diversity in terms of topics and time (Desai et al., 2019;Derczynski et al., 2016); and c) to release unaggregated, multiple annotations to study divergences in annotations (Plank et al., 2014a).
Back to the roots and how knowledge transfers Revisiting classics in neural times is beneficial, as shown for example in recent work which brings back SCL and pseudo-labeling methods (see Table 1), but much is left to see how these methods generalize. This can be linked to the question on what representations capture (Belinkov and Glass, 2019) and how knowledge transfers (Rethmeier et al., 2020).
X scarcity Even unlabeled data can be scarce (X scarcity), particularly in highly-specialized language varieties (e.g., clinical data) (Rethmeier and Plank, 2019). This is often due to data sharing restrictions. In some cases, only a trained source model could be available instead of raw or labeled texts (Laparra et al., 2020). Together with the quest for more efficient learning methods, the general question of how to adapt in light of X scarcity or absence becomes important.

Conclusion
In this survey, we review strands of unsupervised domain adaptation, summarized into model-centric, data-centric, and hybrid methods, including trends in pre-training. We also revisit the notion of domain and suggest to use the term variety instead, to better capture the multitude of dimensions of variation. Our survey identifies a limited focus on sentiment benchmarks and single-task evaluation for UDA. Lastly, we outline future directions, linking to the broader challenges related to learning beyond 1:1 scenarios and out-of-distribution generalization. This also calls for new directions on benchmarks and learning under scarce data.