UDALM: Unsupervised Domain Adaptation through Language Modeling

In this work we explore Unsupervised Domain Adaptation (UDA) of pretrained language models for downstream tasks. We introduce UDALM, a fine-tuning procedure, using a mixed classification and Masked Language Model loss, that can adapt to the target domain distribution in a robust and sample efficient manner. Our experiments show that performance of models trained with the mixed loss scales with the amount of available target data and the mixed loss can be effectively used as a stopping criterion during UDA training. Furthermore, we discuss the relationship between A-distance and the target error and explore some limitations of the Domain Adversarial Training approach. Our method is evaluated on twelve domain pairs of the Amazon Reviews Sentiment dataset, yielding 91.74% accuracy, which is an 1.11% absolute improvement over the state-of-the-art.


Introduction
Deep architectures have achieved state-of-the-art results in a variety of machine learning tasks. However, real world deployments of machine learning systems often operate under domain shift, which leads to performance degradation. This introduces the need for adaptation techniques, where a model is trained with data from a specific domain, and then can be optimized for use in new settings. Efficient techniques for model re-usability can lead to faster and cheaper development of machine learning applications and facilitate their wider adoption. Especially techniques for Unsupervised Domain Adaptation (UDA) can have high real world impact, because they do not rely on expensive and time-consuming annotation processes to collect labeled data for domain-specific supervised training, further streamlining the process.
UDA approaches in the literature can be grouped in three major categories, namely pseudo-labeling techniques (e.g. Yarowsky, 1995;Zhou and Li, 2005), domain adversarial training (e.g. Ganin et al., 2016) and pivot-based approaches (e.g. Blitzer et al., 2006;Pan et al., 2010). Pseudolabeling approaches use a model trained on the source labeled data to produce pseudo-labels for unlabeled target data and then train a model for the target domain in a supervised manner. Domain adversarial training aims to learn a domainindependent mapping for input samples by adding an adversarial cost during model training, that minimizes the distance between the source and target domain distributions. Pivot-based approaches aim to select domain-invariant features (pivots) and use them as a basis for cross-domain mapping. This work does not fall under any of these categories, rather we aim to optimize the fine-tuning procedure of pretrained language models (LMs) for learning under domain-shift.
Transfer learning from language models pretrained in massive corpora (Howard and Ruder, 2018;Devlin et al., 2019;Brown et al., 2020) has yielded significant improvements across a wide variety of NLP tasks, even when small amounts of data are used for fine-tuning. Fine-tuning a pretrained model is a straightforward framework for adaptation to target tasks and new domains, when labeled data are available. However, optimizing the fine-tuning process in UDA scenarios, where only labeled out-ofdomain and unlabeled in-domain data are available is challenging.
In this work, we propose UDALM, a fine-tuning method for BERT (Devlin et al., 2019) in order to address the UDA problem. Our method is based on simultaneously learning the task from labeled data in the source distribution, while adapting to the language in the target distribution using multitask learning. The key idea of our method is that by simultaneously minimizing a task-specific loss on the source data and a language modeling loss on the target data during fine-tuning, the model will be able to adapt to the language of the target domain, while learning the supervised task from the available labeled data.
Our key contributions are: (a) We introduce UDALM, a novel, simple and robust unsupervised domain adaptation procedure for downstream BERT models based on multitask learning, (b) we achieve state-of-the-art results for the Amazon reviews benchmark dataset, surpassing more complicated approaches and (c) we explore how Adistance and the target error are related and conclude with some remarks on domain adversarial training, based on theoretical concepts and our empirical observations. Our code and models are publicly available 1 .

Related Work
Traditionally, UDA has been performed using pseudo-labeling approaches. Pseudo-labeling techniques are semi-supervised algorithms that either use the same model (self-training) (Yarowsky, 1995;McClosky et al., 2006;Abney, 2007) or multiple ensembles of models (tri-training) (Zhou and Li, 2005;Søgaard, 2010) in order to produce pseudo-labels for the target unlabeled data. Saito et al. (2017) proposed an asymmetric tri-training approach. Ruder and Plank (2018) introduced a multi-task tri-training method. Rotman and Reichart (2019) and Lim et al. (2020) study pseudolabeling with contextualized word representations. Ye et al. (2020) combine self-training with XLM-R (Conneau et al., 2020) to reduce the produced label noise and propose CFd, class aware feature self-distillation.
Another line of UDA research includes pivotbased methods, focusing on extracting crossdomain features. Structural Correspondence Learning (SCL) (Blitzer et al., 2006) and Spectral Feature Alignment (Pan et al., 2010) aim to find domaininvariant features (pivots) to learn a mapping between two domain distributions. Ziser and Reichart (2017, 2019 combine SCL with neural network architectures and language modeling. Miller (2019) propose to jointly learn the task and pivots. Li et al. (2018b) learn pivots with hierarchical attention networks. Pivot-based methods have also been used in conjunction with BERT (Ben-David et al., 2020).
Domain adversarial training is a dominant approach for UDA (Ramponi and Plank, 2020), in-1 https://github.com/ckarouzos/slp_daptmlm spired by the theory for learning from different domains introduced in Ben- David et al. (2007David et al. ( , 2010. Ganin et al. (2016); Ganin and Lempitsky (2015) propose to learn a task while not being able to distinguish if samples come from the source or the target distribution, through use of an adversarial cost. This approach has been adopted for a diverse set of problems, e.g. sentiment analysis, tweet classification and universal dependency parsing (Li et al., 2018a;Alam et al., 2018;Sato et al., 2017). Du et al. (2020) pose domain adversarial training in the context of BERT models. Zhao et al. (2018) propose multi-source domain adversarial networks. Guo et al. (2018) propose a mixture-of-experts approach for multi-source UDA. Guo et al. (2020) explore distance measures as additional losses and use them to construct dynamic multi-armed bandit controller for the source domains. Shen et al. (2018) learn domain invariant features via Wasserstein distance. Bousmalis et al. (2016) introduce domain seperation networks with private and shared encoders.
Unsupervised pretraining on domain-specific corpora can be an effective adaptation process. For example BioBERT  and SciB-ERT (Beltagy et al., 2019) are specialized BERT variants, where pretraining is extended on large amounts of biomedical and scientific corpora respectively. Sun et al. (2019) propose continuing the pretraining of BERT with target domain data and multitask learning using relevant tasks for BERT fine-tuning.  introduce a review reading comprehension task and a post-training approach for BERT with an auxiliary loss on a question-answering task. Continuing pretraining on multiple phases, from general to domain specific (DAPT) and task specific data (TAPT), further improves performance of pretrained language models, as shown by Gururangan et al. (2020). Han and Eisenstein (2019) propose AdaptaBERT, which includes a second phase of unsupervised pretraining, in order to use BERT in a unsupervised domain adaptation context.
Recent works have highlighted the merits of using Language Modeling as an auxiliary task during fine-tuning. Chronopoulou et al. (2019) use an auxiliary LM loss to avoid catastrophic forgetting in transfer learning and Jia et al. (2019) adopt this approach for cross-domain named-entity recognition. We draw inspiration from these approaches and utilize auxiliary Language Modeling for UDA.

Problem Definition
Let X be the input space and Y the set of labels. For binary classification tasks Y = {0, 1}. In domain adaptation there are two different distributions over X × Y , called the source domain D S and the target domain D T . In the unsupervised setting labels are provided for samples drawn from D S , while samples drawn from D T are unlabeled. The goal is to train a model that performs well on samples drawn from the target distribution D T . This is summarized in Eq. 1.
where D X T is the marginal distribution of D T over X, n is the number of samples from the source domain and m is the number of samples from the target domain. . Starting from a model that is pretrained in general corpora (Fig. 1a), we keep pretraining it on target domain data using the masked language modeling task ( Fig. 1b). On the final fine-tuning step (Fig. 1c) we update the model weights using both a classification loss on the labeled source data and Masked Language Modeling loss on the unlabeled target data.

Proposed Method
In Fig. 1a we see the BERT general pretraining phase. BERT (Devlin et al., 2019) is based on the Transformer architecture (Vaswani et al., 2017). During BERT pretraining, input tokens are randomly selected to be masked. BERT is trained using the Masked Language Modeling (MLM) objective, which consists of predicting the most probable tokens for the masked positions. Additionally it uses a Next Sentence Prediction (NSP) loss, which classifies whether the pair of input sentences are continuous or not. If a labeled dataset is available, a pretrained BERT model can be fine-tuned for the downstream task in a supervised manner with the addition of an output layer.
In Fig. 1b we initialize a model using the weights of a generally pretrained BERT and continue pretraining on an unsupervised set of in-domain data, in order to adapt to the target domain. This step does not require use of supervised data, since we use the MLM objective.
For the final fine-tuning step, shown in Fig. 1c we perform supervised fine-tuning on the source data, while we keep the MLM objective on the target data as an auxiliary task. Following standard practice, we use the [CLS] token representation for classification. The classifier consists of a single feed-forward layer.
During this procedure the model learns the task through the classification objective using the labeled source domain samples, and simultaneously it adapts to the target domain data through the MLM objective. The model is trained on the source domain labeled data for the classification task and target domain unlabeled data for the masked language modeling task. We mask only the target domain data. During training we interleave source and target data and feed them to the BERT encoder. Features extracted from the source data are then used for classification, while target features are used for Masked Language Modeling.
The mixed loss used for the fine-tuning step, is the sum of the classification loss L CLF and the auxiliary MLM loss L M LM . L CLF is a cross-entropy loss, calculated on labeled examples from source domain, while L M LM is used to predict masked tokens for unlabeled examples from target domain. We train the model over mixed batches, that include both source and target data, used for the respective tasks. The mixed loss is presented in Eq. 2: We process n labeled source samples s ∼ D S and m unlabeled target samples t ∼ D T on a batch. The weighting factor λ is selected as the ratio of labeled source data over the sum of labeled source and unlabeled target data, as stated in Eq. 3:

Dataset
We evaluate UDALM on the Amazon reviews multi-domain sentiment dataset , a standard benchmark dataset for domain adaptation. Reviews with one or two stars are labeled as negative, while reviews with four or five stars are labeled as positive. The dataset contains reviews on four product domains: Books (B), DVDs (D), Electronics (E) and Kitchen appliances (K), yielding 12 adaptation scenarios of source-target domain pairs. Balanced sets of 2000 labeled reviews are available for each domain. We use 20000 (randomly selected) unlabeled reviews for (B), (D) and (E). For (K) 17805 unlabeled reviews are available. For each of the 12 adaptation scenarios we use 20% of both labeled source and unlabeled target data for validation, while labeled target data are used for testing exclusively and are not seen during training or validation.

Implementation Details
We use BERT BASE (uncased) as the Language Model on which we apply domain pretraining.
The BERT BASE original English model is a 12layer, 768-hidden, 12-heads, 110M parameter transformer architecture, trained on the BookCorpus with 800M words and a version of the English Wikipedia with 2500M words. We convert source and target sentences to WordPieces (Wu et al., 2016). For target sentences we randomly mask 15% of WordPiece tokens, as in (Devlin et al., 2019). If a token in a specific position is selected to be masked 80% of the time is replaced with a [MASK] token, 10% of the time with a random token and 10% of the time remains unchanged. The maximum sequence length is set to 512 by truncation of inputs. During domain pretraining we train with batch size of 8 for 3 epochs (2 hours on two GTX-1080Ti cards). During the final finetuning step of UDALM we train with batch size 36, consisting of n = 1 source sub-batch of 4 samples and m = 8 target sub-batches of 4 samples each. We update parameters after every 5 accumulated sub-batches. We train for 10 epochs with early stopping on the mixed loss in Eq. 2. For all experiments we use AdamW optimizer (Loshchilov and Hutter, 2018) with learning rate 10 −5 . Each adaptation scenario requires one hour on one GTX-1080Ti. For the domain adversarial experiments we set λ d = 0.01 in Eq. 4 2 and train for 10 epochs. Models are developed with PyTorch (Paszke et al., 2019) and HuggingFace Transformers (Wolf et al., 2019).

Baselines -Compared methods
We select three state-of-the-art methods for comparison. Each of the selected methods represents a different line of UDA research, namely domain adversarial training BERT-DAAT (Du et al., 2020), selftraining XLM-R based p+CFd (Ye et al., 2020) and pivot-based R-PERL (Ben-David et al., 2020). We report results for the following settings with BERT models: Source only (SO): We fine-tune BERT on source domain labeled data, without using target data. Domain Pretraining (DPT): We use the target domain unlabeled data in order to continue pretraining of BERT with MLM loss (as in Fig. 1b) Fig. 1b), we then fine-tune the model with domain adversarial training as in Ganin et al. (2016). For a BERT model with parameters θ, with L CLF being a cross-entropy loss for supervised task prediction, L ADV being a cross-entropy loss for domain prediction and λ d being a weighting factor, domain adversarial training consists of the minimization criterion described in Eq. 4. 6 Experimental Results

Comparison to state-of-the-art
We present results for all 12 domain adaptation settings in Table 1 We reproduce the domain adversarial training procedure and present results in the DAT BERT column of Table 1. Adversarial training proved to be unstable in our experiments, even after careful tuning of the adversarial loss weighting factor λ d . This is evidenced by the high standard deviations in the DAT BERT experiments. We observe that adversarial training does not manage to outperform the source-only baseline. 3 Domain pretraining increases the average accuracy with an absolute improvement of 0.85% over the source-only baseline. Continuing MLM pretraining on the target domain data leads to better model adaptation, and therefore improved performance, on the target domain. This is consistent with previous works on supervised (Gururangan et al., 2020;Sun et al., 2019) and unsupervised settings (Han and Eisenstein, 2019;Du et al., 2020).
UDALM yields an additional 0.96% absolute improvement of average accuracy over domain pretraining. Keeping the MLM loss during fine-tuning therefore, leads to better adaptation and acts as a regularizer that prevents the model from overfitting on the source domain. We also observe smaller standard deviations when using UDALM, which indicates that including the MLM loss during finetuning can result to more robust training.

Sample efficiency
UDALM surpasses in terms of macro-average accuracy all other approaches for unsupervised domain adaptation on the Amazon reviews multidomain sentiment dataset. Specifically, our method improves on the state-of-the-art pseudo-labeling We further investigate the impact of using different amount of target domain unlabeled data on model performance, to study the sample efficiency of UDALM. We experiment with settings of 500, 2000, 6000, 10000 and 14000 samples, by randomly limiting the number of unlabeled target domain data. For each setting we conduct three experiments with BERT models: (1) DPT, (2) DAT and (3) UDALM. When no target data are available, all methods are equivalent to a source only fine-tuned BERT. Again, we do not tune the hyper-parameters for DPT or UDALM. Fig. 2 shows the average accuracy on the twelve adaptation scenarios of the studied dataset. We see that UDALM produces robust performance improvement when we limit the amount of target data, indicating that it can be used in low-resource settings. However, training BERT in a domain adversarial manner shows instabilities. This is further discussed in Section 7.

On the stopping criteria for UDA training
A common problem when performing UDA is the lack of target labeled data that can be used for hyperparameter validation. For example, Ruder and Plank (2018) use a small set of labeled target data for validation, putting the problem in a semisupervised setting. When training under a domain shift, optimization of model performance on the source data may not result to optimal performance for the target data.
To illustrate this, we examine if the minimization of the mixed loss can be used as a stopping criterion for UDA training. We compare five stopping criteria: (1) fixed training for 1 epoch, (2) fixed training for 3 epochs, (3) fixed training for 10 epochs, (4) stop when the minimum classification loss is reached for the source data and (5) stop when the minimum mixed loss ( Eq. 2) is reached. For (4) and (5) we train for 10 epochs with patience 3. We report average accuracy of the five stopping criteria over the twelve adaptation scenarios of Amazon Reviews dataset on Table 2. Training for a fixed number of 10 epochs and stopping when the minimum mixed loss perform best, yielding comparable accuracies of 91.75% and 91.73% respectively. Note that stopping when the minimum source loss stops the fine-tuning process too soon and does not allow the model to learn the target domain effectively. Overall, we observe that the mixed loss can be effectively used for early stopping, regularizing the model and alleviating the need for extensive search for the optimal number of training steps. This is an indication that the mixed loss could be used for model validation.

Stopping Criterion Epochs
Av.   where d H∆H (D S , D T ) is the H∆H-divergence (Kifer et al., 2004) between two domains, that is a measure of distance between domains that can be estimated from finite samples. Eq. 5 defines an upper bound for the expected error T (h) of a hypothesis h on the target domain as the sum of three terms, namely the expected error on the source domain S (h), the divergence between the source and target domain distributions 1 2 d H∆H (D S , D T ) and the error of the ideal joint hypothesis C. When such an hypothesis exists, the term is considered relatively small and in practice ignored. The first term, bounds the expected error on the target domain by the expected error in the source domain and is expected to be small, due to supervised learning on the source domain. The second term, gives a notion of distance between the source and target domain extracted features. Intuitively this equation states: "if there exists a hypothesis h that has small error on the source data and the source feature space is close to the target feature space, then this hypothesis will have low error on the target data". Domain Adversarial Training aims to learn features that simultaneously result to low source error and low distance between target and source feature spaces based on the combined loss in Eq. 4.

A-distance only provides an upper bound for target error
According to Ben-David et al. (2007) the H∆Hdivergence can be approximated by proxy Adistance, that is defined by Eq. 6 given the domain classification error D .
We calculate an approximation of the distance between domains. Following prior work (Ganin et al., 2016;Saito et al., 2017) we create an SVM domain classifier. We feed the SVM with BERT's [CLS] token representations, measure the domain classification error, and compute A-distance as in Eq. 6. We train the domain classifier on 2000 samples from each source and target domains. Fig. 3 shows the A-distance along with the source and the target error, averaged over the twelve available domain pairs using representations obtained from four methods, namely BERT SO, DAT BERT, DPT BERT and UDALM. DAT BERT minimizes the distance between domains. DPT BERT also reduces the A-distance, to similar levels with DAT, without using an explicit loss to minimize A-distance. To our surprise we found that, although it achieves the lowest error rate, UDALM does not significantly reduce the proxy A-distance compared to the sourceonly baseline. Additionally, we observe that the source error is correlated to model performance on the target task, i.e. models with lower source error have also lower target error. UDALM specifically, achieves high accuracy on the source task and is able to transfer the task knowledge across domains, while DAT is able to bring domain representations closer, but at the cost of achieving weaker performance on the task at hand.
Overall, we do not observe a correlation between the resulting A-distance and model performance on target domain. Therefore, lower distance between domains, achieved intentionally or not, is not a necessary condition for good performance on the target domain 4 , and our efforts could be better spent towards synergistic learning of the supervised source task and the target domain distribution.

Limitations of Domain Adversarial Training
Domain adversarial training (Ganin et al., 2016) faces some critical limitations that make the method difficult to be reproduced due to high hyperparameter sensitivity and instability during training. Such limitations have been highlighted by other authors in the UDA literature. For example, according to Shen et al. (2018) when a domain classifier can perfectly distinguish target from source representations, there will be a gradient vanishing problem.  state that domain adversarial training is unstable and needs careful hyperparameter tuning for their experiments.  report results over three multi-domain NLP datasets, where domain adversarial training in conjunction with BERT under-performs. Ruder and Plank (2018) found that the domain adversarial loss did not help for their experiments on the Amazon reviews dataset.
In our experiments we note that domainadversarial training results to worse performance than naive source only training. Furthermore, we experienced the need for extensive tuning of the λ d parameter from Eq. 4 every time the experimental setting changed (e.g. when testing for different amounts of available target data as in Section 6.2). This motivated us to further investigate the behavior of BERT fine-tuned with the adversarial cost. For visual inspection, we perform T-SNE (Maaten and Hinton, 2008) on representations extracted from BERT, under four UDA setings in Fig. 4. In Fig. 4a we observe features extracted using BERT with Domain Adversarial Training and we compare it with features from SO BERT (Fig. 4b), DPT BERT (Fig. 4c) and UDALM (Fig. 4d). We observe that domain adversarial training manages to group tightly target and source samples, especially in the case of positive samples. Nevertheless, in the process, DAT introduces significant distortion in the semantic space, which is reflected in model performance 5 .
We can attribute this behavior to two factors. First, The formulation of the adversarial loss in Eq. (4) can lead to trivial solutions. In order to maximize the L ADV term of Eq. (4), the model can just flip all domain labels, namely just predict that source samples belong to the target domain and vice-versa. In this case the model can still discriminate between domains and domainindependent representations are not encouraged. We empirically observed this behavior in our experiments with DAT, and only extensive hyperparameter tuning could alleviate this issue. Additionally, Eq. (4) aims to minimize the upper bound of the target error T (h) in Eq. (5). While this is desirable, reduction of the upper bound does not necessarily result in reduction of the bounded term in all scenarios. Furthermore, optimizing the L ADV (θ; D S , D T ) term can lead to increasing L CLF (θ; D S ), and therefore one must find a balance between the two adversarial terms, again through careful hyper-parameter tuning. These issues could potentially be alleviated by including regularization terms that discourage trivial solutions and improve robustness. Therefore, given the lack of guarantees for good performance and the practical considerations, further investigation should be conducted regarding the robustness and reproducibility of DAT for UDA.

Conclusions and Future Work
Unsupervised domain adaptation of pretrained language models is a challenging problem with direct real world applications. In this work we propose UDALM, a robust, plug and play training strategy, which is able to improve performance in the target domain, achieving state-of-the-art results across 12 adaptation settings in the multi-domain Ama- zon reviews dataset. Our method produces robust results with little hyper-parameter tuning and the proposed mixed-loss can be used for model validation, allowing for fast model development. Furthermore, UDALM scales with the amount of available unsupervised data from the target domain, allowing for adaptation in low-resource settings. In our analysis, we discuss the relationship between the A-distance and the target error. We observe that low A-distance may not suggest low target error for high capacity models. Additionally, we examine limitations of Domain Adversarial Training and highlight that the adversarial cost may lead to distortion of the feature space and negatively impact performance.
In the future we plan to apply UDALM to other tasks under domain-shift, such as sequence classification, question answering and part-of-speech tagging. Furthermore, we plan to extend our method for temporal and style adaptation, by adding more relevant auxiliary tasks that model language shift over time and over different platforms. Finally, we want to investigate the effectiveness of the proposed fine-tuning approach in supervised scenarios.