Domain Adversarial Fine-Tuning as an Effective Regularizer

In Natural Language Processing (NLP), pretrained language models (LMs) that are transferred to downstream tasks have been recently shown to achieve state-of-the-art results. However, standard fine-tuning can degrade the general-domain representations captured during pretraining. To address this issue, we introduce a new regularization technique, AFTER; domain Adversarial Fine-Tuning as an Effective Regularizer. Specifically, we complement the task-specific loss used during fine-tuning with an adversarial objective. This additional loss term is related to an adversarial classifier, that aims to discriminate between in-domain and out-of-domain text representations. Indomain refers to the labeled dataset of the task at hand while out-of-domain refers to unlabeled data from a different domain. Intuitively, the adversarial classifier acts as a regularize which prevents the model from overfitting to the task-specific domain. Empirical results on various natural language understanding tasks show that AFTER leads to improved performance compared to standard fine-tuning.


Introduction
Current research in NLP focuses on transferring knowledge from a language model (LM), pretrained on large general-domain data, to a target task. The LM representations are transferred to the target task either as additional features of a task-specific model (Peters et al., 2018), or by finetuning (Howard and Ruder, 2018;Devlin et al., 2019;Yang et al., 2019). Standard fine-tuning involves initializing the target model with the pretrained LM and training it with the target data.
Fine-tuning, however, can lead to catastrophic forgetting (Goodfellow et al., 2013), if the pretrained LM representations are adjusted to such an extent to the target task, that most generic knowledge, captured during pretraining, is in effect for-gotten (Howard and Ruder, 2018). A related problem of fine-tuning is overfitting to the target task, that often occurs when only a small number of training examples is available (Dai and Le, 2015).
Adversarial training is a method to increase robustness and regularize deep neural networks (Goodfellow et al., 2015;Miyato et al., 2017). It has been used for domain adaptation (Ganin et al., 2016) to train a model from scratch to produce representations that are invariant to different domains. Inspired by this approach, we propose a regularization technique for the fine-tuning process of a pretrained LM, that aims to optimize knowledge transfer to the target task and avoid overfitting.
Our method, domain Adversarial Fine-Tuning as an Effective Regularizer (AFTER) extends standard fine-tuning by adding an adversarial objective to the task-specific loss. We leverage out-of-domain unlabeled data (i.e. from a different domain than the target task domain). The transferred LM is fine-tuned so that an adversarial classifier cannot discriminate between text representations from indomain and out-of-domain data. This loss aims to regularize the extent to which the model representations are allowed to adapt to the target task domain. Thus, AFTER is able to preserve the general-domain knowledge acquired during the pretraining of the LM, while adapting to the target task.
Our contributions are: (1) We propose AFTER, an LM fine-tuning method that aims to avoid catastrophing forgetting of general-domain knowledge, acting as a new kind of regularizer. (2) We show that AFTER improves the performance of standard fine-tuning in four natural language understanding tasks from the GLUE benchmark (Wang et al., 2019a), with two different pretrained LMs: BERT (Devlin et al., 2019), and XLNET (Yang et al., 2019). (3) We further conduct an ablation study to provide useful insights regarding the key factors of the proposed approach.

Related Work
Several approaches have been proposed for the adaptation of a model trained on a domain D S to a different domain D T , where no labeled data is available (Grauman, 2012;Tzeng et al., 2014;Sun et al., 2016). Ganin et al. (2016) were the first to propose adversarial training for domain adaptation. They introduced a gradient reversal layer to adversarially train a classifier that should not be able to discriminate between D S and D T , in image classification and sentiment analysis tasks.
Various adversarial losses have been used for domain adaptation in several NLP tasks, such as question answering (Lee et al., 2019), machine reading comprehension (Wang et al., 2019b) and crosslingual named entity recognition (Keung et al., 2019). Adversarial approaches have been also used to learn latent representations that are agnostic to different attributes of the input text, such as language (Lample et al., 2018a,b) and style (Yang et al., 2018). Contrary to previous domain adaptation work, we explore the addition of an adversarial loss term to serve as a regularizer for fine-tuning.
Other variants of LM fine-tuning include a supplementary supervised training stage in data-rich tasks (Phang et al., 2018) or multi-task learning with additional supervised tasks (Liu et al., 2019). However, such methods require additional labeled data. A common way to leverage unlabeled data during fine-tuning is through an additional stage of language modeling. For this stage, the unlabeled data can either come from the task-specific dataset (i.e. the labels are dropped and language modelling is performed on the input data) (Howard and Ruder, 2018), or additional unlabeled in-domain corpora (Sun et al., 2019;Gururangan et al., 2020). This approach adds a computationally expensive step that requires unlabeled data from a specific source. By contrast, our method leverages out-of-domain data with only a small computational overhead and minimal changes to the fine-tuning process.
Our work is compatible with the semi-supervised learning paradigm (Chapelle et al., 2010) that combines learning from both labeled and unlabeled data. In this setting, unlabeled data from the task domain is leveraged using a consistency loss which enforces invariance of the output given small perturbations of the input (Miyato et al., 2017;Clark et al., 2018). The adversarial loss term of AFTER can be interpreted as a consistency loss that ensures invariance of representations across domains.
Recently, adversarial or trust region based approaches (Zhu et al., 2020;Jiang et al., 2020;Aghajanyan et al., 2020) have been proposed as an extension to the LM fine-tuning process. These methods introduce constraints that prevent aggressive updating of the pretrained parameters or enforce smoothness during fine-tuning. However, these approaches require additional forward and backward computations while our method is more computationally efficient and can be implemented with minimal changes to the fine-tuning procedure.
3 Proposed Approach Fig. 1 provides a high-level overview of AFTER. Problem Definition. We tackle a Main task, with a labeled dataset from domain D M . We further exploit an existing unlabeled corpus, Auxiliary, that comes from a different domain D AUX . We label each sample with the corresponding domain label y D , y D = 0 for samples from Main, and y D = 1 for samples from Auxiliary. We note that we do not use any real labels from Auxiliary (if there are any). The domain labels are used to train a classifier that discriminates between D M and D AUX . Model. We initialize our model with pretrained weights from a top-performing language model, such as BERT (Devlin et al., 2019) or XL-NET (Yang et al., 2019). The representation of both BERT and XLNET for the input sequence is encoded in the [CLS] token output embedding. We add a linear layer on top of the sequence representation ([CLS] output embedding) for the Main task, resulting in a task-specific loss L M ain . We also add another linear layer for the binary domain classifier ( Figure 1), with a corresponding loss L Domain , which has the same input. Adversarial Fine-tuning. The domain discriminator outputs a domain label for each sample of the training set. We seek representations that are both discriminative for the Main task and indiscriminative for the domain classifier. Hence, we minimize L M ain and at the same time maximize L Domain , by fine-tuning the pretrained LM with the joint loss: where λ (λ > 0) controls the importance of the domain loss. The parameters of the domain classifier are trained to predict the (true) domain label, while the rest of the network is trained to mislead the domain classifier, thereby developing domainindependent internal representations.

Task-specific Classifier
Domain Classifier

Backward Pass w/ reversed gradients Backward Pass
Forward Pass Figure 1: Illustration of the proposed approach, AFTER. The task-specific classifier leverages the labeled data from the downstream task (Main) while the domain classifier uses unlabeled data from both Main and Auxiliary datasets as well as the created domain labels.
Gradient Reversal Layer. We use a Gradient Reversal Layer (GRL) (Ganin et al., 2016) between the [CLS] output embedding and the domain discriminator layer, as shown in Figure 1, to maximize L Domain . During the forward pass, GRL acts as an identity transform, but during backpropagation, GRL reverses the gradients. In effect, the pretrained LM parameters are updated towards the opposite direction of the gradient of L M ain and, adversarially, towards the direction of the gradient of L Domain .

Experiments
Datasets. We experiment with four Main datasets from the GLUE benchmark ( colo et al., 2007;Bentivogli et al., 2009). The datasets used represent both high (SST-2) and lowresource (RTE, COLA, MRPC) tasks, as well as single-sentence (COLA, SST-2) and sentence-pair (MRPC, RTE) tasks. For Auxiliary data we select corpora from various domains. For the NEWS domain we use the AG NEWS dataset (Zhang et al., 2015) and for the REVIEWS    We make our code publicly available 1 . We tune the λ hyperparameter of Eq. 1 on the validation set for each experiment, finding that most values of λ improve over the baseline. We fine-tune each model for 4 epochs and evaluate the model 5 times per epoch, as suggested by Dodge et al. (2020). We select the best model based on the validation loss. For more implementations details see Appendix A.2.  BERT. We observe that the proposed approach (AFTER) outperforms the first baseline (BERT SFT) in all four tasks. For most of these tasks, AFTER results in improved performance with every Auxiliary dataset, demonstrating the robustness of our approach across domains. Specifically, in COLA, we observe that finetuning with the adversarial loss substantially outperforms standard fine-tuning. Specifically, using an Auxiliary dataset from the NEWS domain improves the baseline by 1.8 points. In SST-2, we notice that although standard fine-tuning achieves high accuracy, the use of AFTER still results in slight performance gains (∼ 0.4%). Similar to COLA, these improvements are consistent across Auxiliary datasets and often come with reduced variance, compared to SFT. In MRPC, we observe gains of 1.5 points on average in accuracy and 1.0 in F1 over SFT. Using NEWS data as Auxiliary, AFTER outperforms the baseline by 2.1 points in accuracy and 1.5 in F1. In RTE, the proposed approach improves upon the baseline from 64.3% to 64.8% in accuracy, using data from the LEGAL domain. However, we also observe deteriorated performance with the use of some Auxiliary datasets (e.g. MEDICAL, MATH). We attribute this result to the similarity between the domain of RTE (Wikipedia) and the domain of the pretraining corpus of BERT (Wikipedia and Books). We test this hypothesis in section 6. XLNET. We observe in Table 2 that AFTER consistently outperforms standard fine-tuning for an even higher-performing LM (XLNET SFT).

Results
Specifically, in SST-2 AFTER improves the accuracy of standard fine-tuning (SFT) by 0.6% on average and reduces variance, as well. For instance, with the use of Auxiliary data from NEWS or MATH domains, AFTER results in 0.9% improvement in accuracy. In MRPC, the performance boost is also consistent across Auxiliary data. In particular, the use of LEGAL data leads in absolute improvement of 1.1% in accuracy and 0.8% in F1. In RTE, adversarial fine-tuning outperforms the baseline by 1.4% in accuracy. However, similar to BERT, we observe lower performance when using AFTER with some Auxiliary data (e.g. NEWS, MEDICAL). We attribute this performance degradation to the same reason as BERT, the similarity between the pretraining corpus domain and the target task domain (both LMs have similar pretraining corpora).
Summary. The experiments of this section reveal that AFTER can boost target task performance and reduce variance compared to standard fine-tuning across different pretrained LMs. We can therefore attribute the effectiveness of AFTER to regularization itself and not to the model architecture. We can also observe in Table 2 that the target task performance of our approach (BERT AFTER) is on par (RTE) or higher (MRPC) than using standard finetuning with a higher-performing pretrained LM (XLNET SFT). This finding demonstrates the effectiveness of the proposed approach and motivates the need for more effective fine-tuning schemes as a way to improve the target of pretrained LMs on downstream tasks.

Ablation Study
We investigate the effect of some key factors of AFTER such as the relation of the target task domain and the domain of the pretraining corpus of the LM, the selection of Auxiliary data and the emergence of domain-invariant characteristics. For the experiments of this section we used BERT, unless otherwise stated. LM pretraining and Task Domains. To explore why AFTER fails to improve upon the baseline on RTE, we examine if the pretrained representations are already well suited for the task (i.e. no regularization is needed). We calculate the average masked LM (MLM) loss of BERT for each Main dataset. We observe in Table 3 that SST-2 produces the largest loss which can be partially attributed to the dataset format (it contains short sentences that make the MLM task very challenging). RTE produces the lowest loss confirming our intuition regarding the similarity of the pretraining corpus of BERT and RTE. In this case, general-domain and domain-specific representations are close, rendering domain-adversarial regularization undesirable. This is also confirmed by the the vocabulary overlap between RTE and a Wikipedia corpus (Table 3). The more distant the pretraining domain of BERT is to the specific task (measured by vocabulary overlap and MLM loss), the more benefits AFTER demonstrates, confirming our intuition regarding domain-adversarial regularization.  Domain Distance. We measure the domain distance for all Main-Auxiliary pairs to evaluate how the choice of the latter affects the performance of AFTER. We represent the word distribution of each dataset using term distributions t ∈ R |V | where t i is the probability of the i-th word in the joint vocabulary V (see Appendix A.4) and calculate Jensen-Shannon (JS) divergence (Plank and van Noord, 2011). Combining the results of Table 2 and Fig. 2, no clear pattern emerges demonstrating, perhaps, our method's robustness to domain distance. We leave a further investigation of selection criteria for the Auxiliary data for future work.

Domain-invariant vs. Domain-specific Features.
To investigate if the benefits of AFTER can be attributed only to data augmentation we compare adversarial (λ > 0 in Eq. 1) and multi-task (λ < 0) finetuning. We experiment with MRPC and COLA for both settings (tuning each λ separately). We observe that during multi-task fine-tuning (Fig. 3), L Domain is close to zero (even in the first epoch). This implies that domain classification is an easy auxiliary task, confirming our intuition that a non-adversarial fine-tuning setting favors domainspecific features. Although the multi-task approach leverages the same unlabeled data, its performance is worse than AFTER (Table 4), which highlights the need for an adversarial domain discriminator.

Conclusions and Future Work
We propose AFTER, a domain adversarial method to regularize the fine-tuning process of a pretrained LM. Empirical results demonstrate that our method can lead to improved performance over standard fine-tuning. AFTER can be widely applied to any transfer learning setting and model architecture, with minimal changes to the fine-tuning process, without requiring any additional labeled data. We aim to further explore the effect of Auxiliary data on the final performance and the use of multiple Auxiliary datasets. We also aim to extend the proposed approach as a way to fine-tune a pretrained LM to a different language, in order to produce language-invariant representations.

A Appendices
In this supplementary material, we provide additional information for producing the results in the paper, and results that could not fit into the main body of the paper.

A.1 Dataset Details
Main datasets. We use only four datasets of the GLUE benchmark as Main for our experiments, due to resources constraints. All Main datasets are open source and can be found in https://gluebenchmark.com/tasks.
Auxiliary datasets. We choose Auxiliary datasets that are larger than Main, which we consider as the most realistic scenario, given the availability of unlabeled compared to labeled data. We under-sample the Auxiliary dataset to ensure that the two domains are equally represented, motivated by the observation of Bingel and Søgaard (2017) that balanced datasets tend to be better in auxiliary tasks. For each mini-batch, we sample equally from the Main and Auxiliary datasets. The Auxiliary datasets are a mixed of labeled and unlabeled datasets from different domains. The labeled Auxiliary datasets (e.g. AG NEWS) are handled as unabeled corpora, by dropping the task-specific labels and using only the domain labels. Although some domains might seem similar to those of the Main datasets, e.g Electronics Reviews vs. Movies revies and Agricultural News vs. News this is not the case as can be seen in Figure 6.
The maximum sequence length for all datasets was 128, so all samples were truncated to 128 tokens and lower-cased. For EUROPARL, which contains parallel corpora in multiple languages, only the English part is used. We therefore sample 120K sentences from the English corpus. For PUBMED we use 120K abstracts from medical papers, from the dataset of Cohan et al. (2018). For MATH we use 120K questions of medium difficulty from the dataset of Saxton et al. (2019). We note that all corpora used are in English.

A.2 Hyperparameters and Model details
For BERT we use the bert-base-uncased pretrained model and we fine-tune it with the following hyperparameters: dropout 0.1, batch size 28 and a maximum length of 128 tokens. For the optimization we use the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 2e-5, adam ep-silon 1e-6 and weight decay 0.01. We use a linear warmup schedule with 0.1 warmup proportion.
For XLNET we use the xlnet-base-cased. We use the last hidden state output embedding, as the input sequence representation. We fine-tune XLNET with the following hyperparameters: 26 batch size and the same learning rate (2e-5) and sequence length (128) as BERT. We do not use weight decay or warmup. In order to replicate the results of Yang et al. (2019) in COLA, the authors suggested using a considerably larger batch size (×4), which was not possible in our case, due to resources constraints 2 .
When we combine AFTER with either BERT or XLNET we use the same hyperparameters as above. We note that both models have approximately 110M parameters and this is (almost) the same using AFTER, as well. Our approach only introduces a binary domain discriminator in the form of a linear layer.
For all experiments we used a 6G GeForce GTX 1080. The duration of the experiments depended on the datasets. For SST-2, which is the largest dataset, the experiments for the baseline (BERT, XLNET) had a runtime of approximately 100mins (for all 4 epochs) and 200mins for AFTER, due to the implicit dataset augmentation. Smaller datasets such as MRPC and COLA had an approximate runtime of 30mins with standard fine-tuning and 60mins with AFTER.

A.4 More Domain Distance Results
In order to create a common vocabulary for all data for Figure 2 we find the 5k most frequent words in each dataset and we then take the union of these sub-vocabularies which results in 23k words. We also calculate the vocabulary overlap, by creating each domain (or task) vocabulary with the 10k most frequent words in each dataset (in case a dataset contains less words we use all the words in the dataset). We then calculate the vocabulary overlap between domains ( Figure 5) and between each task and all domains ( Figure 6). For the latter, we also include the WIKI domain to account for the pretraining domain of BERT and XLNET. For the vocabulary of WIKI we use the WikiText-2 corpus from Merity et al. (2017). We observe in Figure 5 that most domains are dissimilar, with the exception of NEWS and LEGAL domains, that have 36.6% vocabulary overlap. In Figure 6, we observe that RTE has the most overlap in vocabulary with WIKI which is a possible cause for the deteriorated performance of AFTER, since the model has already been pretrained in this domain and does not require further regularization, as described in Section 6.