Feature Adaptation of Pre-Trained Language Models across Languages and Domains with Robust Self-Training

Adapting pre-trained language models (PrLMs) (e.g., BERT) to new domains has gained much attention recently. Instead of ﬁne-tuning PrLMs as done in most previous work, we investigate how to adapt the features of PrLMs to new domains without ﬁne-tuning. We explore unsupervised domain adaptation (UDA) in this paper. With the features from PrLMs, we adapt the models trained with labeled data from the source domain to the unlabeled target domain. Self-training is widely used for UDA, and it predicts pseudo labels on the target domain data for training. However, the predicted pseudo labels inevitably include noise, which will negatively affect training a robust model. To improve the robustness of self-training, in this paper we present class-aware feature self-distillation (CFd) to learn discriminative features from PrLMs, in which PrLM features are self-distilled into a feature adaptation module and the features from the same class are more tightly clustered. We further extend CFd to a cross-language setting, in which language discrepancy is studied. Experiments on two monolingual and multilingual Amazon review datasets show that CFd can consistently improve the performance of self-training in cross-domain and cross-language settings.


Introduction
Pre-trained language models (PrLMs) such as BERT (Devlin et al., 2019) and its variants (Liu et al., 2019c;Yang et al., 2019) have shown significant success for various downstream NLP tasks. However, these deep neural networks are sensitive to different cross-domain distributions (Quionero-Candela et al., 2009) and their effectiveness will be much weakened in such a scenario. How to adapt PrLMs to new domains is important. Unlike the most recent work that fine-tunes PrLMs on the unlabeled data from the new domains (Han and Eisenstein, 2019;Gururangan et al., 2020), we are interested in how to adapt the PrLM features without fine-tuning. To investigate this, we specifically study unsupervised domain adaptation (UDA) of PrLMs, in which we adapt the models trained with source labeled data to the unlabeled target domain based on the features from PrLMs.
Self-training has been proven to be effective in UDA (Saito et al., 2017), which uses the model trained with source labeled data to predict pseudo labels on the unlabeled target set for model training. Unlike the methods of adversarial learning (Ganin et al., 2016;Chen et al., 2018) and Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) that learn domain-invariant features for domain alignment, self-training aims to learn discriminative features over the target domain, since simply matching domain distributions cannot make accurate predictions on the target after adaptation (Lee et al., 2019;Saito et al., 2017). To learn discriminative features for the target, self-training needs to retain a model's high-confidence predictions on the target domain which are considered correct for training. Methods like ensemble learning (Zou et al., 2019;Ge et al., 2020;Saito et al., 2017) which adopt multiple models to jointly make decisions on pseudolabel selections have been introduced to achieve this goal. Though these methods can substantially reduce wrong predictions on the target, there will still be noisy labels in the pseudo-label set, with negative effects on training a robust model, since deep neural networks with their high capacity can easily fit to corrupted labels (Arpit et al., 2017).
In our work, to improve the robustness of selftraining, we propose to jointly learn discriminative features from the PrLM on the target domain to alleviate the negative effects caused by noisy labels. We introduce class-aware feature selfdistillation (CFd) to achieve this goal ( §4.2). The features from PrLMs have been proven to be highly discriminative for downstream tasks, so we propose to distill this kind of features to a feature adaptation module (FAM) to make FAM capable of extracting discriminative features ( §4.2.1). Inspired by recent work on representation learning (van den Oord et al., 2018;Hjelm et al., 2019), we introduce mutual information (MI) maximization for feature self-distillation (Fd). We maximize the MI between the features from the PrLM and the FAM to make the two kinds of features more dependent. Since Fd can only distill features from the PrLM, it ignores the cluster information of data points which can also improve feature discriminativeness (Chapelle and Zien, 2005;Lee et al., 2019). Hence, for the features output by FAM, if the corresponding data points belong to the same class, we further minimize their feature distance to make the cluster more cohesive, so that different classes will be more separable. To retain high-confidence predictions, we re-rank the predicted candidates and balance the numbers of samples in different classes ( §4.1).
We use XLM-R (Conneau et al., 2019) as the PrLM which is trained on over 100 languages. We also extend our method to cross-language, as well as cross-language and cross-domain settings using XLM-R, since it has already mapped different languages into a common feature space. We experiment with two monolingual and multilingual Amazon review datasets for sentiment classification: MonoAmazon for cross-domain and MultiAmazon for cross-language experiments. We demonstrate that self-training can be consistently improved by CFd in all settings ( §5.3). Further empirical results indicate that the improvements come from learning lower errors of ideal joint hypothesis ( §4.3,5.4).

Related Work
Adaptation of PrLMs. Recently, significant improvements on multiple NLP tasks have been enabled by pre-trained language models (PrLMs) (Devlin et al., 2019;Yang et al., 2019;Liu et al., 2019c;Howard and Ruder, 2018;Peters et al., 2018). To enhance their performance on new domains, much work has been done to adapt PrLMs. Two main adaptation settings have been studied. The first is the same as what we study in this work: the PrLM provides the features based on which domain adaptation is conducted (Han and Eisenstein, 2019;Cao et al., 2019;Logeswaran et al., 2019;Ma et al., 2019;Li et al., 2020). In the second setting, the corpus for pre-training a language model has large domain discrepancy with the target domain, so in this scenario, we need the target unlabeled data to fine-tune the PrLM after which we train a task-specific model (Gururangan et al., 2020). For example, Lee et al. (2020) and Alsentzer et al. (2019) transfer PrLMs into biomedical and clinical domains. Instead of fine-tuning PrLMs with unlabeled data from the new domain as in most previous work (Rietzler et al., 2019;Han and Eisenstein, 2019;Gururangan et al., 2020), we are interested in the feature-based approach (Devlin et al., 2019;Peters et al., 2019) to adapt PrLMs, which does not fine-tune PrLMs. The feature-based approach is much faster, easier, and more memory-efficient for training than the fine-tuning-based method, since it does not have to update the parameters of the PrLMs which are usually massive especially the newly released GPT-3 (Brown et al., 2020).
Domain Adaptation. To perform domain adaptation, previous work mainly focuses on how to minimize the domain discrepancy and how to learn discriminative features on the target domain (Ben-David et al., 2010). Kernelized methods, e.g., MMD (Gretton et al., 2012;Long et al., 2015), and adversarial learning (Ganin et al., 2016;Chen et al., 2018) are commonly used to learn domaininvariant features. To learn discriminative features for DA, self-training is widely explored (Saito et al., 2017;Ge et al., 2020;Zou et al., 2019Zou et al., , 2018He et al., 2018). To retain high-confidence predictions for self-training, ensemble methods like tri-training (Saito et al., 2017), mutual learning (Ge et al., 2020 and dual information maximization (Ye et al., 2019) have been introduced. However, the pseudo-label set will still have noisy labels which will negatively affect model training (Arpit et al., 2017;Zhang et al., 2017). Other methods on learning discriminative features include feature reconstruction (Ghifary et al., 2016), semi-supervised learning (Laine and Aila, 2017), and virtual adversarial training (Lee et al., 2019). Based on cluster assumption (Chapelle and Zien, 2005) and the relationship between decision boundary and feature representations, Lee et al. (2019) explore class information to learn discriminative features. Class information is also studied in distant supervision learning for relation extraction (Ye et al., 2017). In NLP, early work explores domain-invariant and domain-specific words to reduce domain discrepancy (Blitzer et al., 2007;Pan et al., 2010;He et al., 2011).

Preliminary
In this section, we introduce the problem definition and the model architecture based on which we build our domain adaptation algorithm presented in the next section.

Unsupervised Domain Adaptation
In order to improve the feature adaptability of pretrained transformers cross domains, we study unsupervised domain adaptation of pre-trained language models where we train models with labeled data and unlabeled data from the source and target domain respectively. We use the features from PrLMs to perform domain adaptation. Labeled data from the source domain are defined as S = {X s , Y s }, in which every sample x s ∈ X s has a label y s ∈ Y s . The unlabeled data from the target domain are T = {X t }. In this work, we comprehensively study domain adaptation in cross-domain and crosslanguage settings, based on the features from the multi-lingual PrLM where we adopt XLM-R (Conneau et al., 2019) for evaluation. By using XLM-R, different languages can be mapped into a common feature space. In this work, we evaluate our method on the task of sentiment classification using two datasets.

Model Architecture
As presented in Figure 1, our model consists of a pre-trained language model (PrLM), a feature adaptation module (FAM), and a classifier.

Pre-trained Language Model
Following BERT (Devlin et al., 2019), most PrLMs consist of an embedding layer and several transformer layers. Suppose a PrLM has L + 1 layers, layer 0 is the embedding layer, and layer L is the last layer. Given an input sentence x = [w 1 , w 2 , · · · , w |x| ], the embedding layer of the PrLM will encode x as: where After obtaining the embeddings of the input sentence, we compute the features of the sentence from the transformer blocks of PrLM. In layer l, we compute the transformer feature as: where h l = [h 1 l , h 2 l , · · · , h |x| l ] and l ∈ {1, 2, · · · , L}. Using all the |x| features will incur much memory space. After experiments, we take the average of h l as:h andh l will be fed into the FAM.

Feature Adaptation Module
To transfer the knowledge from the source to the target domain, the features from PrLMs should be more transferable. Previous work points out that the PrLM features from the intermediate layers are more transferable than the upper-layer features, and the upper-layer features are more discriminative for classification (Hao et al., 2019;Peters et al., 2018;Liu et al., 2019b). By making a trade-off between speed and model performance, we combine the last N -layer features from the PrLM for domain adaptation, which is called the multi-layer representation of the PrLM.
Our FAM consists of a feed-forward neural network (followed by a tanh activation function) and an attention mechanism. We maph l from layer l into z l with the feed-forward neural network: Multi-layer Representation. Since feature effectiveness differs from layer to layer, we use an attention mechanism (Luong et al., 2015) to learn to weight the features from the last N layers. We get the multi-layer representation z of the PrLM as: in which W att is a matrix of trainable parameters. Inspired by Berthelot et al. (2019), we want the model to focus more on the higher-weighted layers, so we further calculate the attention weight as: where θ is a set of learnable parameters that includes the parameters from the feed-forward neural network and the attention mechanism.

Classifier
After obtaining the multi-layer representation z, we train a classifier with the source domain labeled set S. We define the loss function for the task-specific classifier as: where g is a classifier that takes in the features out of E, and g is parameterized by φ. l is the loss function which is cross-entropy loss in our work.
4 Class-aware Feature Self-distillation for Domain Adaptation In this section, we introduce our method for domain adaptation. Our domain adaptation loss function takes the form of: in which L S pred is for learning a task-specific classifier with the source labeled set S (Eq. 7), L T pred is the self-training loss trained with the pseudolabel set T ( §4.1), and L CF d is to enhance the robustness of self-training by learning discriminative features from the PrLM ( §4.2), which is the main algorithm for domain adaptation in this work.

Self-training for Adaptation
We build our adaptation model based on selftraining, which predicts pseudo labels on unlabeled target data. The predicted pseudo labels will be used for model training. In the training process, we predict pseudo labels on all the target samples in T . To retain high-confidence predictions from T , we introduce a simple but effective method called rankdiversify to build the pseudo-label set T , which is a subset of T : Rank. We calculate the entropy loss for every sample in T , specifically: in which z is the multi-layer feature and g is the classifier in Eq. 7. A lower entropy loss indicates a higher confidence of the model for the pseudo label. Then we use the entropy loss to re-rank T . However, after re-ranking, some classes may have too many samples in the top K candidates, which will bias model training, so we also need to diversify the pseudo labels in the top K list. Diversify. We classify the samples into different classes with pseudo labels, and re-rank them with entropy loss in ascending order in every class. Samples are selected following the order from every class in turn until K samples are selected.
With the retained pseudo-label set T , we have the loss function for training as: in which α is a hyper-parameter which will increase gradually in the training process.

Robust Self-training by Discriminative Feature Learning
To alleviate the negative effects caused by the noisy labels in the pseudo-label set T , we propose to learn discriminative features from the PrLM.

Feature Self-distillation
To maintain the discriminative power of PrLM features, we propose to self-distill the PrLM features into the newly added feature adaptation module (FAM). Similar to traditional knowledge distil-lation (Hinton et al., 2015), feature distillation in our work is to make the FAM (student) also capable of generating discriminative features for adaptation as the PrLM (teacher) does. Since the source domain already has the labeled data, there is no need for self-distillation on the source domain, and we apply feature self-distillation (Fd) to the target domain. Inspired by recent work on representation learning (van den Oord et al., 2018;Hjelm et al., 2019;Tian et al., 2020), we propose to use mutual information (MI) maximization for Fd. MI for Feature Self-distillation. MI measures how different two random variables are. Maximizing the MI between them can reduce their difference. By maximizing the MI between the features from PrLM and FAM, we can make the two features more similar. We are interested in distilling the PrLM features into the multi-layer representation z. We can distill the featureh l from any layer l into z. However, only distilling one-layer feature of the PrLM may neglect the information from other layers, so we use the sum of the last N -layer features for distillation 1 : The distillation process is illustrated in Figure 2. Then we maximize the MI I(z,x). We need to find its lower bound for maximization, since it is hard to directly estimate mutual information. Following van den Oord et al. (2018), we also use Noise Contrastive Estimation (NCE) to infer the lower bound as: To estimate the NCE loss, we need a negative sample set in which the PrLM features are randomly sampled for the current z. Given a negative sample wherex * ∈ {x} ∪X neg ; inf(·) is a trainable feedforward neural network followed by the tanh activation, which is to resize the dimension of z to be equal tox * . To obtain the negative sample set, 1 Based on Eq.14, taking the sum or average of the last N -layer features will have the same effect. we select one negativex by randomly shuffling the batch of features which the negativex is in, and this process is repeated |X neg | times.

Class Information
Feature distillation can only maintain the discriminative power of PrLM features but ignores the class information present in class labels. To explore the class information, when performing feature selfdistillation, we further introduce an intra-class loss to minimize the feature distance under the same class. By giving the pseudo-label set T and the source labeled set S, we group the multi-layer features out of the FAM into different classes. For every class c, we calculate the center feature as z c . We define the intra-class loss as follows: where C is the set of classes. The center feature z c for class c ∈ C is calculated as: Before training for an epoch, the center features will be calculated and fixed during training. After one epoch of training, the center features will be updated. After the above analysis, our final CFd loss becomes: where λ is a hyper-parameter which controls the contribution of L S,T C .

Analysis
We provide a theoretical understanding for why in which d H∆H measures the domain discrepancy and is defined as:   and is the error of the ideal joint hypothesis which is defined as: From Ineq. 18, the performance of domain adaptation is bounded by the generalization error on the source domain, domain discrepancy, and the error of the ideal joint hypothesis (joint error). Selftraining aims to learn a low joint error by learning discriminative features on the target domain, so that the adaptation performance can be improved (Saito et al., 2017). Our proposed CFd enhances the robustness of self-training by self-distilling the PrLM features and exploring the class information. In this way, the joint error can be further reduced compared to self-training (Fig. 3). Besides, by optimizing the intra-class loss, d H∆H in Ineq. 18 can be reduced since under the same class, the feature distance of samples from both the source and target domain is minimized (Fig. 4).

Datasets
We use two Amazon review datasets for evaluation. One is monolingual and the other is multilingual.
MonoAmazon. This dataset consists of English reviews from He et al. (2018) and has four domains: Book (BK), Electronics (E), Beauty (BT), and Music (M). Each domain has 2,000 positive, 2,000 negative, and 2,000 neutral reviews. MultiAmazon. This is a multilingual review dataset (Prettenhofer and Stein, 2010) in English, German, French, and Japanese. For every language, there are three domains: Book, Dvd, and Music. Each domain has 2,000 reviews for training and 2,000 for test, with 1,000 positive and 1,000 negative reviews in each set. 6,000 additional reviews form the unlabeled set for each domain. The source domains are only selected from the English corpus. Table 2 shows the data split. To construct the unlabeled set for the target domain, we use reviews from the test set as the unlabeled data in MonoAmazon following He et al. (2018). For MultiAmazon, reviews from the training set and original unlabeled set both from the target domain are combined.
We also evaluate our model on the benchmark dataset of (Blitzer et al., 2007). The results are presented in Appendix B.

Experimental Setup
Model Settings. To enable cross-language transfer, we use XLM-R 2 (Conneau et al., 2019) which has 25 layers as the pre-trained language model. The dimension of its token embeddings is 1024 which is mapped into 256 by the FAM. Based on one transfer result, the last 10-layer features are used in FAM. λ for intra-class loss is set as 1 and 2 for MonoAmazon and MultiAmazon respectively. We set the size of negative sample set as 10 and we perform Fd training only in the target domain. τ for attention mechanism in Eq. 6 is set as 0.3. In the training process, we gradually increase the number of retained pseudo labels for self-training, in which we increase the number by 100 for MonoAmazon and 300 for MultiAmazon every epoch. α for L T pred is the linear and quadratic function of epoch for MonoAmazon and MultiAmazon respectively. More details of the experimental settings are in Appendix A.   (2017)  More detailed baseline settings can be found in Appendix A.3.

Main Results
We conduct experiments in cross-domain (CD), cross-language (CL), and both cross-language and cross-domain (CLCD) settings. Results of CD are evaluated on MonoAmazon (Table 1) and results of CL and CLCD are on MultiAmazon (Table 3). For CL, English is set as the source language. The domains in the source and target languages are the same, i.e., When German&book is the target, the source will be English&book. For CLCD, the sources are also only from English. For example, when the target is German&book, the source language is English and the source domain is dvd or music, in which two sources are set up: En-glish&dvd and English&music, and the two adaptation results are averaged for German&book.  We have the following findings from Table 1 and 3 based on the overall average scores. xlmr-10 vs. xlmr-tuning: xlmr-10 is slightly better than xlmr-tuning which demonstrates the effectiveness of the feature-based approach. xlmr-1 vs. xlmr-10: xlmr-10 is much better than xlmr-1 which means our multi-layer representation of XLM-R is much more transferable than the last-layer feature. xlmr-10 vs. p: p is consistently better than xlmr-10 which shows our self-training method is effective. p vs. p+CFd: After using CFd, p can be consistently improved and p+CFd achieves the best performance among all the methods, which shows the effectiveness of CFd.

Further Analysis
Ablation Study. We conduct the ablation experiments to see the contributions of feature selfdistillation (Fd) and class information (C), which are evaluated on MonoAmazon based on last 10layer features. By ablating p+CFd, we have four baselines of p+C (w/o Fd), p+Fd (w/o C), CFd (w/o p) and Fd (w/o p+C). From the results in Table 4, p+Fd and p+C perform worse than p+CFd but still better than p, so feature self-distillation and class information both contribute to the improvements of p. Also, by removing the effects of p, CFd and Fd substantially outperform xlmr-10, which means CFd and Fd are both effective for domain adaptation, independent of the self-training method. Joint errors. Here we study why CFd can enhance self-training and provide empirical results to demonstrate the theoretical understanding in §4.3. By testing on MonoAmazon based on last 10-layer features, Figure 3 presents the joint error results. For example, to find h * in Eq. 20 for baseline p, following Liu et al. (2019a), we train a classifier using the combined source and target labeled data based on the fixed FAM trained by p. We note that p+Fd and p+C can achieve lower joint errors compared to p, and p+CFd has the best performance, which is consistent with our analysis in §4.3.   Effects of Feature Self-distillation. We conduct an in-domain test to verify that Fd learns discriminative features from the PrLM. We build a sentiment classification model with in-domain data based on the last 10-layer features. From the same domain in MonoAmazon, we select 4,000 labeled pairs for training, 1,000 for validation, and 1,000 for test. We first pre-train the FAM by Fd using the entire 6,000 raw texts, then we freeze FAM and train a classifier with the training data with features out of FAM. We compare the results with the baseline that directly trains the FAM and classifier with training set (Super). From the results in Table 5, the performances of Fd are very close to Super, showing that the features out of FAM after Fd training are discriminative. Effects of Class Information. Table 6 presents the average intra-class loss in the training process. By exploring class information, the intra-class loss can be dramatically minimized and accordingly the transfer performances are improved.   ancy, following Saito et al. (2017), we calculate the A-distance based on the last 10-layer features out of FAM trained by method of p or others, and train a classifier to classify the source and target domain data. d A is equal to 2(1 − δ) and δ is the domain classification error. From Figure 4, p+C and p+CFd have much smaller A-distance, which means that the intra-class loss reduces the domain discrepancy. p+Fd has larger A-distance, probably because Fd learns domain-specific information from the target so the domain distance becomes larger.
Effects of Attention Mechanism. We further show whether combining the intermediate-layer features can enhance adaptation. In Table 7, one layer means only using one-layer features for transfer and the results are obtained by using the feature from the most transferable layer. We introduce the attention mechanism to combine the last N -layer features. We demonstrate that using last 10-layer features with attention can achieve better performances. AVE that averages the last N -layer features cannot improve the performance, since it lacks the ability to focus more on effective features.
We also study how the size of negative sample set affects feature distillation and the effects of sharpen on attention mechanism. The analysis is included in Appendix C.

Conclusion
In this paper, we study how to adapt the features from the pre-trained language models without tuning. We specifically study unsupervised domain adaptation of PrLMs, where we transfer the models trained in labeled source domain to the unlabeled target domain based on PrLM features. We build our adaptation method based on self-training. To enhance the robustness of self-training, we present the method of class-aware feature self-distillation to learn discriminative features. Experiments on sentiment analysis in cross-language and crossdomain settings demonstrate the effectiveness of our method.

A.3 Settings for Baselines
KL. The KL-divergence loss (Zhuang et al., 2015) is defined as: where in which n is the batch size. We set the weight of KL loss as 500, tuned from {100, 500, 1000, 5000}. MMD. We use the Gaussian kernel to implement the MMD loss (Gretton et al., 2012). The kernel number is 5. The weight for MMD loss is set to 1, tuned from {1, 0.1, 0.5} Adv. We follow Ganin et al. (2016) to reverse the gradients from the domain classifier. We set the learning rate for Adv to be the same as the baselines, but set the weight for domain classifier as 0.01, tuned from {1, 0.1, 0.01, 0.001}. xlmr-tuning. The fine-tuning baseline uses the first [CLS] token as the document representation. The learning rate is 1e-5 and the batch size for gradient update is 32. The fine-tuning models generally overfit the training data in 5 epochs.

B Results on Benchmark
Benchmark. This is a benchmark dataset for domain adaptation (Blitzer et al., 2007), whose reviews are also in English. Four domains are included: Book (B), DVDs (D), Electronics (E), and Kitchen (K). Each domain has 1,000 positive and 1,000 negative reviews. Following He et al. (2018), there are 4,000 unlabeled reviews for each domain. Table 9 summarizes the data split when training on Benchmark. The unlabeled set is the combination of the training set and the original unlabeled set. Table 10 shows the results on Benchmark.

C Further Analysis
Size of Negative Sample Set. We study how the size of negative sample set will affect Fd training.
The results are shown in Fig. 5. The method used is xlmr-10+Fd. We find that using a size that is too small or too big is not a good strategy for Fd learning. Size of 10 is a good option for Fd learning.
Effects of Sharpen on Attention Mechanism. In Fig. 6, we show the effects of sharpen mechanism in our attention method which demonstrates that when not using sharpen (τ is ∞), the performance will drop and τ set as 0.3 is a good option for our attention method.     Table 11: Full classification accuracy (%) results on MultiAmazon. Models are evaluated by 5 random runs except xlmr-tuning which is run for 3 times to save time. We report the mean and standard deviation results. The best task performance is boldfaced.