Meta Fine-Tuning Neural Language Models for Multi-Domain Text Mining

Pre-trained neural language models bring significant improvement for various NLP tasks, by fine-tuning the models on task-specific training sets. During fine-tuning, the parameters are initialized from pre-trained models directly, which ignores how the learning process of similar NLP tasks in different domains is correlated and mutually reinforced. In this paper, we propose an effective learning procedure named Meta Fine-Tuning (MFT), served as a meta-learner to solve a group of similar NLP tasks for neural language models. Instead of simply multi-task training over all the datasets, MFT only learns from typical instances of various domains to acquire highly transferable knowledge. It further encourages the language model to encode domain-invariant representations by optimizing a series of novel domain corruption loss functions. After MFT, the model can be fine-tuned for each domain with better parameter initializations and higher generalization ability. We implement MFT upon BERT to solve several multi-domain text mining tasks. Experimental results confirm the effectiveness of MFT and its usefulness for few-shot learning.


Introduction
Recent years has witnessed a boom in pre-trained neural language models. Notable works include ELMo (Peters et al., 2018), BERT (Devlin et al., 2019), Transformer-XL , AL-BERT (Lan et al., 2019), StructBERT (Wang et al., 2019b) and many others. These models revolutionize the learning paradigms of various NLP tasks. After pre-training, only a few fine-tuning epochs are required to train models for these tasks.
The "secrets" behind this phenomenon owe to the models' strong representation learning power to encode the semantics and linguistic knowledge from massive text corpora (Jawahar et al., 2019;Ko-valeva et al., 2019;Liu et al., 2019a;Tenney et al., 2019). By simple fine-tuning, models can transfer the universal Natural Language Understanding (NLU) abilities to specific tasks (Wang et al., 2019a). However, state-of-the art language models mostly utilize self-supervised tasks during pretraining (for instance, masked language modeling and next sentence prediction in BERT (Devlin et al., 2019)). This unavoidably creates a learning gap between pre-training and fine-tuning. Besides, for a group of similar tasks, conventional practices require the parameters of all task-specific models to be initialized from the same pre-trained language model, ignoring how the learning process in different domains is correlated and mutually reinforced.
A basic solution is fine-tuning models by multitask learning. Unfortunately, multi-task fine-tuning of BERT does not necessarily yield better performance across all the tasks (Sun et al., 2019a). A probable cause is that learning too much from other tasks may force the model to acquire nontransferable knowledge, which harms the overall performance. A similar finding is presented in Bingel and Søgaard (2017); McCann et al. (2018) on multi-task training of neural networks. Additionally, language models such as BERT do not have the "shared-private" architecture (Liu et al., 2017) to enable effective learning of domain-specific and domain-invariant features. Other approaches modify the structures of language models to accommodate multi-task learning and mostly focus on specific applications, without providing a unified solution for all the tasks (Stickland and Murray, 2019;Zhou et al., 2019b;Gulyaev et al., 2020).
A recent study (Finn et al., 2017) reveals that meta-learning achieves better parameter initialization for a group of tasks, which improves the models' generalization abilities in different domains and makes them easier to fine-tune. As pre-trained language models have general NLU abilities, they should also have the ability to learn solving a group of similar NLP tasks. In this work, we propose a separate learning procedure, inserted between pretraining and fine-tuning, named Meta Fine-Tuning (MFT). This work is one of the early attempts for improving fine-tuning of neural language models by meta-learning. Take the review analysis task as an example. MFT only targets at learning the polarity of reviews (positive or negative) in general, ignoring features of specific aspects or domains.
After that, the learned model can be adapted to any domains by fine-tuning. The comparison between fine-tuning and MFT is shown in Figure 1. Specifically, MFT first learns the embeddings of class prototypes from multi-domain training sets, and assigns typicality scores to individuals, indicating the transferability of each instance. Apart from minimizing the multi-task classification loss over typical instances, MFT further encourages the language model to learn domain-invariant representations by jointly optimizing a series of novel domain corruption loss functions. For evaluation, we implement the MFT strategy upon BERT (Devlin et al., 2019) for three multidomain text mining tasks: i) natural language inference (Williams et al., 2018) (sentence-pair classification), ii) review analysis (Blitzer et al., 2007) (sentence classification) and iii) domain taxonomy construction (Luu et al., 2016) (word-pair classification). Experimental results show that the effectiveness and superiority of MFT. We also show that MFT is highly useful for multi-domain text mining in the few-shot learning setting. 1

Related Work
We overview recent advances on pre-trained language models, transfer learning and meta-learning.
Pre-trained language models have gained much attention from the NLP community recently. Among these models, ELMo (Peters et al., 2018) learns context-sensitive embeddings for each token form both left-to-right and right-to-left directions. BERT (Devlin et al., 2019) is usually regarded as the most representative work, employing transformer encoders to learn language representations. The pre-training technique of BERT is improved in . Follow-up works employ transformer-based architectures, including Transformer-XL , XLNet , ALBERT (Lan et al., 2019), Struct-BERT (Wang et al., 2019b) and many more. They change the unsupervised learning objectives of BERT in pre-training. After language models are pre-trained, they can be fine-tuned for a variety of NLP tasks. The techniques of fine-tuning BERT are summarized in Sun et al. (2019a). Cui et al. (2019) improve BERT's fine-tuning by sparse selfattention. Arase and Tsujii (2019) introduce the concept of "transfer fine-tuning", which injects phrasal paraphrase relations into BERT. Compared to previous methods, fine-tuning for multi-domain learning has not been sufficiently studied.
Transfer learning aims to transfer the resources or models from one domain (the source domain) to another (the target domain), in order to improve the model performance of the target domain. Due to space limitation, we refer readers to the surveys (Pan and Yang, 2010;Lu et al., 2015;Zhuang et al., 2019) for an overview. For NLP applications, the "shared-private" architecture (Liu et al., 2017) is highly popular, which include subnetworks for learning domain-specific representations and a shared sub-network for knowledge transfer and domain-invariant representation learning. Recently, adversarial training has been frequently applied (Shen et al., 2018;Hu et al., 2019;Li et al., 2019b;Zhou et al., 2019a), where the domain adversarial classifiers are trained to help the models to learn domain-invariant features. Multi-domain learning is a special case of transfer learning whose goal is to transfer knowledge across multiple domains for mutual training reinforcement (Pei et al., 2018;Li et al., 2019a;Cai and Wan, 2019). Our work also addresses multidomain learning, but solves the problem from a meta-learning aspect.
Compared to transfer learning, meta-learning is a slightly different learning paradigm. Its goal is to train meta-learners that can adapt to a variety of different tasks with little training data (Vanschoren, 2018), mostly applied to few-shot learning. In  Figure 2: The neural architecture of MFT for BERT (Devlin et al., 2019). Due to space limitation, we only show two corrupted domain classifiers and three layers of transformer encoders, with others omitted. NLP, existing meta-learning models mostly focus on training meta-learners for single applications, such as link prediction , dialog systems (Madotto et al., 2019), lexical relation classification (Wang et al., 2020) and semantic parsing . Dou et al. (2019) leverage meta-learning for low-resource NLU. Our work is more general and improves the generalization abilities and prediction performance of neural language models in various domains.

MFT: The Proposed Framework
In this section, we start with some basic notations and an overview of MFT. After that, we describe the algorithmic techniques of MFT in detail.

Overview
as the training set of the kth domain, where x k i and y k i are the input text and the class label of the ith sample, respectively 2 . N k is the number of total samples in D k . The goal of MFT is to train a meta-learner initialized from a pre-trained language model, based on the training sets of K domains: D = K k=1 D k . The meta-learner provides better parameter initializations, such that it can be easily adapted to each of the K domains by fine-tuning the meta-learner over the training set of the kth domain separately.
Due to the large parameter space of neural language models, it is computationally challenging to search for the optimal values of the meta-learner's parameters. As discussed earlier, building a trivial multi-task learner over D does not guarantee satisfactory results either (Sun et al., 2019a). Here, we set up two design principles for MFT: Learn-2 Note that x k i can be either a single sentence, a sentence pair, or any other possible input texts of the target NLP task.
ing from Typicality and Learning Domain-invariant Representations, introduced as follows: Learning from Typicality. To make the metalearner easier to be fine-tuned to any domains, the encoded knowledge should be highly general and transferable, not biased towards specific domains. Hence, only typical instances w.r.t. all the domains should be the priority learning targets. We first generate multi-domain class prototypes from D to summarize the semantics of training data. Based on the prototypes, we compute typicality scores for all training instances, treated as weights for MFT.
Learning Domain-invariant Representations. A good meta-learner should be adapted to any domains quickly. Since BERT has strong representation power, this naturally motivates us to learn domain-invariant representations are vital for fast domain adaptation (Shen et al., 2018). In MFT, besides minimizing the classification loss, we jointly minimize new learning objectives to force the language model to have domain-invariant encoders.
Based on the two general principles, we design the neural architecture of MFT, with the example for BERT (Devlin et al., 2019) shown in Figure 2. The technical details are introduced subsequently.

Learning from Typicality
Denote M as the class label set of all K domains. D k m is the collection of input texts in As class prototypes can summarize the key characteristics of the corresponding data (Yao et al., 2020), we treat the class prototype c k m as the averaged embeddings of all the input texts in D k m . Formally, we have where E(·) maps x k i to its d-dimensional embeddings. As for BERT (De-vlin et al., 2019), we utilize the mean pooling of representations of x k i from the last transformer encoder as E(x k i ). Ideally, we regard a training instance (x k i , y k i ) to be typical if it is semantically close to its class prototype c k m , and also is not too far away from class prototypes generated from other domains for high transferability. Therefore, the typicality score t k i of (x k i , y k i ) can be defined as follows where α is the pre-defined balancing factor (0 < α < 1), cos(·, ·) is the cosine similarity function and 1 (·) is the indicator function that returns 1 if the input boolean function is true and 0 otherwise. As one prototype may not be insufficient to represent the complicated semantics of the class (Cao et al., 2017), we can also generate multiple prototypes by clustering, with the jth prototype to be c k m j . Here, t k i is extended by the following formula: After typicality scores are computed, we discuss how to set up the learning objectives for MFT. The first loss is the multi-task typicality-sensitive label classification loss L T LC . It penalizes the text classifier for predicting the labels of typical instances of all K domains incorrectly, which is defined as: 4 where t k i serves as the weight of each training instance. τ m (f (x k i )) is the predicted probability of x k i having the class label m ∈ M, with the ddimensional "[CLS]" token embeddings of the last layer of BERT (denoted as f (x k i )) as features. 3 Here, we assume that the training instance (x k i , y k i ) has the class label m ∈ M. Because each instance is associated with only one typicality score, for simplicity, we denote it as t k i , instead of t k i,m . 4 For clarity, we omit all the regularization terms in objective functions throughout this paper.

Learning Domain-invariant Representations
Based on previous research of domain-invariant learning (Shen et al., 2018;Hu et al., 2019), we could add an additional domain adversarial classifier for MFT to optimize. However, we observe that adding such classifiers to models such as BERT may be sub-optimal. For ease of understanding, we only consider two domains k 1 and k 2 . The loss of the adversarial domain classifier L AD is: where y k i = 1 if the domain is k 1 and 0 otherwise. σ(x k i ) is the predicated probability of such adversarial domain classifier. In the min-max game of adversarial learning (Shen et al., 2018), we need to maximize the loss L AD such that the domain classifier fails to predict the true domain label. The min-max game between encoders and adversarial classifiers is computationally expensive, which is less appealing to MFT over large language models. Additionally, models such as BERT do not have the "shared-private" architecture (Liu et al., 2017), frequently used for transfer learning. One can also replace L AD by asking the classifier to predict the flipped domain labels directly (Shu et al., 2018;Hu et al., 2019). Hence, we can instead minimize the flipped domain loss L F D : We claim that, applying L F D to BERT as an auxiliary loss does not necessarily generate domaininvariant features. When L F D is minimized, σ(x k i ) always tends to predict the wrong domain label (which predicts k 1 for k 2 and k 2 for k 1 ). The optimization of L F D still makes the learned features to be domain-dependent, since the domain information is encoded implicitly in σ(x k i ), only with domain labels inter-changed. A similar case holds for multiple domains where we only force the classifier to predict the domain of the input text x k j i into any one of the reminder K − 1 domains (excluding k j ). Therefore, it is necessary to modify L F D which truly guarantee domain invariance and avoid the expensive (and sometimes unstable) computation of adversarial training.
In this work, we propose the domain corruption strategy to address this issue. Given a training instance (x k i , y k i ) of the kth domain, we generate a corrupt domain label z i from a corrupted domain distribution Pr(z i ). z i is unrelated to the true domain label k, which may or may not be the same as k. The goal of the domain classifier is to approximate Pr(z i ) instead of always predicting the incorrect domains as in Hu et al. (2019). In practice, Pr(z i ) can be defined with each domain uniformly distributed, if the K domain datasets are relatively balanced in size. To incorporate prior knowledge of domain distributions into the model, Pr(z i ) can also be non-uniform, with domain probabilities estimated from D by maximum likelihood estimation.
Consider the neural architecture of transformer encoders in BERT (Devlin et al., 2019). Let h l (x k i ) be the d-dimensional mean pooling of the token embeddings of x k i in the lth layer (excluding the "[CLS]" embeddings), i.e., where h l,j (x k i ) represents the l-the layer embedding of the jth token in x k i , and M ax is the maximum sequence length. Additionally, we learn a d-dimensional domain embedding of the true domain label of (x k i , y k i ), denoted as E D (k). The input features are constructed by adding the two embeddings: h l (x k i ) + E D (k), with the typicalitysensitive domain corruption loss L T DC as: where δ z i (·) is the predicted probability of the input features having the corrupted domain label z i . We deliberately feed the true domain embedding E D (k) into the classifier to make sure even if the classifier knows the true domain information from E D (k), it can only generate corrupted outputs. In this way, we force the BERT's representations h l (x k i ) to hide any domain information from being revealed, making the representations of x k i domain-invariant. We further notice that neural language models may have deep layers. To improve the level of domain invariance of all layers and create a balance between effectiveness and efficiency, we follow the work (Sun et al., 2019b) to train a series of skiplayer classifiers. Denote L s as the collection of selected indices of layers (for example, one can set Algorithm 1 Learning Algorithm for MFT 1: Restore BERT's parameters from the pre-trained model, with others randomly initialized; 2: for each domain k and each class m do 3: Compute prototype embeddings c k m ; 4: end for 5: for each training instance (x k i , y k i ) ∈ D do 6: Compute typicality score t k i ; 7: end for 8: while number of training steps do not reach a limit do 9: Sample a batch {(x k i , y k i )} from D; 10: Shuffle domain labels of {(x k i , y k i )} to generate {zi}; 11: Estimate model predictions of inputs {(x k i , k)} and compare them against {(y k i , zi)}; 12: Update all model parameters by back propagation; 13: end while L s = {4, 8, 12} for BERT-base). The skip-layer domain corruption loss L SDC is defined as the average cross-entropy loss of all |L s | classifiers, defined as follows: In summary, the overall loss L for MFT to minimize is a linear combination of L T LC and L SDC , i.e., L = L T LC + λL SDC , where λ > 0 is a tuned hyper-parameter.

Joint Optimization
We describe how to optimize L for MFT. Based on the formation of L, it is trivial to derive that: As seen, MFT can be efficiently optimized via gradient-based algorithms with slight modifications. The procedure is shown in Algorithm 1. It linearly scans the multi-domain training set D to compute prototypes c k m and typicality scores t k i . Next, it updates model parameters by batch-wise training. For each batch {(x k i , y k i )}, as an efficient implementation, we shuffle the domain labels to generate the corrupted labels {z i }, as an approximation of sampling from the original corrupted domain distribution Pr(z i ). This trick avoids the computation over the whole dataset, and be adapted to the changes of domain distributions if new training data is continuously added to D through time.
When the iterations stop, we remove all the additional layers that we have added for MFT, and fine-tune BERT for the K domains over their respective training sets, separately.

Experiments
We conduct extensive experiments to evaluate MFT on multiple multi-domain text mining tasks.

Datasets and Experimental Settings
We employ the Google's pre-trained BERT model 5 as the language model, with dimension d = 768. Three multi-domain NLP tasks are used for evaluation, with statistics of datasets reported in Table 1: • Natural language inference: predicting the relation between two sentences as "entailment", "neutral" or "contradiction", using the dataset MNLI (Williams et al., 2018). MNLI is a large-scale benchmark corpus for evaluating natural language inference models, with multi-domain data divided into five genres. • Review analysis: classifying the product review sentiment of the famous Amazon Review Dataset (Blitzer et al., 2007) (containing product reviews of four domains crawled from the Amazon website) as positive or negative. • Domain taxonomy construction: predicting whether there exists a hypernymy ("isa") relation between two terms (words/noun phrases) for taxonomy derivation. Labeled term pairs sampled from three domain taxonomies are used for evaluation. The domain taxonomies are constructed by Velardi et al. (2013). with labeled datasets created and released by Luu et al. (2016) 6 .
Because MNLI does not contain public labeled test sets that can be used for single-genre evaluation, we randomly sample 10% of the training data for parameter tuning and report the performance of the original development sets. We hold out 2,000 labeled reviews from the Amazon dataset (Blitzer et al., 2007) and split them into development and test sets. As for the taxonomy construction task, because BERT does not naturally support word-pair classification, we combine a term pair to form a  sequence of tokens separated by the special token "[SEP]" as input. The ratios of training, development and testing sets of the three domain taxonomy datasets are set as 80%:10%:10%. The default hyper-parameter settings of MFT are as follows: α = 0.5, L s = {4, 8, 12} and λ = 0.1. During model training, we run 1 ∼ 2 epochs of MFT and further fine-tune the model in 2 ∼ 4 epochs for each domain, separately. The initial learning rate is set as 2 × 10 −5 in all experiments. The regularization hyper-parameters, the optimizer and the reminder settings are the same as in Devlin et al. (2019). In MFT, only 7K∼11.5K additional parameters need to be added (depending on the number of domains), compared to the original BERT model. All the algorithms are implemented with TensorFlow and trained with NVIDIA Tesla P100 GPU. The training time takes less than one hour. For evaluation, we use Accuracy as the evaluation metric for all models trained via MFT and fine-tuning.

General Experimental Results
We report the general testing results of MFT. For fair comparison, we implement following the finetuning methods as strong baselines: • BERT (S): It fine-tunes K BERT models, each with single-domain data. We also evaluate the performance of MFT and its variants under the following three settings:     The results of three multi-domain NLP tasks are reported in Table 2, Table 3 and Table 4, respectively. Generally, the performance trends of all three tasks are pretty consistent. With MFT, the accuracy of fine-tuned BERT boosts 2.4% for natural language inference, 2.6% for review analysis and 3% for domain taxonomy construction. The simple multi-task fine-tuning methods do not have large improvement, of which the conclusion is consistent with Sun et al. (2019a). Our method has the highest performance in 10 domains (genres) out of a total of 12 domains of the three tasks, outperforming previous fine-tuning approaches. For ablation study, we compare MFT (DC), MFT (TW) and MFT (Full). The results show that domain corruption is more effective than typicality weighting in MNLI, but less effective in Amazon and Taxonomy.

Detailed Model Analysis
In this section, we present more experiments on detailed analysis of MFT. We first study how many training steps of MFT we should do before finetuning. As datasets of different tasks vary in size, we tune the epochs of MFT instead. In this set of experiments, we fix parameters as default, vary the training epochs of MFT and then run fine-tuning for 2 epochs for all domains. The results of two NLP tasks are shown in Figure 3. It can be seen that too many epochs of MFT can hurt the performance because BERT may learn too much from other domains before fine-tuning on the target domain. We suggest that a small number of MFT epochs are sufficient for most cases. Next, we tune the hyperparameter λ from 0 to 0.5, with the results shown in Figure 4. The inverted-V trends clearly reflect the balance between the two types of losses in MFT, with very few exceptions due to the fluctuation of the stochastic learning process. We also vary the number of corrupted domain classifiers by changing L s . Due to space limitation, we only report averaged accuracy across all domains, shown in Table 5. In a majority of scenarios, adding more corrupted domain classifiers slightly improve the performance, as it poses strong    domain-invariance constraints to deeper layers of transformer encoders in BERT. For more intuitive understanding of MFT, we present some cases from Amazon Review Dataset with relatively extreme (low and high) typicality scores, shown in Table 6. As seen, review texts with low scores are usually related to certain aspects of specific products (for example, "crooked spoon handle" and "fragile glass"), whose knowledge is non-transferable on how to do review analysis in general. In contrast, reviews with high typicality scores may contain expressions on the review polarity that can be frequently found in various domains (for example, "huge deal" and "a waste of money"). From the cases, we can see how MFT can create a meta-learner that is capable of learning to solve NLP tasks in different domains.

Experiments for Few-shot Learning
Acquiring a sufficient amount of training data for emerging domains often poses challenges for NLP researchers. In this part, we study how MFT can benefit few-shot learning when the size of the training data in a specific domain is small. Because the original MNLI dataset (Williams et al., 2018) is large in size, we randomly sample 5%, 10% and 20% of the original training set for each genre, and do MFT and fine-tuning over BERT. For model evaluation, we use the entire development set without sampling. In this set of experiments, we do not tune any hyper-parameters and set them as default.
Due to the small sizes of our few-shot training sets, we run MFT for only one epoch, and then fine-tune BERT for 2 epochs per genre.
In Table 7, we report the few-shot learning results, and compare them against fine-tuned BERT without MFT. From the experimental results, we can safely come to the following conclusions. i) MFT improves the performance for text mining of all genres in MNLI, regardless of the percentages of the original training sets we use. ii) MFT has a larger impact on smaller training sets (a 3.9% improvement in accuracy for 5% few-shot learning, compared to a 2.6% improvement for 20%).
iii) The improvement of applying MFT before finetuning is almost the same as doubling the training data size. Therefore, MFT is highly useful for few-shot learning when the training data of other domains are available.