Do you have the right scissors? Tailoring Pre-trained Language Models via Monte-Carlo Methods

It has been a common approach to pre-train a language model on a large corpus and fine-tune it on task-specific data. In practice, we observe that fine-tuning a pre-trained model on a small dataset may lead to over- and/or under-estimate problem. In this paper, we propose MC-Tailor, a novel method to alleviate the above issue in text generation tasks by truncating and transferring the probability mass from over-estimated regions to under-estimated ones. Experiments on a variety of text generation datasets show that MC-Tailor consistently and significantly outperforms the fine-tuning approach.


Introduction
Recently, pre-trained language models (PLM), e.g. GPT-2 (Radford et al., 2019), have shown great promise in many applications of natural language generation, such as stylized text generation (Syed et al., 2019) and dialog system (Wolf et al., 2019). PLM is obtained by first pre-training on largescaled raw sentences (always general domain corpus), and then used in downstream tasks by finetuning on task-specific datasets (always from some specific domains). Specifically, given a pre-trained GPT-2 model, to generate sentences of email domain, we always need to fine-tune the GPT-2 on a small set of email domain corpus.
However, we argue that to get desired sentence outputs, fine-tuning PLM on a specific domain dataset is not necessarily the best, especially when the fine-tuning dataset is of a small size. Typically, fine-tuning is conducted through Maximum Likelihood Estimation (MLE), with which the resulting model distribution will be asymptotically consistent with true distribution when the fine-tuning dataset has infinite data samples. But it is not the  Figure 1: The over-and under-estimation problems of the model distribution. For example, sample b represents the simple sentence "Yes .", whose probability is over-estimated. Its model NLL (4.01, negative loglikelihood) is significantly lower than the 95% confidence interval of its real NLL [4.89,5.37], which is estimated on the training set.
case of fine-tuning on small datasets, which always leads to the mismatch problem of the real and model distributions.
Specifically, MLE minimizes the Kullback-Leibler (KL) divergence between model and true distributions. Theis et al. (2016) point out that minimizing KL avoids assigning an extremely small probability to any data point but assigns a lot of probability mass to non-data regions, which leads to a gap between P Real and P M odel . Additionally, simple data patterns in the fine-tuning dataset could be easily memorized and over-estimated. Meanwhile, the complex ones may be under-estimated. The above problem is not severe with adequate data samples, but non-trivial when the size of the fine-tuning dataset is not large enough. (see Figure 1).
To address the over-and under-estimated problem, in this paper, we propose MC-Tailor, which can tailor the resulting density of model distribution by cutting the probability mass of over-estimated zones to under-estimated zones, leading to more realistic model distribution after fine-tuning. Concretely, MC-Tailor consists of two components: a ratio estimator to distinguish over-and under-estimated regions of model distribution; and an early rejection sampling (ERS) component to tailor (reassign) probability mass and efficiently obtain sampled sentences from the model distribution. Note that the proposed ERS is inspired by Sequential Monte Carlo (SMC, Doucet et al. (2000)), but can avoid the degeneration from SMC, as it directly kills samples rather than performs resampling.
We conduct experiments on various data sets to verify the effectiveness of the proposed MC-Tailor. Empirical results show that MC-Tailor can generate significantly better samples than finetuning, and the resulting model distributions of our model are closer to real data distributions.

Pre-Trained Language Model
Language models generally estimate the density of sentences in real context within an autoregressive style: where x is a sentence with length N . Recently, with an extremely large number of parameters, pretrained language models like GPT-2 (Radford et al., 2019) and Transformer-XL (Dai et al., 2019) have shown great promise in text generation. PLMs are first trained on a huge general domain data set and then fine-tuned on specific domain datasets of different downstream tasks. Specifically, given a pre-trained GPT2 model, to generate sentences of email domain, we always need to fine-tune the GPT2 on a small set of email domain corpus. Additionally, PLMs have some other important applications. Miao et al. (2019) use fine-tuned language models for constrained text generation. Wolf et al. (2019) fine-tune GPT-2 on a dialog data set to boost the performance of dialog system. However, as stated in the Introduction, directly fine-tuning the PLM on a small dataset may lead to the mismatch problem, namely the over-and underestimated problem between the true distribution and the model distribution. In the next section, we propose a new method to alleviate this problem.

Proposed MC-Tailor
To mitigate the above shortcomings of finetuning, we propose MC-Tailor, which generates samples from a modified sample distribution. MC-Tailor is composed of a ratio estimator, which detects over-and under-estimate regions of model distributions, and the Early Rejection Sampling algorithm (ERS), which accelerates sampling while ensuring sample quality.

Ratio Estimator
Ratio estimator is a common technique to measure the gap between two related distributions (Yuxuan et al., 2020). In this work, We apply ratio estimator γ(x) to estimating P Model (x) P True (x) , the probability ratio of sentence x in fine-tuned model distribution P Model (x) and true distribution P True (x). To tailor the probability from a finetuned PLM, we cut the probabilities of over-fitting samples. Specifically, when γ(x) > 1, i.e., the model over-estimates the probability of sample x, we remove x with a probability of 1 − 1 r(x) to approximate P True (x). After normalization, probabilities of under-estimated areas will increase correspondingly. The resulting new distribution is P Tailor ∝ P Model (x) max(γ(x),1) . In this work, we try several different structures of ratio estimators. Convolutional Ratio Estimator. Since ratio estimation shares similar properties with classification problems and convolutional neural networks (CNN) are powerful classifiers, our first thought is to build a CNN-based ratio estimator. To be concrete, we use a two-layer CNN to predict whether x is from true or learned distribution. By training with cross-entropy loss, .
(2) Naturally, we define . (3) Dual Ratio Estimator. Though the basic convolutional ratio estimator is easy to apply, it makes sampling inefficient. For most sentence x, we can roughly predict whether it is in a specific domain or suffering from overfitting by the first a few words. However, γ(x) can only be obtained after a full sentence is generated, so massive computing resources are wasted on generating unpromising samples.
To determine whether a prefix x [1:i] is promising, we can estimate where γ (x , we will end up getting the average value of γ(x) for all sentences with prefix x [1:i] , rather than the minimum value. If so, some sentences with low γ(x) will be erroneously rejected. Luckily, the properties of minmax dual sheds some light on this problem. We first define γ (x) = max i (γ (x [1:i] )) as the dual form of γ (x). Under some weak conditions, we can prove that if γ (x) approximates P Model (x) P True (x) , then γ (x [1:i] ) approximates min(γ(x)) for x with prefix x [1:i] . Similar to training γ(x), we train γ (x) by distinguishing P True (x) from P Model (x). Since γ (x) is a function of γ (x [1:i] ), we can get a set of proper parameters for γ (x [1:i] ). Hierarchical Ratio Estimator. Since a single ratio estimator may not be powerful enough to accurately estimate P Model (x) P Real (x) , we break down the workload to several γ i (x) in the spirit of boosting. We first train γ 0 (x) to estimate P Model (x) P Real (x) , and get P 0 Tailor (x). And then we use γ 1 (x) to estimate the gap between P Real and P 0 Tailor (x)... With the collaboration of γ i (x), we can get a more accurate P n Tailor (x). Using hierarchical ratio estimators also avoids using a single but complicated ratio estimator, which is prone to over-fitting. Similarly, we can add hierarchy to the dual ratio estimator to make a hierarchical dual ratio estimator.

Efficient Sampling
In this part, we introduce our specially designed Early Rejection Sampling (ERS) algorithm for MC-Tailor. Improved from Sequential Monte Carlo, ERS can efficiently generate samples with high diversity.
Rejection Sampling By applying RS, we first generate a batch of samples from P Model , and then rejecting some samples by rejection ratio 1 − 1 max(γ(x),1) . However, RS is very inefficient in actual use since it rejects samples at the end of sampling. As shown in Figure 2a, lots of computation resources are wasted on ultimately rejected samples.
Sequntial Monte Carlo Instead of rejecting samples at the end of sampling, SMC performs resampling at each step. The unnormalized resampling weight at step i is provided by , leading to an asymptotically unbiased estimator. However, SMC suffers from serious degeneracy problem. In other words, samples from SMC tend to share a very small number of the ancestors because most of the ancestors are killed during resampling. As a result, sample diversity of SMC is critically low.
Early Rejection Sampling To overcome the degeneracy problem of SMC and increase sample diversity. We propose Early Rejection Sampling (ERS) algorithm. ERS first uniformly samples a real number r in (0, 1). After step i, if γ (x[1 : i]) > 1 r , this particle is killed immediately and computation resources are released to parallel threads. The main difference between ERS and RS is that ERS kills unpromising particles before they are fully generated. But unlike SMC, there is no correlation between SMC samples, resulting in higher sample diversity.

Experiments
In this section, We empirically compare the sample quality of our model and baseline models. We first set up experiments and show results in Section 4.2.

Experimental Setup
We conduct experiments on 9 data sets with different styles and sizes. And we use five different metrics, including human evaluation, to measure the generation performance of each method. Datasets. We use the following data sets for experiments.
• Ontonotes (Pradhan et al., 2013) is a multigenre data set for sequence annotation. We use sentences from six genres (bn, bc, mz, nw, tc, wb) for the experiment. • Switchboard (Jurafsky et al., 1997) and Dai-lyDialog (Li et al., 2017) are large and medium scale dialog data sets, of which only responses are used for the experiment. • IWSLT-16 (Cettolo et al., 2016) is a data set of paired conference speeches for machine translation. We use English sentences from De-En pairs to test model performance on the special conference speech domain. Evaluation Metrics. To evaluate the generation quality and diversity, we use the following metrics.
• PPL reflects the average density of samples from test set in a generative model. Models with lower PPLs have more similar model distributions with real contexts. Unlike baseline models, MC-Tailor only has an unnormalized log-probability. We estimate the normalization constant of MC-Tailor by importance sampling and calculate PPLs directly from the normalized log-probability. • Rev-PPL is a good indicator for both sample quality and diversity, which is derived by first training a language model with generated samples and calculating the PPL of test set in the language model. • EMD-l is the earth mover distance between sentence lengths of real and generated data.
• EMD-f is the earth mover distance between word frequencies of real and generated data. • Human Evaluation Score is added to reflect the comprehensive sample quality. We ask 4 volunteers to select a score from {0, 0.5, 1} for each sample according to their fluency and coherence with the target style. In 85% cases, at least three volunteers give the same score, showing the reliability of the human evaluation. Model Details. In all the experiments, we use the released GPT-2 with 117M parameters as the pretrained language model. We first fine-tune GPT-2 on each dataset and then build our tailor on it. Early-stop is applied to avoid over-fitting. For ratio estimators, we use simple CNNs with two convolution layers where (filter number, kernel size) is set to (10,5) and (5,5), respectively.

Experimental Results
Rev-PPLs of different models are shown in Table 1. We find that MC-Tailor significantly reduces Rev-PPLs than fine-tuning baseline in data sets of different sizes, from Ontonotes-mz with only 7k training samples to relatively large Switchboard data set with more than 200k samples. We also notice that multi-layer MC-Tailor ERS performs better than single-layer MC-Tailor RS , which confirms the point in Section 3.2 that the gap between P Model and P Data is too complex for a single-layer ratio estimator to estimate. Sample NLLs of each method ( Table 2) further confirms that MC-Tailor succeeds in decreasing the probabilities of over-estimated simple patterns and reallocating them to underestimated samples.
We further compare MC-Tailor with the baseline   model under other metrics. From table 4, we find MC-Tailor greatly reduce PPL, which means increased probabilities of generating samples similar to test samples. And we can draw the conclusion that sample distributions of MC-Tailor are closer to real sample distributions, with lower EMD-l and EMD-f. What's more, human evaluation scores of MC-Tailor are about 10% higher than fine-tuning, which indicates better sample quality to human eyes. Cases shown in Table 3 further demonstrate the advantage of MC-Tailor in fluency and informativeness. Seq-GAN is also compared in our experiment. However, rev-ppls of GANs are even higher than directly fine-tuning GPT-2, and they are especially difficult to train. So we remove Seq-GAN from baseline models.
The acceleration effect of ERS is also verified in the experiment. For MC-Tailor with 1, 2, and 3 layers of ratio estimator, ERS reduces 30%, 79%, and 90% of computation wasted on unpromising samples, achieving 1.5x, 2.8x, 5x accelerations, respectively.

Conclusion
In this paper, we propose MC-Tailor to alleviate the over-and under-estimation problem between true and model distributions. MC-Tailor is composed of a ratio estimator, which adjusts the probabilities of MLE fine-tuned PLMs to approximate true distributions, and the ERS to accelerate sampling while ensuring sample quality. Experiments on various datasets show the effectiveness and efficiency of MC-Tailor. Data MCT PPL EMD-l EMD-f Human  Table 4: PPL, EMD-l, EMD-f and human evaluation score of MC-Tailor ERS with 3 layers and fine-tuning. MCT means whether to use our proposed MC-Tailor or to direct fine-tune. SB and DD represent the Switchboard and DailyDialog data sets, respectively. By onetail t-tests, we find that improvements in human evaluation scores are significant, with p-values smaller than 0.05.