SCRIPT: Self-Critic PreTraining of Transformers

We introduce Self-CRItic Pretraining Transformers (SCRIPT) for representation learning of text. The popular masked language modeling (MLM) pretraining methods like BERT replace some tokens with [MASK] and an encoder is trained to recover them, while ELECTRA trains a discriminator to detect replaced tokens proposed by a generator. In contrast, we train a language model as in MLM and further derive a discriminator or critic on top of the encoder without using any additional parameters. That is, the model itself is a critic. SCRIPT combines MLM training and discriminative training for learning rich representations and compute- and sample-efficiency. We demonstrate improved sample-efficiency in pretraining and enhanced representations evidenced by improved downstream task performance on GLUE and SQuAD over strong baselines. Also, the self-critic scores can be directly used as pseudo-log-likelihood for efficient scoring.


Introduction
In natural language processing, the landscape of unsupervised learning methods is dominated by masked language modeling (MLM) for bidirectional encoders, such as BERT (Devlin et al., 2018;Yang et al., 2019;Joshi et al., 2020;Lan et al., 2019;Lewis et al., 2020;Jiao et al., 2019), and causal masking for uni-directional autoregressive decoders (Radford et al., 2018(Radford et al., , 2019Brown et al., 2020;Raffel et al., 2020; such as GPT. In MLM an encoder is pretrained on a generic corpus of text with the hope of learning universal contextual embeddings, which, then, are fine-tuned on a specific down-stream task. Whereas recent developments in causal masking aim to learn a large-scale model once and define the down-stream task as an auto-regressive manner in the form of few-shot evaluation (Brown * Research conducted at Salesforce Einstein et al., 2020). In practice, while an universal autoregressive neural backbone model without the need for fine-tuning such as GPT-3 is desirable, the computational complexity at inference time remains an open problem. While the two-stage approach of MLM of smaller models is computationally convenient, the pretraining still incurs a substantial computational cost. Hence, in this work, we focus on learning contextual bi-directional representations with the goal of improving upon sample efficiency.
In MLM, the input sequence of tokens is perturbed by randomly masking out a small subset of the identities of tokens (Devlin et al., 2018) or attention scores to those tokens (Yang et al., 2019). Then, the generative model is learned as a denoising auto-encoder (Vincent et al., 2008) which recovers the masked out tokens. While the learned contextual representations achieve remarkable performance on down-stream tasks, the pretraining requires substantial compute. This is mainly due to learning from gradients from the restricted subset of tokens (Clark et al., 2020).
In ELECTRA (Clark et al., 2020), the input sequence is perturbed by replacing a subset of tokens by sampled tokens drawn from an auxiliary generator model in the form of a bi-directional encoder, which itself is learned by MLM. Then, the discriminative model is learned by a binary classification task which detects whether a token is unperturbed or has been replaced. This approach enjoys remarkable sample efficiency, which, we believe, stems primarily from reducing the complexity of the classification task from masked token prediction over a large set of classes (i.e., a typical vocabulary size of 30, 522 classes) to replaced token detection (i.e., 2 classes).
Despite it being less efficient, MLM training guides the model to learn rich representations. ELECTRA uses MLM only in learning the auxiliary generator which is discarded after pretraining. We propose to combine MLM and discriminative Figure 1: An overview of SCRIPT. We combine MLM and discriminative training in a single transformer encoder, exploiting the rich representations extracted through MLM training and the compute-and sample-efficiency though discriminative training, resulting in a simple yet effective pretraining approach for representation learning. Pretraining starts with replacing a small portion of tokens (e.g., 15%) in a text sequence x x x with [MASK], yieldingx x x. The architecture of SCRIPT is a transformer encoder with a softmax output layer, producing a distribution over tokens, same as any MLM models like as BERT. In the MLM forward pass, SCRIPT takesx x x as input and outputs a distribution for each token. This distribution is first used to compute the MLM loss, LMLM , the negative log-likelihood of recovering the masked token. It is then used to construct a Gumbel-Softmax distribution, from whichx x x is sampled (indicated by the broken arrows in the figure). The critic forward pass takesx x x as input and goes through the same model. The output softmax distribution is used to construct a binary classifier to discriminate an original versus a replaced token. And the discriminative training loss, LDisc, is simply cross-entropy of the derived binary classifier. Finally, a single backward pass is guided by the combination of LMLM and LDisc.
training. The resulting model thus has the rich representations from both MLM and discriminative learning and enjoys compute and sample efficiency from its discriminative learning. Furthermore, instead of learning an auxiliary model in addition to the main encoder, our approach learns a single model which is leveraged to recover masked tokens, propose token replacements, and detect replaced tokens. Hence the encoder itself is also a critic, giving the name of our model, Self-CRItic Pretraining Transformers (SCRIPT). Our experiments show that SCRIPT has improved compute and sample efficiency in pretraining and enhanced representations, hence outperforming strong baselines in fine-tuning on downstream tasks. Contributions.
(1) We propose a novel pretraining approach in which the model acts as a self-critic. (2) We demonstrated improved downstream task performance over state-of-the-art under computational constraints. (3) We show the selfcritic scores may serve as computationally efficient pseudo-log-likelihood for scoring tasks.

Method
We propose a pretraining approach which combines masked token recovery and replaced token detection and does not introduce any additional parameters compared to a regular BERT. In the following sections, we first introduce MLM training which is the same as that in BERT, and then present selfcritic training.
Suppose x x x = [x 1 , ..., x t , ..., x T ] is a text sequence where x t is the tth token. In MLM training, a portion of tokens (e.g., 15%) are replaced with a special token [MASK]. Letx x x be the sequence after the mask replacement and e( be the contextual representations computed by the transformer. Let W ∈ R V ×d be the weight matrix of a softmax layer where V is the vocabulary size. The logit or score for token t is s t = W e t ∈ R V . Then the log-likelihood of the sequence x x x is, where m t ∈ {0, 1} indicates whether x t is a masked token, [MASK]. The loss function for MLM is the negative log-likelihood where p data is the empirical data distribution. Besides defining the log-likelihood for MLM training, p θ (x t |x x x) naturally provides a conditional distribution of x t with which we can construct a sampled sequence,x x x = [x 1 , ...,x T ], by replacing x t withx t , a token sampled from p θ (x t |x x x). x t is replaced only if it is masked inx x x (i.e., m t = 1). In particular, the replacement token is sampled from a Gumbel-Softmax distribution (Jang et al., 2016).
for notational clarity. Then the probability of sampling the vth token in the vocabulary for x t is, where {g v } V v =1 are i.i.d. samples drawn from Gumbel(0, 1) 1 and τ is the temperature for sampling. The Gumbel-Softmax distribution π approaches one-hot when τ is small (e.g., τ = 0.1) and uniform when τ is large (e.g., τ = 10.0).
To apply discriminative training to the model, we derive a discriminator from the existing model and parameters.x t is considered as a positive token ifx t = x t , while deemed a negative token ifx t = x t . In the MLM training, the last layer defines a V -class classifier with the parameters W . We can augment W with an extra row for computing the score or logit for the negative token class, making it classify V + 1 classes. Denote the augmented weight matrix as W + . Then the classification logits are s + t = W + e t ∈ R V +1 . However, it is unnecessary to bring in new parameters and over-parameterization since subtracting an arbitrary function f (e t ) ∈ R from all the logits, s + tv − f (e t ) ∀v = 1, ..., V + 1, does not change the softmax output. Thus we fix the last row of W + to all zeros 0 0 0 ∈ R 1×d . Then we have the logit for the tth token, Then the probability of the tth token inx x x being a negative token is, while the probability being a positive token is, where t − and t + indicatex t is a positive token and a negative token, respectively. The generator per se is thus also a critic or discriminator for replaced token detection, giving the name of our model, selfcritic. The loss of discriminative training is simply the cross-entropy loss, The overall loss function of SCRIPT combines MLM and discriminative training, L θ = 1 The Gumbel(0, 1) distribution can be sampled using inverse transform sampling by drawing u ∼ Uniform(0, 1) and computing g = − log(− log u) L M LM (θ) + αL Disc (θ), where α is an coefficient determining the strength of discriminative training. The learning of SCRIPT involves two forward passes through a single model, one for MLM witĥ x x x as input, one for discriminative training withx x x as input, and a single backward pass. Figure 1 gives an overview of our model.

Experiments
In the subsequent empirical evaluations, we shall address the following questions: (1) Does the learning as self-critic lead to competitive down-stream task performance? (2) Can we treat the self-critic scores as pseudo-log-likelihoods? (3) Is the sample efficiency improved over state-of-the-art baselines?
Hence, we train and evaluate two SCRIPT models "small" and "base" with an encoder of the 14M and 110M parameters, respectively. For a direct comparison, the models are trained on the Open-WebText corpus (Gokaslan and Cohen, 2019) with identical pre-processing and optimization procedures as in (Devlin et al., 2018) and (Clark et al., 2020). We refer to the Appendix for details.

Transfer to Downstream Tasks
We evaluate the efficacy of our method on the GLUE natural language understanding benchmark (Wang et al., 2018) and the SQuAD 1.1 and 2.0 question answering dataset (Rajpurkar et al., 2016a). We report mean scores of GLUE tasks over 8 fine-tuning runs with varying random seed. For the evaluation on SQuAD, we re-trained the "small" models with a sequence length of 512 tokens. Table 1 depicts improved scores across the benchmarks. The task specific GLUE scores are shown in Table 2.

Efficient Pseudo-Log-Likelihood Scoring
In contrast to MLM and ELECTRA pretraining, SCRIPT allows for efficient computation of  Table 2: Comparison of small and base models on the GLUE dev set. The models were trained on the OpenWebText corpus (Gokaslan and Cohen, 2019) for 1, 000, 000 and 766, 000 steps, respectively. The GLUE task scores are means of 8 runs over a set of random seeds. SCRIPT outperforms ELECTRA while enjoying a simple architecture and learning algorithm.
a pseudo-log-likelihood (PLL) for a given sequence x x x, The PLL allows for the re-ranking of a set of sequences produced by a NMT or ASR system. While language models seem a natural fit for a ranking problem, Salazar et al. (2019) show improved performance when ranking is based on the PLL. However, for a sequence with T tokens, this would require T forward passes as each token has to be masked out. Instead, we propose to recruit (7) as a measure of PLL.  GLUE. Figure 2 depicts the improvement in the mean GLUE scores for ELECTRA-small and SCRIPT-small over the number of training steps. While the wall-clock time per computational training step of SCRIPT is increased over ELECTRA, the sample-efficiency of SCRIPT in terms of the mean GLUE score over training steps is higher. Hence, the efficiency of both methods may be comparable, however, SCRIPT achieves improved overall performance on GLUE.

Conclusion
This work presents SCRIPT for representation learning. It is a transformer encoder like BERT. In pretraining, it recovers masked tokens, proposes negative samples, and acts as a self-critic, discriminating between sampled and original tokens. The joint MLM and discriminative learning improves sample efficiency in pretraining and enhances representation learning, leading to improved performance over strong baselines on various downstream tasks. It also provides an efficient way for computing pseudo-log-likelihood for scoring tasks and achieves competitive performance.