Pre-Training Transformers as Energy-Based Cloze Models

We introduce Electric, an energy-based cloze model for representation learning over text. Like BERT, it is a conditional generative model of tokens given their contexts. However, Electric does not use masking or output a full distribution over tokens that could occur in a context. Instead, it assigns a scalar energy score to each input token indicating how likely it is given its context. We train Electric using an algorithm based on noise-contrastive estimation and elucidate how this learning objective is closely related to the recently proposed ELECTRA pre-training method. Electric performs well when transferred to downstream tasks and is particularly effective at producing likelihood scores for text: it re-ranks speech recognition n-best lists better than language models and much faster than masked language models. Furthermore, it offers a clearer and more principled view of what ELECTRA learns during pre-training.


Introduction
The cloze task (Taylor, 1953) of predicting the identity of a token given its surrounding context has proven highly effective for representation learning over text. BERT (Devlin et al., 2019) implements the cloze task by replacing input tokens with [MASK], but this approach incurs drawbacks in efficiency (only 15% of tokens are masked out at a time) and introduces a pre-train/fine-tune mismatch where BERT sees [MASK] tokens in training but not in fine-tuning. ELECTRA (Clark et al., 2020) uses a different pre-training task that alleviates these disadvantages. Instead of masking tokens, ELECTRA replaces some input tokens with fakes sampled from a small generator network. The pre-training task is then to distinguish the original vs. replaced tokens. While on the surface it appears quite different from BERT, in this paper we elucidate a close connection between ELECTRA and cloze modeling. In particular, we develop a new way of implementing the cloze task using an energy-based model (EBM). Then we show the resulting model, which we call Electric, is closely related to ELECTRA, as well as being useful in its own right for some applications. 1 EBMs learn an energy function that assigns low energy values to inputs in the data distribution and high energy values to other inputs. They are flexible because they do not have to compute normalized probabilities. For example, Electric does not use masking or an output softmax, instead producing a scalar energy score for each token where a low energy indicates the token is likely given its context. Unlike with BERT, these likelihood scores can be computed simultaneously for all input tokens rather than only for a small masked-out subset. We propose a training algorithm for Electric that efficiently approximates a loss based on noise-contrastive estimation (Gutmann and Hyvärinen, 2010). Then we show that this training algorithm is closely related to ELECTRA; in fact, ELECTRA can be viewed as a variant of Electric using negative sampling instead of noise-contrastive estimation.
We evaluate Electric on GLUE (Wang et al., 2019) and SQuAD (Rajpurkar et al., 2016), where Electric substantially outperforms BERT but slightly under-performs ELECTRA. However, Electric is particularly useful in its ability to efficiently produce pseudo-likelihood scores (Salazar et al., 2020) for text: Electric is better at re-ranking the outputs of a speech recognition system than GPT-2 (Radford et al., 2019) and is much faster at re-ranking than BERT because it scores all input tokens simultaneously rather than having to be run multiple times with different tokens masked out. In total, investigating Electric leads to a more principled understanding of ELECTRA and our results ...  Figure 1: Comparison of BERT and Electric. Both model the probability of a token given its surrounding context, but BERT produces a full output distribution over tokens only for masked positions while Electric produces unnormalized probabilities (but no full distribution) for all input tokens.
suggest that EBMs are a promising alternative to the standard generative models currently used for language representation learning.

Method
BERT and related pre-training methods (Baevski et al., 2019;Liu et al., 2019;Lan et al., 2020) train a large neural network to perform the cloze task. These models learn the probability p data (x t |x \t ) of a token x t occurring in the surrounding context Typically the context is represented as the input sequence with x t replaced by a special [MASK]placeholder token. This masked sequence is encoded into vector representations by a transformer network (Vaswani et al., 2017). Then the representation at position t is passed into a softmax layer to produce a distribution over tokens p θ (x t |x \t ) for the position.

The Electric Model
Electric also models p data (x t |x \t ), but does not use masking or a softmax layer. Electric first maps the unmasked input x = [x 1 , ..., x n ] into contextualized vector representations h(x) = [h 1 , ..., h n ] using a transformer network. The model assigns a given position t an energy score using a learned weight vector w. The energy function defines a distribution over the possible tokens at position t as where REPLACE(x, t, x ) denotes replacing the token at position t with x and V is the vocabulary, in practice usually word pieces (Sennrich et al., 2016). Unlike with BERT, which produces the probabilities for all possible tokens x using a softmax layer, a candidate x is passed in as input to the transformer. As a result, computing p θ is prohibitively expensive because the partition function Z θ (x \t ) requires running the transformer |V| times; unlike most EBMs, the intractability of Z θ (x \t ) is more due to the expensive scoring function rather than having a large sample space.

NCE Loss
As computing the exact likelihood is intractable, training energy-based models such as Electric with standard maximum-likelihood estimation is not possible. Instead, we use (conditional) Noise-Contrastive Estimation (NCE) (Gutmann and Hyvärinen, 2010; Ma and Collins, 2018), which provides a way of efficiently training an unnormalized model that does not compute Z θ (x \t ). NCE learns the parameters of a model by defining a binary classification task where samples from the data distribution have to be distinguished from samples generated by a noise distribution q(x t |x \t ). First, we define the un-normalized output Operationally, NCE can be viewed as follows: • A positive data point is a text sequence x from the data and position in the sequence t.
• A negative data point is the same except x t , the token at position t, is replaced with a noise tokenx t sampled from q.
• Define a binary classifier D that estimates the probability of a data point being positive as • The binary classifier is trained to distinguish positive vs negative data points, with k negatives sampled for every n positive data points.
Formally, the NCE loss L(θ) is This loss is minimized whenp θ matches the data distribution p data (Gutmann and Hyvärinen, 2010). A consequence of this property is that the model learns to be self-normalized such that Z θ (x \t ) = 1.

Training Algorithm
To minimize the loss, the expectations could be approximated by sampling as shown in Algorithm 1. Taking the gradient of this estimated loss produces Algorithm 1 Naive NCE loss estimation Given: Input sequence x, number of negative samples k, noise distribution q, modelp θ .
1. Initialize the loss as an unbiased estimate of ∇ θ L(θ). However, this algorithm is computationally expensive to run, since it requires k + 1 forward passes through the transformer to compute thep θ s (once for the positive samples and once for each negative sample). We propose a much more efficient approach that replaces k input tokens with noise samples simultaneously shown in Algorithm 2. It requires just

Algorithm 2 Efficient NCE loss estimation
Given: Input sequence x, number of negative samples k, noise distribution q, modelp θ .
Replace the k random positions with negative samples: 3. For each position t = 1 to n: add to the loss − log one pass through the transformer for k noise samples and n − k data samples. However, this procedure only truly minimizes L ifp θ (x t |x \t ) = p θ (x t |x noised \t ). To apply this efficiency trick we are making the assumption they are approximately equal, which we argue is reasonable because (1) we choose a small k of 0.15n and (2) q is trained to be close to the data distribution (see below). This efficiency trick is analogous to BERT masking out multiple tokens per input sequence.

Noise Distribution
The noise distribution q comes from a neural network trained to match p data . NCE commonly employs this idea to ensure the classification task is sufficiently challenging for the model (Gutmann and Hyvärinen, 2010;Wang and Ou, 2018). In particular, we use a two-tower cloze model as proposed by Baevski et al. (2019), which is more accurate than a language model because it uses context to both sides of each token. The model runs two transformers T LTR and T RTL over the input sequence. These transformers apply causal masking so one processes the sequence left-to-right and the other operates right-to-left. The model's predictions come from a softmax layer applied to the concatenated states of the two transformers: The noise distribution is trained simultaneously with Electric using standard maximum likelihood estimation over the data. The model producing the noise distribution is much smaller than Electric to reduce the computational overhead.

Connection to ELECTRA
Electric is closely related to the ELECTRA pretraining method. ELECTRA also trains a binary classifier (the "discriminator") to distinguish data tokens from noise tokens produced by a "generator" network. However, ELECTRA's classifier is simply a sigmoid layer on top of the transformer: it models the probability of a token being negative (i.e., as replaced by a noise sample) as σ(E(x) t ) where σ denotes the sigmoid function. Electric on the other hand models this probability as While ELECTRA learns whether a token is more likely to come from the data distribution p data or noise distribution q, Electric only learns p data because q is passed into the model directly. This difference is analogous to using negative sampling (Mikolov et al., 2013) vs. noise-contrastive estimation (Mnih and Kavukcuoglu, 2013) for learning word embeddings. A disadvantage of Electric compared to ELEC-TRA is that it is less flexible in the choice of noise distribution. Since ELECTRA's binary classifier does not need to access q, its q only needs to be defined for negative sample positions in the input sequence. Therefore ELECTRA can use a masked language model rather than a two-tower cloze model for q. An advantage of Electric is that it directly provides (un-normalized) probabilitieŝ p θ for tokens, making it useful for applications such as re-ranking the outputs of text generation systems. The differences between ELECTRA and Electric are summarized below:

Experiments
We train two Electric models the same size as BERT-Base (110M parameters): one on Wikipedia and BooksCorpus (Zhu et al., 2015) for comparison with BERT and one on OpenWebTextCorpus (Gokaslan and Cohen, 2019) for comparison 2 with GPT-2. The noise distribution transformers T LTR and T RTL are 1/4 the hidden size of Electric. We do no hyperparameter tuning, using the same hyperparameter values as ELECTRA. Further details on training are in the appendix.

Transfer to Downstream Tasks
We evaluate fine-tuning the Electric model on the GLUE natural language understanding benchmark (Wang et al., 2019) and the SQuAD 2.0 question answering dataset (Rajpurkar et al., 2018). We report exact-match for SQuAD, average score 3 over the GLUE tasks 4 , and accuracy on the multi-genre natural language inference GLUE task. Reported scores are medians over 10 fine-tuning runs with different random seeds. We use the same finetuning setup and hyperparameters as ELECTRA. Results are shown in Table 1. Electric scores better than BERT, showing the energy-based formulation improves cloze model pre-training. However, Electric scores slightly lower than ELECTRA. One possible explanation is that Electric's noise distribution is worse because a two-tower cloze model is less expressive than a masked LM. We tested this hypothesis by training ELECTRA with the same two-tower noise model as Electric. Performance did indeed go down, but it only explained about half the gap. The surprising drop in performance suggests that learning the difference between the data and generations from a low-capacity model leads to better representations than only learning the data distribution, but we believe further research is needed to fully understand the discrepancy.

Fast Pseudo-Log-Likelihood Scoring
An advantage of Electric over BERT is that it can efficiently produce pseudo-log-likelihood (PLL) scores for text (Wang and Cho, 2019). PLLs for Electric are PLLs can be used to re-rank the outputs of an NMT or ASR system. While historically log-likelihoods from language models have been used for such reranking, recent work has demonstrated that PLLs from masked language models perform better (Shin et al., 2019). However, computing PLLs from a masked language model requires n passes of the transformer: once with each token masked out. Salazar et al. (2020) suggest distilling BERT into a model that uses no masking to avoid this cost, but this model considerably under-performed regular LMs in their experiments.
Electric can produce PLLs for all input tokens in a single pass like a LM while being bidirectional like a masked LM. We use the PLLs from Electric for re-ranking the 100-best hypotheses of a 5-layer where n-best(f, s) consists of the top n (we use n = 100) predictions from the speech recognition model found with beam search, f (x|s) is the score the speech model assigns the candidate output sequence x. We select the best λ on the dev set out of [0.05, 0.1, ..., 0.95, 1.0], with different λs selected for the "clean" and "other" portions of the data.
We compare Electric against GPT-2 (Radford et al., 2019), BERT (Devlin et al., 2019), and two baseline systems that are bidirectional while only requiring a single transformer pass like Electric. TwoTower is a two-tower cloze model similar to Electric's noise distribution, but of the same size as Electric. ELECTRA-TT is identical to ELECTRA except it uses a two-tower noise distribution rather than a masked language model. 5 The noise distribution probabilities and binary classifiers scores of ELECTRA can be combined to assign probabilities for tokens as shown in Appendix G of the ELECTRA paper.
Results are shown in Table 2. Electric scores better than GPT-2 when trained on comparable data. While scoring worse than BERT, Electric is much faster to run. It also slightly outperforms ELECTRA-TT, which is consistent with the finding from Labeau and Allauzen (2018) that NCE outperforms negative sampling for training language models. Furthermore, Electric is simpler and faster than ELETRA-TT in that it does not require running the generator to produce PLL scores. TwoTower scores lower than Electric, presumably because it is not a "deeply" bidirectional model and instead just concatenates forward and backward hidden states.

Related Work
Language modeling (Dai and Le, 2015;Radford et al., 2018;Peters et al., 2018) and cloze modeling (Devlin et al., 2019;Baevski et al., 2019;Liu et al., 2019) have proven to be effective pre-training tasks for NLP. Unlike Electric, these methods follow the standard recipe of estimating token probabilities with an output softmax and using maximumlikelihood training.
Energy-based models have been widely explored in machine learning (Dayan et al., 1995;LeCun 5 With ELECTRA's original masked LM generator, it would be impossible to score all tokens in a single pass.

Conclusion
We have developed an energy-based cloze model we call Electric and designed an efficient training algorithm for Electric based on noise-contrastive estimation. Although Electric can be derived solely from the cloze task, the resulting pre-training method is closely related to ELECTRA's GANlike pre-training algorithm. While slightly underperforming ELECTRA on downstream tasks, Electric is useful for its ability to quickly produce pseudo-log-likelihood scores for text. Furthermore, it offers a clearer and more principled view of the ELECTRA objective as a "negative sampling" version of cloze pre-training.    (2020). Runtime is measured in passes through the transformer and data indicates the pretraining dataset. "Clean" and "other" are easier and harder splits of the data. *We use a public re-implementation of OpenWebText.
getting a median score of multiple models. While using dev-set model selection to choose the test set submission may alleviate the high variance of fine-tuning to some extent, such model selection is still not sufficient for reliable comparisons between methods (Reimers and Gurevych, 2018).