Masked Language Model Scoring

Pretrained masked language models (MLMs) require finetuning for most NLP tasks. Instead, we evaluate MLMs out of the box via their pseudo-log-likelihood scores (PLLs), which are computed by masking tokens one by one. We show that PLLs outperform scores from autoregressive language models like GPT-2 in a variety of tasks. By rescoring ASR and NMT hypotheses, RoBERTa reduces an end-to-end LibriSpeech model’s WER by 30% relative and adds up to +1.7 BLEU on state-of-the-art baselines for low-resource translation pairs, with further gains from domain adaptation. We attribute this success to PLL’s unsupervised expression of linguistic acceptability without a left-to-right bias, greatly improving on scores from GPT-2 (+10 points on island effects, NPI licensing in BLiMP). One can finetune MLMs to give scores without masking, enabling computation in a single inference pass. In all, PLLs and their associated pseudo-perplexities (PPPLs) enable plug-and-play use of the growing number of pretrained MLMs; e.g., we use a single cross-lingual model to rescore translations in multiple languages. We release our library for language model scoring at https://github.com/awslabs/mlm-scoring.

1 Introduction BERT (Devlin et al., 2019) and its improvements to natural language understanding have spurred a rapid succession of contextual language representations (Yang et al., 2019;Liu et al., 2019; inter alia) which use larger datasets and more involved training schemes. Their success is attributed to their use of bidirectional context, often via their masked language model (MLM) objectives. Here, a token w t is replaced with [MASK] and predicted using all past and future tokens W \t := (w 1 , . . . , w t−1 , w t+1 , . . . , w |W | ). * Work done during an internship at Amazon AWS AI. Figure 1: To score a sentence, one creates copies with each token masked out. The log probability for each missing token is summed over copies to give the pseudo-log-likelihood score (PLL). One can adapt to the target domain to improve performance, or finetune to score without masks to improve memory usage.
In contrast, conventional language models (LMs) predict w t using only past tokens W <t := (w 1 , . . . , w t−1 ). However, this allows LMs to estimate log probabilities for a sentence W via the chain rule (log P LM (W ) = |W | t=1 log P LM (w t | W <t )), which can be used out of the box to rescore hypotheses in end-to-end speech recognition and machine translation (Chan et al., 2016;Gulcehre et al., 2015), and to evaluate sentences for linguistic acceptability (Lau et al., 2017).
Our work studies the corresponding pseudo-loglikelihood scores (PLLs) from MLMs (Wang and Cho, 2019), given by summing the conditional log probabilities log P MLM (w t | W \t ) of each sentence token (Shin et al., 2019). These are induced in BERT by replacing w t with [MASK] (Figure 1).
PLLs and their corresponding pseudo-perplexities (PPPLs) (Section 2.3) are intrinsic values one can assign to sentences and corpora, allowing us to use MLMs in applications previously restricted to conventional LM scores. Furthermore, we show that one can finetune BERT to compute PLLs in a single, non-recurrent inference pass (Section 2.2).
Existing uses of pretrained MLMs in sequenceto-sequence models for automatic speech recognition (ASR) or neural machine translation (NMT) involve integrating their weights (Clinchant et al., 2019) or representations (Zhu et al., 2020) into the encoder and/or decoder during training. In contrast, we train a sequence model independently, then rescore its n-best outputs with an existing MLM. For acceptability judgments, one finetunes MLMs for classification using a training set (Warstadt et al., 2019;Devlin et al., 2019); instead, PLLs give unsupervised, relative judgements directly.
In Section 3, we show that scores from BERT compete with or even outperform GPT-2 (Radford et al., 2019), a conventional language model of similar size but trained on more data. Gains scale with dataset and model size: RoBERTa large (Liu et al., 2019) improves an end-to-end ASR model with relative WER reductions of 30%, 18% on Lib-riSpeech test-clean, test-other respectively (with further gains from domain adaptation), and improves state-of-the-art NMT baselines by up to +1.7 BLEU on low-resource pairs from standard TED Talks corpora. In the multilingual case, we find that the pretrained 15-language XLM (Conneau and Lample, 2019) can concurrently improve NMT systems in different target languages.
In Section 4, we analyze PLLs and propose them as a basis for other ranking/scoring schemes. Unlike log probabilities, PLL's summands are more uniform across an utterance's length (no left-toright bias), helping differentiate fluency from likeliness. We use PLLs to perform unsupervised acceptability judgments on the BLiMP minimal pairs set (Warstadt et al., 2020); BERT and RoBERTa models improve the state of the art (GPT-2 probabilities) by up to 3.9% absolute, with +10% on island effects and NPI licensing phenomena. Hence, PLLs can be used to assess the linguistic competence of MLMs in a supervision-free manner.

Pseudolikelihood estimation
Bidirectional contextual representations like BERT come at the expense of being "true" language models P LM (W ), as there may appear no way to generate text (sampling) or produce sentence probabilities (density estimation) from these models. This handicapped their use in generative tasks, where they at best served to bootstrap encoder-decoder models (Clinchant et al., 2019;Zhu et al., 2020) or unidirectional LMs .
However, BERT's MLM objective can be viewed as stochastic maximum pseudolikelihood estimation (MPLE) (Wang and Cho, 2019;Besag, 1975) on a training set W, where {w t } |W | t=1 are random variables in a fully-connected graph. This approximates conventional MLE, with MLM training asymptotically maximizing the objective: In this way, MLMs learn an underlying joint distribution whose conditional distributions w t | W \t are modeled by masking at position t. We include a further discussion in Appendix B.
This enabled text generation with BERT via Gibbs sampling, leading to the proposal (but not evaluation) of a related quantity, the sum of logits, for sentence ranking (Wang and Cho, 2019). More recent work (Shin et al., 2019) extended past research on future-conditional LMs in ASR (Section 5) with deeply-bidirectional self-attentive language models (bi-SANLMs). They trained shallow models from scratch with the [MASK] scoring method, but did not relate their work to pseudolikelihood and fluency, which provide a framework to explain their success and observed behaviors.
Experimentally, we extend both works by evaluating pretrained models, domain adaptation, and usage in NMT and multilingual settings (Section 3), along with acceptability judgements and PLL's intrinsic numerical properties (Section 4).

[MASK]less scoring
A practical point unaddressed in both works is that computing PLLs from an MLM requires a sentence copy for each position, making the number of inference passes dependent on length (though these can be parallelized). The cost of a softmax is also incurred, which is dependent on vocabulary size V ; together this gives O(|W | · V ). We propose reducing this to O(1) by training a network q with parameters Θ S to match BERT's PLLs without [MASK] tokens: We propose finetuning q from the pretrained MLM directly (i.e., initializing Θ S with Θ), via regression over the More generally, one could use any student model q, as in knowledge distillation (Hinton et al., 2014). Here, the teacher gives individual token probabilities (|W | inference passes) while the student approximates their sum (one inference pass). This is reminiscent of distilling an autoregressive teacher to a parallel student, as in the case of WaveNet (Oord et al., 2018). Other [MASK]less bidirectional models like XLNet (Yang et al., 2019) can also give PLLs; we leave this to future work.

Pseudo-perplexity
Analogous to conventional LMs, we propose the pseudo-perplexity (PPPL) of an MLM as an intrinsic measure of how well it models a corpus of sentences W. Let N denote the number of tokens in the corpus. Then a model's PPPL on W is Past work (Chen et al., 2017) also computed this quantity with bi-RNNLMs for ASR, although such models are not deeply bidirectional like selfattentive MLMs (see Section 5). These PPPLs can be used in lieu of perplexities. For example, during domain adaptation, one can perform early stopping with respect to development PPPL. This is in contrast to MLM accuracy, which is not a continuous loss and is often stochastic (e.g., when performing dynamic masking as in RoBERTa). In Section 4.1, we see that PPPLs naturally separate out sets of acceptable and unacceptable sentences.
Unlike previous works (Chen et al., 2017;Shin et al., 2019) we use pretrained BERTs, which are open-vocabulary (subword) bidirectional LMs. However, PPPLs are only comparable under the same subword vocabulary, which differs between e.g., BERT and RoBERTa. Normalizing with N as the number of words mitigates this. In Appendix C, we show that word-normalized PPPLs correlate with domain adaptation, and with downstream metrics like ASR and BLEU after rescoring.
3 Sequence-to-sequence rescoring Let X denote audio features or source text tokens, and let W = (w 1 , . . . , w |W | ) denote target text tokens. For non-end-to-end ASR and MT systems, having separate acoustic/translation models P AM/TM (X | W ) and language models P LM (W ) is motivated by the Bayes rule decomposition used to select the best hypothesisŴ (Jelinek et al., 1975;Brown et al., 1993):

The log-linear model
End-to-end ASR and NMT use encoder-decoder architectures that are trained discriminatively. Though less principled, many still adopt a loglinear model with learned functions f, g and a hyperparameter λ, to good effect (Sutskever et al., 2014;Chan et al., 2016). One often takes f = P S2S (W | X) as the sequence-to-sequence model and g = P LM (W ) as the language model. Since the sequence-level arg max is intractable, one can do fusion, which decomposes f = f t and g = g t over time (Gulcehre et al., 2015), restricting to the top N intermediate candidates at each step (beam search). Instead, our work considers N -best rescoring, which computes f (W , X) first, still using beam search to maintain the top N candidates and scores. Then, g(W ) is computed for the resulting hypotheses and interpolated with these scores, giving a new top-1 hypothesis. The sequence model is now solely responsible for "capturing" the best hypothesisŴ in its beam. However, there are two advantages to N -best rescoring, which motivate PLLs as well as our maskless finetuning approach, respectively: Decoupling of scale. Fusion requires correspondence between f t and g t at every t. This requires the sequence model and LM to be autoregressive and share tokenizations. In rescoring, f = P S2S does not require g to decompose over time or to be a true probability at all, though g should scale with f so that λ remains valid for all lengths |W |; e.g., taking g(W ) to be a "relevance score" between 0 and 1 would not satisfy this property. The choice of log-linear is relevant here (Appendix B).
Length-independent inference. If g is nonrecurrent, then g(W ) may be computed in a single inference pass. This difference manifests with selfattentive LMs like SANLMs and Transformer-XL

Experimental setup
Further implementation and experimental details can be found in Appendix A and our code release: LMs. We rescore sequence-to-sequence hypotheses as in Section 3.1. Each hypothesis is assigned its log probability (uni-SANLM, GPT-2) or pseudolog-likelihood score (bi-SANLM, BERT, M-BERT, RoBERTa, XLM). We tune the LM weight λ on the development set to minimize word error rate (WER) for ASR or maximize tokenized BLEU for NMT. We then evaluate on the test set.
ASR. Our 100-best hypotheses are from an endto-end, 5-layer BLSTMP model (Shin et al., 2019) from ESPnet (Watanabe et al., 2018) on the 960hour LibriSpeech corpus (Panayotov et al., 2015). Though this baseline is not state-of-the-art, we use their lists to enable direct comparison in Table 5.

Corpus
Source English (en) → German (de) 4.5M Table 1: Sizes of translation datasets used in this paper.

Out-of-the-box (monolingual)
We consider BERT (Devlin et al., 2019), GPT-2 (Radford et al., 2019), and RoBERTa (Liu et al., 2019), which are trained on 17GB, 40GB, and 160GB of written text respectively. Each model comes in similarly-sized 6-layer (117M / base) and 12-layer (345M / large) versions. GPT-2 is autoregressive, while BERT and RoBERTa are MLMs. We begin by rescoring ASR outputs in  As GPT-2 is trained on cased, punctuated data while the ASR model is not, we use cased MLMs and append "." to hypotheses to compare out-ofthe-box performance. BERT outperforms its corresponding GPT-2 models despite being trained on less data. RoBERTa reduces WERs by 30% relative on LibriSpeech test-clean and 18% on test-other.
We repeat the same on English-target NMT in Table 3. As 100-best can be worse than 4-best due to the beam search curse (Yang et al., 2018;Murray and Chiang, 2018), we first decode both beam sizes to ensure no systematic degradation in our models. Hypothesis rescoring with BERT (base) gives up to +1.1 BLEU over our strong 100-best baselines, remaining competitive with GPT-2. Using RoBERTa (large) gives up to +1.7 BLEU over the baseline. Incidentally, we have demonstrated conclusive improvements on Transformers via LM rescoring for the first time, despite only using N -best lists; the most recent fusion work (Stahlberg et al., 2018) only used LSTM-based models.

Model
TED Talks   We also consider a non-English, higher-resource target by rescoring a pre-existing WMT 2014 English-German system (trained on 4.5M sentence pairs) with German BERT (base) models 1 trained on 16GB of text, similar to English BERT. From 27.3 BLEU we get +0.5, +0.3 from uncased, cased; a diminished but present effect that can be improved as in Table 3 with more pretraining, a larger model, or domain adaptation (Section 3.5).

Out-of-the-box (multilingual)
To assess the limits of our modular approach, we ask whether a shared multilingual MLM can improve translation into different target languages. We use the 100+ language M-BERT models, and the 15-language XLM models (Conneau and Lample, 2019) optionally trained with a crosslingual translation LM objective (TLM). Monolingual training was done on Wikipedia, which gives e.g., 6GB of German text; see Table 4.
The 100-language M-BERT models gave no consistent improvement. The 15-language XLMs fared better, giving +0.2-0.4 BLEU, perhaps from their use of language tokens and fewer languages. Our 1 https://github.com/dbmdz/german-bert  German BERT results suggest an out-of-the-box upper bound of +0.8 BLEU, as we found with English BERT on similar resources. We expect that increasing training data and model size will boost XLM performance, as in Section 3.3.

Domain adaptation
Out-of-the-box rescoring may be hindered by how closely our models match the downstream text. For example, our uncased multilingual models strip accents, exacerbating their domain mismatch with the cased, accented gold translation. We examine this effect in the setting of LibriSpeech, which has its own 4GB text corpus and is fully uncased and unpunctuated, unlike the cased MLMs in Section 3.3. We rescore using in-domain models in  Using a BERT model trained only on the text corpus outperforms RoBERTa (Table 2) which is trained on far more data, underscoring the tradeoff between in-domain modeling and out-of-the-box integration. Even minor differences like casing gives +0.3-0.4 WER at test time. In Section 4.3 we see that these domain shifts can be visibly observed from the positionwise scores log P MLM (w t | W \t ).
The best results ("adaptation") still come from adapting a pretrained model to the target corpus. We proceed as in BERT, i.e., performing MLM on sequences of concatenated sentences (more details in Appendix A). In contrast, the 3-layer SANLMs (Shin et al., 2019) do per-utterance training, which is slower but may reduce mismatch even further.
Finally, we show in Appendix C that even before evaluating WER or BLEU, one can anticipate improvements in the downstream metric by looking at improvements in word-normalized PPPL on the target corpus. The domain-adapted MLM has lower PPPLs than the pretrained models, and RoBERTa has lower PPPLs than BERT.

Finetuning without masking
We finetune BERT to produce scores without [MASK] tokens. For LibriSpeech we take the normalized text corpus and keep sentences with length |W | ≤ 384, score them with our adapted BERT (base), then do sentence-level regression (Section 2.2). We train using Adam with a learning rate of 10 −5 for 10 epochs (  Sentence-level finetuning degrades performance by +0.2-0.4 WER, leaving room for future improvement. This still outperforms GPT-2 (117M, cased), though this gap may be closed by adaptation. For now, maskless finetuning could be reserved for cases where only a masked language model is available, or when latency is essential.
Remarkably, we found that out-of-the-box scoring without [MASK] still significantly improves the baseline. This is likely from the 20% of the time BERT does not train on [MASK], but instead inputs a random word or the same word (Devlin et al., 2019). Future work could explore finetuning to positionwise distributions, as in word-level knowledge distillation (Kim and Rush, 2016), for which our results are a naïve performance bound.

Analysis
We recall the log-linear model from Section 3.1: Although end-to-end models f = P S2S (W |X) predict W directly from X, interpolation with the unconditional g = P LM (W ) remains helpful (Toshniwal et al., 2018). One explanation comes from cold and simple fusion (Sriram et al., 2018;Stahlberg et al., 2018), which further improve on shallow fusion (Section 3.1) by learning g(W ) first. They argue g expresses fluency; fixing g early allows f (W , X) to focus its capacity on adequacy in encoding the source, and thus specializing the two models. With this perspective in mind, we compare log P LM and PLL as candidates for log g.

Relative linguistic acceptability
In this work we interpret fluency as linguistic acceptability (Chomsky, 1957); informally, the syntactic and semantic validity of a sentence according to human judgments (Schütze, 1996). Its graded form is well-proxied by neural language model scores (log P LM ) once length and lexical frequency are accounted for (Lau et al., 2017). This can be seen in a controlled setting using minimal pairs and GPT-2 (345M) scores: Raymond is selling this sketch. −40.0, Raymond is selling this sketches. −45.2.
This example is from the Benchmark of Linguistic Minimal Pairs (BLiMP) (Warstadt et al., 2020), a challenge set of 67k pairs which isolate contrasts in syntax, morphology, and semantics (in this example, determiner-noun agreement). While its predecessor, the Corpus of Linguistic Acceptability (CoLA), has a training set and asks to label sentences as "acceptable" or not in isolation (Warstadt et al., 2019), BLiMP provides an unsupervised setting: language models are evaluated on how often they give the acceptable sentence a higher (i.e., less negative) score. This is equivalent to 2-best rescoring without sequence model scores (log f = 0). Since most minimal pairs only differ by a single word, the effect of length on log probabilities and PLLs (discussed in Section 4.3) is mitigated. We compute PLLs on the sentences of each pair using cased BERT and RoBERTa, then choose the sentence with the highest score. Our results are in Table 7. Despite using less than half the data and a Model (cased) O  Table 7: Unsupervised performance (forced choice accuracy) on BLiMP using log probabilities (GPT-2) or PLLs. Human scores from Warstadt et al. (2020). Values with * denote improvements over GPT-2 of ≥1% absolute.

v e r a l l A N A . A G R A R G . S T R B I N D I N G C T R L . R A I S . D -N A G R E L L I P S I S F I L L E R G A P I R R E G U L A R I S L A N D N P I Q U A N T I F I E R S S -V A G R U n a c c . P P P L A c c . P P P L R a t i o
third of the capacity, BERT (base) already outperforms the previous state of the art (GPT-2) by 1.6% absolute, increasing to 3.9% with RoBERTa (large). There are 4 of 12 categories where all four PLLs outperform log probabilities by ≥1% absolute (values marked by *), and 7 where three or more PLLs outperform by this margin. Interestingly, PLLs do consistently worse on quantifiers, though all are relatively bad against the human baseline. The ratio of token-level PPPLs between unacceptable and acceptable sentences overall increases with performance, separating the two sentence sets. RoBERTa improves by around 10% on filler-gap dependencies, island effects, and negative polarity items (NPIs), largely closing the human gap. This suggests that the difficulty of these BLiMP categories was due to P LM decomposing autoregressively, and not intrinsic to unsupervised language model training, as the original results may suggest (Warstadt et al., 2020). For some intuition, we include examples in Table 8. In the subject-verb agreement example, BERT sees The pamphlets and resembled those photographs when scoring have vs. has, whereas GPT-2 only sees The pamphlets, which may not be enough to counter the misleading adjacent entity Winston Churchill at scoring time.

Interpolation with direct models
We observed that log g = PLL(W ) is not unduly affected by unconditional token frequencies; this mitigates degradation in adequacy upon interpolation with P S2S . Consider a two-word proper noun, e.g., W = "San Francisco": log P LM (W ) = log P LM (San) + log P LM (Francisco | San) log P MLM (San | Francisco) It is a highly-fluent but low-probability bigram and thus gets penalized by log P LM (W ). Informally, PLL(W ) expresses how likely each token is given other tokens (self-consistency), while log P LM (W ) expresses the unconditional probability of a sentence, beginning with the costly unconditional term P LM (San). We see this in practice when we take LM to be GPT-2 (345M) and MLM to be RoBERTa (large). Substituting in the actual scores: Both give similar probabilities P (Francisco | San) ≈ e −1.0 ≈ 37%, but differ in the first summand.
We examine the interplay of this bias with our sequence models, in cases where the baseline, GPT-2, and BERT gave different top-1 hypotheses (Table 8). In our examples, GPT-2 restores fluency using common and repeated words, at the cost of adequacy: clasping truth and → class in truth and,

Union by the Union by the Union Civities.
One can view these as exacerbations of the rare word problem due to overconfident logits (Nguyen and Chiang, 2018), and of over-translation (Tu et al., 2016). Meanwhile, BERT rewards selfconsistency, which lets rarer but still-fluent words with better acoustic or translation scores to persist: clasping truth and → clasping truth in, Union by the Union Sivities → LibriSpeech (dev-other) Baseline clasping truth and jail ya in the mouth of the student is that building up or tearing down GPT-2 class in truth and jail ya in the mouth of the student is that building up or tearing down BERT (adapted) clasping truth in jail gagging the mouth of the student is that building up or tearing down Target clapping truth into jail gagging the mouth of the student is that building up or tearing down gl→en (test) Source (gl) Traballaba de asesora científica na ACLU , a Unión polas Liberdades Civís . Baseline I worked on a scientific status on the ACL, the Union by the Union Sivities . GPT-2 I worked on a scientific status on the ACL, the Union by the Union by the Union Civities . BERT I worked on a scientific status on the ACL, the Union by the Union of LiberCivities . Target (en) I was working at the ACLU as the organization 's science advisor . Table 8: Examples of different top-1 hypotheses after ranking the minimal pairs or rescoring hypotheses from 4best models, with differences highlighted. GPT-2 and BERT both promote fluency, but GPT-2's left-to-right biased scores appear to cause it to overweigh common word sequences at the expense of adequacy.

Union by the Union of LiberCivities,
which preserves the p sound in the ground truth (clapping) for ASR, and promotes the more globally-fluent Union by the Union of LiberCivities.
We also see the under-translation (i.e., omission) of Liber being corrected, without being discouraged by the rare sequence LiberCivities. Given the differences between PLLs and log probabilities, we explore whether ensembling both improves performance in Appendix D. Similar to the largely-dominant results of MLMs on BLiMP over GPT-2 (Section 4.1), we find that as the MLM gets stronger, adding GPT-2 scores has negligible effect, suggesting that their roles overlap.

Numerical properties of PLL
PLL's numerical properties make it an ideal foundation for future ranking or scoring schemes. For example, given fixed |W | one expects − log P MLM (w t | W \t ) to be in the same range for all t. Meanwhile − log P LM (w t | W <t ) decreases as t → |W |, the rate of which was studied in recurrent language models (Takahashi and Tanaka-Ishii, 2018). We validate this with GPT-2 ( Figure 3) and BERT (Figure 4). In particular, we see the outsized cost of the unconditional first unigram in Figure 3. This also explains why bi-SANLM was more robust than uni-SANLM at shorter and earlier positions (Shin et al., 2019); the difference is intrinsic to log probabilities versus PLLs, and is not due to model or data size. Figure 4 also shows that domain adaptation (Section 3.5) affects PLL's positionwise cross-entropies. Cased BERT spikes at position 1, as it observes a lowercase word where a capitalized word is expected. All MLMs spike at the final token of an utterance, before our appended period ".". Terminal  words are difficult to predict in general, but here more so as the BERT+LibriSpeech text corpora and the LibriSpeech test set are mismatched; the latter's ground-truth utterances were segmented by voice activity and not punctuation (Panayotov et al., 2015). Otherwise, the averaged cross-entropies are flat. This, plus our success on BLiMP, suggest posi-tionwise scores as a way of detecting "disfluencies" (at least, those in the form of domain mismatches) by observing spikes in cross-entropy; with log P LM , spikes are confounded by the curve in Figure 3.
In Appendix C, we plot sentence-level PLLs versus |W | and observe linearity as |W | → ∞, with spikes from the last word and lowercase first word smoothing out. This behavior motivates our choice of α = 1.0 when applying the Google NMT-style length penalty (Wu et al., 2016) to PLLs, which corresponds to the asymptoticallylinear LP MLM = (5 + |W |)/(5 + 1). In contrast, autoregressive scores like P LM (W ) integrate over the inverse power-law curve in Figure 3. We speculate that this explains the effectiveness of their hyperparameter α = 0.6, widely used in NMT baselines like ours, as there exists C such that LP S2S (W ) = (5 + |W |) 0.6 (5 + 1) 0.6 ≈ |W | 0 C (5 + x) 0.4 dx.

Related work
Our work extends the closest previous works (Wang and Cho, 2019;Shin et al., 2019) with regards to experiments and tasks, as outlined in Section 2.1. Furthermore, neither work considers the inference cost of masked rescoring, which we address with our maskless scoring approach, or analyze PLL's numerical properties.
Future context. Log probabilities conditioned on past and future context have been used in MT (Finch and Sumita, 2009;Xiong et al., 2011) and perennially in ASR (Shi et al., 2013;Arisoy et al., 2015;Chen et al., 2017) to positive effect. However, these are not "deep bidirectional" as they model interactions between W <t and W >t via the forward and backward context vectors, while MLMs model all pairwise interactions w s and w s via dotproduct attention (compare ELMo versus BERT). Their PLLs would have different properties from ours (e.g., their cross-entropies in Figure 4 may be convex instead of flat).
Discriminative language modeling. Previous works (Roark et al., 2004;Huang et al., 2018) have explored training language models that directly optimize for a downstream metric (WER, BLEU). While we also eschew using log probabilities from conventional LMs, our approach remains generative. Log probabilities model the joint distribution; PLL does so as well, albeit implicitly (Appendix B). PLL's summands (conditional probabilities) remain accessible for Gibbs sampling and are not tailored to any metric. The two approaches are complementary; for example, one could use PLL as a "prior" or regularizer for scores given by discriminatively-finetuned BERT models in tasks like passage re-ranking (Nogueira and Cho, 2019).
Language model integration. Beyond finetuning pretrained LMs and MLMs, monolingual pretraining has also improved NMT performance (Ramachandran et al., 2017;Conneau and Lample, 2019). However, modular integration of language representation models remains prevalent for various pragmatic reasons, similar to fusion in ASR. Contemporary examples are the use of finetuned BERT scores in a question-answering pipeline (Nogueira and Cho, 2019), or "as-is" cosine similarity scores from BERT to evaluate generated text (Zhang et al., 2020). For example, one might have no pretrained multilingual LMs for decoder initialization or fusion, as such models are difficult to train (Ragni et al., 2016). However, one may have an M-BERT or XLM for the target language/domain. Finally, N -best rescoring and pretraining are not mutually exclusive, though pretraining may already go partway to improve fluency.

Conclusion
We studied scoring with MLM pseudo-loglikelihood scores in a variety of settings. We showed the effectiveness of N -best rescoring with PLLs from pretrained MLMs in modern sequenceto-sequence models, for both ASR and low-to medium-resource NMT. We found rescoring with PLLs can match or outperform comparable scores from large unidirectional language models (GPT-2). We attributed this to PLL's promotion of fluency via self-consistency, as demonstrated by improvement on unsupervised acceptability judgements and by qualitative analysis. We examined the numerical properties of PLLs, proposed maskless scoring for speed, and proposed pseudo-perplexities for intrinsic evaluation of MLMs, releasing a codebase implementing our work. Future work could find additional modular uses of MLMs, simplify maskless PLL computations, and use PLLs to devise better sentence-or document-level scoring metrics.

A Experiment details
A.1 Language models Implementation. English BERT, M-BERT, GPT-2, and RoBERTa models were served, adapted, and finetuned via the GluonNLP toolkit (Guo et al., 2020). German BERT and XLM models were served via HuggingFace's Transformers toolkit (Wolf et al., 2019). We release a reference implementation (a language model scoring package) for our work at https://github.com/awslabs/ mlm-scoring.
Training. When adapting to a corpus we continue the training scheme for BERT, i.e., MLM + next-sentence prediction (Devlin et al., 2019), on the new dataset only, until the training loss converges. We still perform warmup at adaptation time (ratio of 0.01), but continue to use batches of 256 sequences of contiguous sentences, each with length up to 512.
Scoring. For BERT, M-BERT, and RoBERTa we prepend and append [CLS], [SEP] tokens. For GPT-2 we prepend and append <|endoftext|>, the default tokens for unconditional generation, as we found this outperformed other initial conditions (e.g., a preceding "."). For XLM we prepend and append </s> (prepending <s> is more proper, but this is due to a bug in Hugging-Face Transformer's XLMTokenizer that we will fix; changes in results should be negligible). When computing (pseudo-)perplexity (Section 2.3), these special tokens' conditional probabilities are not included, nor are they counted for token or word counts during length normalization.
N -best rescoring. We follow the log-linear model in Section 3.1 with its hyperparameter λ, i.e., weighted addition of (M)LM scores with sequenceto-sequence scores. When interpolating MLMs with GPT-2 there is also a hyperparamter γ (Appendix D). We do grid search on (λ, γ) with increments (0.05, 0.1) for the best weights on the development set for downstream WER or BLEU, then evaluate on the corresponding test set. In the case of ties, we choose the largest λ, γ.

A.2 Automatic speech recognition
We use the LibriSpeech corpus (Panayotov et al., 2015) for our experiments. To adapt BERT we use the provided 800M-word text-only data, processed using Kaldi to match the normalized, downloadable corpus 2 but with sentences in their original order (instead of alphabetically as in Kaldi's recipe), to match the long-context training regime of our language models. Our LibriSpeech-only BERT (base) model was trained on this corpus using Glu-onNLP's recipe, for 1.5M steps.
We take pre-existing 100-best lists shared via e-mail communication (Shin et al., 2019), which were produced by ESPnet (Watanabe et al., 2018) on LibriSpeech's dev and test sets. The ESPnet model was the sequence-to-sequence BLSTMP model in the librispeech/asr1 recipe, except with 5 layers and a beam size of 100.
For speech corpora, to alleviate some of the domain shift from BERT's original written corpora, we appended "." at the end of utterances during adaptation, and appended "." to all hypotheses before subword tokenization, masking, and token/word counting.
We consider 5 low-resource directions from the TED Talks dataset (Qi et al., 2018): Arabic (ar), Galician (gl), and Slovak (sk) to English; and English to Arabic, German (de), languages which were considered in Aharoni et al. (2019). We also include a more popular benchmark, English to Vietnamese (vi) from the IWSLT '15 evaluation campaign 4 (Cettolo et al., 2015). These give a breadth of English-source and English-target pairs and include a right-to-left language; more importantly, the three non-English targets are covered by the 15-language XLMs (Conneau and Lample, 2019).
Our models are also described as baselines in a dedicated work (Nguyen and Salazar, 2019). They are base Transformers with 6 layers, 8 heads, an 8k BPE vocabulary, and dropout of 0.3, except for gl→en where we use 4 layers, 4 heads, 3k BPE, and a dropout of 0.4 due to its significantly smaller size. We use a warmup of 8k steps and the default hyperparameters (Vaswani et al., 2017). We apply GNMT length normalization (Wu et al., 2016) with α = 0.6 to the sequence-to-sequence log probabilities, and α = 1.0 to the PLLs (motivation is given in Section 4.3), with respect to their chosen tokenization's lengths. We compute tokenized BLEU via multi-bleu.perl from Moses 5 to compare with past works on these datasets.

B BERT as a generative model
In their published version (Wang and Cho, 2019), the authors claimed that BERT is a Markov random field language model (MRF-LM) where {w t } |W | t=1 are categorical random variables (over the vocabulary) in a fully-connected graph G. They define a potential over cliques of G such that all partialgraph potentials are exp(0) = 1 and the full-graph potential is exp is the logit corresponding to log P MLM (w t | W \t ) (although in their formulation, one could include the softmax into the feature function f θ and take log φ t (G) = PLL(G) exactly).
Abusing notation, we write W interchangeably with its graph G. An MRF defined this way would give the joint distribution: where Z is the partition function making this a valid distribution by normalizing over all sequences of the same length |W |, the set denoted by S. One then hopes to say that log P MLM (w t | W \t ) is the conditional distribution of this MRF. However, their erratum 6 notes this is not the case, as w t would be affected by other log potentials as well.
In practice, one could instead a priori make the modeling assumption g(W ) = P MLM (W ) := 1 Z exp PLL(W ), 5 https://statmt.org 6 "BERT has a Mouth and must Speak, but it is not an MRF" from https://sites.google.com/site/ deepernn/home/blog as done in the work on bi-RNNLMs (Chen et al., 2017). They choose to model the distribution of sentences as a product-of-experts w t | W \t , whose parameters are shared via the underlying bi-RNN.
For fixed λ and Z (which is intrinsic to the MLM), we see that λ log Z does not affect rank-ordering when taking arg max to get the best hypothesisŴ . Hence, the heuristic interpolation enacted by λ is "the same" for normalized log P LM , unnormalized PLL, and our hypothetical log P MLM . The remaining issue is whether λ has the same effect for all lengths |W |, which one mitigates by applying the correct length penalties to f and g (Section 4.3).

C Pseudo-perplexity and rescoring
We briefly examine the relationship between PPPL (Section 2.3) and metrics post-rescoring. We plot negative PLLs versus |W | and observe linearity, helping justify our simple average over length:  Note that in this section, we consider PPPLs normalized by number of words (PPPL w ) to improve comparability between different subword vocabularies. We see a good correspondence between PPPL w improvements and post-rescoring WER in Table 9, and post-rescoring BLEU in Table 10.
Thus, one could compute a new pretrained model's word-normalized PPPL on a small target-  domain sample to quickly assess whether rescoring with it could improve on the previous model.

D Combining MLMs and GPT-2
We ask whether scores from a unidirectional LM are complementary with a masked LM for rescoring. When interpolating, we introduce γ such that: log g(W ) = (1 − γ) log P LM (W ) + γ PLL(W ).
Our results are in  As the MLM gets stronger, the improvement from adding scores from GPT-2 goes to zero, suggesting that their roles overlap at the limit. However, unlike recent work (Shin et al., 2019) but like previous work (Chen et al., 2017), we found that interpolating with a unidirectional LM remained optimal, though our models are trained on different datasets and may have an ensembling effect.