Limitations of Autoregressive Models and Their Alternatives

Standard autoregressive language models perform only polynomial-time computation to compute the probability of the next symbol. While this is attractive, it means they cannot model distributions whose next-symbol probability is hard to compute. Indeed, they cannot even model them well enough to solve associated easy decision problemsin main text, note that they’re easy because you get to see the whole string rather than a prefix—this is really the difference between checking a given assignment against a formula and asking whether any satisfying assignment exists for which an engineer might want to consult a language model. These limitations apply no matter how much computation and data are used to train the model, unless the model is given access to oracle parameters that grow superpolynomially in sequence length. Thus, simply training larger autoregressive language models is not a panacea for NLP. Alternatives include energy-based models (which give up efficient sampling) and latent-variable autoregressive models (which give up efficient scoring of a given string). Both are powerful enough to escape the above limitations.


Introduction
Sequence modeling is a core NLP problem. Many sequence models˜ are efficient at scoring strings: given a string x, its score˜ (x) can be computed in (poly(|x|)). For example, an RNN (Mikolov et al., 2011) scores x in time (|x|) while a Transformer (Vaswani et al., 2017) does so in time (|x| 2 ). The score may be an unnormalized probability, and can be used to rank candidate strings.
Many sequence models also make it easy to compute marginal properties of˜ . They support efficient sampling of strings x (which allows unbiased approximation of marginal expectations). And they support efficient computation of the normalizing constant = x˜ (x) (or simply guarantee = 1) for any value of the model parameters.
How about training? Briefly: If a sequence model can efficiently compute˜ (x) (and its derivatives * Part of this work was done at Facebook AI. Figure 1: Valid answers to hard natural language inference problems can be hard to find (Munroe, 2009), but in many cases can be checked efficiently (e.g. the K problem here). Given a large enough parametric autoregressive model with correct parameters, we can efficiently solve all problem instances with input length , and efficiently verify the solutionsbut the required model size can grow superpolynomially in .
(This allows the model to store precomputed results that we can compare against at test time, in ( ).) A main observation of this paper is that assuming NP P/poly, then without such a superpolynomial growth in model size, autoregressive models cannot even be used to verify answers to some problems where polynomial-time verification algorithms exist. with respect to model parameters), then it is efficient to compute parameter updates for noise-contrastive estimation (Gutmann and Hyvärinen, 2010;Gutmann and Hyvärinen, 2012) or score-matching (Hyvärinen, 2005). If sampling x or computing (and its derivatives) is also efficient, then it is efficient to compute parameter updates for ordinary MLE training.
Finally, popular sequence models are compact. Usually a fixed-size model is used to score strings x of all lengths. More generally, it might be reasonable to use an (poly( ))-sized parameter vector when x has length , at least if parameter vectors can be obtained (perhaps from an oracle) for all needed lengths. In this paper, we investigate what can and cannot be achieved with models that are compact in this sense. This setup allows us to discuss the asymptotic behavior of model families.
Standard autoregressive models have the form (x) = ( | x < ) where each factor is ef-In this paper we use the shorthand x < 1 . . . −1 . ficient to compute from a fixed parameter vector. These models satisfy all three of the desiderata above. Through the use of flexible neural network architectures, standard autoregressive models have achieved stellar empirical results in many applications (Oord et al., 2016;Child et al., 2019;Zellers et al., 2019;Brown et al., 2020). However there are still tasks that they have not mastered: for example it is reported that they struggle at deep logical structure, even with the help of huge pretrained models (Wang et al., 2019a).
We point out that unfortunately, there are certain sequence distributions whose unnormalized string probabilities˜ (x) are easy to compute, yet whose autoregressive factors ( | x < ) are hard (NPhard) to compute or even approximate. Thus, standard autoregressive models are misspecified for these distributions (cannot fit them). It does not help much to focus on strings of bounded length, or to enlarge the model: under the common complexity-theoretic assumption NP P/poly, the parameter size | | must grow superpolynomially in to model the probabilities of all strings of length up to .
Indeed, one of our main findings is that there exist unweighted languages ∈ P for which no standard autoregressive model has as its support, i.e., assigns weight > 0 to just the strings x ∈ . This is downright depressing, considering the costs invested in training huge parametric autoregressive models (Bender et al., 2021): since ∈ P, it is trivial to build an efficient scoring function˜ (x) with fixed parameters that has as its support, just not an autoregressive one. The problem holds for all standard autoregressive models, regardless of how much computation and training data are used to learn the model parameters.
That is, for an NP-hard problem, scoring strings under a standard autoregressive model (x) cannot be used to verify witnesses. Nor can finding witnesses be solved by prompting such a model with a description of a problem instance and sampling a continuation of that string. Such problems are abundant in NLP: for example surface realization under Optimality Theory (Idsardi, 2006), decoding text from an AMR parse (Cai and Knight, 2013), phrase alignment between two sentences (DeNero and Klein, 2008), and in general inference of propositional logic (Cook, 1971), which illustrates the NP-hardness of general natural language inference, as in Figure 1. In other words, our results imply that standard autoregressive models do not have the right structure to capture important linguistic regularities: e.g., that observed sequences were in fact constructed to be phonologically optimal, expressive of a semantic form, or logically coherent! Our work is also relevant to autoregressive models of fixed-dimensional vectors, such as NADE (Uria et al., 2016). These models can be extended to arbitrary -dimensional vectors by providing separate parameters for each . However, our constructions imply that for some distributions, | | must grow superpolynomially in , even though this would be not be necessary if the models were not autoregressive.
In the remainder of this paper, we formalize our three desiderata for sequence models. We formalize compact autoregressive models and describe some limitations on their expressiveness. We then show that it can help to choose an alternative model family that relaxes any one of the three desiderata (Table 1).

Weighted languages
An unweighted language ⊆ * is a set of strings x over a finite alphabet . A weighted language˜ is a function˜ : * → R ≥0 . It may be regarded as specifying an unweighted language = support(˜ ) {x :˜ (x) ≠ 0} along with positive weights for the strings in . We say that a weighted language˜ is normalizable if its global normalizing constant x∈ * ˜ (x) is finite and strictly positive. Wheñ is normalizable, (x) ˜ (x)/ is a probability distribution over . A distribution is any weighted language whose global normalizing constant is 1. Letx x mean thatx is a prefix of x ∈ * (not necessarily a strict prefix). If˜ is normalizable, then (x) x∈ * :x x˜ (x) is ≤ for anyx ∈ * , yielding a marginal prefix probability (x)/ . If the prefixx has positive prefix probability, then it admits a local conditional probability ( |x) (x )/ (x) for each symbol ∈ , where the denominator is interpreted as a local normalizing 5149 constant. This is the conditional probability that if a random string starts with the prefixx, the next symbol is . There is also a probability ($ |x) 1 − ∈ ( |x) =˜ (x)/ (x) ≥ 0 that the string ends immediately afterx; the special symbol $ ∉ represents "end of string."

Computation for weighted languages
We define a weighted language˜ to be computable if it is defined by a Turing machine (also called˜ ) that maps any x ∈ * to˜ (x) ∈ Q ≥0 in finite time. The Turing machine does not have to compute .
While the computable weighted languages allow any computable function as˜ , most architectures for defining weighted languages (e.g., RNNs or Transformers) do only a bounded or linear amount of work per input symbol. As a result, they compute˜ (x) in time (poly(|x|)). That is,˜ ∈ FP; we refer to such weighted languages as efficiently computable (EC). This does not imply that the normalized version is efficiently computable, since finding the denominator requires summing over all of * . If we tried to construct the same normalized distribution as in the previous paragraph using a standard autoregressive model, we would model it as a product of local conditional probabilities, (x) = ( |x| =1 ( | x < )) ($ | x). Most such architectures again do only a bounded or linear amount of work per input symbol. Yet one suspects that this may not always be enough work to do the job: the local conditional probabilities of the original˜ are expensive to compute (unless˜ has some special structure making (x) tractable).
Indeed, the observation of this paper is that for some efficiently computable weighted languages˜ , the local conditional probabilities are expensive to compute or even to approximate well. More precisely, autoregressive models cannot fit the local conditional probabilities unless they are superpolynomial in either their runtime or in their number of parameters (where the parameters may be precomputed at training time). We now explain how to formalize these notions.

Non-uniform computation
In the machine learning approach to sequence modeling, we usually do not manually design the Turing machine behind˜ . Rather, we design a model with parameters . is a Turing machine that reads and outputs a specialized Turing machine˜ ( ) that can score strings x and hence defines a weighted language. Without loss of generality, we will express as a string in B * (where B {0, 1}). For each , we obtain a potentially different weighted language.
Strings vary in length, and accurate modeling of longer strings may sometimes require more complex computations with more parameters. For example, when is a natural language alphabet, a recurrent neural network may require more hidden units to model sentences of the language rather than individual words, and even more units to model whole documents. To accommodate this, we allow an infinite sequence of parameter vectors, = { ∈ B * | ∈ N}, which yields an infinite sequence of Turing machines {˜ | ∈ N} via˜ ( ). We then definẽ (x) ˜ |x| (x), so a string of length is scored by the˜ machine. This is known as non-uniform computation. Of course, it is legal (and common) for all of the to be equal, or empty, but if desired, we can obtain more power by allowing the number of parameters to grow with if needed.
We can now consider how rapidly the parametric and runtime complexity may grow.
• If | | is permitted to grow exponentially, then one can fit any weighted language˜ (even an uncomputable one). Simply use to encode a trie with (| | +1 ) nodes that maps x ↦ →˜ (x) for any |x| of length , and design such that the Turing machine˜ = ( ) has a (large) state transition table that mirrors the structure of this trie. The resulting collection of Turing machines {˜ | ∈ N} can then compute˜ (x) exactly for any x, with only linear runtime (|x|) (which is used to traverse the trie).
• Separately, if unbounded runtime is permitted for , then one can exactly fit any computable weighted language˜ . Simply have , when run on , compute and return the large trie-structured˜ that was mentioned above. In this case, need not even use the parameters , except to determine . • Finally, if unbounded runtime is permitted for˜ , then again one can exactly fit any computable weighted language˜ . In this case, trivially returns˜ =˜ for all . • However, if the parameters are "compact" in the sense that | | grows only as (poly( )), and also˜ = ( ) is constructed by in time (poly( )), and˜ scores any x of length in time (poly( )), then we say that the resulting weighted language˜ is efficiently computable with compact parameters (ECCP). We refer See our remark on computability in Appendix A. Since we require to run in polytime, it can only look at a polynomial-sized portion of . Hence it is not really crucial for to paired with a parameter space of possible compact values for as an ECCP model. Neural models of weighted languages are typically ECCP models. The construction and execution of the neural network˜ may perform a polynomial amount of total computation to score the string x. This computation may involve parameters that were precomputed using any amount of effort (e.g., training on data) or even obtained from an oracle (they need not be computable). However, the exponentially many strings of length must share a polynomial-size parameter vector , which prevents the solution given in the first bullet point above.
In practice one takes = for all and obtains ∈ R by training. However, we do not consider whether such parameters are easy to estimate or even computable. We simply ask, for a given target language˜ , whether there exists a polynomially growing sequence of "good" parameter vectors for any parametric model . When not, there can be no scheme for estimating arbitrarily long finite prefixes of such a sequence. So for any polynomial , any training scheme that purports to return a trained model of size ( ) that works "well" for strings of length ≤ must fail for large enough -even if unlimited data, computation, and oracles are allowed at training time.

P, P/poly, and NP/poly
The phrase "efficiently computable with compact parameters" means that without access to those parameters, the ECCP weighted language may no longer be efficiently computable. Indeed, it need not be computable at all, if the parameter vectors store the outputs of some uncomputable function.
Our definitions above of EC and ECCP weighted languages are weighted generalizations of complexity classes P and P/poly, respectively, and their supports are always unweighted languages in P and P/poly, respectively. An unweighted language is in P iff there is a deterministic Turing machine that decides in (poly(|x|)) time whether x ∈ . And an unweighted language is in P/poly iff there exist the parameters p to be compact, but we nonetheless include this intuitive condition, without loss of generality.
Namely the nonnegative functions in FP and FP/poly. Our presentation of P/poly is a variant of Arora and Barak (2009, §6), in which inputs x of length are evaluated by a polytime function that is given an advice string as an auxiliary argument. This corresponds to a neural architecture that can consult trained parameters at runtime. We have replaced the standard call ( , x) with the "curried" expression ( ) (x), which we still require to execute in polynomial total time. Here the intermediate result = ( ) corresponds to a Turing machines { : ∈ N} such that decides in (poly( )) time whether x of length is in , where each can be constructed in (poly( )) time as ( ), for some Turing machine and some sequence of polynomially-sized advice strings = { | ∈ N} with | | ∈ (poly( )). We define the language class NP/poly similarly to P/poly: the only difference is the family { : ∈ N} consists of nondeterministic Turing machines.
Naturally, P ⊆ P/poly. But P/poly is larger than P: it contains all sparse languages, regardless of their hardness -even sparse undecidable languagesas well as many dense languages. The extra power of P/poly comes from its access to compact advice strings that do not have to be recursively enumerable, let alone efficient to find. This corresponds to statistical modeling, where the trained model has a computationally efficient architecture plus access to parameters that might have taken a long time to find.

NP-completeness and S
NP-complete decision problems have solutions that are efficient to validate but inefficient to find (assuming P ≠ NP). One of the most well-known NP-complete problems is the boolean satisfiability problem (S ) (Cook, 1971). Given a boolean formula , S accepts iff can be satisfied by some value assignment. For example, the formula ( 1 ∨ ¬ 2 ∨ 3 ) ∧ ( 1 ∨ ¬ 4 ) is in S , since there is a satisfying assignment 1...4 = 1101. We denote the number of satisfying assignments to as #( ).
It is widely believed that no NP-complete languages are in P/poly. Otherwise we would have all of NP ⊆ P/poly and the polynomial hierarchy would collapse at the second level (Karp and Lipton, 1980). A capacity limitation of EC/ECCP weighted languages naturally follows from this belief: Lemma 1. For any ∈ P, there exists an EC weighted language with support . For any ∈ P/poly, there exists an ECCP language with support . But for any ∈ NP-complete, there exists no ECCP language with support (assuming NP P/poly).
In addition to not capturing the support of NPcomplete languages, ECCP languages cannot help trained runtime model for inputs of length . Our Turing machines have size polynomial in (because they are constructed by in polynomial time). They correspond to the polynomial-sized boolean circuits that are used to evaluate inputs of length under the classical definition of P/poly (Ladner, 1975). We exposed these intermediate results only to observe in §2.3 and §4.3 that if we had allowed the to grow exponentially, they would have been able to encode the answers in tries.
All omitted proofs are in Appendix A.
solve other NP-hard problems, either. For example, many structured prediction problems in NLP can be formulated as argmax x:x x˜ (x): we are given a prefix x as input and look for its optimal continuation under . But if this problem is NP-hard for a particular˜ , then it is not in P/poly (assuming NP P/poly), so it cannot be accomplished by any polytime algorithm that queries an ECCP model.

Autoregressive ECCP models (ELNCP models) have reduced capacity
In this section we formally define autoregressive ECCP languages, and prove that they have strictly less capacity than general ECCP languages or even just EC languages. Our proofs rely on the construction of a EC language˜ where computing the local conditional probabilities ( |x) is NP-hard, so they cannot be computed with compact parameters, if NP P/poly.

ELN and ELNCP models
Many parameter estimation techniques and inference methods specifically work with local conditional probabilities ( |x). Thus, it is common to use parametric models where such quantities can be computed in time (poly(|x|)) (given the parameters). These are the "standard autoregressive models" we discussed in §1. We say that the resulting distributions are efficiently locally normalizable, or ELN. We may again generalize to allow the use of compact parameters. For any weighted language˜ , the Turing machine q efficiently locally normalizes with compact parameters q = { q | ∈ N} if • the parameter size | q | grows only as (poly( )) • q ( q ) returns a Turing machine (similar tõ in §2.3) in time (poly( )) •˜ is normalizable (so exists) • mapsx ↦ → ( |x) for all ∈ ∪ {$} and all prefixesx ∈ * with |x| ≤ and (x) > 0 An autoregressive model architecture generally defines (x) as an efficiently computable ( §2.2) product of local conditional probabilities. However, the parametrization usually ensures only that ∈ ( |x) = 1 for all prefixesx. Some parameter settings may give rise to inconsistent distributions where x∈ * (x) < 1 because the generative process terminates with probability < 1 (Chen et al., 2018). In this case, the factors ( |x) defined by the autoregressive model are not actually the conditional probabilities of the weighted language (as defined by §2.1). It is true that training with a likelihood objective does encourage finding a weighted language whose generative process always terminates (hence = 1), since this is the behavior observed in the training corpus (Chi and Geman, 1998;Chen et al., 2018;Welleck et al., 2020). In this paper, we are only concerned with such languages, since we require the actual conditional probabilities to be efficiently computable. Autoregressive models that do not sum to 1, whose normalized probabilities can be uncomputable, are not ruled out by our theorems in this section.
• runs on those inputsx in time (poly( )) If there is q that efficiently locally normalizes a weighted language˜ with compact parameters q , we say˜ is efficiently locally normalizable with compact parameters, or ELNCP. Note that this is a property of the weighted language itself. In this case, it is not hard to see that =˜ / is ECCP: Lemma 2. An ELNCP model˜ is also ECCP. Likewise, an ELN model is also EC.
If we define ELNCP models analogously to ECCP models, Lemma 2 means that locally normalized models do not provide any extra power. Their distributions can always be captured by globally normalized models (of an appropriate architecture that we used in the proof). But we will see in Theorem 1 that the converse is likely not true: provided that NP P/poly, there are efficiently computable weighted languages that cannot be efficiently locally normalized, even with the help of compact parameters. That is, they are EC (hence ECCP), yet they are not ELNCP (hence not ELN).

ELNCP models cannot exactly capture all EC (or ECCP) distributions
We prove our claim by defining a certain weighted language˜ and reducing S to computing certain local conditional probabilities of˜ (as defined in §2.1). Each decision S ( ) (where ranges over formulas) corresponds to a particular local conditional probability, implying that there is no polytime scheme for computing all of these probabilities, even with polynomially sized advice strings (i.e., parameters). Without loss of generality, we consider only formulae such that the set of variables mentioned at least once in is { 1 , . . . , } for some ∈ N; we use | | to denote the number of variables in . We say that a satisfies if a ∈ B | | and ( 1 = 1 , . . . , | | = | | ) is a satisfying assignment. Finally, let boldface ∈ B * denote enc( ) where enc is a prefix-free encoding function. We can now define the unweighted language = { a | is a formula and a ∈ B | | and a satisfies } over alphabet B, which contains each possible S problem concatenated to each of its solutions.
We now convert to a weighted language˜ , defined by˜ (x) =˜ ( , a) = ( 1 (a | ) is uniform over the satisfying assignments a of , as they all have the same length | |.
is efficiently computable, and so is =˜ / . Yet deciding whether the local conditional probabilities of are greater than 0 is NP-hard. In particular, we show that S can be reduced to deciding whether certain local probabilities are greater than 0, namely the ones that condition on prefixesx that consist only of a formula:x = for some . This implies, assuming NP P/poly, that no ( q , q ) can efficiently locally normalize˜ with compact parameters. Granted, the restriction of˜ to the finite set {x ∈ B * : |x| ≤ } can be locally normalized by some polytime Turing machine , using the same trie trick sketched in §2.3. But such tries have sizes growing exponentially in , and it is not possible to produce a sequence of such machines, { : ∈ N}, via a single master Turing machine q that runs in (poly( )) on q . That is: Theorem 1. Assuming NP P/poly, there exists an efficiently computable normalizable weighted language˜ that is not ELNCP.
Proof sketch. Take˜ to be the weighted language we defined earlier in this section.˜ is clearly efficiently computable. We will show that if it is ELNCP via ( q , q ), then the NP-complete problem S is in P/poly, contradicting the assumption. We must give a method for using ( q , q ) to decide S in polytime and with compact parameters . Given , our method constructs a simple related formula such that • has at least one satisfying assignment (so ( ) > 0 and thus (1 | ) is defined) • has satisfying assignments with 1 = 1 (i.e., (1 | ) > 0) if and only if is satisfiable Our construction also provides a polynomial function such that | | is guaranteed to be ≤ (| |). We now define by = q ( ) (∀ ). When our S algorithm with compact parameters is given of length , it can use the polynomial-size advice string to ask ( q , q ) in polynomial time for (1 | ). S ( ) returns true iff that probability is > 0.

ELNCP models cannot even capture all EC (or ECCP) supports or rankings
We can strengthen Theorem 1 as follows: Theorem 2. Assuming NP P/poly, there exists an efficiently computable normalizable weighted Almost. This could be irrational, but at least it is computable to any desired precision. For any rationalˆ ≈ , we can saŷ =˜ /ˆ ≈ is EC, via a Turing machine ˆ that storesˆ . Further remarks on irrationality appear on page 14 in Appendix A.
See also the remark on implications for seq2seq models following the proof in Appendix A.
language˜ where there is no ELNCP˜ such that support(˜ ) = support(˜ ).
Proof. Observe that for any two weighted languages and˜ with the same support, ∀x ∈ * , ˜ (x) > 0 ⇐⇒ ˜ (x) > 0 (where ˜ and ˜ return the prefix probabilities of˜ and˜ respectively). Thus, for anyx with ˜ (x) > 0, (1 |x) ˜ (x1)/ ˜ (x) and (1 |x) ˜ (x1)/ ˜ (x) are well-defined and (1 |x) > 0 ⇐⇒ (1 |x) > 0. If˜ is ELNCP, then all such probabilities (1 |x) can be computed in polytime with compact parameters, so it is likewise efficient to determine whether (1 |x) > 0. But this cannot be the case when˜ is the weighted language used in the proof of Theorem 1, since that would suffice to establish that S ∈ P/poly, following the proof of that theorem.
To put this another way, there exists an unweighted language in P (namely support(˜ )) that is not the support of any ELNCP distribution.
If they have different support, normalizable languages also differ in their ranking of strings: Therefore, no ELNCP˜ captures the string ranking of˜ from Theorem 2. And for some˜ , any ELNCP misranks even string pairs of "similar" lengths: Theorem 3. Assuming NP P/poly, there exists an efficiently computable normalizable weighted lan-guage˜ such that no ELNCP˜ with support(˜ ) ⊇ support(˜ ) has˜ (x 1 ) <˜ (x 2 ) ⇒˜ (x 1 ) <˜ (x 2 ) for all x 1 , x 2 ∈ * . Indeed, any such˜ has a counterexample where˜ (x 1 ) = 0. Moreover, there is a polynomial ˜ : N → N such that a counterexample exists for every x 1 such that˜ ( where the x 2 in this counterexample always satisfies Theorem 3 is relevant if one wishes to train a model to rerank strings that are proposed by another method (e.g., beam search on˜ , or exact -best decoding from a more tractable distribution). If the desired rankings are given by Theorem 3's˜ , any smoothed ELNCP model˜ will misrank some sets of candidate strings, even sets all of whose strings are "close" in length, by failing to rank an impossible string (x 1 with (x 1 ) = 0) below a possible one (x 2 with˜ (x 2 ) > 0).

ELNCP models cannot even approximate EC (or ECCP) distributions
Local probabilities cannot even be approximated well with compact parameters (if NP P/poly). Theorem 2 implies that there exists˜ whose local probabilities ( |x) are not approximated by any ELNCP to within any constant factor , since that would perfectly distinguish zeroes from non-zeroes and the resulting support sets would be equal.
Dropping the normalization requirement on the approximate local probabilities does not help. We say that q efficiently approximately locally normal-izes˜ with compact parameters q if, for some ≥ 1, it meets a relaxed version of the conditions from §3.1, where (x ) only needs to match ( |x) within a factor of . (So possibly ∈ (x ) ≠ 1.) This is not possible for the language˜ from §3.2, since that would still allow a S algorithm to determine whether (1 | ) > 0 in the proof of Theorem 1.
However, these demonstrations hinge on the difficulty of multiplicative approximation of zeroeswhereas real-world distributions may lack zeroes. Below we further show that it is hard even to approximate the non-zero local conditional probabilities (even with the additional help of randomness).
Theorem 4. Assuming NP P/poly, there exists an efficiently computable weighted language˜ : N} that satisfies all of the following properties (similar to §3.1): • the parameter size | q | grows only as (poly( )) • q ( q ) returns a probabilistic Turing machine in time (poly( )) • there exists ≥ 1 such that for each ∈ ∪ {$} andx ∈ * with |x| ≤ and ( |x) > 0, the probabilistic computation (x ) has probability > 2 /3 of approximating ( |x) to within a factor of (that is, runs on those inputsx in time (poly( )) Moreover, the statement above still remains true with either of the following modifications: (a) the approximation guarantee is only required to hold for prefixesx such that {x :x x} is finite (so that ( |x) is computable by brute force) (b) support(˜ ) = *

Alternative model families
We now discuss alternative families of sequence distributions that trade away efficiency or compactness in exchange for greater capacity, as shown in Table 1.

Energy-based models (EBMs)
Energy-based models (LeCun et al., 2006) of discrete sequences (Rosenfeld et al., 2001;Sandbank, 2008;Huang et al., 2018) traditionally refer to the EC models of §2.2. Only the unnormalized probabilities (x) are required to be efficiently computable. Lemmas 1 and 2 showed that this model family contains all ELN languages and can achieve any support in P. Theorem 1 shows that it also contains languages that are not ELN or even ELNCP: intuitively, the reason is that the sums (x) needed to compute the local normalizing constants (see §2.1) can be intractable.
If we generalize energy-based sequence models to include all ECCP models -that is, we allow nonuniform computation with compact parametersthen Lemmas 1 and 2 guarantee that they can capture all ELNCP languages and furthermore all languages in P/poly (though still not NP-complete languages).

Experiments on different parameterizations.
Maximum-likelihood parameter estimation (MLE) can be expensive in an EBM because the likelihood formula involves the expensive summation = x∈ * ˜ (x). This forces us in practice to use alternative estimators that do not require computing normalized probabilities, such as noise-contrastive estimation (NCE) or score matching ( §1), which are less statistically efficient. In pilot experiments we found that both RNN-and Transformer-based EBMs trained with NCE achieved worse held-out perplexity than comparable locally normalized models trained with MLE. This might be due to a capacity limitation of the specific globally normalized architectures (i.e., no parameters work well), or excess capacity (i.e., too many parameters work well on the finite sample), or statistical inefficiency of the estimator (the NCE objective on the finite sample, with the noise distribution we chose, does not distinguish among parameters as well as MLE does), or an optimization difficulty caused by local optima in the NCE optimization landscape.
Fortunately, it is possible to infuse a globally normalized architecture with the inductive bias of a locally normalized one, which empirically yields good results. Residual energy-based models (REBMs) (Bakhtin et al., 2021) are a simple hybrid architecture: This simply multiplies our previous weight by a new factor 0 (x). The base model 0 : → (0, 1] is a locally normalized neural sequence model (ELN model) that was pretrained on the same distribution. : * → R is a learnable function (with parameters ) that is used to adjust 0 , yielding a weighted lan-guage˜ with the same support . We implemented REBMs, again with NCE training, and evaluated them on two different neural architectures (GRU-and Transformer-based) and 3 datasets (WikiText (Merity et al., 2017), Yelp (Yelp), and RealNews (Zellers et al., 2019)). In each setting we tried, the REBM slightly but significantly improved the perplexity of the base model 0 ( < 0.05).

Latent-variable models
Autoregressive models have = 1 for any setting of the parameters (or at least any setting that guarantees consistency: see footnote 7). Clearly = 1 ensures that is both finite and tractable. Can we find a model family that retains this convenience (unlike EBMs), while still being expressive enough to have any non-empty language in P as support?
Autoregressive latent-variable models form such a family. As in directed graphical models, the use of latent variables provides a natural way to model partial observations of an underlying stochastic sequence of events. We will model an observed sequence x of length as a function of a latent string z of length (poly( )). As in EBMs, the probability (x) can be computationally intractable, allowing these models to break the expressivity bottleneck of ordinary autoregressive models. However, the intractability no longer comes from exponentially many summands in the denominator , but rather from exponentially many summands in the numerator -namely, the summation over all latent z that could have produced x. Notice that as a result, even unnormalized string weights are now hard to compute, although once computed they are already normalized.
Formally, we define marginalized weighted languages. We say that˜ is a marginalization of the weighted language˜ if it can be expressed as (x) = z: (z)=x˜ (z), where : Z → * is some function (the marginalization operator). We say it is a light marginalization if |z| ∈ (poly(| (z)|)) and runs in time (poly(|z|)). Typically (z) extracts a subsequence of z; it can be regarded as keeping the observed symbols while throwing away a We independently conceived of and implemented the REBM idea proposed in Bakhtin et al. (2021). Details of neural architecture choice, model parameter sizes, training regimen, and evaluation (Appendices B-D) differ between our work and theirs, which also reported positive empirical results (on different datasets). We regard the two independent positive findings as a strong indication that the REBM design is effective.
WLOG, can be required to run in linear time (|z|), as it does in our constructions below. polynomially bounded number of latent symbols.
Light marginalizations of ELN distributions are a reasonable formalization of latent-variable autoregressive models. They are more powerful than ELN distributions, and even include some distributions that (by Lemma 1) are not even ELNCP or ECCP: Theorem 5. There exists a light marginalization of an ELN distribution, such that support( ) is an NP-complete language.
Our proof of Theorem 5 relies on special structure of a certain NP-complete language (S ) and does not evidently generalize to all languages in NP.
However, light marginalizations of ELNCP distributions are more powerful still, and can have any language ∈ NP or even NP/poly ( §2.4) as support: Theorem 6. The following statements are equivalent for any nonempty ⊆ * : (a) ∈ NP/poly.

(b) is the support of a light marginalization of an ELNCP distribution. (c) is the support of a light marginalization of an ECCP weighted language.
Theorems 5 and 6 make use of unrestricted latentvariable autoregressive models. There exist more practical restricted families of such models that admit tractable computation of (x) (Lafferty et al., 2001;Rastogi et al., 2016;Wu et al., 2018;Buys and Blunsom, 2018). Such models are EC (and indeed, typically ELN) -but this limits their expressivity, by Theorem 1. Both Lin et al. (2019) and Buys and Blunsom (2018) observed that such models yield worse empirical results than models that do not have tractable exact inference methods. The tractability requirement is dropped in "self-talk" (blixt, 2020;Gontier et al., 2020;Shwartz et al., 2020), where a neural autoregressive language model generates an analysis of the prefixx via latent intermediate symbols before predicting the next output symbol.
We remark that for autoregressive models, the position of the latent variables is significant. Marginalizing out latent variables at the end of the string adds no The capacity established by Theorem 6 does not need the full power of marginalization. We could similarly define light maximizations of ELNCP distributions,˜ (x) = max z: (z)=x˜ (z). Replacing sum by max does not change the support.
Here the marginal distribution of the next observed symbol can require superpolynomial time to compute (if #P ≠ FP, which follows from NP P/poly). Theorem 1 could likewise be evaded by other autoregressive approaches that invest superpolynomial computation in predicting the next symbol (Graves, 2016). Each autoregressive step might explicitly invoke lookahead or reasoning algorithms, just as feed-forward network layers can invoke optimizers or solvers (Amos and Kolter, 2017;Wang et al., 2019b). power. More precisely, if an ELNCP distribution is over strings z of the form x#y, then its marginalization via (x#y) = x can be expressed more simply as an ELNCP language. Thus, by Theorem 2, marginalizations of such distributions cannot have arbitrary NP languages as support. Our proofs of Theorems 5 and 6 instead use latent strings of the form y#x, where all latent variables precede all observed ones (as in Kingma and Welling, 2014). (This simple design can always be used without loss of generality.) Trying to reorder those latent strings as x#y while preserving their weights would have yielded a non-ELNCP distribution (x#y) (because if it were ELNCP, then (x) would be ELNCP also, and we know from Lemma 1 that it cannot be for any distribution whose support is an NP-complete language).
How about lightly marginalizing ECCP languages instead of ELNCP ones? This cannot model any additional unweighted languages, by Theorem 6. But it may be able to model more probability distributions. One can easily construct a light marginalization of an ECCP distribution such that #( ) = · ( ), where #( ) is the number of satisfying assignments of and the constant depends only on = | |. We conjecture that this is not possible with lightly marginalized ELNCP distributions.

Lookup models §2
.3 noted that with exponential growth in stored parameters, it is possible to fit any weighted language up to length , with local probabilities computed in only ( ) time by lookup. Of course this rapidly becomes impractical as increases, even if the amount of training data increases accordingly. However, there has been some recent movement toward storage-heavy models. Such models are typically semiparametric: they use a parametric neural model, such as an autoregressive model, together with an external knowledge base of text strings or factoids that are not memorized in the layer weights. The neural model generates queries against the knowledge base and combines their results. Examples include NNLMs (Khandelwal et al., 2020) and semiparametric LMs (Yogatama et al., 2021). The knowledge base grows linearly with the training data rather than compressing the data into a smaller parameter vector. It is in fact a copy of the training data, indexed to allow fast lookup (Indyk and Motwani, 1998). (Preparing the index is much cheaper than neural network training.) Access to the large knowledge base may reduce the amount of computation needed to find the local conditional probabilities, much as in the trie construction of §2.3. Chen et al. (2018) show that it is hard to map RNN parameters to properties of the resulting autoregressive weighted language. Our point is that any given autoregressive weighted language, if consistent, comes with an efficient algorithm to map a string to properties of that string (its prefix probability and next-symbol probabilities). Thus, given consistent RNN parameters, the resulting language is always one of those for which these problems are tractable -this tractability property of the language is not hard to determine.

Related work
In a Bayes network -which is really just an autoregressive model of fixed-length strings -approximate marginal inference is NP-hard (Roth, 1996). Assuming NP P/poly and the grid-minor hypothesis, Chandrasekaran et al. (2008, Theorem 5.6) further showed that for any infinite sequence of graphs 1 , 2 , . . . where has treewidth , there is no sequence of algorithms 1 , 2 , . . . such that performs approximate marginal inference in time (poly( )) on graphical models of structure . This remarkable negative result says that in any graph sequence of unbounded treewidth, approximating the normalizing constant for given arbitrary parameters is hard (not (poly( ))), even with advice strings. Our negative result (Theorem 4) focuses on one particular infinite weighted language, showing that approximating local conditional probabilities given an arbitrary lengthprefix is hard in the same way. (So this language cannot be captured by an RNN, even with advice strings.)

Conclusion and future work
While autoregressive models have several properties that appeal to practitioners, we have shown under common complexity-theoretic assumptions that they cannot match the expressivity of energy-based models -unless their runtime or parameter size grows superpolynomially in input length, or they are allowed to use latent variables, which add considerable power.
All model families we have discussed in this paper can be seen as making compromises between different desiderata (Table 1). Natural follow-up questions include 'Are there model families that win on all fronts?' 'What are other modeling desiderata?' While some languages ∈ P cannot be supports of ELNCPs, we do not know if the same can be said for most languages ∈ P. This problem seems to be closely related to the average complexity of NP-complete languages, where most questions remain open (Levin, 1986;Bogdanov and Trevisan, 2006).

Lookup Models
Lightly Marginalized ELNCP Models ELN EC (all unweighted languages) < l a t e x i t s h a 1 _ b a s e 6 4 = " s 1 n p v V 5 m q Q n g 6 5 9 R w U 2 Y X D a o 5 4 0 = " > A A A B + X i c d V D L S g M x F M 3 4 r P U 1 6 t J N s A j V x Z A Z a 1 t 3 B T c u K 9 g H t G P J p G k b m s k M S a Z Q h v 6 J G x e K u P V P 3 P k 3 Z t o K K n o g c D j n X u 7 J C W a l J U 3 J X z 9 F P 5 P m p 7 j l h 3 v t l S o 1 Z Z 1 5 M A x O A F F 4 I I K q I E b U A c N Q M A E P I A n 8 G y l 1 q P 1 Y r 0 u R l e s 5 c 4 R + A H r 7 R P t K J M 0 < / l a t e x i t > P(V ⇤ ) Figure 2: The space of unweighted languages. We assume in this diagram that NP P/poly. Each rectangular outline corresponds to a complexity class (named in its lower right corner) and encloses the languages whose decision problems fall into that class. Each bold-italic label (colored to match its shape outline) corresponds to a model family and encloses the languages that can be expressed as the support of some weighted language in that family. All induced partitions in the figure are non-empty sets: shape A properly encloses shape B if and only if language class A is a strict superset of language class B. As mentioned in Table 1, standard autoregressive models (ELN models) have support languages that fall in P (Theorem 2). ELNCP models ( §3.1) extend ELN models by allowing the parameter size to grow polynomially in string length, allowing them to capture both more languages inside P (Lemma A.4) and languages outside P (including undecidable but sparse languages) that can be characterized autoregressively with the help of these compact parameters. All of those languages belong in the class P/poly. Theorem 2 establishes that energy-based (EC) and ECCP models go strictly further than ELN and ELNCP models, respectively (Theorem 2): they correspond to the entire classes P and P/poly (Lemma 1). However, even ECCP does not capture any NP-complete languages under our assumption NP P/poly. Allowing a polynomial number of latent symbols extends the power further still: lightly marginalized ELNCP or ECCP distributions cover exactly the languages ∈ NP/poly (Theorem 6). Finally, if we were to drop the requirement that the parameters must be compact, we could store lookup tries to model any weighted language ( §4.3).

A Proofs
Lemma 1. For any ∈ P, there exists an EC weighted language with support . For any ∈ P/poly, there exists an ECCP language with support . But for any ∈ NP-complete, there exists no ECCP language with support (assuming NP P/poly).
This simple lemma relates our classes EC and ECCP of weighted languages to the complexity classes P and P/poly of their supports, which are unweighted formal languages ( §2). It holds because computing a string's weight can be made as easy as determining whether that weight is nonzero (if we set the weights in a simple way), but is certainly no easier. We spell out the trivial proof to help the reader gain familiarity with the formalism.

Proof. Given , define a weighted language˜ with support by˜
If ∈ , then clearly˜ is EC since the return value of 1 or 0 can be determined in polytime.
If ∈ P/poly, can be described as a tuple ( , ) following our characterization in §2.4. It is easy to show that˜ is ECCP, using the same polynomially-sized advice strings . We simply construct p such that p ( ) returns 1 or 0 on input x according to whether ( ) accepts or rejects x. Both p ( ) and p ( ) (x) are computed in time (poly( )) if |x| = . (The technical construction is that p simulates the operation of on the input to obtain the description of the Turing machine = ( ), and then outputs a slightly modified version of this description that will write 1 or 0 on an output tape.) For the second half of the lemma, we use the reverse construction. Suppose˜ is an ECCP weighted language with support .˜ can be characterized by a tuple ( p , ). It is easy to show that ∈ P/poly, using the same polynomially-sized advice strings . We simply construct such that ( ) accepts x iff p ( )(x) > 0. Then by the assumption, ∉ NPcomplete.

Lemma 2. An ELNCP model˜ is also ECCP. Likewise, an ELN model is also EC.
Proof. Let˜ be an ELNCP language. Let q efficiently locally normalize˜ with compact parameters q = { q | ∈ N}. It is simple to define a Turing machine r that maps each parameter string q to a Turing machine , where (x) simply computes =1 ( | x < ) · ($ | x). Then for all x of length , (x) = =1 ( | x < ) · ($ | x), by the definition of local normalization, and thus (x) = (x). r can be constructed by incorporating the definition of q , so that = r ( q ) can include = q ( q ) as a subroutine. This allows to query for local conditional probabilities and multiply them together.
• Since q runs in polytime, it is straightforward for this construction to ensure that r runs in polytime as well.
• We were given that | q | ∈ (poly( )) (compact parameters). Since is the weighted language defined by ( r , q ), and r and q have the properties just discussed, we see that is efficiently computable with compact parameters.
In the case where˜ is more strongly known to be ELN (the parameters q are not needed), a simplification of this argument shows that it is EC. Theorem 1. Assuming NP P/poly, there exists an efficiently computable normalizable weighted language˜ that is not ELNCP.
Proof. The proof was sketched in §3.2. Here we fill in the details.
The unweighted language˜ defined in that section is efficiently computable via the following simple algorithm that outputs˜ (x) given x ∈ B * . If x has a prefix that encodes a formula , and the remainder of x is a satisfying assignment a to the variables of , then return ( 1 3 ) |x|+1 . Otherwise return 0. This algorithm can be made to run in polynomial time because whether an assignment satisfies a formula can be determined in polynomial time (a fact that is standardly used to establish that S ∈ NP).
Given a formula with variables 1 , . . . , , we define = (¬ 1 ∧ ¬ 2 ∧ . . . ∧ ¬ ∧ ¬ +1 ) ∨ ( 1 ∧ Shift( )), where Shift( ) is a version of in which has been renamed to +1 for all 1 ≤ ≤ . It is obvious that and have the properties stated in the proof sketch. The strings in that begin with are precisely the strings of the form a where a is a satisfying assignment of -which happen just when a = 0 +1 or a = 1a where a is a satisfying assignment of . At least one string in begins with , namely 0 +1 , so ( ) > 0. Moreover, ( 1) > 0 iff has any satisfying assignments. Therefore the local probability (1 | ) = ( 1) / ( ) is defined (see §2.1), and is > 0 iff S ( ).
Notice that the formal problem used in the proof is a version of S whose inputs are encoded using the same prefix-free encoding function enc that was used by our definition of in §3.2. We must choose this encoding function to be concise in the sense that enc( ) can be converted to and from the conventional encoding of in polynomial time. This ensures that our version of S is ≤ -interreducible with the conventional version and hence NP-complete. It also ensures that there is a polynomial function such that | | ≤ (| |), as required by the proof sketch, since there is a polynomial-time function that maps → → → and the output length of this function is bounded by its runtime. This is needed to show that our version of S is in P/poly.
Specifically, to show that the existence of ( q , q ) implies S ∈ P/poly, we use it to construct an appropriate pair ( , ) such that ( ( )) ( ) = S ( ) if | | = . As mentioned in the proof sketch, we define by = q ( ) , and observe that | | ∈ (poly( )) (thanks to compactness of the parameters q and the fact that is polynomially bounded).
Finally, define ( ) to be a Turing machine that maps its input of length to of length ≤ ( ), then calls q ( ) = q ( q ( ) ) on 1 to obtain (1 | ), and returns true or false according to whether (1 | ) > 0. Computing takes time polynomial in (thanks to the properties of enc). Constructing q ( ( ) ) and calling it on each take time polynomial in (thanks to the properties of and q ).
Remark on conditional models. While we focus on modeling joint sequence probabilities in this work, we note that in many applications it often suffices to just model conditional probabilities (Sutskever et al., 2014). Unfortunately, our proof of Theorem 1 above implies that ELNCPs do not make good conditional models either: specifically, there exists such that deciding whether (1 | ) > 0 is NP-hard, and thus beyond ELNCP's capability.
Remark on irrationality. In our definitions of ECCP and ELNCP languages, we implicitly assumed that the Turing machines that return weights or probabilities would write them in full on the output tape, presumably as the ratio of two integers. Such a Turing machine can only return rational numbers. But then our formulation of Theorem 1 allows another proof. We could construct˜ such that the local conditional probabilities ( |x) (x )/ (x) are sometimes irrational. In this case, they cannot be output exactly by a Turing machine, implying that˜ is not ELNCP. However, this proof exposes only a trivial weakness of ELNCPs, namely the fact that they can only define distributions whose local marginal probabilities are rational.
We can correct this weakness by formulating ELNCP languages slightly differently. A real number is said to be computable if it can be output by a Turing machine to any desired precision. That Turing machine takes an extra input which specifies the number of bits of precision of the output. Similarly, our definitions of ECCP and ELNCP can be modified so that their respective Turing machines˜ and take this form, are allowed to run in time (poly( + )), and have access to the respective parameter vectors p + and q + . Since some of our results concern the ability to distinguish zero from small values (arbitrarily small in the case of Lemma A.4), our modified definitions also require˜ and to output a bit indicating whether the output is exactly zero. For simplicity, we suppressed these technical details from our exposition.
Relatedly, in §4.3, we claimed that lookup models can fit any weighted language up to length . This is not strictly true if the weights can be irrational. A more precise statement is that for any weighted language˜ , there is a lookup model that maps (x, ) to the first bits of˜ (x). Indeed, this holds even when˜ (x) is uncomputable.
Remark on computability. In §2.1 we claimed that any weighted language˜ that has a finite and strictly positive can be normalized as (x) =˜ (x) / . However, may be uncomputable: that is, there is no algorithm that takes number of bits of precision as input, and outputs an approximation of within bits of precision. Therefore, even if˜ is computable, may have weights that are not merely irrational but even uncomputable. An example appears in the proof of Lemma A.4 below. Weighted language classes (e.g. ELNCP) that only model normalized languages will not be able to model such languages, simply because the partition function is uncomputable.
However, our proof of Theorem 1 does not rely on this issue, because the˜ that it exhibits happens to have a computable . For any , may be computed to bits of precision as the explicit sum x: |x| ≤ ˜ (x) for a certain large that depends on .

Remark on RNNs.
Our proof of Theorem 1 showed that our problematic language˜ is efficiently computable (though not by any locally normalized architecture with compact parameters). Because this paper is in part a response to popular neural architectures, we now show that˜ can in fact be computed efficiently by a recurrent neural network (RNN) with compact parameters. Thus, this is an example where a simple globally normalized RNN parameterization is fundamentally more efficient (in runtime or parameters) than any locally normalized parameterization of any architecture (RNN, Transformer, etc.).
Since we showed that˜ is efficiently computable, the existence of an RNN implementation is established in some sense by the ability of finite rational-weighted RNNs to simulate Turing machines (Siegelmann and Sontag, 1992), as well as an extension to Chen et al. (2018, Thm. 11) to a family of RNNs, where each RNN instance also takes some formula encoding as input. However, it is straightforward to give a concrete construction, for each ∈ N, for a simple RNN that maps each string x ∈ B to˜ (x). Here˜ (x) will be either ( 1 3 ) +1 or 0, according to whether x has the form a where encodes a 3-CNF-S formula that is satisfied by a. The basic idea is that has ≤ variables, so there are only ( 3 ) possible 3-CNF clauses. The RNN allocates one hidden unit to each of these. When reading a, each clause encountered in causes the corresponding hidden unit to turn on, and then each literal encountered in a turns off the hidden units for all clauses that would be satisfied by that literal. If any hidden units remain on after x has been fully read, then was not satisfied by a, and the RNN's final output unit should return 0. Otherwise it should return ( 1 3 ) +1 , which is constant for this RNN. To obtain digital behaviors such as turning hidden units on and off, it is most convenient to use ramp activation functions for the hidden units and the final output unit, rather than sigmoid activation functions. Note that our use of a separate RNN RNN for each input length is an example of using more hidden units for larger problems, a key idea that we introduced in §2.3 in order to look at asymptotic behavior. The RNN's parameter sequence RNN = { RNN | ∈ N} is obviously The restriction to 3-CNF-S formulas is convenient, but makes this a slightly different definition of and˜ than we used in the proofs above. Those proofs can be adjusted to show that this˜ , too, cannot be efficiently locally normalized with compact parameters. The only change is that in the construction of Theorem 1, must be converted to 3-CNF. The proof then obtains its contradiction by showing that 3-CNF-S ∈ P/poly (which suffices since 3-CNF-S is also NP-complete). compact, as RNN only has to store the input length . With our alphabet B for˜ , | RNN | ∈ (log ).
Proof. Suppose that the claim is false, i.e.,˜ and˜ have the same ranking of strings. Then the minimumweight strings under˜ must also be minimum-weight under˜ . WLOG, there exists x ∈ * with˜ (x) = 0 and˜ (x) = > 0. Then > 0 is the minimum weight of strings in˜ . But this is not possible for a normalizable language˜ , since it means that ˜ x ∈ * (x ) ≥ x ∈ * diverges.

Theorem 3. Assuming NP P/poly, there exists an efficiently computable normalizable weighted lan-guage˜ such that no ELNCP˜ with
for all x 1 , x 2 ∈ * . Indeed, any such˜ has a counterexample where˜ (x 1 ) = 0. Moreover, there is a polynomial ˜ : N → N such that a counterexample exists for every x 1 such that˜ (x 1 ) = 0 and˜ (x 1 ) > 0, where the x 2 in this counterexample always satisfies |x 2 | ≤ ˜ (|x 1 |).
Our proof of Theorem 3 below is based on an observation that any rational number takes (log ) bits to store, which implies that any polytime algorithm that assigns probability (x) to string x must have | log (x)| bounded below some polynomial function of |x|. In other words, the negative quantity log (x) is bounded above some polynomial function. Then we show there exists a family of x 2 's, such that for every x 1 there exists x 2 , where (x 1 ) > 0, (x 1 ) > (x 2 ) because x 2 is longer, but 0 =˜ (x 1 ) <˜ (x 2 ) following Theorem 2 and Lemma 3.
We will refer to the condition above as 'the x 1 condition' in the rest of the proof.
Since 2 is a family of strictly increasing strings (third claim regarding 2 ) and longer strings have lower upperbounds to their weights under (second claim regarding 2 ), together with the x 1 condition, we can identify a subset of 2 whose strings have lower weights under than x 1 . Specifically, we know that there is a polynomial function such that ∀x 2 ∈ 2 , (x 1 ) > (x 2 ) if |x 2 | > (|x 1 |). Let these strings form a subset ( |x 1 |) 2 ⊆ 2 . Now we show that there exists a polynomial function of |x 1 |, such that there is a non-empty subset of any such ( |x 1 |) 2 , whose string lengths are bounded below that function as well. We know such function exists, since we know log (x 1 ) is bounded above − |x 1 | , and that ∀x ( ) 2 ∈ 2 , log (x ( ) 2 ) is bounded below some polynomial function of (the second and fourth claims regarding 2 ). Therefore there exists a family of strings where : N∪{0} → N is a polynomial function, such that by combining the x 1 condition and our second claim regarding 2 , we have ∀x 1 ∈ * , log ( ). From our fourth claim regarding 2 , we know ∀x 1 ∈ * , ∀x 2 ∈ ( |x 1 |) , which is what our theorem claims. Also from Theorem 2 and Lemma 3 we know ∃x 1 ∈ * such that (x 1 ) = 0,˜ (x 1 ) > 0. And from our first claim regarding 2 we know ∀x 2 ∈ ( |x 1 |) 2 ⊆ 2 ,˜ (x 2 ) > 0. Therefore given any x 1 , we can find a counterexample

Lemma A.1. The first part of Theorem 4 (without the modifications (a) and (b)).
We first prove the first part of Theorem 4 (which is restated in full below). In this case we will use a distribution˜ that does not have support * (so it does not prove modification (b)).
Proof. We take˜ to be the weighted language that was defined in §3.2, which was already shown to be efficiently computable. Suppose ( q , q , ) is a counterexample to Lemma A.1. Choose integer ≥ 1 in a manner (dependent only on ) to be described at the end of the proof.
The strings in = support(˜ ) that begin with are precisely the strings of the form a where a is a satisfying assignment of . This is achieved precisely when a = 0 + or a = 1a ì where a is a satisfying assignment of and ì ∈ B −1 .
By our definition of˜ , all strings in that begin with have equal weight under˜ . Call this weight . Clearly ( 0) = , and ( 1) = · 2 −1 · (number of satisfying assignments of ).
Recall that (0 | ) = ( 0)/( ( 0) + ( 1)). Let us abbreviate this quantity by . It follows from the previous paragraph that if is unsatisfiable, then = 1, but if is satisfiable, then ≤ 1 /(1+2 −1 ). By hypothesis, is approximated (with error probability < 1 /3) by the possibly random quantity ( q ( q | | )) ( 0), which we abbreviate by , to within a factor of . That is, ∈ [ / , ]. By choosing large enough such that [ / , ] cannot contain both 1 and 1 /(1+2 −1 ), we can use to determine whether = 1 or ≤ 1 /(1+2 −1 ). This allows us to determine S ( ) in polynomial time with error probability < 1 /3, since by hypothesis is computable in polynomial time with compact parameters. This shows that S ∈ BPP/poly = P/poly, implying NP ⊆ P/poly, contrary to our assumption. (BPP/poly is similar to P/poly but allows q to be a bounded-error probabilistic Turing machine.) Theorem 4. Assuming NP P/poly, there exists an efficiently computable weighted language˜ : N} that satisfies all of the following properties (similar to §3.1): • the parameter size | q | grows only as (poly( )) • q ( q ) returns a probabilistic Turing machine in time (poly( )) • there exists ≥ 1 such that for each ∈ ∪ {$} andx ∈ * with |x| ≤ and ( |x) > 0, the probabilistic computation (x ) has probability > 2 /3 of approximating ( |x) to within a factor of (that is, (x )/ ( |x) ∈ [1/ , ]) • runs on those inputsx in time (poly( )) Moreover, the statement above still remains true with either of the following modifications: (a) the approximation guarantee is only required to hold for prefixesx such that {x :x x} is finite (so that ( |x) is computable by brute force) (b) support(˜ ) = * Proof. It remains to show that the statement remains true with modification (a) and with modification (b). For (a), the proof of Lemma A.1 suffices, since it reduces S to approximate local probability queries of the stated form. That is, the true local probabilities ( |x) that can be computed with finite summations, thanks to the structure of our example language˜ , which guarantees that the prefixx can only continue with suffixes of a fixed length that is easily determined fromx.

For modification (b), again let
Choose some > 0 (any choice will do), and let We use 1 , 2 , and respectively to denote normalizing constants of these three weighted languages. Note that˜ 1 is the weighted language that was previously used in the proofs of Theorem 1 and Lemma A.1. Our new˜ is intended to be very similar while satisfying the additional condition (b). It is easy to show that is efficiently computable, much as we showed for 1 in Theorem 1. Also,˜ is normalizable, since = 1 + · 2 , where 1 ≤ ( 1 3 )/(1 − 2 3 ) = 1 and 2 = ( 1 9 )/(1 − 2 9 ) = 1 7 are both finite. The proof proceeds as in Lemma A.1, with constructed from as before. Recall that has variables, has + variables, and | | = . We may assume WLOG that the encoding function enc is such that an encoded formula always has at least as many bits as the number of variables in the formula, so ≥ + .
Notice that 1 ( ) sums over the satisfying assignments of , and there may be as few as one of these (if is unsatisfiable). By contrast, 2 ( ) sums over an infinite number of continuations with positive probability. The faster decay rate of 1 9 in˜ 2 was chosen to keep 2 ( ) small relative to 1 ( ) despite this. Specifically, As in the proof of Lemma A.1, we will show that (0 | ) is much larger when is unsatisfiable. Recall that (x) = 1 (x) + · 2 (x). When has zero satisfying assignments, whereas if has at least one satisfying assignment, then This rewrites both probabilities in terms of · ( 0) quantities, which do not depend on the number of satisfying assignments. So now we can see that the first probability is at least (1 + 2 −1 ) / (1 + 2 7 ) times as large as the second probability. Choose large enough such that [ / , ] cannot contain both probabilities, and complete the proof as in Lemma A.1.

Theorem 5. There exists a light marginalization of an ELN distribution, such that support( ) is an NP-complete language.
Proof. We will construct such that support( ) is the NP-complete language S of all satisfiable boolean formulas. The idea is to construct an ELN distribution that can autoregressively generate any assignment a followed by any formula that is satisfied by a. Thus, if we delete the a prefixes, the support consists of exactly the satisfiable formulas (or more precisely, their encodings ).
To be more precise, we will have support( ) be the language = {1 0 a | ≥ 0 and a ∈ B and is a formula satisfied by a}. This is defined similarly to the support language in §3.2, but with the order of and a crucially swapped: will now generate the "solution" a before the "problem" . We have prepended to a a unary encoding of , the number of variables, so that it is clear where a ends and begins. The marginalization operator maps 1 0 a to , which deletes the first 2 + 1 symbols.
| | ≥ . This ensures that marginalizing over the 2 + 1 latent symbols is only light marginalization since 2 + 1 + | | ∈ (poly(| |)). For convenience, we will also require to be a CNF formula. These requirements shrink support( ) but do not affect its NP-completeness. The remaining challenge is to construct an autoregressive distribution whose support is . We can think of this distribution as describing an efficient procedure for randomly generating a string from left to right so that the procedure terminates with probability 1, runs in time that is polynomial in the length of the resulting string (so that it is ELN), has positive probability of producing any string in , and has zero probability of producing any string not in . Below we give such a procedure.
1. First, the procedure generates a string of the form 1 0. At each step, it chooses uniformly from {0, 1} until it generates a 0. Let be the number of 1 symbols generated so far. For example, if 1110 was generated, then = 3. 2. Next, the procedure generates a as a sequence of random symbols from {0, 1}, again making a uniform draw at each step. In our example, it might generate 010. 3. Finally, the procedure must generate the encoding of a random CNF formula that is satisfied by a, such as ( 2 ∨ ¬ 3 ∨ ¬ 2 ∨ 2 ) ∧ (¬ 1 ) in our example. This involves generating a random sequence of 0 or more satisfied clauses connected by ∧. At each step, the procedure decides whether to generate a new clause or end the formula. The probability of generating a new clause is ordinarily 1 /2. However, this probability is 1 if the previous clauses do not yet mention all the variables 1 , . . . , . How does it generate each satisfied disjunctive clause? This involves generating a sequence of literals connected by ∨, at least one of which must be true. At each step of this subroutine, it uniformly chooses an integer ∈ [1, ], and It is clear that the procedure below terminates in finite time almost surely (i.e., with probability 1), so that is in fact a consistent probability distribution over the finite strings * (see footnote 7). Phase 1 almost surely terminates after a finite number of 1's. Phase 2 always terminates. Phase 3 almost surely terminates after a finite number of clauses, and each clause almost surely terminates after a finite number of literals.
Our presentation here makes use of an infinite alphabet that includes symbols such as and ¬ for all ∈ N >0 , as well as symbols such as 0, 1, ∧, ∨. We implicitly invoke some prefix-free encoding scheme to translate each symbol into a fixed string over a finite alphabet . We note that such translation can be done in (poly( )). then flips a fair coin to decide whether to add the literal or ¬ to the current clause. If the clause is now satisfied by a (i.e., at least one of the literals is true), it then flips another fair coin to decide whether to end the clause. is ELN because there exists a Turing machine which computes from inputx -in time (poly(|x|)-the probability that the next symbol generated after the prefixx would be , under the above procedure. Notice that this lemma concerns the class NP/poly, not P/poly (see §2.4). The proof is straightforward.
Proof. Suppose˜ is ECCP via ( r , r ), and is the marginalization operator such that˜ (x) = z: (z)=x˜ (z). By the light marginalization assumption, there is a polynomial such that |z| ≤ (| (z)|).
To prove support(˜ ) ∈ NP/poly, we must show that there exists ( , ) such that for all ≥ 0, a nondeterministic Turing machine can be constructed as ( ) in time (poly( )), which can in turn decide in time (poly( )) whether˜ (x) > 0 for any x with |x| = .
How can check a string z of length ? It can decide the first condition (z) = x in time (poly( )), since the marginalization operator is a polytime function. To decide the second conditioñ (z) > 0, it must construct the (deterministic) Turing machine r ( r ) and then apply it to z to obtain˜ (z): since˜ is ECCP, both steps take time (poly( )) = (poly( ( ))) ⊆ (poly( )) as required. However, this means that = ( ) must have access to the parameter vectors r for all ≤ ( ). We therefore make include this collection of parameter vectors. Each | r | ∈ (poly( )) ⊆ (poly( )) since˜ is ECCP. So | | ∈ (poly( )) as required.
Lemma A.3. For any ∈ NP/poly, there exists a light marginalization of an ELNCP distribution, such that support( ) = .
In this proof we will demonstrate how an EL-NCP distribution can correspond to a left-to-right stochastic string generation process such that when generation terminates, we have a string whose suffix is guaranteed to be ∈ .
Our generative story is inspired by rejection sampling, a widely used Monte Carlo method for sampling from intractable distributions. In standard rejection sampling schemes, we first sample a string from a tractable distribution, then toss a coin to decide whether we keep the sample. If the sample is not in the support of a target distribution, we reject unconditionally. Otherwise, there is a nonzero probability that we accept the sample, then halt. If we reject a sample, we try again, until we accept (and halt). Rejection sampling offers a general framework for exploiting randomness to (approximately) solve computational problems. But it is not guaranteed to halt in finite time, which poses as a problem if we require light marginalization -we would like a guarantee that all accepted strings are relatively short to the answer it contains. We therefore make use of the polysize parameter vectors of ELNCP languages to store certain 'example strings' that are guaranteed to be ∈ . Wherever ordinary rejection sampling would reject a string and try generating another, we switch to accepting a previously stored example string of an appropriate length. This places a lot of probability mass on a small number of example strings, whereas rejection sampling effectively throws away this mass and renormalizes. Unlike rejection sampling, it distorts the local probabilities, but it does preserve the support.
At a high level, is a distribution over strings that record traces of the generative story we describe above. Formally they take the form z = abc e, where a, b, c and e are themselves strings. a represents string length in : we will accept a string ∈ that is at least as long as |a| − 1. We then sample a string b from a proposal distribution over |a|−1 . c is an encoding of a polysize prefix to a computation path from the Turing machine that decides on strings of length |b|: since ∈ NP/poly, |b| (following the notation established in §2.4) must halt in ≤ (|b|) steps on an input string of length |b|, where : N → N is a polynomial function. is a convenience variable that indicates the outcome of rejection sampling ( = 0 for accepting b, and = 1 for rejecting  b). We always reject b if it is not ∈ : that is, Finally, d is a copy of b iff b ∈ is accepted in the previous rejection sampling (indicated by = 0). If b was not accepted (indicated by = 1), we let d be some example string memorized in the ELNCP weighted language 's parameter vectors that is also known to be ∈ , and of length ≥ |b|. And after the generation of d, we stop. d is guaranteed to be ∈ .
Proof. WLOG we assume uses the Boolean alphabet B. We discuss two cases of . If is finite, then there exists a polytime Turing machine that acceptsx ∈ B * iff ∃x ∈ ,x x. encodes a finite trie, where ∀x ∈ B * , we can decide ifx is a valid prefix of some string ∈ in time (|x|) by following a path in the encoded trie. Given such an , we can define an ELN that computes any ( |x) where the prefix probability (x) > 0 in (poly(|x|)). Let the marginalization operator (x) = x, lightly marginalizes itself and has support . Now we discuss the case where is infinite. First, we describe the set of 'example strings' formally. We define L 0 = {x ( ) 0 : ∈ N} to be a subset of where the strings x ( ) 0 , speaking at a high level, is a 'specimen' -th shortest string from . And |x ( ) 0 | ≥ − 1. More formally: To identify a unique L 0 , we arbitrarily define that ∀x ( ) 0 ∈ L 0 , x ( ) 0 has the lowest lexicographical order among all x ∈ {x ∈ | |x| = |x ( ) 0 |}. We define a weighted language (z) = |z |+1 =0 (z | z < ), where z |z |+1 $, to have support Z, which contains strings of the form z = abc e. Below we (re)use as an index variable to scan through strings {a, b, c, , e}. a is of form 0*1. We define a = 0...01 ∈ A, and ( = 0 | a < ) = 1 /2. We define b ∈ B |a|−1 , and ( = 0 | ab < ) = ( = 1 | ab < ) = 1 /2 iff |b < | < |a| − 1. c is some encoding of choices made during |b | 's execution on b. Specifically, c is a list of log (|b|) -bit (also word width later) integers (words) 1 2 . . ., where : N → R >1 is a polynomial function, ∈ N means that at timestep , the nondeterministic Turing machine |b | takes the -th entry in the lexicographically sorted transition relation table that has at most (poly(|b|)) entries, on an input string b. We define ( | abc < ) ∝ 1 ⇐⇒ c < is a prefix to some valid transition of |b| on input |b|, and ( | abc < ) = 0 otherwise.
First we check the claim that (x) > 0 ⇐⇒ x ∈ . We check both strings ∈ and strings ∉ : if x ∈ , there must exist a transition string of |x| such that |x| accepts x in at most (|x|) steps, by the definition of NP/poly. Therefore, ∃a ∈ B * , |a| = |x| + 1, ∃c ∈ B * , |c| ∈ (poly(|x|)) such that z = axc0c ∈ Z and (x) ≥ (z) > 0. On the other hand, if x ∉ , and if there were z ∈ Z such that (z) = x, (x) ≥ (z) > 0, there must be strings of form If there were such a string, ∃a ∈ A, c ∈ B * such that z = axc0x ∈ Z where c encoded an accepting transition string of x. Since given x ∉ , all computation of |x| must reject x in at most (|x|) steps by definition, we have an contradiction. We then show z = {0, 1} * 1x ∈ Z where (z) = x either: because of the preceding 1 before suffix x of z, x must be ∈ L 0 . Since L 0 ⊆ , this contradicts our assumption x ∉ . Finally, {z | z = {0, 1} * x ∈ Z, x = (z)} is an empty set, therefore (x) = z: (z)=x (z) = 0.
Finally, we check whether is an ELNCP language: first we check whether defines a distribution over Z. We observe that given any a ∈ A, the conditional probability that a has a finitely long suffix under is 1: specifically, ∀a ∈ A, (z) = 0 if z ∈ {z ∈ B * | a z, |z| > 3|x ( |a|−1) And therefore defines a distribution over Z.
We then check if there exists a family of parameter vectors such that on string z, computing ( | z < ), ∈ [1 . . . |z| + 1] uses (poly(|z|)) time and a parameter vector of size (poly(|z|)). We define a family of parameter vectors { r : ∈ N} where r encodes both all example strings that a string of length may use ∈ L 0 : L ( ) 0 |}, and advice strings for input length ≤ : ( ) { | ∈ , ≥ } of the NP/poly language . Since ∀x ( ) 0 ∈ L ( ) 0 , ≥ |x ( ) 0 |, we have |L ( ) 0 | ∈ (poly( )). Therefore L ( ) 0 can be stored using size (poly( )). As for the advice strings ( ) , again ∀ ∈ ( ) , | | ∈ (poly( )) ⊆ (poly( )) can be stored using size (poly( )). Therefore | r |z | | ∈ (poly(|z|)). For everyẑ such that the prefix probability (ẑ) > 0, we can compute (· |ẑ) as follows: 1. Ifẑ is a proper prefix to some a ∈ A (decidable in (|ẑ|)), (· |ẑ) can be computed in (|ẑ|). Otherwise, extract a fromẑ and proceed to subsequent checks. 2. If we fall through the previous check, check ifẑ is a proper prefix to some ab, where b ∈ B |a|−1 . Ifẑ is a proper prefix to ab, (· |ẑ) = 1 /2. Otherwise, extract b fromẑ and proceed to subsequent checks. 3. Letĉ =ẑ > |a|+ |b| . If we fall through the previous checks, check ifẑ is a proper prefix to some abc, where c is a transition string of |b | on input string b.ẑ is a proper prefix to some abc iffĉ is a prefix to valid transitions of |b| on input string b that do not contain end in ACCEPT or REJECT states. Such a check can be done in (poly(|b|)) ⊆ (poly(|ẑ|)) (first we build |b | from advice string |b| as described in §2.4, then we proceed to check words onĉ). Ifĉ is a proper prefix to some accepting/rejecting c, we compute ( |ẑ) ∝ I(ĉ is a prefix to some accepting/rejecting c), ∀ ∈ B. ( |ẑ) can be computed in (poly(|ẑ|)). Otherwise (ifẑ is not a proper prefix to some abc -meaning that z contains an accepting/rejecting c substring), extract c fromẑ and proceed to subsequent checks. 4. If we fall through the previous checks, check ifẑ has prefix abc but is a proper prefix to abc . The check follows easily from the previous step. Ifẑ is a proper prefix to abc in this manner, we let ( = 0 | abc) ∝ I(c ends in an ACCEPT state) and ( = 1 | abc) ∝ 1, which can be computed in (poly(|ẑ|)). Otherwise, we extract from z. 5. If we fall through the previous checks, check ifẑ has prefix abc but is a proper prefix to abc e. This can be checked easily by looking can be looked up in r |ẑ | in polytime. And ( |ẑ) = 1 ⇐⇒ z abc0b, which can also be computed in polytime. 6. Finally, if we fall through the previous checks, if z = abc e, ($ |ẑ) = 1. Otherwise ($ |ẑ) = 0. For every valid prefixẑ, we can compute (· |ẑ) in (poly(|ẑ|)) ⊆ (poly(|z|)). Since the parameter vector r |z | ∈ (poly(|z|)) as previously described, is ELNCP.
We include this lemma to justify why this region is drawn as non-empty in Figure 2. To prove it, we first construct a weighted language˜ that is known to be in both EC and ELNCP. We then show if this ∈ ELN, we would have an algorithm that solves the halting problem of deterministic Turing machines, which is impossible.
For our purposes, we take to consist of all strings of the form x (1) x (2) such that x (1) = enc( ) is a prefix-free encoding of some deterministic Turing machine that takes an empty input tape, and x (2) is a halting execution trace of on the empty tape, represented as a sequence of states of that begins with an initial state and ends at a HALT state. Note that any deterministic TM x (1) can be paired with at most one halting execution trace x (2) , and cannot be paired with any x (2) if it does not halt.

Clearly
∈ P: given x ∈ B * , we can decide whether x ∈ by first checking if x contains wellformed strings x (1) and x (2) . Then we build from x (1) , then check the state transitions in x (2) on stepby-step. All these operations will take (poly(|x|)) in time. We conclude that the˜ derived from is EC.
However,˜ is not necessarily ELNCP as desired. To fix this, we modify the construction to use a weighted language˜ with sparse support . We will again be able to show that˜ is EC. To show that˜ is also ELNCP, we rely on the sparsity of , meaning that prefixes( ) {x : (∃x ∈ )x x } contains at most (poly( )) stringsx of length ≤ + 1. Thus, we can use q to store all of those stringsx in polynomial space, along with their (x ) values. Notice that all stringsx ∉ prefixes( ) have (x ) = 0, so they need not be stored. Now for anyx of length ≤ , a Turing machine that consults q can compute ( |x ) = ˜ (x ) / ˜ (x ) in time (poly( )) as desired.
We may define˜ as follows. Let sparsify(x) be a version ofx with many extra 0 symbols inserted: specifically, it inserts 2 copies of 0 immediately before the th bit of x, for all 1 ≤ ≤ |x|. We construct˜ so that˜ (sparsify(x)) =˜ (x). Specifically, let sparsify( ). The inverse function sparsify −1 (x ) is defined on exactly x ∈ , and is unique when defined. For all x ∈ B * , let˜ (x ) ˜ (sparsify −1 (x )) if sparsify −1 (x ) is defined, and˜ (x ) 0 otherwise. This can be computed in polytime, so˜ is EC. Also, its support is sparse as claimed, so˜ is ELNCP.
Finally, we claim˜ is not ELN. Given any Turing machine that takes an empty input tape, halts iff enc( ) ∈ prefixes( ) iff sparsify(enc( )) ∈ prefixes( ) iff ( |x ) > 0 for allx , such thatx sparsify(enc( ))). But this would be decidable if˜ were ELN as defined in §3.1, since then we would have a Turing machine to compute the local conditional probabilities ( |x ).

B.1 Modeling finite subsets of infinite languages
The experiments of this paper are conducted on datasets where we only observe strings that are finitely long. We use the notation ≤ = {x | x ∈ , |x| ≤ } for the subset of an infinite language More precisely, the first bits of (x ) ≤ 1 may be stored in q + , when ELNCP is defined as explained in our "Remark on irrationality" above. that contains all strings that are most symbols long. Specific values of of datasets used in experiments are listed in Appendix D.1.
B.2 Design of base models 0 0 can be any distribution over ≤ provided that we can sample from it, and evaluate 0 (x), ∀x ∈ ≤ , both in (poly(|x|)). In this work, we experiment with two designs of 0 : GRU-and Transformer-based locally normalized language models. GRU-based models are used in WikiText and Yelp experiments. The GRU-based 0 's are parametrized with 2-layer GRUs with 500 hidden units, and word embeddings of dimension size 500.
As for Transformer-based 0 's, we make use of Grover models (Zellers et al., 2019), which effectively are GPT-2 models trained on the aforementioned R N dataset. In this work, we experiment with the 'base' variant of public available weights, which are 12-layered Transformers, with 12 heads, and 768 hidden units.

B.3 Design of discriminators
We formulate (x) as a summation of scores at positions 1 . . . |x|, passed through an activation function : To verify whether lower-bounding would help with learning, as we discuss in §4.1, we experiment with two variants of : The former one is bounded between (−2, 2), while the second one has range (−∞, 0). The offset term in the softplus activation function determines initial values of . In this paper we set = 20. The design of (x; ) follows their base model counterparts: we use Bi-GRU discriminators for GRU base models; and bi-directional Transformer discriminators for Transformer ones. For GRUs (x; ) = h · , For Transformers (x; ) = h where h are the hidden states at time step . In both cases, the discriminators have access to information of the whole sequence x at any timestep: the Bi-GRU discriminators achieve this through the bi-directional RNNs, and the Transformers through the attention mechanism without directional masking.

B.4 Training procedure
As we note in §4.1, MLE-based training methods are generally not feasible for globally normalized models. We therefore opt to train our model using the ranking variant of noise contrastive estimation (NCE) (Ma and Collins, 2018), which does not require samples from 0 and has a simple form for residual LMs. Using 0 as a noise distribution, NCE training requires minimizing the following single-sequence loss, in expectation over the true distribution : where The NCE minimization objective (2) now reduces to the simple form L ( , x, 0 , ) Notice that minimizing the expected loss with stochastic gradient descent methods L defined in equation (3) requires only evaluating sequence probabilities under , and tuning its parameters, but not the base model 0 . We only need to generate the noise samples {x ( ) ∼ | ∈ [ ]} from 0 . This way we do not need to backpropagate through parameters of the base model 0 , which can speed up training considerably when 0 is backed by a huge network. In fact, the training of can be completely agnostic to the design of 0 , allowing for the application of finetuning any locally normalized 0 . Given the same discriminator , the difference of KL-divergence between the true model and residual language models˜ (x) = 0 (x) · exp (x), and the KL-divergence between the true model and (x) = 0 (x) ·exp (x), defined with base models 0 and 0 respectively, can be written as Proof.
Plugging assumptions : the former can be done by minimizing empirical cross entropy, which is computationally efficient, while the latter involves an intractable partition function x∈ ≤ ˜ (x). Pseudocode for fine-tuning is listed in Algorithm 1.

C Comparison between REBMs and autoregressive models
We evaluate the effectiveness of REBMs on two different neural architectures (GRU-and Transformerbased) and 3 datasets: WikiText (Merity et al., 2017), Yelp (Yelp), and RealNews (Zellers et al., 2019), on the task of modeling sequence probabilities. An REBM˜ has two components, and 0 , and we would like to see how˜ competes against 0 itself. We do not further tune 0 while training . As a fair comparison, we also see how 0 compares against 0 , where 0 is simply a version of 0 that has been trained as many additional epochs as were used to train . 0 models are pretrained on moderately large corpora (in GRU cases) or a very large corpus (in the Transformer case). We compare residual energybased models˜ to further-fine-tuned base models 0 , on conservatively estimated (at the low end of 95% confidence interval) token perplexity and bootstrapsampled log likelihood improvements. The results are in Table 2. Residual energy-based models show consistent perplexity improvement compared to 0 that are trained on the same data using the same maximum numbers of iterations. Although the improvement in log-likelihood of over 0 is modest (especially for RealNews experiments, where 0 is a very strong baseline), we verify that these improvements are all statistically significant ( < 0.05) using bootstrapped test datasets.
We experiment with different designs of the discriminator , evaluating the effectiveness of bounding and varying its number of parameters. We find that in Transformer-based experiments, bounding considerably helps with performance; but the opposite happens for GRU-based models. We speculate that this is due to the base models' performance: the Transformer base models have high parameter count and were trained on a lot of data; and the true distribution likely is relatively similar to 0 , and benefits from a small hypothesis space -even though we don't know if the at-most-error assumption in §4.1 holds. On the other hand our GRU-based 0 has neither the capacity, nor the huge amount of training data. As a result, the unbounded variant (and ) may end up learning a better approximation of .

D Experimental details D.1 Datasets
Residual language model experiments are conducted on these datasets: • Segmented WikiText: we take the standard WikiText-2 corpus (Merity et al., 2017), and segment it into sequences at new line breaks. We discard all empty lines, and any line that starts with the '=' token. In effect, we obtain sequences that are mostly entire paragraphs. We also only keep lines that are shorter than 800 tokens after BPE tokenization. Because of our preprocessing, Segmented WikiText loses much interparagraph context information, and doesn't have the 'simple' header sequences that were in the original WikiText corpus, and is much In the Transformer case we simply take 0 to be the Grover   harder to language-model. • Yelp: the Yelp dataset (Yelp) contains business reviews. As in Segmented WikiText, We keep reviews shorter than 800 tokens. • R N : we make use of the standard R -N corpus comes from (Zellers et al., 2019), which contains news articles that are up to 1, 024 tokens long. In all experiments we tokenize with BPE tokenizers derived from the GPT-2 language models: the GRU models use Huggingface's implementation and the Transformers use Grover's . Number of sequences in preprocessed datasets are listed in Table 3.

D.2 Pretraining base models 0
We use a pretrained Grover model as the base model in RealNews experiments. For GRU-based experiments, we train base models on WikiText and Yelp datasets using separate training and validation splits than those of the discriminator (Table 4). The base models are periodically (every 1, 000 iterations) evaluated on the validation split for early stopping, where we stop if there is no improvement on validation perplexity for 10 consecutive evaluations. The base models achieve 113.98 for Segmented WikiText, and 110.89 in test set perplexity, respectively. Note that these base models are further fine-tuned on additional datasets in our comparison against residual language models.

D.3 Metrics
We evaluate the relative performance of residual language models against autoregressive models (i.e. fine-tuned base models) on two metrics, log likelihood and perplexity improvement, which are approximated as follows: • Log likelihood improvement: since , and 0 are all distributions over ≤ , we can quantitatively evaluate their difference in log likelihood. We measure the difference between KL[ || ] and KL[ || 0 ]: whereˆ is estimated using equation (6). A negative value of log likelihood difference indicates that˜ approximates better than 0 in terms of KL-divergence. • Perplexity improvement: perplexity is a common language modeling metric. Following Note that 0 here is the base model component of˜ . While comparing between residual language models and autoregressive models, we also finetune 0 on additional data to get a new model 0 , which has different parameters than 0 .
where (D) is the total token count of dataset D, and |D | is the number of sequences of D. is ecomputed Appendix B.5 Both evaluation metrics involve estimating the partition function withˆ . For the perplexity improvement metric, we obtain 32 estimates ofˆ , which are normally distributed, and compute equation (8) usingˆ the conservative end of a 95% confidence level. To account for variance in our test datasets, we further make use of bootstrapping estimation for log likelihood improvement: we bootstrap-sample 1, 000 subsamples for each test dataset, and compute equation (7) for each datapoint in the Cartesian product (1, 000 × 32 in total). We then report results at the 2.5% and 97.5% percentiles.

D.4 Hyperparameters
Transformer experiments. We train our models on 64 GPUs across 8 nodes, with a total batch size of 64 × 8 × 2 = 1, 024, and with 1 noise sequence ( = 1 in Appendix B.4) per batch. We use an initial learning rate of 5 − 5. The rest of the hyperparameters largely follow settings in (Zellers et al., 2019). Optimization is done with the Grover implementation of AdaFactor. GRU experiments. We train our models on 8 GPUs on a single node, with a total batch size of 8 × 2 = 16, and with 25 noise sequences ( = 25 in Appendix B.4) per batch. We have an initial learning rate of 1 − 4. Upon no improvement on validation data, we half the learning rate, with patience = 1. The model parameters are 2 regularized with a coefficient of 1 − 5. We also apply dropout regularization with = 0.5. Optimization is done with PyTorch-supplied Adam.

D.5 Configurations
We study the effects of these configurations: • Bounding : we note in §4.1 that with the strong hypothesis that the base model 0 has bounded error, will have a bounded range, and leads to a much smaller hypothesis space. In this work we experiment with both bounded and unbounded 's, with ranges (−∞, 0) and We set = 512 in this paper.
(−2, 2) respectively. More details can be found in Appendix B.3. • Model capability of : we hypothesize that the expressiveness of does not need to be as rich as the parametrization of 0 , since essentially only has to tell whether the sequence x comes from or 0 . For the GRU + WikiText experiments, we experiment with {1, 2}-layer GRU models of . For 1-layer models, we additionally experiment with a setup that has only 250 hidden units. For the Transformers/RealNews dataset, we experiment with {12, 4}-layer Transformer models.

D.6 Log likelihood improvements under different configurations
We also see in Table 5 that using tanh as the activation function does better than softplus for Transformers; but performs very poorly for GRUs. We also observe degeneracy problems. We speculate that our Transformer-based base models have already learned a good approximation of the true distribution; and limiting the model capacity of in exchange of smaller variance results in a favorable trade-off, and vice versa for GRUs. Regarding discriminator capability: we see that performance is not sensitive to model size. Our best Transformers run actually is from the smaller-model runs. And the 1-layer 500-unit GRU models achieve best performance. Overall, results in Table 5 suggests that performance is sensitive to the choice of model configuration.