Global Autoregressive Models for Data-Efficient Sequence Learning

Standard autoregressive seq2seq models are easily trained by max-likelihood, but tend to show poor results under small-data conditions. We introduce a class of seq2seq models, GAMs (Global Autoregressive Models), which combine an autoregressive component with a log-linear component, allowing the use of global a priori features to compensate for lack of data. We train these models in two steps. In the first step, we obtain an unnormalized GAM that maximizes the likelihood of the data, but is improper for fast inference or evaluation. In the second step, we use this GAM to train (by distillation) a second autoregressive model that approximates the normalized distribution associated with the GAM, and can be used for fast inference and evaluation. Our experiments focus on language modelling under synthetic conditions and show a strong perplexity reduction of using the second autoregressive model over the standard one.


Introduction
Neural sequential text generation models have become the standard in NLP applications such as language modelling, NLG, machine translation.When enough data is available, these models can be trained end-to-end with impressive results.Generally, inference and training proceed in an auto-regressive manner, namely, the next decoded symbol is predicted by a locally normalized conditional distribution (the "softmax").This has several advantages: (i) the probability of the sequence is already normalized, by the chainrule over local decisions, (ii) max-likelihood (ML) training is easy, because the log-likelihood of the full sequence is simply the sum of local CE (crossentropy) losses, (iii) exact sampling of full se-quences from the model distribution is directly obtained through a sequence of local sampling decisions.
However, these autoregressive models (AMs) tend to suffer from a form of myopia.They have difficulty accounting for global properties of the predicted sequences, from overlooking certain aspects of the semantic input in NLG to duplicating linguistic material or producing "hallucinations" in MT, and generally through being unable to account for long-distance consistency requirements that would be obvious for a human reader. 1  The main contributions of this paper are as follows.
First, we propose a hybrid seq2seq formalization, the Global Autoregressive Model (GAM), that combines a local autoregressive component with a global log-linear component, allowing the use of a priori features to compensate for the lack of training data.GAMs are related both to the class of Energy-Based Models (EBM) and to that of Exponential Families (EF), and inherit some important properties from those: an intimate relationship between training and sampling (EBM); the identity of empirical and model expectations at maximum-likelihood; convexity of log-likelihood (EF).
Second, we propose a training procedure in two steps.In the first step, we train through maxlikelihood a GAM, which however is unnormalized and improper for fast inference or evaluation.In the second step, we use this GAM to train (by distillation) a second autoregressive model that approximates the normalized distribution associated with the GAM, and can be used for fast inference and evaluation.
Third, we demonstrate the ability of GAMs to be data-efficient, namely, to exploit the original data better than a standard autoregressive model.In order to clarify the core techniques and issues, we design a simple class of synthetic data, consisting of random binary strings containing "motifs" (specific substrings) that we can manipulate in different ways.We show that, in limited data conditions, GAMs are able to exploit the features to obtain final autoregressive models that perform better than the original ones.
The remainder of the paper is structured as follows.In Section 2, we provide some background about autoregressive models, energy-based models, and log-linear models.In Section 3, we introduce GAMs.In section 4, we describe our focus on synthetic data.In Section 5, we explain our training procedure.In Section 6, we comment on related work.In Section 7, we describe our experiments.In Section 8, we provide an analysis of our results.We conclude with a discussion in Section 9. Note that some additional explanations and experiments are provided in the Supplementary Material, indicated by [SM].

Autoregressive models (AM)
These are currently the standard for neural seq2seq processing, with such representatives as RNN/LSTMs (Hochreiter and Schmidhuber, 1997;Sutskever et al., 2014), ConvS2S (Gehring et al., 2017), Transformer (Vaswani et al., 2017)).Formally, they are defined though a distribution r η (x|C), where C is an input (aka Context, e.g. a source sentence in Machine Translation (MT)), and x is a target sequence (e.g. a target sentence in MT).We have: where each s η (x i |x 1 , . . ., x i−1 , C) is a normalized conditional probability over the next symbol of the sequence, computed by a neural network (NN) with parameters η.The local normalization of the incremental probabilities implies the overall normalization of the distribution r η (x|C), and consequently, the possibility of directly sampling from it and evaluating the likelihood of training sequences.

Energy-Based Models (EBM)
EBMs are a generic class of models, characterized by an energy function U η (x|C) computed by a NN parametrized by η (LeCun et al., 2006).Equivalently, they can be seen as directly defining a potential (an unnormalized probability distribution) x|C) , and indirectly the normalized distribution p η (x|C) = 1/Z η (C) P η (x|C), with Z η (C) = x P η (x|C).A fundamental property of these models is that, for maxlikelihood training, the SGD updates can be computed through the formula:2 which, in principle, reduces the problem of training with unnormalized potentials to the problem of sampling from them.

Log-Linear Models / Exponential Families
Log-Linear models (Jebara, 2013) are the conditional version of Exponential Families (Jordan, 2010).The general form of a log-linear model (for the discrete case) is as follows: ), φ(x;C) .Here φ(x; C) is a vector of predefined real features of the pair (x, C), which is combined by scalar product with a real vector of weights λ(C) of the same dimension; µ(x; C) is an arbitrary "base measure", which is fixed.These models, which allow to introduce prior knowledge through features and have nice formal properties (see below), were mainstream in NLP before the revival of neural approaches.

Proposal: GAMs
We now define Global Autoregressive Models (GAMs).These are hybrid seq2seq models that exploit both local autoregressive properties as well as global properties of the full target sequence.
A GAM is an unnormalized distribution P η (x|C) over sequences x, parametrized by a vector η = η 1 ⊕ η 2 : Here r η 1 (x|C) is an autoregressive seq2seq model for generating x from input C, parametrized by η 2 ; φ(x; C) is a vector of predefined real features of the pair (x, C), which is combined by a scalar product with a real vector λ η 2 (C) of the same dimension, computed over the input C by a network parametrized by η 2 .The normalized distribution associated with the GAM is p η (x|C) = GAMs appear promising for the following reasons: • Features φ(x; C) provide a simple way to draw attention of the model to potentially useful aspects that may be difficult for the AM component to discover on its own from limited data.
• GAMs are an instance of EBMs, where the potential P η (x|C) is the product of the an AM potential r η 1 (x|C) with a "log-linear" potential e λη 2 (C),φ(x;C) .Here the gradient relative to the log-linear part takes the especially simple form: • Log-linear models, on their own, while great at expressing prior knowledge, are not as good as AM models at discovering unforeseen regularities in the data.Also, they are typically problematic to train from a log-likelihood perspective, because sampling from them is often unfeasible.GAMs address the first issue through the r component, and alleviate the second issue by permitting the use of r as a powerful "proposal" (aka "surrogate") distribution in importance sampling and related approaches, as we will see.

Experimental focus
While the motivation for GAMs ultimately lies in practical NLP applications such as those evoked earlier, in this paper we aim to understand some of their capabilities and training techniques in simple and controllable conditions.We focus on the unconditional (i.e.language modelling) case, and on synthetic data.Our setup is as follows: • We consider an underlying process p true that generates binary sequences according to a welldefined and flexible process.In this paper we use PFSAs (Probabilistic Finite State Automata) to impose the presence or absence of sub-strings ("motifs") anywhere in the generated data, exploiting the intersection properties of automata.
• Due to the dynamic programming properties of PFSAs, it is possible to compute the true entropy H(p true ) = − x p true (x) log p true (x) of the process (see [SM]), as well as other quantities (Partition Functions, Mean sequence length); it is also possible to generate training (D), validation (V ), and test data (T ) in arbitrary quantities.
• We employ an unconditional GAM of the simple form: , with Z λ .= x P λ (x) and where r is trained on D and then kept fixed, and where λ is then trained on top of r, also on D.
It should be noted that with r fixed in this way, this formulation exactly corresponds to the definition of an exponential family (Jordan, 2010), with r as base measure.In such models, we have two important properties: (i) the log-likelihood of the data is convex relative to the parameters λ, and thus a local maximum is also global; (ii) the max-likelihood value λ * has the property that the model expectation Matching" property of exponential families).
• We are specially interested in the relative dataefficiency of the GAM compared to the AM r: namely the ability of the GAM to recover a lower perplexity approximation of p true than r, especially in small training-set conditions.
5 Training procedure

Two-stage training
We consider a two-stage training procedure (see Fig. 1).

r(x)
π θ (x) Training-1 Training-2 The main difficulty then consists in computing an estimate of the model moments In our experiments, we compare two Monte-Carlo approaches (Robert and Casella, 2005) for addressing this problem: (i) Rejection Sampling (rs), using r as the proposal distribution and (ii) Self-Normalized Importance Sampling (snis) (Owen, 2017;Y. Bengio and J. S. Senecal, 2008), also using r as the proposal.
Rejection sampling is performed as follows.We use r(x) as the proposal, and P λ (x) = r(x) e λ•φ(x) as the unnormalized target distribution; for any specific λ, because our features are bounded between 0 and 1, we can easily upperbound the ratio P λ (x) r(x) = e λ•φ(x) by a number β; we then sample x from r, compute the ratio ρ(x) = P λ (x) β r(x) ≤ 1, and accept x with probability ρ(x).The accepted samples are unbiased samples from p λ (x) and can be used to estimate model moments.
Snis also uses the proposal distribution r, but does not require an upper-bound, and is directly oriented towards the computation of expectations.In this case, we sample a number of points x 1 , . . ., x N from r, compute "importance ratios" w(x i ) = P λ (x i ) r(x i ) , and estimate The estimate is biased for a given N , but consistent (that is, it converges to the true E for N → ∞).
Training-2 While Training-1 results in a welldefined model P λ (x), which may fit the data closely in principle, we should not conclude that P λ (x) is convenient to use for inferencenamely, in language modeling, efficiently sampling from its normalized version p λ (x); as seriously, because of the partition factor Z λ , it is also not obvious to evaluate the perplexity of P λ (x) on test data.In order to do both, one approach consists in using a distillation technique (Hinton et al., 2015), where, during training, one expends generous time towards producing a set of samples from P λ , for instance by Monte-Carlo (e.g.Rejection Sampling) techniques, and where this set (which may be arbitrarily larger than the original D) is in turn used to train a new autoregressive model π θ (x), which can then be used directly for sampling or for computing data likelihood.This is the approach that we use in our current experiments, again using the original r(x) as a proposal distribution.

Cyclical training
In the case of small |D|, the proposal distribution r is weak and as a result the distillation process, based on rejection sampling, can be slow.To address this issue, we also consider a cyclical training regime that updates the proposal distribution after distilling each batch of samples, with the intention of reducing the rejection rate.Once the process of distillation is finished, we use the aggregated samples to train the final π θ .The two-stage training procedure is a variant of the cyclical one, with a fixed proposal (see Algorithm 1 for more details).
6 Related Work (Hoang et al., 2018), working in a NMT context, have a similar motivation to ours.They first train an autoregressive seq2seq model (Transformer in their case) on bilingual data, then attempt to control global properties of the generated sequences through the introduction of a priori features.They interpolate the training of the autoregressive model with training of a Moment Matching component which tries to equate the features expectations of the model with those of the data.Contrarily to our approach, they do not directly try to maximize likelihood in an integrated model.(Andor et al., 2016) consider transition-based neural networks, and contrast local to global normalization of decision sequences, showing how the global approach avoids the label bias problem in such tasks as tagging or parsing.They initialize and then train RNN 3:

30:
return P λ focus on inference as maximization, e.g.finding the best sequence of tags for a sequence of words, and consistent with that objective, their training procedure exploits a beam-search approximation.By contrast, our focus is on inference as sampling in a language modelling perspective, on the complementarity between auto-regressive models and log-linear models, and on the relations between training and sampling in energy-based models.

Experiments
We conduct a series of experiments on synthetic data to illustrate our approach.

Synthetic data
To assess the impact of GAMs, we focus on distributions p true (x) that are likely to be well approximated by the AM r(x) in the presence of large data.The first class of distributions is obtained through a PFSA that filters binary strings of fixed length n = 30, 0's and 1's being equally probable (white-noise strings), through the condition that they contain a specific substring ("motif") anywhere; here the relative frequency of sequences containing the motif among all sequences varies from ∼ 0.01 (shorter motifs |m| = 10) to ∼ 0.001 (longer motifs |m| = 14).
We also consider mixtures of two PFSAs (motif/anti-motif): the first (with mixture prob.0.9) produces white-noise strings containing the motif and the second (with mixture prob.0.1) strings excluding the motif.
From these processes we produce a training set D, of size |D| varying between 5•10 2 and 2•10 4 , a validation set V of size 0.25•|D| (but never smaller than 5 • 10 2 or bigger than 2 • 10 3 ) and a test set T of fixed size 5 • 10 3 .

Features
In a real world scenario, prior knowledge about the true process will involve, along with predictive features, a number of noisy and useless features.By training the λ parameters to match the empirical moments, the GAM will learn to distinguish between these types.In order to simulate this situation we consider feature vectors over our artificial data that involve both types.
With x the full string and m the fixed motif used in constructing the training data, we consider variations among the 7 binary features in the set F : where m = 0 iff the motif m appears in x, m +0 = 0 iff the motif followed by a zero ("supermotif") appears in x, m /2 = 0 iff an initial section of the motif ("sub-motif", roughly half the size of m) appears in x.These three features are chosen because they have some correlation with the process for generating the training data.By contrast, the four remaining features are "distractors":

Training: Two-Stage and Cyclical
The implementation is described in (Algorithm 1).
Here we provide some additional details.
that we did recently but do not report here; by matching the data expectations of these two additional features, the model is able to represent the mean and variance of length in the data.Here the prior knowledge provided to the model just tells it to be attentive to the distribution of length, a much weaker form of prior knowledge than telling it to be attentive to a specific motif.
Training-1 For training P λ (x) we test two regimes in Eq. 5, namely rs and snis; in both cases, we first train r(x) on the whatever D is available, and use it as the proposal distribution.
During rs, we compute the model's expectation over 10 accepted samples, update the λ's according to (5), and iterate.During snis, we keep a buffer of the last 5 • 10 4 samples from r(x) to compute the weighted average of the feature moments.For the training of λ's, we use a basic SGD optimization with learning rate α(#epoch) = To assess the quality of P λ (x) for early stopping during training, we use the distance between the empirical and model moments:

Cross-entropy comparison
We conduct experiments to compare the crossentropy (measured in nats) between the initial AM r(x) relative to the test set T and the final AM π θ (x) also relative to T ; we vary the size of |D| ∈ {0.5, 1, 5, 10, 20} • 10 3 , the regimes (tReg) for Training-1 (rs or snis), the features employed, the rarity of the motifs.Figure 2 depicts the resulting curves at the end of the two-stage training (plain lines).
Here we show only a few experiments (a more extensive set is provided in the [SM]).
We observe that, for a small dataset size |D|, there is a big gap between the CE of r(x) and the CE of π θ (x).As |D| increases, these crossentropies become closer to one another, but a large gap persists for |D| = 5000.
We note that the presence of the "fullypredictive" feature m results in a π θ (x) that has CE very close to the theoretical entropy, even in low |D| regimes, where r on its own is very weak. 4 4 The CE of a model relative to the true underlying pro-Thus, not only is the distilled AM much better than the initial AM, but this is an indication that P λ itself (for which the cross-entropy is more difficult to compute exactly) is a good approximation of the true process.
By contrast, if the m feature is absent, then, while π θ is still better than r in low |D| regimes, it cannot reach the theoretical entropy in such regimes, because features such as m 0+ and m /2 can only partially model the data.With large |D|, on the other hand, r on itself does a good job at predicting the data, and P λ adds little on top of its r component.
Finally, we note that the two regimes for training P λ (x), rs and snis, result in π θ 's with similar accuracies.
We also observe that with a good performance of π θ (x), the moments of motif feature on the distilled dataset are close to the true ones (see [SM] Figure 4, 5, 7).
These trends are consistent across the experiments with different motifs, as can be checked in Table 3 and with the additional plots in the [SM].

Motif frequencies
In order to assess the predictive properties of obtained AMs, we also compare the frequency of motifs in strings sampled from r and from π θ (2 • 10 3 samples in total).From Figure 2 we see that when vary |D|, the frequency of motifs (dashed lines) is aligned with the CE performance.Namely, π θ produces a higher fraction of strings with motif than r when |D| is small (|D| ∈ {0.5, 1, 5} • 10 3 ).

Detailed illustration
To provide more intuition, we provide an illustration from one experiment in Table 1.

Mixture D mam vs pure D m
In our experiments, the strings in D mam (motifanti-motif) contain a motif with p = 0.9.However, if not all of the samples in D mam contain the motif, then the motif feature itself is not fully predictive.It can be seen in panel (d) of Figure 2 that the π θ achieved with P λ trained on mixture D mam has consistent behaviour with the results obtained on the pure D m of panels (a,b,c).cess (approximated by the test set T ) can never be below the entropy of this process, due to the KL-divergence being nonnegative.with training set of size 5000, r is only able to generate the motif a fraction of the time (0.045, see line 10), but is better able to generate some submotifs (underlined); π θ generates the motif frequently (0.959), as illustrated on line 3.With the features from f t (line 4), Training-1 produces a P λ with first feature λ m strongly negative (line 5), meaning that P λ strongly penalizes the absence of the motif; the "distractor" features d 0 , d 1 , d 2 , d 3 get a weight close to 0, meaning that they have little predictive power in combination with feature m.It is visible from lines 6,7,8 that π θ is much better able to approximate the true feature expectations than r [features expectations (aka moments) under r (resp.π θ ) : E x∼r(•) φ(x) (resp.E x∼π θ (•) φ(x)) ] Finally (line 9), the CE of π θ relative to the test set is close to the true entropy of the process, while that of r is much further away.

Regimes in Training-1
For training GAM we consider two methods, snis and rs.As described in the previous sections, their impact on P λ leads to π θ 's that have similar CE's and motif frequencies.Despite such resemblance in terms of accuracy, these two methods differ in terms of speed (see Table 2).Namely, when r is close to white noise due to small |D|, then for the rare events rs rejects most samples not containing the motif due to the effect of the log linear term and negative value of the component λ m corresponding to the m feature, while snis is able to exploit all samples.Despite being faster than rs, snis remains competitive in terms of CE.

Cyclical vs two-stage training
We conducted a small experiment to compare the performance of cyclical training with two-stage training in terms of speed and accuracy for a fixed motif m and features f t (see [SM]

Discussion
The basic idea behind GAMs is very simple.First, we extend the representational power of the autoregressive model r by multiplying by a loglinear potential, obtaining an unnormalized model P λ (Training-1).Then we try to "project" this extended representation again to an autoregressive model π θ (Training-2).Our results showed that, under favorable prior knowledge conditions, the final π θ was able to perform as well, when trained on small data, as the standard r, trained on large data.During our experiments, we noticed that training P λ was actually easier than training π θ from it.Intuitively, the small number of parameters to be fitted in the log-linear model requires less work and fewer data than the training of an autoregressive component. 5 It is interesting to relate our study to certain aspects of Reinforcement Learning (RL).
First, consider Training-2.There, we have a "score" P λ that we are trying to approximate through an autoregressive model π θ , which is basically a sequential "policy".The main difference with RL is that we are not trying to find a policy that maximizes the score (which would be a bad idea for language modelling, as it would tend to concentrate the mass on a few sequences), but one that approximates P λ in a distributional sense; our 5 At a deeper level, there are extreme situations where the P λ obtained at the end of Training-1 can perfectly represent the true process, but where no autoregressive model can actually fit P λ : one way to obtain such situations consists in generating binary strings that satisfy a certain cryptographic predicate, associated with a specific feature; the importance of this feature can be easily detected through Training-1, but an autoregressive model has no chance of generalizing from distilled or true data, even in large quantities.current distillation technique is only one way to approach this problem, but other techniques more in the spirit of RL are possible, a direction that we leave for future work.
Second, consider Training-1.Our approach, consisting in suggesting to the model a number of prior features, might look too easy and suspicious.But notice that in RL, one would typically directly provide to the model an externally defined reward, a very strong form of prior knowledge.Here, instead, we "only" indicate to the models which features it might attend to, and Training-1 then determines the "reward" P λ through max-likelihood, a milder form of prior knowledge, more respectful for what the data has to say. 6 Supplementary Material

A.1 SGD in Energy-based models
The formula 1 is fundamental for studying the SGD behavior of Energy-based models, and for convenience, we provide a derivation here.
If we define Z η (C) .= x P η (x|C), we find that: We then have: A.2 The relevance of finite state automata; connections and differences with Reinforcement Learning The way our synthetic data is produced through FSA's may look contrived, but there are good motivations for using automata in such a study as ours.
Consider the following problem: you are given some RNN r that produces sequences x over a vocabulary V , with probabilities r but you would like to filter out sequences that do not contain a specific symbol a, while preserving the relative probabilities of sequences provided by the RNN: There appears to be no obvious way to realize p f iltered through an RNN, apart from techniques similar to what we have been describing in our discussion of Training-2.
The situation is completely different with FSA's.If you have a PFSA (Probabilistic FSA) r pf sa generating sequences x, then you can intersect r pf sa with an automaton that accepts all sequences containing at least one a, and renormalize the intersection through dynamic programming, obtaining a new PFSA that generates the filtered distribution.7Such dynamic programming, with the capacity to anticipate properties that need to be satisfied on the global sequence, is unavailable in the RNN world.
With RNNs, the situation is reminiscent of RL, with a reward associated with having observed an a during the production of the sequence.But a standard RL approach would mean that we would try to maximize P f iltered (x), without taking into consideration the original r(x) that we are filtering from.To be correct, we need to find a policy π θ (x) (similar to an RNN), that tries to approximate p f iltered (x) in a distributional sense, not in a maximization sense (see (Bellemare et al., 2017) for related considerations).This is what we try to do in Training-2, using motifs as our main casestudy, instead of a single symbol a (which would not make sense for binary strings).
The advantages of using PFSAs in our study are multiple.They provide a well-understood comparison point to the more complex techniques that need to be deployed for autoregressive models.From an operational viewpoint, they also permit, through dynamic programming, to perform various calculations of interest for our study, such as sampling datasets of arbitrary size and computing exact entropy and partition functions that can serve as comparison points for the results obtained with GAMs.In the present paper, we only exploited PFSA's in the context of motifs, but they provide a much larger class of models that could serve to expand our understanding of sequence-based energy based models.

A.2.1 Computing the Entropy of a PFSA
As mentioned earlier, one advantage of using weigthed finite-state automata for generating synthetic data is that some important quantities, such as entropy, mean sequence length, or partition function can be computed by dynamic programming.
Here we only derive a simple iterative method for computing the entropy of a PFSA, the other computations are very similar. 8e consider a PFSA with transitions of the form (q, l, q , w), where q, q are states, l is the label of the transition from q to q (in our case l ∈ {0, 1}), and w is the probability of the transition.The fact that the automaton is probabilistic, instead of simply weighted, means that the sum of w's associated with transitions starting at q is equal to 1.We further assume that the automaton is deterministic, namely that given q and l uniquely determines the next state q . 9 The entropy H(q) of a state q is defined as H(q) .= − x q p(x|q) log p(x|q), where x q denotes a sequence of labels x that ends in a final state of the automaton, for which p(x|q) is computed in the obvious way.The entropy of the automaton as a whole is then defined as H(q s ), where q s is the initial state of the automaton.
Lemma The entropies of states satisfy the fixpoint equation: Proof.Let's denote by q l the state obtained from q by following l.We have: It is possible to show that the state entropies actually correspond to the least fixpoint of equation ( 7), and this allows a simple iterative algorithm for computing the state entropies: at time t = 0, for all states q, we define H t=0 (q) .= 0, and then we iterate until convergence: H t+1 (q) = (q,l,q ,w) −w log w + wH t (q ). 9 The case of non-deterministic probabilistic automata appears much more difficult (Cortes et al., 2008).

Figure 1 :
Figure 1: Two-stage training.At the end of the process, we compare the perplexities of r and π θ on test data: CE(T, r) vs. CE(T, π θ ).

Table 1 :
Illustration.Setting is from

Table 2 :
Comparison of the time for Training-1 in rs and snis; for motif 10001011111000; f t = 1011111; H(p true ) = 0.449 with pure D (m) and f t = 1001111; H(p true ) = 0.482 with mixture of motif-anti-motif D (mam).

Table 4 ,
. We observed that CEs of the obtained π θ 's were about the same for different values of |D| and Training-1 regimes.On the other hand, there was no systematic improvement in the training speed of one method over the other.

Table 4 :
Cyclical training vs two stage training for motif 10001011111000, D m , f t = 1001111; CE is short for CE(T, π θ ).