Neural Particle Smoothing for Sampling from Conditional Sequence Models

We introduce neural particle smoothing, a sequential Monte Carlo method for sampling annotations of an input string from a given probability model. In contrast to conventional particle filtering algorithms, we train a proposal distribution that looks ahead to the end of the input string by means of a right-to-left LSTM. We demonstrate that this innovation can improve the quality of the sample. To motivate our formal choices, we explain how neural transduction models and our sampler can be viewed as low-dimensional but nonlinear approximations to working with HMMs over very large state spaces.


Introduction
Many structured prediction problems in NLP can be reduced to labeling a length-T input string x with a length-T sequence y of tags. In some cases, these tags are annotations such as syntactic parts of speech. In other cases, they represent actions that incrementally build an output structure: IOB tags build a chunking of the input (Ramshaw and Marcus, 1999), shift-reduce actions build a tree (Yamada and Matsumoto, 2003), and finite-state transducer arcs build an output string (Pereira and Riley, 1997).
One may wish to score the possible taggings using a recurrent neural network, which can learn to be sensitive to complex patterns in the training data. A globally normalized conditional probability model is particularly valuable because it quantifies uncertainty and does not suffer from label bias (Lafferty et al., 2001); also, such models often arise as the predictive conditional distribution p(y | x) corresponding to some well-designed generative model p(x, y) for the domain. In the neural case, however, inference in such models becomes intractable. It is hard to know what the model actually predicts and hard to compute gradients to improve its predictions.
In such intractable settings, one generally falls back on approximate inference or sampling. In the NLP community, beam search and importance sam-pling are common. Unfortunately, beam search considers only the approximate-top-k taggings from an exponential set (Wiseman and Rush, 2016), and importance sampling requires the construction of a good proposal distribution (Dyer et al., 2016).
In this paper we exploit the sequential structure of the tagging problem to do sequential importance sampling, which resembles beam search in that it constructs its proposed samples incrementally-one tag at a time, taking the actual model into account at every step. This method is known as particle filtering (Doucet and Johansen, 2009). We extend it here to take advantage of the fact that the sampler has access to the entire input string as it constructs its tagging, which allows it to look ahead or-as we will showto use a neural network to approximate the effect of lookahead. Our resulting method is called neural particle smoothing.

What this paper provides
For x = x 1 · · · x T , let x :t and x t: respectively denote the prefix x 1 · · · x t and the suffix x t+1 · · · x T .
We develop neural particle smoothing-a sequential importance sampling method which, given a string x, draws a sample of taggings y from p θ (y | x). Our method works for any conditional probability model of the quite general form 1 where G is an incremental stateful global scoring model that recursively defines scores G t of prefixes of (x, y) at all times 0 ≤ t ≤ T : These quantities implicitly depend on x, y and θ. Here s t is the model's state after observing the pair of length-t prefixes (x :t , y :t ). G t is the score-so-far of this prefix pair, while G T − G t is the score-togo. The state s t summarizes the prefix pair in the sense that the score-to-go depends only on s t and the length-(T − t) suffixes (x t: , y t: ). The local scoring function g θ and state update function f θ may be any functions parameterized by θ-perhaps neural networks. We assume θ is fixed and given.
This model family is expressive enough to capture any desired p(y | x). Why? Take any distribution p(x, y) with this desired conditionalization (e.g., the true joint distribution) and factor it as log p(x, y) = T t=1 log p(x t , y t | x :t−1 , y :t−1 ) = T t=1 log p(x t , y t | s t−1 ) use as g θ (s t−1 ,xt,yt) = G T (4) by making s t include as much information about (x :t , y :t ) as needed for (4) to hold (possibly s t = (x :t , y :t )). 2 Then by defining g θ as shown in (4), we get p(x, y) = exp G T and thus (1) holds for each x.

Relationship to particle filtering
Our method is spelled out in §4 (one may look now). It is a variant of the popular particle filtering method that tracks the state of a physical system in discrete time (Ristic et al., 2004). Our particular proposal distribution for y t can be found in equations (5), (6), (25) and (26). It considers not only past observations x :t as reflected in s t−1 , but also future observations x t: , as summarized by the states t of a right-to-left recurrent neural networkf that we will train: Conditioning the distribution of y t on future observations x t: means that we are doing "smoothing" rather than "filtering" (in signal processing terminology). Doing so can reduce the bias and variance of our sampler. It is possible so long as x is provided in its entirety before the sampler runs-which is often the case in NLP.

Applications
Why sample from p θ at all? Many NLP systems instead simply search for the Viterbi sequence y that maximizes G T and thus maximizes p θ (y | x). If the space of states s is small, this can be done efficiently by dynamic programming (Viterbi, 1967); if not, then A * may be an option (see §2). More common is to use an approximate method: beam search, or perhaps a sequential prediction policy trained with reinforcement learning. Past work has already shown how to improve these approximate search algorithms by conditioning on the future (Bahdanau et al., 2017;Wiseman and Rush, 2016). Sampling is essentially a generalization of maximization: sampling from exp G T temperature approaches maximization as temperature → 0. It is a fundamental building block for other algorithms, as it can be used to take expectations over the whole space of possible y values. For unfamiliar readers, Appendix E reviews how sampling is crucially used in minimum-risk decoding, supervised training, unsupervised training, imputation of missing data, pipeline decoding, and inference in graphical models.

Exact Sequential Sampling
To develop our method, it is useful to first consider exact samplers. Exact sampling is tractable for only some of the models allowed by §1.1. However, the form and notation of the exact algorithms in §2 will guide our development of approximations in §3.
An exact sequential sampler draws y t from p θ (y t | x, y :t−1 ) for each t = 1, . . . , T in sequence. Then y is exactly distributed as p θ (y | x).
For each given x, y :t−1 , observe that Thus, we can easily construct the needed distribution (7) by normalizing (12) over all possible values of y t . The challenging part of (12) is to compute H t : as defined in (10), H t involves a sum over exponentially many futures y t: . (See Figure 1.) We chose the symbols G and H in homage to the A * search algorithm (Hart et al., 1968). In that algorithm (which could be used to find the Viterbi sequence), g denotes the score-so-far of a partial solution y :t , and h denotes the optimal score-togo. Thus, g + h would be the score of the best sequence with prefix y :t . Analogously, our G t + x 1 ="On" x 2 ="Thursday" … x t-1 ="Fed" x t ="raised" x t+1 ="interest" x t+2 ="rates" … y 1 ="PREP" y 2 ="N" … y t-1 ="N" y t ="ADJ" … y t ="V" y t+1 ="V" … y t+1 ="N" … y t+2 ="N" y t+2 ="N" H t x y g θ (s t-1 , x t , y t ) G t-1 Figure 1: Sampling a single particle from a tagging model. y1, . . . , yt−1 (orange) have already been chosen, with a total model score of Gt−1, and now the sampler is constructing a proposal distribution q (purple) from which the next tag yt will be sampled. Each yt is evaluated according to its contribution to Gt (namely g θ ) and its future score Ht (blue). The figure illustrates quantities used throughout the paper, beginning with exact sampling in equations (7)-(12). Our main idea ( §3) is to approximate the Ht computation (a log-sum-exp over exponentially many sequences) when exact computation by dynamic programming is not an option. The form of our approximation uses a right-to-left recurrent neural network but is inspired by the exact dynamic programming algorithm.
H t is the log of the total exponentiated scores of all sequences with prefix y :t . G t and H t might be called the logprob-so-far and logprob-to-go of y :t . Just as A * approximates h with a "heuristic"ĥ, the next section will approximate H t using a neural estimateĤ t (equations (5)-(6)). However, the specific form of our approximation is inspired by cases where H t can be computed exactly. We consider those in the remainder of this section.

Exact sampling from HMMs
A hidden Markov model (HMM) specifies a normalized joint distribution p θ (x, y) = exp G T over state sequence y and observation sequence x, 3 Thus the posterior p θ (y | x) is proportional to exp G T , as required by equation (1).
The HMM specifically defines G T by equations (2)-(3) with s t = y t and g θ (s t−1 , x t , y t ) = log p θ (y t | y t−1 ) + log p θ (x t | y t ). 4 In this setting, H t can be computed exactly by the backward algorithm (Rabiner, 1989). (Details are given in Appendix A for completeness.)

Exact sampling from OOHMMs
For sequence tagging, a weakness of (first-order) HMMs is that the model state s t = y t may contain little information: only the most recent tag y t is remembered, so the number of possible model states s t is limited by the vocabulary of output tags.
We may generalize so that the data generating process is in a latent state u t ∈ {1, . . . , k} at each time t, and the observed y t -along with x t -is generated from u t . Now k may be arbitrarily large. The 3 The HMM actually specifies a distribution over a pair of infinite sequences, but here we consider the marginal distribution over just the length-T prefixes. 4 It takes s0 = BOS, a beginning-of-sequence symbol, so p θ (y1 | BOS) specifies the initial state distribution. model has the form This is essentially a pair HMM (Knudsen and Miyamoto, 2003) without insertions or deletions, also known as an " -free" or "same-length" probabilistic finite-state transducer. We refer to it here as an output-output HMM (OOHMM). 5 Is this still an example of the general model architecture from §1.1? Yes. Since u t is latent and evolves stochastically, it cannot be used as the state s t in equations (2)-(3) or (4). However, we can define s t to be the model's belief state after observing (x :t , y :t ). The belief state is the posterior probability distribution over the underlying state u t of the system. That is, s t deterministically keeps track of all possible states that the OOHMM might be in-just as the state of a determinized FSA keeps track of all possible states that the original nondeterministic FSA might be in.
We may compute the belief state in terms of a vector of forward probabilities that starts at α 0 , and is updated deterministically for each 0 < t ≤ T by the forward algorithm (Rabiner, 1989): (α t ) u can be interpreted as the logprob-so-far if the system is in state u after observing (x :t , y :t ). We may express the update rule (15) by α t = α t−1 P where the matrix P depends on (x t , y t ), namely The belief state s t def = α t ∈ R k simply normalizes α t into a probability vector, where u def = u/(u 1) denotes the normalization operator. The state update (15) now takes the form (3) as desired, with f θ a normalized vector-matrix product: As in the HMM case, we define G t as the log of the generative prefix probability, which has the form (2) as desired if we put Again, exact sampling is possible. It suffices to compute (9). For the OOHMM, this is given by where β T def = 1 and the backward algorithm = ut:,y t: for 0 ≤ t < T uses dynamic programming to find the total probability of all ways to generate the future observations x t: . Note that α t is defined for a specific prefix y :t (though it sums over all u :t ), whereas β t sums over all suffixes y t: (and over all u t: ), to achieve the asymmetric summation in (19). (20) can clearly be expressed in the forms t = Ps t+1 , much like (16).

The logprob-to-go for OOHMMs
Let us now work out the definition of H t for OOHMMs (cf. equation (35) in Appendix A for HMMs). We will write it in terms ofĤ t from §1.2. Let us defineĤ t symmetrically to G t (see (17)): which has the form (5) as desired if we put From equations (10), (17), (19) and (21), we see where C t ∈ R can be regarded as evaluating the compatibility of the state distributions s t ands t . In short, the generic strategy (12) for exact sampling says that for an OOHMM, y t is distributed as This is equivalent to choosing y t in proportion to (19)-but we now turn to settings where it is infeasible to compute (19) exactly. There we will use the formulation (24) but approximate C t . For completeness, we will also consider how to approximatê H t , which dropped out of the above distribution (because it was the same for all choices of y t ) but may be useful for other algorithms (see §4).

Models with large state spaces
The expressivity of an OOHMM is limited by the number of states k. The state u t ∈ {1, . . . , k} is a bottleneck between the past (x :t , y :t ) and the future (x t: , y t: ), in that past and future are conditionally independent given u t . Thus, the mutual information between past and future is at most log 2 k bits. In many NLP domains, however, the past seems to carry substantial information about the future. The first half of a sentence greatly reduces the uncertainly about the second half, by providing information about topics, referents, syntax, semantics, and discourse. This suggests that an accurate HMM language model p(x) would require very large kas would a generative OOHMM model p(x, y) of annotated language. The situation is perhaps better for discriminative models p(y | x), since much of the information for predicting y t: might be available in x t: . Still, it is important to let (x :t , y :t ) contribute enough additional information about y t: : even for short strings, making k too small (giving ≤ log 2 k bits) may harm prediction (Dreyer et al., 2008).
Of course, (4) says that an OOHMM can express any joint distribution for which the mutual information is finite, 6 by taking k large enough for v t−1 to capture the relevant info from (x :t−1 , y :t−1 ).
So why not just take k to be large-say, k = 2 30 to allow 30 bits of information? Unfortunately, evaluating G T then becomes very expensive-both computationally and statistically. As we have seen, if we define s t to be the belief state α t ∈ R k , updating it at each observation (x t , y t ) (equation (3)) requires multiplication by a k × k matrix P . This takes time O(k 2 ), and requires enough data to learn O(k 2 ) transition probabilities.

Neural approximation of the model
As a solution, we might hope that for the inputs x observed in practice, the very high-dimensional belief states α t ∈ R k might tend to lie near a ddimensional manifold where d k. Then we could take s t to be a vector in R d that compactly encodes the approximate coordinates of α t relative to the manifold: s t = ν( α t ), where ν is the encoder.
In this new, nonlinearly warped coordinate system, the functions of s t−1 in (2)-(3) are no longer the simple, essentially linear functions given by (16) and (18). They become nonlinear functions operating on the manifold coordinates. (f θ in (16) should now ensure that s t ≈ ν( (ν −1 (s t−1 )) P ), and g θ in (18) should now estimate log (ν −1 (s t−1 )) P 1.) In a sense, this is the reverse of the "kernel trick" (Boser et al., 1992) that converts a low-dimensional nonlinear function to a high-dimensional linear one.
Our hope is that s t has enough dimensions d k to capture the useful information from the true α t , and that θ has enough dimensions k 2 to capture most of the dynamics of equations (16) and (18). We thus proceed to fit the neural networks f θ , g θ directly to the data, without ever knowing the true k or the structure of the original operators P ∈ R k×k .
We regard this as the implicit justification for various published probabilistic sequence models p θ (y | x) that incorporate neural networks. These models usually have the form of §1.1. Most simply, (f θ , g θ ) can be instantiated as one time step in an RNN (Aharoni and Goldberg, 2017), but it is com-mon to use enriched versions such as deep LSTMs. It is also common to have the state s t contain not only a vector of manifold coordinates in R d but also some unboundedly large representation of (x, y :t ) (cf. equation (4)), so the f θ neural network can refer to this material with an attentional (Bahdanau et al., 2015) or stack mechanism (Dyer et al., 2015).
A few such papers have used globally normalized conditional models that can be viewed as approximating some OOHMM, e.g., the parsers of Dyer et al. (2016) and Andor et al. (2016). That is the case ( §1.1) that particle smoothing aims to support. Most papers are locally normalized conditional models (e.g., Kann and Schütze, 2016;Aharoni and Goldberg, 2017); these simplify supervised training and can be viewed as approximating IOHMMs (footnote 5). For locally normalized models, H t = 0 by construction, in which case particle filtering (which estimates H t = 0) is just as good as particle smoothing. Particle filtering is still useful for these models, but lookahead's inability to help them is an expressive limitation (known as label bias) of locally normalized models. We hope the existence of particle smoothing (which learns an estimate H t ) will make it easier to adopt, train, and decode globally normalized models, as discussed in §1.3.

Neural approximation of logprob-to-go
We can adopt the same neuralization trick to approximate the OOHMM's logprob-to-go H t = C t +Ĥ t . We takes t ∈ R d on the same theory that it is a lowdimensional reparameterization of β t , and define (f φ , h φ ) in equations (5)-(6) to be neural networks. Finally, we must replace the definition of C t in (23) with another neural network c φ that works on the low-dimensional approximations: 7 The resulting approximation to (24) (which does not actually require h φ ) will be denoted q θ,φ : The neural networks in the present section are all parameterized by φ, and are intended to produce an estimate of the logprob-to-go H t -a function of x t: , which sums over all possible y t: .
By contrast, the OOHMM-inspired neural networks suggested in §3.2 were used to specify an actual model of the logprob-so-far G t -a function of x :t and y :t -using separate parameters θ.
Arguably φ has a harder modeling job than θ because it must implicitly sum over possible futures y t: . We now consider how to get corrected samples from q θ,φ even if φ gives poor estimates of H t , and then how to train φ to improve those estimates.

Particle smoothing
In this paper, we assume nothing about the given model G T except that it is given in the form of equations (1)-(3) (including the parameter vector θ).
Suppose we run the exact sampling strategy but approximate p θ in (7) with a proposal distribution q θ,φ of the form in (25)-(26). Suppressing the subscripts on p and q for brevity, this means we are effectively drawing y not from p(y | x) but from If C t ≈ H t + const within each y t draw, then q ≈ p. Normalized importance sampling corrects (mostly) for the approximation by drawing many sequences y (1) , . . . y (M ) IID from (27) and assigning . This ensemble of weighted particles yields a distribution that can be used as discussed in §1.3. To compute w (m) in practice, we replace the numerator p(y (m) | x) by the unnormalized version exp G T , which gives the samep. Recall that each G T is a sum T t=1 g θ (· · · ). Sequential importance sampling is an equivalent implementation that makes t the outer loop and m the inner loop. It computes a prefix ensemble Then for 0 < t ≤ T , we extend these particles in parallel: where each y (m) t is drawn from (26). Each Y t yields a distributionp t over prefixes y :t , which estimates the distribution p t (y :t ) def ∝ exp (G t +C t ). We returnp def =p T ≈ p T = p. This gives the samep as in (28): the final y (m) T are the same, with the same final weights w (m) T = exp G T q(y (m) |x) , where G T was now summed up as C 0 + T t=1 g θ (· · · ) + C t − C t−1 . That is our basic particle smoothing strategy. If we use the naive approximation C t = 0 everywhere, it reduces to particle filtering. In either case, various well-studied improvements become available, such as various resampling schemes (Douc and Cappé, 2005) and the particle cascade (Paige et al., 2014). 8 An easy improvement is multinomial resampling. After computing eachp t , this replaces Y t with a set of M new draws fromp t (≈ p t ), each of weight 1-which tends to drop low-weight particles and duplicate high-weight ones. 9 For this to usefully focus the ensemble on good prefixes y :t , p t should be a good approximation to the true marginal p(y :t | x) ∝ exp (G t + H t ) from (10). That is why we arranged for p t (y :t ) ∝ exp (G t + C t ). Without C t , we would have only p t (y :t ) ∝ exp G t -which is fine for the traditional particle filtering setting, but in our setting it ignores future information in x t: (which we have assumed is available) and also favors sequences y that happen to accumulate most of their global score G T early rather than late (which is possible when the globally normalized model (1)-(2) is not factored in the generative form (4)).

Training the Sampler Heuristic
We now consider training the parameters φ of our sampler. These parameters determine the updatesf φ in (6) and the compatibility function c φ in (25). As a result, they determine the proposal distribution q used in equations (27) and (31), and thus determine the stochastic choice ofp that is returned by the sampler on a given input x.
In this paper, we simply try to tune φ to yield good proposals. Specifically, we try to ensure that q φ (y | x) in equation (27) is close to p(y | x) from equation (1). While this may not be necessary for the sampler to perform well downstream, 10 it does 8 The particle cascade would benefit from an estimate ofĤt, as it (like A * search) compares particles of different lengths. 9 While resampling mitigates the degeneracy problem, it could also reduce the diversity of particles. In our experiments in this paper, we only do multinomial resampling when the effective sample size ofpt is lower than M 2 . Doucet and Johansen (2009) give a more thorough discussion on when to resample. 10 In principle, one could attempt to train φ "end-to-end" on some downstream objective by using reinforcement learning or the Gumbel-softmax trick (Jang et al., 2017;Maddison et al., 2017). For example, we might try to ensure thatp closely matches the model's distribution p (equation (28))-the "na-guarantee it (assuming that the model p is correct). Specifically, we seek to minimize (1 − λ)KL(p||q φ ) + λKL(q φ ||p) (with λ ∈ [0, 1]) (32) averaged over examples x drawn from a training set. 11 (The training set need not provide true y's.) The inclusive KL divergence KL (p||q φ ) is an expectation under p. We estimate it by replacing p with a samplep, which in practice we can obtain with our sampler under the current φ. (The danger, then, is thatp will be biased when φ is not yet well-trained; this can be mitigated by increasing the sample size M when drawingp for training purposes.) Intuitively, this term tries to encourage q φ in future to re-propose those y values that turned out to be "good" and survived intop with high weights.
The exclusive KL divergence KL(q φ ||p) is an expectation under q φ . Since we can sample from q φ exactly, we can get an unbiased estimate of ∇ φ KL(q φ ||p) with the likelihood ratio trick (Glynn, 1990). 12 (The danger is that such "REINFORCE" methods tend to suffer from very high variance.) This term is a popular objective for variational approximation. Here, it tries to discourage q φ from re-proposing "bad" y values that turned out to have low exp G T relative to their proposal probability.
Our experiments balance "recall" (inclusive) and "precision" (exclusive) by taking λ = 1 2 (which Appendix F compares to λ ∈ {0, 1}). Alas, because of our approximation to the inclusive term, neither term's gradient will "find" and directly encourage good y values that have never been proposed. Appendix B gives further discussion and formulas.

Models for the Experiments
To evaluate our methods, we needed pre-trained models p θ . We experimented on several models. In each case, we trained a generative model p θ (x, y), so that we could try sampling from its posterior distribution p θ (y | x). This is a very common setting where particle smoothing should be able to help. Details for replication are given in Appendix C.
tural" goal of sampling. This objective can tolerate inaccurate local proposal distributions in cases where the algorithm could recover from them through resampling. Looking even farther downstream, we might merely wantp-which is typically used to compute expectations-to provide accurate guidance to some decision or training process (see Appendix E). This might not require fully matching the model, and might even make it desirable to deviate from an inaccurate model. 11 Training a single approximation q φ for all x is known as amortized inference. 12 The normalizing constant of p from (1) can be ignored because the gradient of a constant is 0.

Tagging models
We can regard a tagged sentence (x, y) as a string over the "pair alphabet" X × Y. We train an RNN language model over this "pair alphabet"-this is a neuralized OOHMM as suggested in §3.2: This model is locally normalized, so that log p θ (x, y) (as well as its gradient) is straightforward to compute for a given training pair (x, y). Joint sampling from it would also be easy ( §3.2).
However, p(y | x) is globally renormalized (by an unknown partition function that depends on x, namely exp H 0 ). Conditional sampling of y is therefore potentially hard. Choosing y t optimally requires knowledge of H t , which depends on the future x t: .
As we noted in §1, many NLP tasks can be seen as tagging problems. In this paper we experiment with two such tasks: English stressed syllable tagging, where the stress of a syllable often depends on the number of remaining syllables, 13 providing good reason to use the lookahead provided by particle smoothing; and Chinese NER, which is a familiar textbook application and reminds the reader that our formal setup (tagging) provides enough machinery to treat other tasks (chunking).
English stressed syllable tagging This task tags a sequence of phonemes x, which form a word, with their stress markings y. Our training examples are the stressed words in the CMU pronunciation dictionary (Weide, 1998). We test the sampler on held-out unstressed words.
Chinese social media NER This task does named entity recognition in Chinese, by tagging the characters of a Chinese sentence in a way that marks the named entities. We use the dataset from Peng and Dredze (2015), whose tagging scheme is a variant of the BIO scheme mentioned in §1. We test the sampler on held-out sentences.

String source separation
This is an artificial task that provides a discrete analogue of speech source separation (Zibulevsky and Pearlmutter, 2001). The generative model is that J strings (possibly of different lengths) are generated IID from an RNN language model, and are then combined into a single string x according to a random interleaving string y. 14 The posterior p(y | x) predicts the interleaving string, which suffices to reconstruct the original strings. The interleaving string is selected from the uniform distribution over all possible interleavings (given the J strings' lengths). For example, with J = 2, a possible generative story is that we first sample two strings Foo and Bar from an RNN language model. We then draw an interleaving string 112122 from the aforementioned uniform distribution, and interleave the J strings deterministically to get FoBoar.
p(x, y) is proportional to the product of the probabilities of the J strings. The only parameters of p θ , then, are the parameters of the RNN language model, which we train on clean (non-interleaved) samples from a corpus. We test the sampler on random interleavings of held-out samples.
The state s (which is provided as an input to c θ in (25)) is the concatenation of the J states of the language model as it independently generates the J strings, and g θ (s t−1 , x t , y t ) is the log-probability of generating x t as the next character of the y t th string, given that string's language model state within s t−1 . As a special case, x T = EOS (see footnote 1), and g θ (s T −1 , EOS, EOS) is the total log-probability of termination in all J language model states.
String source separation has good reason for lookahead: appending character "o" to a reconstructed string " gh" is only advisable if "s" and "t" are coming up soon to make "ghost." It also illustrates a powerful application setting-posterior inference under a generative model. This task conveniently allowed us to construct the generative model from a pre-trained language model. Our constructed generative model illustrates that the state s and transition function f can reflect interesting problemspecific structure.

CMU Pronunciation dictionary
The CMU pronunciation dictionary (already used above) provides sequences of phonemes. Here we use words no longer than 5 phonemes. We interleave the (unstressed) phonemes of J = 5 words.
Penn Treebank The PTB corpus (Marcus et al., 1993) provides English sentences, from which we use only the sentences of length ≤ 8. We interleave the words of J = 2 sentences.

Experiments
In our experiments, we are given a pre-trained scoring model p θ , and we train the parameters φ of a particle smoothing algorithm. 15 We now show that our proposed neural particle smoothing sampler does better than the particle filtering sampler. To define "better," we evaluate samplers on the offset KL divergence from the true posterior.

Evaluation metrics
Given x, the "natural" goal of conditional sampling is for the sample distributionp(y) to approximate the true distribution p θ (y | x) = exp G T / exp H 0 from (1). We will therefore report-averaged over all held-out test examples x-the KL divergence wherep(y | x) denotes the unnormalized distribution given by exp G T in (2), and Z(x) denotes its normalizing constant, exp H 0 = yp (y | x).
As we are unable to compute log Z(x) in practice, we replace it with an estimate z(x) to obtain an offset KL divergence. This change of constant does not change the measured difference between two samplers, KL(p 1 ||p) − KL(p 2 ||p). Nonetheless, we try to use a reasonable estimate so that the reported KL divergence is interpretable in an absolute sense. Specifically, we take z(x) = log y∈Yp (y | x) ≤ log Z, where Y is the full set of distinct particles y that we ever drew for input x, including samples from the beam search models, while constructing the experimental results graph. 16 Thus, the offset KL divergence is a "best effort" lower bound on the true exclusive KL divergence KL(p||p).

Results
In all experiments we compute the offset KL divergence for both the particle filtering samplers and the particle smoothing samplers, for varying ensemble sizes M . We also compare against a beam search baseline that keeps the highest-scoring M particles at each step (scored by exp G t with no lookahead). The results are in Figures 2a-2d  The y-axis is the offset KL divergence described in §7.1 (in bits per sequence). The smoothing samplers offer considerable speedup: for example, in Figure 2a, the non-resampled smoothing sampler achieves comparable offset KL divergences with only 1 /4 as many particles as its filtering counterparts. Abbreviations in the legend: PF=particle filtering. PS=particle smoothing. BEAM=beam search. ':R' suffixes indicate resampled variants. For readability, beam search results are omitted from Figure 2d, but appear in Figure 3 of the appendices.
Given a fixed ensemble size, we see the smoothing sampler consistently performs better than the filtering counterpart. It often achieves comparable performance at a fraction of the ensemble size.
Beam search on the other hand falls behind on three tasks: stress prediction and the two source separation tasks. It does perform better than the stochastic methods on the Chinese NER task, but only at small beam sizes. Varying the beam size barely affects performance at all, across all tasks. This suggests that beam search is unable to explore the hypothesis space well.
We experiment with resampling for both the particle filtering sampler and our smoothing sampler. In source separation and stressed syllable prediction, where the right context contains critical information about how viable a particle is, resampling helps particle filtering almost catch up to particle smoothing. Particle smoothing itself is not further improved by resampling, presumably because its effective sample size is high. The goal of resampling is to kill off low-weight particles (which were overproposed) and reallocate their resources to higher-weight ones. But with particle smoothing, there are fewer lowweight particles, so the benefit of resampling may be outweighted by its cost (namely, increased variance).

Related Work
Much previous work has employed sequential importance sampling for approximate inference of intractable distributions (e.g., Thrun, 2000;Andrews et al., 2017). Some of this work learns adaptive proposal distributions in this setting (e.g. Gu et al., 2015;Paige and Wood, 2016). The key difference in our work is that we consider future inputs, which is impossible in online decision settings such as robotics. Klaas et al. (2006) did do particle smoothing, like us, but they did not learn adaptive proposal distributions.
Just as we use a right-to-left RNN to guide posterior sampling of a left-to-right generative model, Krishnan et al. (2017) employed a right-to-left RNN to guide posterior marginal inference in the same sort of model. Serdyuk et al. (2018) used a right-toleft RNN to regularize training of such a model.

Conclusion
We have described neural particle smoothing, a sequential Monte Carlo method for approximate sampling from the posterior of incremental neural scoring models. Sequential importance sampling has arguably been underused in the natural language processing community. It is quite a plausible strategy for dealing with rich, globally normalized probability models such as neural models-particularly if a good sequential proposal distribution can be found. Our contribution is a neural proposal distribution, which goes beyond particle filtering in that it uses a right-to-left recurrent neural network to "look ahead" to future symbols of x when proposing each symbol y t . The form of our distribution is well-motivated.
There are many possible extensions to the work in this paper. For example, we can learn the generative model and proposal distribution jointly; we can also infuse them with hand-crafted structure, or use more deeply stacked architectures; and we can try training the proposal distribution end-to-end (footnote 10). Another possible extension would be to allow each step of q to propose a sequence of actions, effectively making the tagset size ∞. This extension relaxes our |y| = |x| restriction from §1 and would allow us to do general sequence-to-sequence transduction.
A The logprob-to-go for HMMs As noted in §2.1, the logprob-to-go H t can be computed by the backward algorithm. By the definition of H t in equation (10), where the vector β t is defined by base case (β T ) y = 1 and for 0 ≤ t < T by the recurrence The backward algorithm (20) for OOHMMs in §2.2 is a variant of this.

B Gradients for Training the Proposal Distribution
For a given x, both forms of KL divergence achieve their minimum of 0 when (∀y) q φ (y | x) = p(y | x). However, we are unlikely to be able to find such a φ; the two metrics penalize q φ differently for mismatches. We simplify the notation below by writing q φ (y) and p(y), suppressing the conditioning on x.
Inclusive KL Divergence The inclusive KL divergence has that name because it is finite only when support(q φ ) ⊇ support(p), i.e., when q φ is capable of proposing any string y that has positive probability under p. This is required for q φ to be a valid proposal distribution for importance sampling.
The first term E y∼p [log p (y)] is a constant with regard to φ. As a result, the gradient of the above is just the gradient of the second term: the cross-entropy H(p,q φ ) We cannot directly sample from p. However, our weighted mixturep from equation (28) (obtained by sequential importance sampling) could be a good approximation: Following this approximate gradient downhill has an intuitive interpretation: if a particular y t value ends up with high relative weight in the final ensemblep, then we will try to adjust q φ so that it would have had a high probability of proposing that y t value at step t in the first place.
Exclusive KL Divergence The exclusive divergence has that name because it is finite only when support(q φ ) ⊆ support(p). It is defined by where p(y) = 1 Zp (y) forp(y) = exp G T and Z = yp (y). With some rearrangement, we can write its gradient as an expectation that can be estimated by sampling from q φ . 17 Observing that Z is constant with respect to φ, first write where the last step uses the fact that ∇ φ q φ (y) = q φ (y)∇ φ log q φ (y) (the "likelihood ratio trick"): which can, if desired, be further rewritten as If we regard d φ (y) as a signed error (in the log domain) in trying to fit q φ top, then the above gradient of KL can be interpreted as the gradient of the mean squared error (divided by 2). 18 We would get the same gradient for any rescaled version of the unnormalized distributionp, but the formula for obtaining that gradient would be different. In particular, if we rewrite the above derivation but add a constant b to both logp(y) and log Z throughout (equivalent to adding b to G T ), we will get the slightly generalized expectation formulas in place of equations (43) and (44). By choosing an appropriate "baseline" b, we can reduce the variance of the sampling-based estimate of these expectations. This is similar to the use of a baseline in the REIN-FORCE algorithm (Williams, 1992). In this work we choose b using an exponential moving average of past E [d φ (y)] values: at the end of each training minibatch, we update b ← 0.1 · b + 0.9 ·d, wherē d is the mean of the estimated E y∼q φ (·|x) [d φ (y)] values for all examples x in the minibatch.

C Implementation Details
We implement all RNNs in this paper as GRU networks (Cho et al., 2014) with d = 32 hidden units (state space R 32 ). Each of our models ( §6) always specifies the logprob-so-far in equations (2) and (3) using a 1-layer left-to-right GRU, 19 while the corresponding proposal distribution ( §3.3) always specifies the state s t in (6) using a 2-layer right-to-left 18 We thank Hongyuan Mei, Tim Vieira, and Sanjeev Khudanpur for insightful discussions on this derivation. 19 For the tagging task described in §6.1, g θ (st−1, xt, yt) def = log p θ (xt, yt | st−1), where the GRU state st−1 is used to define a softmax distribution over possible (xt, yt) pairs in the same manner as an RNN language model (Mikolov et al., 2010). Likewise, for the source separation task ( §6.2), the source language models described in Appendix G are GRU-based RNN language models. GRU, and specifies the compatibility function C t in (23) using a 4-layer feedforward ReLU network. 20 For the Chinese social media NER task ( §6.1), we use the Chinese character embeddings provided by Peng and Dredze (2015), while for the source separation tasks ( §6.2), we use the 50-dimensional GloVe word embeddings (Pennington et al., 2014). In other cases, we train embeddings along with the rest of the network. We optimize with the Adam optimizer using the default parameters (Kingma and Ba, 2015) and L 2 regularization coefficient of 10 −5 .

D Training Procedures
In all our experiments, we train the incremental scoring models (the tagging and source separation models described in §6.1 and §6.2, respectively) on the training dataset T . We do early stopping, using perplexity on a held-out development set D 1 to choose the number of epochs to train (maximum of 3). Having obtained these model parameters θ, we train our proposal distributions q θ,φ on T , keeping θ fixed and only tuning φ. Again we use early stopping, using the KL divergence from §7.1 on a separate development set D 2 to choose the number of epochs to train (maximum of 20 for the two tagging tasks and source separation on the PTB dataset, and maximum of 50 for source separation on the phoneme sequence dataset). We then evaluate q θ * ,φ * on the test dataset E.
[Appendices E-G appear in the supplementary material file.]