Scalable Bayesian Learning of Recurrent Neural Networks for Language Modeling

Recurrent neural networks (RNNs) have shown promising performance for language modeling. However, traditional training of RNNs using back-propagation through time often suffers from overfitting. One reason for this is that stochastic optimization (used for large training sets) does not provide good estimates of model uncertainty. This paper leverages recent advances in stochastic gradient Markov Chain Monte Carlo (also appropriate for large training sets) to learn weight uncertainty in RNNs. It yields a principled Bayesian learning algorithm, adding gradient noise during training (enhancing exploration of the model-parameter space) and model averaging when testing. Extensive experiments on various RNN models and across a broad range of applications demonstrate the superiority of the proposed approach relative to stochastic optimization.


Introduction
Language modeling is a fundamental task, used for example to predict the next word or character in a text sequence given the context. Recently, recurrent neural networks (RNNs) have shown promising performance on this task (Mikolov et al., 2010;Sutskever et al., 2011). RNNs with Long Short-Term Memory (LSTM) units (Hochreiter and Schmidhuber, 1997) have emerged as a popular architecture, due to their representational power and effectiveness at capturing long-term dependencies.
RNNs are usually trained via back-propagation through time (Werbos, 1990), using stochastic optimization methods such as stochastic gradient de-scent (SGD) (Robbins and Monro, 1951); stochastic methods of this type are particularly important for training with large data sets. However, this approach often provides a maximum a posteriori (MAP) estimate of model parameters. The MAP solution is a single point estimate, ignoring weight uncertainty (Blundell et al., 2015;Hernández-Lobato and Adams, 2015). Natural language often exhibits significant variability, and hence such a point estimate may make over-confident predictions on test data.
To alleviate this overfitting problem in RNNs, good regularization is known as a key factor to successful applications. In the neural network literature, Bayesian learning has been proposed as a principled method to impose regularization and incorporate model uncertainty (MacKay, 1992;Neal, 1995), by imposing prior distributions on model parameters. Due to the intractability of posterior distributions in neural networks, Hamiltonian Monte Carlo (HMC) (Neal, 1995) has been used to provide sample-based approximations to the true posterior. Despite the elegant theoretical property of asymptotic convergence to the true posterior, HMC and other conventional Markov Chain Monte Carlo methods are not scalable to large training sets.
This paper seeks to scale up Bayesian learning of RNNs to meet the challenge of the increasing amount of "big" sequential data in natural language processing, by leveraging recent advances in stochastic gradient Markov Chain Monte Carlo (SG-MCMC) algorithms (Welling and Teh, 2011;Chen et al., 2014;Ding et al., 2014;Li et al., 2016a). Specifically, instead of training a single network, SG-MCMC is employed to train an ensemble of networks, where each network has its parameters drawn from a shared posterior distribution. This is implemented by adding additional gradient noise during training and utilizing model  Figure 1: Illustration of different weight learning strategies in a single-hidden-layer RNN. Stochastic optimization used for MAP estimation puts fixed values on all weights. Naive dropout is allowed to put weight uncertainty only on encoding and decoding weights, and fixed values on recurrent weights. The proposed SG-MCMC scheme imposes distributions on all weights. averaging when testing.
This simple procedure has the following favorable properties for training neural networks: (i) The injected noise encourages model-parameter trajectories when training to better explore the parameter space. This procedure was also empirically found effective in (Neelakantan et al., 2016). (ii) Model averaging when testing alleviates overfitting and hence improves generalization, transferring uncertainty in the learned model parameters to subsequent prediction. (iii) In theory, both asymptotic and non-asymptotic consistency properties of SG-MCMC methods in posterior estimation have been recently established to guarantee convergence (Chen et al., 2015a;Teh et al., 2016). (iv) SG-MCMC is scalable; it shares the same level of computational cost as SGD in training, by only requiring the evaluation of gradients on a small mini-batch. To the authors' knowledge, RNN training using SG-MCMC has not been investigated previously, and is a contribution of this paper. We also perform extensive experiments on several natural language processing tasks, demonstrating the effectiveness of SG-MCMC for RNNs, including character/word-level language modeling, image captioning and sentence classification.

Related Work
Several scalable Bayesian learning methods have been proposed recently for neural networks. These come in two broad categories: stochastic variational inference (Graves, 2011;Blundell et al., 2015;Hernández-Lobato and Adams, 2015) and SG-MCMC methods (Korattikara et al., 2015;Li et al., 2016a). While prior work focuses on feed-forward neural networks, there has been little if any research reported for RNNs using SG-MCMC.
Dropout (Hinton et al., 2012;Srivastava et al., 2014) is a commonly used regularization method for training neural networks. Recently, there has been several works on studying how to apply dropout to RNNs (Pachitariu and Sahani, 2013;Bayer et al., 2013;Pham et al., 2014;Zaremba et al., 2014;Bluche et al., 2015;Moon et al., 2015;Semeniuta et al., 2016;Gal and Ghahramani, 2016b). Among them, naive dropout (Zaremba et al., 2014) can impose weight uncertainty only on encoding weights (those that connect input to hidden units) and decoding weights (those that connect hidden units to output), but not the recurrent weights (those that connect consecutive hidden states). It has been concluded that noise added in the recurrent connections leads to model instabilities, hence disrupting the RNN's ability to model sequences.
Dropout has been recently shown to be a variational approximation technique in Bayesian learning (Gal and Ghahramani, 2016a;Kingma et al., 2015). Based on this, (Gal and Ghahramani, 2016b) proposed a new variant of dropout that can be successfully applied to recurrent layers, where the same dropout masks are shared along time for encoding, decoding and recurrent weights, respectively. Alternatively, we focus on SG-MCMC, which can be viewed as the Bayesian interpretation of dropout from the perspective of posterior sampling (Li et al., 2016b); this also allows imposition of model uncertainty on recurrent layers, boosting performance. A comparison of naive dropout and SG-MCMC is illustrated in Fig. 1.

RNN as Bayesian Predictive Models
Consider data D = {D 1 , · · · , D N }, where D n (X n , Y n ), with input X n and output Y n . Our goal is to learn model parameters θ to best characterize the relationship from X n to Y n , with corresponding data likelihood p(D|θ) = N n=1 p(D n |θ). In Bayesian statistics, one sets a prior on θ via distribution p(θ). The posterior p(θ|D) ∝ p(θ)p(D|θ) reflects the belief concerning the model parameter distribution after observing the data. Given a test inputX (with missing outputỸ), the uncertainty learned in training is transferred to prediction, yielding the posterior predictive distribution: When the input is a sequence, RNNs may be used to parameterize the input-output relationship. Specifically, consider input sequence X = {x 1 , . . . , x T }, where x t is the input data vector at time t. There is a corresponding hidden state vector h t at each time t, obtained by recursively applying the transition function h t = H(h t−1 , x t ) (specified in Section 3.2; see Fig. 1). The ouput Y differs depending on the application: a sequence {y 1 , . . . , y T } in language modeling or a discrete label in sentence classification. In RNNs the corresponding decoding function is p(y|h), described in Section 3.3.

RNN Architectures
The transition function H(·) can be implemented with a gated activation function, such as Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) or a Gated Recurrent Unit (GRU) (Cho et al., 2014). Both LSTM and GRU have been proposed to address the issue of learning long-term sequential dependencies.
Long Short-Term Memory The LSTM architecture addresses the problem of learning longterm dependencies by introducing a memory cell, that is able to preserve the state over long periods of time. Specifically, each LSTM unit has a cell containing a state c t at time t. This cell can be viewed as a memory unit. Reading or writing the cell is controlled through sigmoid gates: input gate i t , forget gate f t , and output gate o t . The hidden units h t are updated as follows: where σ(·) denotes the logistic sigmoid function, and represents the element-wise matrix multiplication operator. W {i,f,o,c} are encoding weights, and U {i,f,o,c} are recurrent weights, as shown in Fig. 1. b {i,f,o,c} are bias terms.
Variants Similar to the LSTM unit, the GRU also has gating units that modulate the flow of information inside the hidden unit. It has been shown that a GRU can achieve similar performance to an LSTM in sequence modeling (Chung et al., 2014). We specify the GRU in the Supplementary Material.
The LSTM can be extended to the bidirectional LSTM and multilayer LSTM. A bidirectional LSTM consists of two LSTMs that are run in parallel: one on the input sequence and the other on the reverse of the input sequence. At each time step, the hidden state of the bidirectional LSTM is the concatenation of the forward and backward hidden states. In multilayer LSTMs, the hidden state of an LSTM unit in layer is used as input to the LSTM unit in layer + 1 at the same time step.

Applications
The proposed Bayesian framework can be applied to any RNN model; we focus on the following basic tasks to demonstrate the ideas.

Language Modeling
In word-level language modeling, the input to the network is a sequence of words, and the network is trained to predict the next word in the sequence with a softmax classifier. Specifically, for a length-T sequence, denote y t = x t+1 for t = 1, . . . , T − 1. x 1 and y T are always set to a special START and END token, respectively. At each time t, there is a decoding function p(y t |h t ) = softmax(Vh t ) to compute the distribution over words, where V are the decoding weights (the number of rows of V corresponds to the number of words/characters). We also extend this basic language model to consider other applications: (i) a character-level language model can be specified in a similar manner by replacing words with characters (Karpathy et al., 2016). (ii) Image captioning can be considered as a conditional language modeling problem, in which we learn a generative language model of the caption conditioned on an image (Vinyals et al., 2015).
Sentence Classification Sentence classification aims to assign a semantic category label y to a whole sentence X. This is usually implemented through applying the decoding function once at the end of sequence: p(y|h T ) = softmax(Vh T ), where the final hidden state of a RNN h T is often considered as the summary of the sentence (here the number of rows of V corresponds to the number of classes).

The Pitfall of Stochastic Optimization
Typically there is no closed-form solution for (1), and traditional MCMC methods scale poorly for large N . To ease the computational burden, stochastic optimization is often employed to find the MAP solution. This is equivalent to minimizing an objective of regularized loss function U (θ) that corresponds to a (non-convex) model of interest: θ MAP = arg min U (θ), U (θ) = − log p(θ|D). The expectation in (1) is approximated as: Though simple and effective, this procedure largely loses the benefit of the Bayesian approach, because the uncertainty on weights is ignored. To more accurately approximate (1), we employ SG-MCMC.

Large-scale Bayesian Learning
In a Bayesian model, the above regularized loss function corresponds to the potential energy defined as the negative log-posterior: In optimization, E = − N n=1 log p(D n |θ) is typically referred to as the loss function, and R ∝ − log p(θ) as a regularizer.
For large N , stochastic approximations are often employed: The gradient on this mini-batch is denoted asf t = ∇Ũ t (θ), which is an unbiased estimate of the true gradient. The evaluation of (4) is cheap even when N is large, allowing one to efficiently collect a sufficient number of samples in large-scale Bayesian learning, {θ s } S s=1 , where S is the number of samples (this will be specified later). These samples are used to construct a sample-based estimation to the expectation in (1): The finite-time estimation errors of SG-MCMC methods are bounded (Chen et al., 2015a), which guarantees (5) is an unbiased estimate of (1) asymptotically under appropriate decreasing stepsizes.

SG-MCMC Algorithms
SG-MCMC and stochastic optimization are two parallel lines of work, designed for different purposes; their relationship has recently been revealed in the context of deep learning. The most basic SG-MCMC algorithm has been applied to Langevin dynamics, and is termed SGLD (Welling and Teh, 2011). To help convergence, a momentum term has been introduced in SGHMC (Chen et al., 2014), a "thermostat" has been devised in SGNHT (Ding et al., 2014;Gan et al., 2015) and preconditioners have been employed in pS-GLD (Li et al., 2016a). These SG-MCMC algorithms often share similar characteristics with their counterpart approaches from the optimization literature such as the momentum SGD, Santa (Chen et al., 2016) and RMSprop/Adagrad (Tieleman and Hinton, 2012;Duchi et al., 2011). The interrelationships between SG-MCMC and optimizationbased approaches are summarized in Table 1.
pSGLD Preconditioned SGLD (pSGLD) (Li et al., 2016a) was proposed recently to improve the mixing of SGLD. It utilizes magnitudes of recent gradients to construct a diagonal preconditioner to approximate the Fisher information matrix, and thus adjusts to the local geometry of parameter space by equalizing the gradients so that a constant stepsize is adequate for all dimensions. This is important for RNNs, whose parameter space often exhibits pathological curvature and saddle points (Pascanu et al., 2013), resulting in slow mixing. There are multiple choices of preconditioners; similar ideas in optimization include Adagrad (Duchi et al., 2011), Adam (Kingma and Ba, 2015) and RMSprop (Tieleman and Hinton, 2012). An efficient version of pSGLD, adopting RM-Sprop as the preconditioner G, is summarized in Algorithm 1, where denotes element-wise matrix division. When the preconditioner is fixed as the identity matrix, the method reduces to SGLD.

Understanding SG-MCMC
To further understand SG-MCMC, we show its close connection to dropout/dropConnect (Srivastava et al., 2014;Wan et al., 2013). These methods improve the generalization ability of deep models, by randomly adding binary/Gaussian noise to the local units or global weights. For neural networks with the nonlinear function q(·) and consecutive layers h 1 and h 2 , dropout and dropConnect are denoted as:

Dropout:
h 2 = ξ 0 q(θh 1 ), where the injected noise ξ 0 can be binary-valued with dropping rate p or its equivalent Gaussian form (Wang and Manning, 2013): Binary noise: Gaussian noise: Note that ξ 0 is defined as a vector for dropout, and a matrix for dropConnect. By combining drop-Connect and Gaussian noise from the above, we have the update rule (Li et al., 2016b): (8) shows that dropout/ dropConnect and SGLD in (6) share the same form of update rule, with the distinction being that the level of injected noise is different. In practice, the noise injected by SGLD may not be enough. A better way that we find to improve the performance is to jointly apply SGLD and dropout. This method can be interpreted as using SGLD to sample the posterior distribution of a mixture of RNNs, with mixture probability controlled by the dropout rate.

Experiments
We present results on several tasks, including character/ word-level language modeling, image captioning, and sentence classification. The hyperparameter setting, the initialization of model parameters and model specifications on each dataset are all provided in the Supplementary Material.
We do not perform any dataset-specific tuning other than early stopping on validation sets. When dropout is utilized, the dropout rate is empirically set to 0.5. All experiments are implemented in Theano (Theano Development Team, 2016), using a NVIDIA GeForce GTX TITAN X GPU with 12GB memory.

Language Modeling
We first test character-level and word-level language modeling. The setup for each task is as follows.  Table 2. We observe that pSGLD consistently outperforms RMSprop. Table 3 summarizes the test set performance on PTB 1 . It is clear that our sampling-based method consistently outperforms the optimization counterpart, where the performance gain mainly comes from adding gradient noise and model averaging. When compared with dropout, SGLD performs better on the small LSTM model, but slightly worse on the medium and large LSTM model. This may imply that dropout is suitable to regularizing large networks, while SGLD exhibits better regularization ability on small networks, partially due to the fact that dropout may inject a higher level of noise during training than SGLD. In order to inject a higher level of noise into SGLD, we empirically apply SGLD and dropout jointly, and found that this provided the best performace on the medium and large LSTM model. We study three strategies to do model averaging, i.e., forward collection, backward collection and thinned collection. Given samples (θ 1 , · · · , θ K ) and the number of samples S used for averaging, forward collection refers to using (θ 1 , · · · , θ S ) for the evaluation of a test function, backward collection refers to using (θ K−S+1 , · · · , θ K ), while   thinned collection chooses samples from θ 1 to θ K with interval K/S. Fig. 2 plots the effects of these strategies, where Fig. 2(a) plots the perplexity of every single sample, Fig. 2(b) plots the perplexities using the three schemes. It can be seen that only after 20 samples is a converged perplexity achieved in the thinned collection, while it requires 30 samples for forward collection or 60 samples for backward collection. This is unsurprising, because thinned collection provides a better way to select samples. Nevertheless, averaging of samples provides significantly lower perplexity than using single samples. Note that the overfitting problem in Fig. 2(a) is also alleviated by model averaging.
To better illustrate the benefit of model averaging, we visualize in Fig. 3 the probabilities of each word in a randomly chosen test sentence. The first 3 rows are the results predicted by 3 distinctive model samples, respectively; the bottom row is the result after averaging. Their corresponding perplexities for the test sentence are also shown on the right of each row. The 3 individual samples provide reasonable probabilities. For example, the consecutive words "New York", "stock exchange" and "did not" are assigned with a higher probability. After averaging, we can see a much lower perplexity, as the samples can complement each other. For example, though the second sample can  yield the lowest single-model perplexity, its prediction on word "York" is still benefited from the other two via averaging.

Image Caption Generation
We next consider the problem of image caption generation, which is a conditional RNN model, where image features are extracted by residual network (He et al., 2016), and then fed into the RNN to generate the caption. We present results on two benchmark datasets, Flickr8k (Hodosh et al., 2013) and Flickr30k (Young et al., 2014). These datasets contain 8,000 and 31,000 images, respectively. Each image is annotated with 5 sentences. A single-layer LSTM is employed with the number of hidden units set to 512.
The widely used BLEU (Papineni et al., 2002), METEOR (Banerjee andLavie, 2005), ROUGE-L (Lin, 2004), and CIDEr-D (Vedantam et al., 2015) metrics are used to evaluate the performance. All the metrics are computed by using the code released by the COCO evaluation server (Chen et al., 2015b). Table 4 presents results for pSGLD/RMSprop with or without dropout. Consistent with the results in the basic language modeling, pSGLD a"tan"dog"is"playing"in"the"grass a"tan"dog"is"playing"with"a"red"ball"in"the"grass a"tan"dog"with"a"red"collar"is"running"in"the"grass a"yellow"dog"runs"through"the"grass a"yellow"dog"is"running"through"the"grass a"brown"dog"is"running"through"the"grass a"group"of"people"stand"in"front"of"a"building a"group"of"people"stand"in"front"of"a"white"building a"group"of"people"stand"in"front"of"a"large"building a"man"and"a"woman"walking"on"a"sidewalk a"man"and"a"woman"stand"on"a"balcony a"man"and"a"woman"standing"on"the"ground yields improved performance compared to RM-Sprop. For example, pSGLD provides 2.77 BLEU-4 score improvement over RMSprop on the Flickr8k dataset. By comparing pSGLD with RM-Sprop with dropout, we conclude that pSGLD exhibits better regularization ability than dropout on these two datasets.
Apart from modeling weight uncertainty, different samples from our algorithm may capture different aspects of the input image. An example with two images is shown in Fig. 4, where 2 randomly chosen model samples are considered for each image. For each model sample, the top 3 generated captions are presented. We use the beam search approach (Vinyals et al., 2015) to generate captions, with a beam of size 5. In Fig. 4, the two samples for the first image mainly differ in the color and activity of the dog, e.g., "tan" or "yellow", "playing" or "running", whereas for the second image, the two samples reflect different understanding of the image content.  Figure 5: Learning curves on TREC dataset.

Sentence Classification
We study the task of sentence classification on 5 datasets as in ( Table 5 shows the testing classification errors. 10-fold cross-validation is used for evaluation on the first 4 datasets, while TREC has a pre-defined training/test split, and we run each algorithm 10 times on TREC. In addition to (naive) dropout, we further compare pSGLD with the Gal's dropout, recently proposed in (Gal and Ghahramani, 2016b), which is shown to be applicable to recurrent layers. The combination of pS-GLD and dropout consistently provides the lowest errors.
In the following, we focus on the analysis of TREC. Each sentence of TREC is a question, and the goal is to decide which topic type the question is most related to: location, human, numeric, abbreviation, entity or description. Fig. 5 plots the learning curves of different algorithms on the training, validation and testing sets of the TREC dataset. pSGLD and dropout have similar behavior: they explore the parameter space during learning, and thus coverge slower than RSMprop on the training dataset. However, the learned uncertainty What does cc in engines mean?
What does a defibrillator do?

Description Description
Testing5Question Entity Abbreviation Figure 6: Visualization. Top two rows show selected ambiguous sentences, which correspond to the points with black circles in tSNE visualization of the testing dataset.
alleviates overfitting and results in lower errors on the validation and testing datasets.
To further study the Bayesian nature of the proposed approach, in Fig. 6 we choose two testing sentences with high uncertainty (i.e., standard derivation in prediction) from the TREC dataset. Interestingly, after embedding to 2d-space with tSNE (Van der Maaten and Hinton, 2008), the two sentences correspond to points lying on the boundary of different classes. We use 20 model samples to estimate the prediction mean and standard derivation on the true type and predicted type. The classifier yields higher probability on the wrong types, associated with higher standard derivations. One can leverage the uncertainty information to make decisions: either manually make a human judgement when uncertainty is high, or automatically choose the one with lower standard derivations when both types exhibits similar prediction means. A more rigorous usage of the uncertainty information is left as future work.

Conclusion
We propose a scalable Bayesian learning framework using SG-MCMC, to model weight uncertainty in recurrent neural networks. The learn-ing framework is tested on several tasks, including language models, image caption generation and sentence classification. Our algorithm outperforms conventional stochastic optimization algorithms, indicating the importance of learning weight uncertainty in recurrent neural networks. Our algorithm requires little additional computational overhead in training, and multiple times of forward-passing for model averaging in testing. Future works include improving the testing efficiency for the large-scale RNNs, via learning a single neural network that approximates the model averaging result (Korattikara et al., 2015).