Simple Recurrent Units for Highly Parallelizable Recurrence

Common recurrent neural architectures scale poorly due to the intrinsic difficulty in parallelizing their state computations. In this work, we propose the Simple Recurrent Unit (SRU), a light recurrent unit that balances model capacity and scalability. SRU is designed to provide expressive recurrence, enable highly parallelized implementation, and comes with careful initialization to facilitate training of deep models. We demonstrate the effectiveness of SRU on multiple NLP tasks. SRU achieves 5—9x speed-up over cuDNN-optimized LSTM on classification and question answering datasets, and delivers stronger results than LSTM and convolutional models. We also obtain an average of 0.7 BLEU improvement over the Transformer model (Vaswani et al., 2017) on translation by incorporating SRU into the architecture.


Introduction
Recurrent neural networks (RNN) are at the core of state-of-the-art approaches for a large number of natural language tasks, including machine translation (Cho et al., 2014;Bahdanau et al., 2015;Jean et al., 2015;Luong et al., 2015), language modeling (Zaremba et al., 2014;Gal and Ghahramani, 2016;Zoph and Le, 2016), opinion mining (Irsoy and Cardie, 2014), and situated language understanding (Mei et al., 2016;Misra et al., 2017;Suhr et al., 2018;Suhr and Artzi, 2018).Key to many of these advancements are architectures of increased capacity and computation.For instance, the top-performing models for semantic role labeling and translation use eight recurrent layers, requiring days to train (He et al., 2017;Wu et al., 2016b).The scalability of these models has become an important problem that impedes NLP research.
The difficulty of scaling recurrent networks arises from the time dependence of state computation.In common architectures, such as Long Short-term Memory (LSTM; Hochreiter and Schmidhuber, 1997) and Gated Recurrent Units (GRU; Cho et al., 2014), the computation of each step is suspended until the complete execution of the previous step.This sequential dependency makes recurrent networks significantly slower than other operations, and limits their applicability.For example, recent translation models consist of non-recurrent components only, such as attention and convolution, to scale model training (Gehring et al., 2017;Vaswani et al., 2017).
In this work, we introduce the Simple Recurrent Unit (SRU), a unit with light recurrence that offers both high parallelization and sequence modeling capacity.The design of SRU is inspired by previous efforts, such as Quasi-RNN (QRNN; Bradbury et al., 2017) and Kernel NN (KNN; Lei et al., 2017), but enjoys additional benefits: • SRU exhibits the same level of parallelism as convolution and feed-forward nets.This is achieved by balancing sequential dependence and independence: while the state computation of SRU is time-dependent, each state dimension is independent.This simplification enables CUDA-level optimizations that parallelize the computation across hidden dimensions and time steps, effectively using the full capacity of modern GPUs. Figure 1 compares our architecture's runtimes to common architectures.
• SRU replaces the use of convolutions (i.e., ngram filters), as in QRNN and KNN, with more recurrent connections.This retains modeling capacity, while using less computation (and hyper-parameters).Average processing time in milliseconds of a batch of 32 samples using cuDNN LSTM, wordlevel convolution conv2d (with filter width k = 2 and k = 3), and the proposed SRU.We vary the number of tokens per sequence (l) and feature dimension (d).
• SRU improves the training of deep recurrent models by employing highway connections (Srivastava et al., 2015) and a parameter initialization scheme tailored for gradient propagation in deep architectures.
We evaluate SRU on a broad set of problems, including text classification, question answering, translation and character-level language modeling.Our experiments demonstrate that light recurrence is sufficient for various natural language tasks, offering a good trade-off between scalability and representational power.On classification and question answering datasets, SRU outperforms common recurrent and non-recurrent architectures, while achieving 5-9x speed-up compared to cuDNN LSTM.Stacking additional layers further improves performance, while incurring relatively small costs owing to the cheap computation of a single layer.We also obtain an average improvement of 0.7 BLEU score on the English to German translation task by incorporating SRU into Transformer (Vaswani et al., 2017).

Related Work
Improving on common architectures for sequence processing has recently received significant attention (Greff et al., 2017;Balduzzi and Ghifary, 2016;Miao et al., 2016;Zoph and Le, 2016;Lee et al., 2017).One area of research involves incorporating word-level convolutions (i.e.n-gram filters) into recurrent computation (Lei et al., 2015;Bradbury et al., 2017;Lei et al., 2017).For example, Quasi-RNN (Bradbury et al., 2017) proposes to alternate convolutions and a minimalist recurrent pooling function and achieves significant speed-up over LSTM.While Bradbury et al. (2017) focus on the speed advantages of the network, Lei et al. (2017) study the theoret-ical characteristics of such computation and possible extensions.Their results suggest that simplified recurrence retains strong modeling capacity through layer stacking.This finding motivates the design of SRU for both high parallelization and representational power.SRU also relates to IRNN (Le et al., 2015), which uses an identity diagonal matrix to initialize hidden-to-hidden connections.SRU uses point-wise multiplication for hidden connections, which is equivalent to using a diagonal weight matrix.This can be seen as a constrained version of diagonal initialization.
Various strategies have been proposed to scale network training (Goyal et al., 2017) and to speed up recurrent networks (Diamos et al., 2016;Shazeer et al., 2017;Kuchaiev and Ginsburg, 2017).For instance, Diamos et al. (2016) utilize hardware infrastructures by stashing RNN parameters on cache (or fast memory).Shazeer et al. (2017) and Kuchaiev and Ginsburg (2017) improve the computation via conditional computing and matrix factorization respectively.Our implementation for SRU is inspired by the cuDNNoptimized LSTM (Appleyard et al., 2016), but enables more parallelism -while cuDNN LSTM requires six optimization steps, SRU achieves more significant speed-up via two optimizations.
The design of recurrent networks, such as SRU and related architectures, raises questions about representational power and interpretability (Chen et al., 2018;Peng et al., 2018).Balduzzi and Ghifary (2016) applies type-preserving transformations to discuss the capacity of various simplified RNN architectures.Recent work (Anselmi et al., 2015;Daniely et al., 2016;Zhang et al., 2016;Lei et al., 2017) relates the capacity of neural networks to deep kernels.We empirically demonstrate SRU can achieve compelling results by stacking multiple layers.

Simple Recurrent Unit
We present and explain the design of Simple Recurrent Unit (SRU) in this section.A single layer of SRU involves the following computation: where W, W f and W r are parameter matrices and v f , v r , b f and b v are parameter vectors to be learnt during training.The complete architecture decomposes to two sub-components: a light recurrence (Equation 1 and 2) and a highway network (Equation 3 and 4).
The light recurrence component successively reads the input vectors x t and computes the sequence of states c t capturing sequential information.The computation resembles other recurrent networks such as LSTM, GRU and RAN (Lee et al., 2017).Specifically, a forget gate f t controls the information flow (Equation 1) and the state vector c t is determined by adaptively averaging the previous state c t−1 and the current observation Wx t according to f t (Equation 2).
One key design decision that differs from previous gated recurrent architectures is the way c t−1 is used in the sigmoid gate.Typically, c t−1 is multiplied with a parameter matrix to compute f t , e.g., f t = σ(W f x t + V f c t−1 + b f ).However, the inclusion of V f c t−1 makes it difficult to parallelize the state computation: each dimension of c t and f t depends on all entries of c t−1 , and the computation has to wait until c t−1 is fully computed.To facilitate parallelization, our light recurrence component uses a point-wise multiplication v f c t−1 instead.With this simplification, each dimension of the state vectors becomes independent and hence parallelizable.
The highway network component (Srivastava et al., 2015) facilitates gradient-based training of deep networks.It uses the reset gate r t (Equation 3) to adaptively combine the input x t and the state c t produced from the light recurrence (Equation 4), where (1 − r t ) x t is a skip connection that allows the gradient to directly propagate to the previous layer.Such connections have been shown to improve scalability (Wu et al., 2016a;Kim et al., 2016;He et al., 2016;Zilly et al., 2017).
The combination of the two components makes the overall architecture simple yet expressive, and easy to scale due to enhanced parallelization and gradient propagation.

Parallelized Implementation
Despite the parallelization friendly design of SRU, a naive implementation which computes equations (1)-(4) for each step t sequentially would not achieve SRU's full potential.We employ two optimizations to enhance parallelism.The optimizations are performed in the context of GPU / CUDA programming, but the general idea can be applied to other parallel programming models.
We re-organize the computation of equations ( 1)-( 4) into two major steps.First, given the input sequence {x 1 • • • x L }, we batch the matrix multiplications across all time steps.This significantly improves the computation intensity (e.g.GPU utilization).The batched multiplication is: where L is the sequence length, U ∈ R L×3d is the computed matrix and d is the hidden state size.When the input is a mini-batch of B sequences, U would be a tensor of size (L, B, 3d).
The second step computes the remaining pointwise operations.Specifically, we compile all point-wise operations into a single fused CUDA kernel and parallelize the computation across each dimension of the hidden state.Algorithm 1 shows the pseudo code of the forward function.The complexity of this step is O(L • B • d) per layer, where L is the sequence length and B is the batch size.In contrast, the complexity of LSTM is because of the hidden-to-hidden multiplications (e.g.Vh t−1 ), and each dimension can not be independently parallelized.The fused kernel also reduces overhead.Without it, operations such as sigmoid activation would each invoke a separate function call, adding kernel launching latency and more data moving costs.
The implementation of a bidirectional SRU is similar: the matrix multiplications of both directions are batched, and the fused kernel handles and parallelizes both directions at the same time.

Proper parameter initialization can reduce gradient propagation difficulties and hence have a positive
Algorithm 1 Mini-batch version of the forward pass defined in Equations ( 1)-( 4).
Indices: Sequence length L, mini-batch size B, hidden state dimension d.Input: Input sequences batch x[l, i, j]; grouped matrix multiplication U[l, i, j ]; impact on the final performance.We now describe an initialization strategy tailored for SRU.
We start by adopting common initializations derived for feed-forward networks (Glorot and Bengio, 2010;He et al., 2015).The weights of parameter matrices are drawn with zero mean and 1/d variance, for instance, via the uniform distribution [− 3/d, + 3/d].This ensures the output variance remains approximately the same as the input variance after the matrix multiplication.
However, the light recurrence and highway computation would still reduce the variance of hidden representations by a factor of 1/3 to 1/2: and the factor converges to 1/2 in deeper layers (see Appendix A).This implies the output h t and the gradient would vanish in deep models.To offset the problem, we introduce a scaling correction constant α in the highway connection where α is set to √ 3 such that Var[h t ] ≈ Var[x t ] at initialization.When the highway network is initialized with a non-zero bias b r = b, the scaling constant α can be accordingly set as:  For the SST dataset, we report average results of 5 runs.For other datasets, we perform 3 independent trials of 10-fold cross validation (3×10 runs).The last column compares the wall clock time (in seconds) to finish 100 epochs on the SST dataset.
other architectures.We stack multiple layers of SRU to directly substitute other recurrent, convolutional or feed-forward modules.We minimize hyper-parameter tuning and architecture engineering for a fair comparison.Such efforts have a nontrivial impact on the results, which are beyond the scope of our experiments.Unless noted otherwise, the hyperparameters are set identical to prior work.
Setup We stack multiple SRU layers and use the last output state to predict the class label for a given sentence.We train for 100 epochs and use the validation (i.e., development) set to select the best training epoch.We perform 10-fold 2 We use the binary version of SST dataset.cross validation for datasets that do not have a standard train-evaluation split.The result on SST is averaged over five independent trials.We use Adam (Kingma and Ba, 2014) with the default learning rate 0.001, a weight decay 0 and a hidden dimension of 128.
We compare SRU with a wide range of methods on these datasets, including various convolutional models (Kalchbrenner et al., 2014;Kim, 2014;Zhang and Wallace, 2017) and a hierarchical sentence model (Zhao et al., 2015) reported as the state of the art on these datasets (Conneau et al., 2017).Their setups are not exactly the same as ours, and may involve more tuning on word embeddings and other regularizations.We use the setup of Kim (2014) but do not fine-tune word embeddings and the learning method for simplicity.In addition, we directly compare against three baselines trained using our code base: a reimplementation of the CNN model of Kim (2014), a two-layer LSTM model and Quasi-RNN (Bradbury et al., 2017).We use the official implementation of Quasi-RNN and also implement a version with highway connection for a fair comparison.These baselines are trained using the same hyper-parameter configuration as SRU.We use the open source implementation of Document Reader in our experiments. 4We train models for up to 100 epochs, with a batch size of 32 and a hidden dimension of 128.Following the author suggestions, we use the Adamax optimizer (Kingma and Ba, 2014) and variational dropout (Gal and Ghahramani, 2016) during training.We compare with two alternative recurrent components: the bidirectional LSTM adopted in the original implementation of Chen et al. (2017) and Quasi-RNN with highway connections for improved performance.

Results
Results Table 2 summarizes the results on SQuAD.SRU achieves 71.4% exact match and 80.2% F1 score, outperforming the bidirectional LSTM model by 1.9% (EM) and 1.4% (F1) respectively.SRU also exhibits over 5x speed-up over LSTM and 53-63% reduction in total training time.In comparison with QRNN, SRU obtains 0.8% improvement on exact match and 0.6% on F1 score, and runs 60% faster.This speed improvement highlights the impact of the fused kernel (Algorithm 1).While the QRNN baseline involves a similar amount of computation, assembling all element-wise operations of both direcas character-level embeddings, which are not directly comparable to the setup of Chen et al. (2017).However, these models can potentially benefit from SRU since RNNs are incorporated in the model architecture.tions in SRU achieves better GPU utilization.

Machine Translation
Dataset We train translation models on the WMT English→German dataset, a standard benchmark for translation systems (Peitz et al., 2014;Li et al., 2014;Jean et al., 2015).The dataset consists of 4.5 million sentence pairs.We obtain the pre-tokenized dataset from the Open-NMT project (Klein et al., 2017).The sentences were tokenized using the word-piece model (Wu et al., 2016b), which generates a shared vocabulary of about 32,000 tokens.Newstest-2014 and newstest-2017 are provided and used as the validation and test sets.5 Setup We use the state-of-the-art Transformer model of Vaswani et al. (2017) as our base architecture.In the base model, a single Transformer consists of a multi-head attention layer and a bottleneck feed-forward layer.We substitute the feedforward network using our SRU implementation: base: The intuition is that SRU can better capture sequential information as a recurrent network, and potentially achieve better performance while requiring fewer layers.We keep the model configuration the same as Vaswani et al. (2017): the model dimension is d model = 512, the feed-forward and SRU layer has inner dimensionality d ff = d sru = 2048, and positional encoding (Gehring et al., 2017) is applied on the input word embeddings.The base model without SRU has 6 layers, while we set the number of layers to 4 and 5 when SRU is added.Following the original setup, we use a dropout probability 0.1 for all components, except the SRU in the 5-layer model, for which we use a dropout of 0.2 as we observe stronger over-fitting in training.
We use a single NVIDIA Tesla V100 GPU for each model.The published results were obtained using 8 GPUs in parallel, which provide a large effective batch size during training.To approximate the setup, we update the model parameters every 5×5120 tokens and use 16,000 warm-up steps following OpenNMT suggestions.We train each model for 40 epochs (250,000 steps), and perform 3 independent trials for each model configuration.A single run takes about 3.5 days with a Tesla V100 GPU.

Results
Table 3 shows the translation results.When SRU is incorporated into the architecture, both the 4-layer and 5-layer model outperform the Transformer base model.For instance, our 5layer model obtains an average improvement of 0.7 test BLEU score and an improvement of 0.5 BLEU score by comparing the best results of each model achieved across three runs.SRU also exhibits more stable performance, with smaller variance over 3 runs.Figure 4 further compares the validation accuracy of different models.These results confirm that SRU is better at sequence modeling compared to the original feed-forward network (FFN), requiring fewer layers to achieve similar accuracy.3).We perform 3 independent runs for each configuration.We select the best epoch based on the valid BLEU score for each run, and report the average results and the standard deviation over 3 runs.In addition, we experiment with averaging model checkpoints and use the averaged version for evaluation, following (Vaswani et al., 2017).We show the best BLEU results achieved in brackets.We compare various recurrent models and use a parameter budget similar to previous methods.In addition, we experiment with the factorization trick (Kuchaiev and Ginsburg, 2017) to reduce the total number of parameters without decreasing the performance.See details in Appendix B.
Results Table 4 presents the results of SRU and other recurrent models.The 8-layer SRU model achieves validation and test bits per character (BPC) of 1.21, outperforming previous best reported results of LSTM, QRNN and recurrent highway networks (RHN).Increasing the layer of SRU to 12 and using a longer context of 256 characters in training further improves the BPC to 1.19

Ablation Analysis
We perform ablation analyses on SRU by successively disabling different components: (1) Remove the point-wise multiplication term v c t−1 in the forget and reset gates.The resulting variant involves less recurrence and has less representational capacity.
We train model variants on the classification and question answering datasets.Table 5 and Figure 5 confirm the impact of our design decisions -removing these components result in worse classification accuracies and exact match scores.

Discussion
This work presents Simple Recurrent Unit (SRU), a scalable recurrent architecture that operates as fast as feed-forward and convolutional units.We   confirm the effectiveness of SRU on multiple natural language tasks ranging from classification to translation.We open source our implementation to facilitate future NLP and deep learning research.
Trading capacity with layers SRU achieves high parallelization by simplifying the hidden-tohidden dependency.This simplification is likely to reduce the representational power of a single layer and hence should be balanced to avoid performance loss.However, unlike previous work that suggests additional computation (e.g., n-gram filters) within the layer (Balduzzi and Ghifary, 2016;Bradbury et al., 2017), we argue that increasing the depth of the model suffices to retain modeling capacity.Our empirical results on various tasks confirm this hypothesis.In practice, multiple SRU layers are stacked to construct a deep network.The internal state c t and h t would be a weighted combination of inputs {x 1 • • • x t }, which will increase the correlation of the state vectors at different steps.These state vectors are again fed into the next layer, and keep increasing the correlation.As a result, we expect the actual ratio between the variance of c t and that of the input of the current layer x t lies between the two derived values, and would finally converge to the upper bound value of 1.

A.2 Computing Var[h t ]
Given the result in Equation ( 5), we proceed to compute Var[h t ].The i-th entry of h t is similarly computed as The highway reset gate is not necessarily initialized with a zero bias.Let the initial bias be b and u = w r,i x t + v r,i c t−1,i denote the rest of terms in the sigmoid function.We have E[u] = 0 and Var[u] 1 because x t and c t−1 have small variance.
We approximate the value of r t,i using its Taylor expansion at u = 0: We can ignore the term with u 2 since Var [u] 1, which gives us

Substituting this result in Var
Since from (5) we have Var which is equivalent to when b = 0.

A.3 Computing the Scaling Constant α
Finally, we compute the scaling constant α (Section 3.2).Using the result in Equation ( 6), when α is introduced we get: as Var[c] → Var[x] according to Equation ( 5) and the empirical evaluation (Figure 6).This implies e

B Experimental Details
We include additional experimental setup and results in this section.

B.1 Classification
The data and pre-processing code are obtained from the code repository of Harvard NLP. 7e use a batch size of 32 and a dropout probability of 0.5 for all models.In addition, we increment the dropout to 0.55 or 0.6 for the 8-layer SRU model.Following the implementation of (Kim, 2014), out-of-vocabulary words that are not in the pre-trained embeddings are initialized with random vectors with values from [−0.25, 0.25].

B.2 Question Answering
We use a word embedding dropout of 0.5 and a recurrent dropout of 0.2.In the setup of Chen et al. (2017), the bi-LSTM models concatenates the output of each layer and feed it to subsequent layers.This helps the gradient propagation and improves the final performance.With highway connection, this is no longer necessary.In SRU and Q-RNN (with highway), only the output of the last layer is given to subsequent layers.

B.3 Machine Translation
We use the OpenNMT PyTorch implementation for the translation experiments.Table 6 shows the list of configuration options used for training.For evaluation, we use beam size 5 and length penalty 0.6.
-  Table 7 shows the averaged BLEU score of each model from 20th to 40th epoch.The improvement over the Transformer base model is consistent across different epochs.
Figure 7 plots the training and validation perplexity of three models.With a higher dropout (0.2) used for the SRU, the 5-layer model gets consistent lower validation perplexity over the base model and the 4-layer model.We also see that models with SRU exhibit much faster training progress with much lower training perplexity, suggesting the models could be tuned better with further training regularization.

B.4 Character-level Language Modeling
We train all models using a weight decay of 10 −7 and a gradient clipping of 0.3.We set the learning rate factor of Noam scheduling to 3 and the warmup steps to 32, 000.We tune the dropout probability from {0.2, 0.3}.
The projection (bottleneck) trick is implemented as follows.Recall that the batched multiplication of SRU is computed as The stacked parameter matrices on the left is reparameterized by a low-rank factorization, where Q ∈ R d in ×d and P ∈ R 3dout×d are two new parameter matrices to be learned, and d is the projection dimension that is much smaller than the input and output dimension of the SRU.
Figure1: Average processing time in milliseconds of a batch of 32 samples using cuDNN LSTM, wordlevel convolution conv2d (with filter width k = 2 and k = 3), and the proposed SRU.We vary the number of tokens per sequence (l) and feature dimension (d).

Figure 2
Figure 2 compares the training progress with and without the scaling correction.See Appendix A for the derivation and more discussion.

Figure 4 :
Figure 4: Mean validation accuracy (y-axis) of different translation models after each training epoch (x-axis).

Figure 5 :
Figure 5: Ablation analysis on the classification datasets.Average validation results are presented.We compare the full SRU implementation (left blue), the variant without v c t−1 multiplication (middle green) and the variant without highway connection (right yellow).

Figure 6 :
Figure 6: Empirical estimation of the variance ratio Var[c t ]/Var[x t] at each layer in a randomly initialized SRU model.We use the pre-trained word2vec embeddings as input, resulting an initial ratio slightly higher than 1/3.As expected, the ratio increases to 1 in deep layers.
Figure 6 confirms our expectation by computing the empirical value of Var[c]/Var[x] in deep SRU networks.

Figure 7 :
Figure 7: Training and validation perplexity curves of the base model and two SRU models.

Table 1 :
Test accuracies on classification benchmarks (Section 4.1).The first block presents best reported results of various methods.The second block compares SRU and other baselines given the same setup.
Table1compares the test results on the six benchmarks.We select the best number re-Mean validation accuracies (y-axis) and standard deviations of the CNN, 2-layer LSTM and 2-layer SRU models.We plot the curves of the first 100 epochs.X-axis is the training time used (in seconds).Timings are performed on NVIDIA GeForce GTX 1070 GPU, Intel Core i7-7700K Processor and cuDNN 7003.
Zhao et al. (2015) methods when multiple model variants were explored in their experiments.Despite our simple setup, SRU outperforms most previous methods and achieves comparable results compared to the state-of-the-art but more sophisticated model ofZhao et al. (2015).

Table 2 :
Exact match (EM) and F1 scores of various models on SQuAD (Section 4.2).We also report the total processing time per epoch and the time spent in RNN computations.SRU outperforms other models, and is more than five times faster than cuDNN LSTM.

Table 3 :
Finally, adding SRU does not affect the parallelization or speed of Transformer -the 4-layer model exhibits 10% speed improvement, English→German translation results (Section 4.

Table 4 :
Validation and test BPCs of different recurrent models on Enwik8 dataset.The last column presents the training time per epoch.For SRU with projection, we set the projection dimension to 512.

Table 5 :
Ablation analysis on SQuAD.Components are successively removed and the EM scores are averaged over 4 runs.

Table 7 :
Average BLEU scores after each epoch.