A Lightweight Recurrent Network for Sequence Modeling

Recurrent networks have achieved great success on various sequential tasks with the assistance of complex recurrent units, but suffer from severe computational inefficiency due to weak parallelization. One direction to alleviate this issue is to shift heavy computations outside the recurrence. In this paper, we propose a lightweight recurrent network, or LRN. LRN uses input and forget gates to handle long-range dependencies as well as gradient vanishing and explosion, with all parameter related calculations factored outside the recurrence. The recurrence in LRN only manipulates the weight assigned to each token, tightly connecting LRN with self-attention networks. We apply LRN as a drop-in replacement of existing recurrent units in several neural sequential models. Extensive experiments on six NLP tasks show that LRN yields the best running efficiency with little or no loss in model performance.


Introduction
Various natural language processing (NLP) tasks can be categorized as sequence modeling tasks, where recurrent networks (RNNs) are widely applied and contribute greatly to state-of-the-art neural systems (Yang et al., 2018;Peters et al., 2018;Zhang et al., 2018;Chen et al., 2018;Kim et al., 2019).To avoid the optimization bottleneck caused by gradient vanishing and/or explosion (Bengio et al., 1994), Hochreiter and Schmidhuber (1997) and Cho et al. (2014) develop gate structures to ease information propagation from distant words to the current position.Nevertheless, integrating these traditional gates inevitably increases computational overhead which is accumulated along token positions due to the sequen-tial nature of RNNs.As a result, the weak parallelization of RNNs makes the benefits from improved model capacity expensive in terms of computational efficiency.
Recent studies introduce different solutions to this issue.Zhang et al. (2018) introduce the addition-subtraction twin-gated recurrent unit (ATR), which reduces the amount of matrix operations by developing parameter-shared twin-gate mechanism.Lei et al. (2018) introduce the simple recurrent unit (SRU), which improves model parallelization by moving matrix computations outside the recurrence.Nevertheless, both ATR and SRU perform affine transformations of the previous hidden state for gates, though SRU employs a vector parameter rather than a matrix parameter.In addition, SRU heavily relies on its highway component, without which the recurrent component itself suffers from weak capacity and generalization (Lei et al., 2018).
In this paper, we propose a lightweight recurrent network (LRN), which combines the strengths of ATR and SRU.The structure of LRN is simple: an input gate and a forget gate are applied to weight the current input and previous hidden state, respectively.LRN has fewer parameters than SRU, and compared to ATR, removes heavy calculations outside the recurrence, generating gates based on the previous hidden state without any affine transformation.In this way, computation inside each recurrent step is highly minimized, allowing better parallelization and higher speed.
The gate structure endows LRN with the capability of memorizing distant tokens as well as handling the gradient vanishing and explosion issue.This ensures LRN's expressiveness and performance on downstream tasks.In addition, decomposing its recurrent structure discovers the correlation of input/forget gate with key/query in selfattention networks (Vaswani et al., 2017), where arXiv:1905.13324v1[cs.CL] 30 May 2019 these two gates together manipulate the weight assigned to each token.We also reveal how LRN manages long-term and short-term memories with the decomposition.
We carry out extensive experiments on six NLP tasks, ranging from natural language inference, document classification, machine translation, question answering and part-of-speech tagging to language modeling.We use LRN as a drop-in replacement of existing recurrent units in different neural models without any other modification of model structure.Experimental results show that LRN outperforms SRU by 10%∼20% in terms of running speed, and is competitive with respect to performance and generalization compared against all existing recurrent units.

Related Work
Past decades have witnessed the rapid development of RNNs since the Elman structure was proposed (Elman, 1990).Bengio et al. (1994) point out that the gradient vanishing and explosion issue impedes the optimization and performance of RNNs.To handle this problem, Hochreiter and Schmidhuber (1997) develop LSTM where information and gradient from distant tokens can successfully pass through the current token via a gate structure and a memory cell.Unfortunately, the enhanced expressivity via complex gates comes at the cost of sacrificing computational efficiency, which becomes more severe when datasets are scaled up.Simplifying computation but keeping model capacity in RNNs raises a new challenge.
One direction is to remove redundant structures in LSTM.Cho et al. (2014) remove the memory cell and introduce the gated recurrent unit (GRU) with only two gates.Lee et al. (2017) introduce an additive structure to generate hidden representations with linear transformed inputs directly, though we empirically observe that non-linear activations can stabilize model training.Zhang et al. (2018) propose a twin-gate mechanism where input and forget gate are simultaneously produced from the same variables.We extend this mechanism by removing the affine transformation of previous hidden states.
Another direction is to shift recurrent matrix multiplications outside the recurrence so as to improve the parallelization of RNNs.Bradbury et al. (2016) propose the quasi-recurrent network (QRNN).QRNN factors all matrix multiplications out of the recurrence and employs a convolutional network to capture local input patterns.A minimal recurrent pooling function is used in parallel across different channels to handle global input patterns.Lei et al. (2017) apply the kernel method to simplify recurrence and show improved model capacity with deep stacked RNNs.This idea is extended to SRU (Lei et al., 2018) where a minimal recurrent component is strengthened via an external highway layer.The proposed LRN falls into this category with the advantage over SRU of the non-dependence on the highway component.
Orthogonal to the above work, recent studies also show the potential of accelerating matrix computation with low-level optimization.Diamos et al. (2016) emphasize persistent computational kernels to exploit GPU's inverted memory hierarchy for reusing/caching purpose.Appleyard et al. (2016) upgrade NIVIDIA's cuDNN implementation through exposing parallelism between operations within the recurrence.Kuchaiev and Ginsburg (2017) reduce the number of model parameters by factorizing or partitioning LSTM matrices.In general, all these techniques can be applied to any recurrent units to reduce computational overhead.
Our work is closely related with ATR and SRU.Although recent work shows that novel recurrent units derived from weighted finite state automata are effective without the hidden-to-hidden connection (Balduzzi and Ghifary, 2016;Peng et al., 2018), we empirically observe that including previous hidden states for gates is crucial for model capacity which also resonates with the evolution of SRU.Unlike ATR and SRU, however, we demonstrate that the affine transformation on the previous hidden state for gates is unnecessary.In addition, our model has a strong connection with selfattention networks.

Lightweight Recurrent Network
Given a sequence of input X = [x 1 ; x 2 ; . . .; x n ] ∈ R n×d with length of n, LRN operates as follows2 : where W q , W k , W v ∈ R d×d are model parameters and g(•) is an activation function, such as identity and tanh.and σ(•) indicate the elementwise multiplication and sigmoid activation function, respectively.q t , k t and v t correspond to the t-th row of the projected sequence representation Q, K, V. We use the term q, k and v to denote the implicit correspondence to query, key and value in self-attention networks which is elaborated in the next section.
As shown in Eq. ( 1), all matrix-related operations are shifted outside the recurrence and can be pre-calculated, thereby reducing the complexity of the recurrent computation from O(d 2 ) to O(d) and easing model parallelization.The design of the input gate i t and forget gate f t is inspired by the twin-gate mechanism in ATR (Zhang et al., 2018).Unlike ATR, however, we eschew the affine transformation on the previous hidden state.By doing so, the previous hidden state directly offers positive contribution to the input gate but negative to the forget gate, ensuring adverse correlation between these two gates.
The current hidden state h t is a weighted average of the current input and the previous hidden state followed by an element-wise activation.When identity function is employed, our model shows analogous properties to ATR.However, we empirically observe that this leads to gradually increased hidden representation values, resulting in optimization instability.Unlike SRU, which controls stability through a particular designed scaling term, we replace the identity function with the tanh function, which is simple but effective.

Structure Decomposition
In this section, we show an in-depth analysis of LRN by decomposing the recurrent structure.With an identity activation, the t-th hidden state can be expanded as follows: where the representation of the current token is composed of all previous tokens with their contribution distinguished by both input and forget gates.
Relation with self-attention network.After grouping these gates, we observe that: . ( 6) Each weight can be regarded as a query from the current token f t to the k-th input token i k .This query chain can be decomposed into two parts: a key represented by i k and a query represented by t−k l=1 f k+l .The former is modulated through the weight matrix W k , and tightly associated with the corresponding input token.Information carried by the key remains intact during the evolution of time step t.In contrast, the latter, induced by the weight matrix W q , highly depends on the position and length of this chain, which dynamically changes between different token pairs.
The weights generated by keys and queries are assigned to values represented by v k and manipulated by the weight matrix W v .Compared with self-attention networks, LRN shows analogous weight parameters and model structure.The difference is that weights in self-attention networks are normalized across all input tokens.Instead, weights in LRN are unidirectional, unnomalized and spanned over all channels.
Memory in LRN Alternatively, we can view the gating mechanism in LRN as a memory that gradually forgets information.
Given the value representation at k-th time step v k , the information delivered to later time step t (k < t) in LRN is as follows: The input gate i k indicates the moment that LRN first accesses the input token x k , whose value reflects the amount of information or knowledge allowed from this token.A larger input gate corresponds to a stronger input signal, thereby a large change of activating short-term memory.This information is then delivered through a forget chain where memory is gradually decayed by a forget gate at each time step.The degree of memory decaying is dynamically controlled by the input sequence itself.When a new incoming token is more informative, the forget gate would increase so that previous knowledge is erased so as to make way for new knowledge in the memory.By contrast, meaningless tokens would be simply ignored.

Model
The gradient back-propagated from the t-th step heavily depends on the following one-step derivation: Due to the chain rule, the recurrent weight matrix U will be repeatedly multiplied along the sequence length.Gradient vanishing/explosion results from a weight matrix with small/large norm (Pascanu et al., 2013).
In LRN, however, the recurrent weight matrix is removed.The current hidden state is generated by directly weighting the current input and the previous hidden state.The one-step derivation of Eq. (2-4) is as follows: where σ i and σ f denote the derivation of Eq. ( 2) and Eq.(3), respectively.The difference between Eq. ( 9) and Eq. ( 10) is that the recurrent weight matrix is substituted by a more expressive component denoted as A in Eq. ( 10).Unlike the weight matrix U, the norm of A is input-dependent and varies dynamically along different positions.The dependence on inputs provides LRN with the capability of avoiding gradient vanishing/explosion.

Experiments
We verify the effectiveness of LRN on six diverse NLP tasks.For each task, we adopt (near) state-of-the-art neural models with RNNs handling sequence representation.We compare LRN with several cutting-edge recurrent units, including LSTM, GRU, ATR and SRU.For all comparisons, we keep the neural architecture intact and only alter the recurrent unit. 3All RNNs are implemented without specialized cuDNN kernels.Unless otherwise stated, different models on the same task share the same set of hyperparameters.

Natural Language Inference
Settings Natural language inference reasons about the entailment relationship between a premise sentence and a hypothesis sentence.We use the Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) and treat the task as a three-way classification task.This dataset contains 549,367 premise-hypothesis pairs for training, 9,842 pairs for developing and 9,824 pairs for testing.We employ accuracy for evaluation.
We implement a variant of the word-by-word attention model (Rocktäschel et al., 2016) using Tensorflow for this task, where we stack two additional bidirectional RNNs upon the final sequence representation and incorporate character embedding for word-level representation.The pretrained GloVe (Pennington et al., 2014) word vectors are used to initialize word embedding.We also integrate the base BERT (Devlin et al., 2018)  We set the character embedding size and the RNN hidden size to 64 and 300 respectively.Dropout is applied between consecutive layers with a rate of 0.3.We train models within 10 epochs using the Adam optimizer (Kingma and Ba, 2014) with a batch size of 128 and gradient norm limit of 5.0.We set the learning rate to 1e −3 , and apply an exponential moving average to all model parameters with a decay rate of 0.9999.These hyperparameters are tuned according to development performance.
Results Table 1 shows the test accuracy and training time of different models.Our implementation outperforms the original model where Rocktäschel et al. (2016) report an accuracy of 83.50.Overall results show that LRN achieves competitive performance but consumes the least training time.Although LSTM and GRU outperform LRN by 0.3∼0.9 in terms of accuracy, these recurrent units sacrifice running efficiency (about 7%∼48%) depending on whether LN and BERT are applied.No significant performance difference is observed between SRU and LRN, but LRN has fewer model parameters and shows a speedup over SRU of 8%∼21%.
Models with layer normalization (LN) (Ba et al., 2016) tend to be more stable and effective.However, for LSTM, GRU and ATR, LN results in significant computational overhead (about 27%∼71%).In contrast, quasi recurrent models like SRU and LRN only suffer a marginal speed decrease.This is reasonable because layer normalization is moved together with matrix multiplication out of the recurrence.
Results with BERT show that contextual information is valuable for performance improvement.LRN obtains additional 4 percentage points gain with BERT and reaches an accuracy of around 89.9.This shows the compatibility of LRN with existing pretrained models.In addition, although the introduction of BERT brings in heavy matrix computation, the benefits from LRN do not disappear.LRN is still the fastest model, outperforming other recurrent units by 8%∼27%.

Document Classification
Settings Document classification poses challenges in the form of long-range dependencies where information from distant tokens that contribute to the correct category should be captured.We use Amazon Review Polarity (AmaPolar, 2 labels, 3.6M/0.4Mfor training/testing), Amazon Review Full (AmaFull, 5 labels, 3M/0.65M for training/testing), Yahoo!Answers (Yahoo, 10 labels, 1.4M/60K for training/testing) and Yelp Review Polarity (YelpPolar, 2 labels, 0.56M/38K for training/testing) from Zhang et al. (2015) for experiments.We randomly select 10% of training data for validation.Models are evaluated by test error.
We treat a document as a sequence of words.Our model is a bidirectional RNN followed by an attentive pooling layer.The word-level representation is composed of a pretrained GloVe word vector and a convolutional character vector.We use Tensorflow for implementation and do not use layer normalization.We set character embedding size to 32, RNN hidden size to 64 and dropout rate to 0.1.Model parameters are tuned by Adam optimizer with initial learning rate of 1e −3 .Gradients are clipped when their norm exceeds 5. We limit the maximum document length to 400 and maximum training epochs to 6. Parameters are smoothed by an exponential moving average with a decay rate of 0.9999.These hyperparameters are tuned according to development performance.Results Table 2  underperforms LSTM and GRU (-0.45∼-1.22).This indicates that LRN is capable of handling long-range dependencies though not as strong as complex recurrent units.Instead, the simplification endows LRN with less computational overhead than these units.Particularly, LRN accelerates the training over LSTM and SRU by about 20%, or several days of training time on GeForce GTX 1080 Ti.4

Machine Translation
Settings Machine translation is the task of transforming meaning from a source language to a target language.We experiment with the WMT14 English-German translation task (Bojar et al., 2014) which consists of 4.5M training sentence pairs. 5We use newstest2013 as our development set and newstest2014 as our test set.Casesensitive tokenized BLEU score is used for evaluation.
We implement a variant of the GNMT system (Wu et al., 2016) using Tensorflow, enhanced with residual connections, layer normalization, label smoothing, a context-aware component (Zhang et al., 2017) and multi-head attention (Vaswani et al., 2017).Byte-pair encoding (Sennrich et al., 2016)  ule (Chen et al., 2018).We cut gradient norm to 1.0 and set the token size to 32K.Label smoothing rate is set to 0.1.Model Variant Apart from LRN, we develop an improved variant for machine translation that includes an additional output gate.Formally, we change the Eq. ( 4) to the following one: We denote this variant oLRN.Like LRN, the added matrix transformation in oLRN can be shifted out of the recurrence, bringing in little computational overhead.The design of this output gate o t is inspired by the LSTM structure, which acts as a controller to adjust information flow.In addition, this gate helps stabilize the hidden activation to avoid value explosion, and also improves model fitting capacity.
Results The results in Table 3 show that translation quality of LRN is slightly worse than that of GRU (-0.02 BLEU).After incorporating the output gate, however, oLRN yields the best BLEU score of 26.73, outperforming GRU (+0.45 BLEU).In addition, the training time results in Table 3 confirm the computational advantage of LRN over all other recurrent units, where LRN speeds up over ATR and SRU by approximately 25%.For decoding, nevertheless, the autoregressive schema of GNMT disables position-wise parallelization.In this case, the recurrent unit with the least computation operations, i.e.ATR, becomes the fastest.Still, both LRN and oLRN translate sentences faster than SRU (+15%/+6%).

Reading Comprehension
Settings given document, which involves complex sentence matching, reasoning and knowledge association.We use the SQuAD corpus (Rajpurkar et al., 2016) for this task and adopt span-based extraction method.This corpus contains over 100K document-question-answer triples.We report exact match (EM) and F1-score (F1) on the development set for evaluation.We employ the public available rnet model (Wang et al., 2017) 6 in Tensorflow.We use the default model settings: character embedding size 8, hidden size 75, batch size 64, and Adadelta optimizer (Zeiler, 2012) with initial learning rate of 0.5.Gradient norm is cut to 5.0.We also experiment with Elmo (Peters et al., 2018), and feed the Elmo representation in before the encoding layer and after the matching layer with a dropout of 0.5.Results Table 4 lists the EM/F1 score of different models.In this task, LRN outperforms ATR and SRU in terms of both EM and F1 score.After integrating Elmo for contextual modeling, the performance of LRN reaches the best (76.14 6 https://github.com/HKUST-KnowComp/R-Net EM and 83.83 F1), beating both GRU and LSTM (+0.33EM, +0.71F1).As recent studies show that cases in SQuAD are dominated by local pattern matching (Jia and Liang, 2017), we argue that LRN is good at handling local dependencies.

Named Entity Recognition
Settings Named entity recognition (NER) classifies named entity mentions into predefined categories.We use the CoNLL-2003 English NER dataset (Tjong Kim Sang and De Meulder, 2003) and treat NER as a sequence labeling task.We use the standard train, dev and test split.F1 score is used for evaluation.
We adopt the bidirectional RNN with CRF inference architecture (Lample et al., 2016).We implement different models based on the public codebase in Tensorflow. 7We use the default hyperparameter settings.Word embedding is initialized by GloVe vectors.Results As shown in Table 6 8 , the performance of LRN matches that of ATR and SRU, though LSTM and GRU operate better (+1.05 and +0.79).As in the SQuAD task, the goal of NER is to detect local entity patterns and figure out the entity boundaries.However, the performance gap between LSTM/GRU and LRN in NER is significantly larger than that in SQuAD.We ascribe this to the weak model architecture and the small scale NER dataset where entity patterns are not fully captured by LRN.

Language Modeling
Settings Language modeling aims to estimate the probability of a given sequence, which re- quires models to memorize long-term structure of language.We use two widely used datasets, Penn Treebank (PTB) (Mikolov et al., 2010) and WikiText-2 (WT2) (Merity et al., 2016) for this task.Models are evaluated by perplexity.
We modify the mixture of softmax model (MoS) (Yang et al., 2018) 9 in PyTorch to include different recurrent units.We apply weight dropout to all recurrent-related parameters instead of only hidden-to-hidden connection.We follow the experimental settings of MoS, and manually tune the initial learning rate based on whether training diverges.
Results Table 5 shows the test perplexity of different models. 10In this task, LRN significantly outperforms GRU, ATR and SRU, and achieves near the same perplexity as LSTM.This shows that in spite of its simplicity, LRN can memorize longterm language structures and capture a certain degree of language variation.In summary, LRN generalizes well to different tasks and can be used as a drop-in replacement of existing recurrent units.

Ablation Study
Part of LRN can be replaced with some alternatives.In this section, we conduct ablation analysis to examine two possible designs: gLRN The twin-style gates in Eq. (2-3) can be re-9 https://github.com/zihangdai/mos 10Our re-implementation of LSTM model is worse than the original model (Yang et al., 2018) because the system is sensitive to hyperparameters, and we apply weight dropout to all LSTM parameters which makes the original best choices not optimal.placed with a general one:

Model
In this way, input and forget gate are inferable from each other with the key weight parameter removed.
eLRN The above design can be further simplified into an extreme case where the forget gate is only generated from the previous hidden state without the query vector: We experiment with SNLI and PTB tasks.Results in Table 7 show that although the accuracy on SNLI is acceptable, gLRN and eLRN perform significantly worse on the PTB task.This suggests that these alternative structures suffer from weak generalization.

Structure Analysis
In this section, we provide a visualization to check how the gates work in LRN.
We experiment with a unidirectional LRN on the AmaPolar dataset, where the last hidden state is used for document classification.Figure 1 shows the decay curve of each token along the token position.The memory curve of each token decays over time.However, important clues that contribute significantly to the final decision, as the token "great" does, decrease slowly, as shown by the red curve.Different tokens show different decay rate, suggesting that input and forget gate are capable of learning to propagate relevant signals.All these demonstrate the effectiveness of our LRN model.

Conclusion and Future Work
This paper presents LRN, a lightweight recurrent network that factors matrix operations outside the recurrence and enables higher parallelization.Theoretical and empirical analysis shows that the input and forget gate in LRN can learn long-range dependencies and avoid gradient vanishing and explosion.LRN has a strong correlation with selfattention networks.Experiments on six different NLP tasks show that LRN achieves competitive performance against existing recurrent units.It is simple, effective and reaches better trade-off among parameter number, running speed, model performance and generalization.
In the future, we are interested in testing lowlevel optimizations of LRN, which are orthogonal to this work, such as dedicated cuDNN kernels.

Figure 1 :
Figure 1: The decay curve of each token modulated by input and forget gates along the token position.Notice how the memory of term "great" flows to the final state shown in red, and contributes to a Positive decision.Weight denotes the averaged activation of i k t−k l=1 f k+l as shown in Eq. (5).

Table 1 :
Test accuracy (ACC) on SNLI task."#Params": the parameter number of Base.Base and LN denote the baseline model and layer normalization respectively.Time: time in seconds per training batch measured from 1k training steps on GeForce GTX 1080 Ti.Best results are highlighted in bold.

Table 2 :
Test error (ERR) on document classification task."#Params": the parameter number in AmaPolar task.Time: time in seconds per training batch measured from 1k training steps on GeForce GTX 1080 Ti.

Table 4 :
Wang et al. (2017)the vocabulary size to 32K.We set the hidden size and embedding size to 1024.Models are trained using Adam optimizer with adaptive learning rate sched-Exact match/F1-score on SQuad dataset."#Params": the parameter number of Base.rnet*: results published byWang et al. (2017).

Table 5 :
Test perplexity on PTB and WT2 language modeling task."#Params": the parameter number in PTB task.Finetune: fintuning the model after convergence.Dynamic dynamic evaluation.Lower perplexity indicates better performance.

Table 7 :
SNLI PTB LRN 85.06 61.26 gLRN 84.72 92.49eLRN 83.56 169.81 Test accuracy on SNLI task with Base+LN setting and test perplexity on PTB task with Base setting.