Recurrent Residual Learning for Sequence Classification

In this paper, we explore the possibility of leveraging Residual Networks (ResNet), a powerful structure in constructing extremely deep neural network for image understanding, to improve recurrent neural networks (RNN) for modeling sequential data. We show that for sequence classiﬁcation tasks, incorporating residual connections into recurrent structures yields similar accuracy to Long Short Term Memory (LSTM) RNN with much fewer model parameters. In addition, we propose two novel models which combine the best of both residual learning and LSTM. Experiments show that the new models signiﬁcantly outperform LSTM.


Introduction
Recurrent Neural Networks (RNNs) are powerful tools to model sequential data. Among various RNN models, Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) is one of the most effective structures. In LSTM, gating mechanism is used to control the information flow such that gradient vanishing problem in vanilla RNN is better handled, and long range dependency is better captured. However, as empirically verified by previous works and our own experiments, to obtain fairly good results, training LSTM RNN needs carefully designed optimization procedure (Hochreiter et al., 2001;Pascanu et al., 2013;Dai and Le, 2015;Laurent et al., 2015;He et al., 2016;Arjovsky et * This work was done when the author was visiting Microsoft Research Asia. al., 2015), especially when faced with unfolded very deep architectures for fairly long sequences (Dai and Le, 2015).
From another perspective, for constructing very deep neural networks, recently Residual Networks (ResNet) (He et al., 2015) have shown their effectiveness in quite a few computer vision tasks. By learning a residual mapping between layers with identity skip connections (Jaeger et al., 2007), ResNet ensures a fluent information flow, leading to efficient optimization for very deep structures (e.g., with hundreds of layers). In this paper, we explore the possibilities of leveraging residual learning to improve the performances of recurrent structures, in particular, LSTM RNN, in modeling fairly long sequences (i.e., whose lengths exceed 100). To summarize, our main contributions include: 1. We introduce residual connecting mechanism into the recurrent structure and propose recurrent residual networks for sequence learning. Our model achieves similar performances to LSTM in text classification tasks, whereas the number of model parameters is greatly reduced.
2. We present in-depth analysis of the strengths and limitations of LSTM and ResNet in respect of sequence learning.
3. Based on such analysis, we further propose two novel models that incorporate the strengths of the mechanisms behind LSTM and ResNet. We demonstrate that our models outperform LSTM in many sequence classification tasks.

938
2 Background RNN models sequences by taking sequential input x = {x 1 , · · · , x T } and generating T step hidden states h = {h 1 , · · · , h T }. At each time step t, RNN takes the input vector x t ∈ R n and the previous hidden state vector h t−1 ∈ R m to produce the next hidden state h t . Based on this basic structure, LSTM avoids gradient vanishing in RNN training and thus copes better with long range dependencies, by further augmenting vanilla RNN with a memory cell vector c t ∈ R m and multiplicative gate units that regulate the information flow. To be more specific, at each time step t, an LSTM unit takes x t , c t−1 , h t−1 as input, generates the input, output and forget gate signals (denoted as i t , o t and f t respectively), and produces the next cell state c t and hidden state h t : where ⊗ refers to element-wise product. σ(x) is the sigmoid function σ(x) = 1/(1+exp(−x)). W j (j ∈ {i, o, f, c}) are LSTM parameters. In the following part, such functions generating h t and c t are denoted as h t , c t = LST M (x t , h t−1 , c t−1 ).
Residual Networks (ResNet) are among the pioneering works (Szegedy et al., 2015;Srivastava et al., 2015) that utilize extra identity connections to enhance information flow such that very deep neural networks can be effectively optimized. ResNet (He et al., 2015) is composed of several stacked residual units, in which the l th unit takes the following transformation: where h l and h l+1 are the input and output for the l th unit respectively. F is the residual function with weight parameters W l . f is typically the ReLU function (Nair and Hinton, 2010). g is set as identity function, i.e., g(h l ) = h l . Such an identity connection guarantees the direct propagation of signals among different layers, thereby avoids gradient vanishing. The recent paper (Liao and Poggio, 2016) talks about the possibility of using shared weights in ResNet, similar to what RNN does.

Recurrent Residual Learning
The basic idea of recurrent residual learning is to force a direct information flow in different time steps of RNNs by identity (skip) connections. In this section, we introduce how to leverage residual learning to 1) directly construct recurrent neural network in subsection 3.1; 2) improve LSTM in subsection 3.2.

Recurrent Residual Networks (RRN)
The basic architecture of Recurrent Residual Network (RRN for short) is illustrated in Figure 1, in which orange arrows indicate the identity connections from each h t−1 to h t , and blue arrows represent the recurrent transformations taking both h t and x t as input. Similar to equation (2), the recurrent transformation in RRN takes the following form (denoted as h t = RRN (x t , h t−1 ) in the following sections): where g is still the identity function s.t. g(h t−1 ) = h t−1 , corresponding to the orange arrows in Figure  1. f is typically set as tanh. For function F with weight parameters W (corresponding to the blue arrows in Figure 1), inspired by the observation that higher recurrent depth tends to lead to better performances , we impose K deep transformations in F: where x t is taken at every layer such that the input information is better captured, which works similarly to the mechanism of highway network (Srivastava et al., 2015). K is the recurrent depth defined in . The weights W m (m ∈ {1, · · · , K}) are shared across different time steps t. RRN forces the direct propagation of hidden state signals between every two consecutive time steps with identity connections g. In addition, the multiple non-linear transformations in F guarantees its capability in modelling complicated recurrent relationship. In practice, we found that K = 2 yields fairly good performances, meanwhile leads to half of LSTM parameter size when model dimensions are the same.

Gated Residual RNN
Identity connections in ResNet are important for propagating the single input image information to higher layers of CNN. However, when it comes to sequence classification, the scenario is quite different in that there is a new input at every time step. Therefore, a forgetting mechanism to "forget" less critical historical information, as is employed in LSTM (controlled by the forget gate f t ), becomes necessary. On the other hand, while LSTM benefits from the flexible gating mechanism, its parametric nature brings optimization difficulties to cope with fairly long sequences, whose long range information dependencies could be better captured by identity connections.
Inspired by the success of the gating mechanism of LSTM and the residual connecting mechanism with enhanced information flow of ResNet, we further propose two Gated Residual Recurrent models leveraging the strengths of the two mechanisms.

Model 1: Skip-Connected LSTM (SC-LSTM)
Skip-Connected LSTM (SC-LSTM for short) introduces identity connections into standard LSTM to enhance the information flow. Note that in Figure 1, a straightforward approach is to replace F with an LSTM unit. However, our preliminary experiments do not achieve satisfactory results. Our conjecture is that identity connections between consecutive memory cells, which are already sufficient to maintain short-range memory, make the unregulated information flow overly strong, and thus compromise the merit of gating mechanism.
To reasonably enhance the information flow for LSTM network while keeping the advantage of gating mechanism, starting from equation (1), we propose to add skip connections between two LSTM hidden states with a wide range of distance L (e.g., L = 20), such that ∀t = {1, 1 + L, 1 + 2L, · · · , 1 + T −L−1 L L}: Here α is a scalar that can either be fixed as 1 (i.e., identity mapping) or be optimized during training process as a model parameter (i.e., parametric skip connection). We refer to these two variants as SC-LSTM-I and SC-LSTM-P respectively. Note that in SC-LSTM, the skip connections only exist in time steps 1, 1 + L, 1 + 2L, · · · , 1 + T −L−1 L L. The basic structure is shown in Figure 2.

Model 2: Hybrid Residual LSTM (HRL)
Since LSTM generates sequence representations out of flexible gating mechanism, and RRN generates representations with enhanced residual historical information, it is a natural extension to combine the two representations to form a signal that benefits from both mechanisms. We denote this model as Hybrid Residual LSTM (HRL for short).
In HRL, two independent signals, h LST M t generated by LSTM (equation (1)) and h RRN t generated by RRN (equation (3)), are propagated through LSTM and RRN respectively:

940
The final representation h HRL T is obtained by the mean pooling of the two "sub" hidden states: h HRL T is then used for higher level tasks such as predicting the sequence label. Acting in this way, h HRL T contains both the statically forced and dynamically adjusted historical signals, which are respectively conveyed by h RRN

Experiments
We conduct comprehensive empirical analysis on sequence classification tasks. Listed in the ascending order of average sequence lengths, several public datasets we use include: 1. AG's news corpus 1 ,a news article corpus with categorized articles from more than 2, 000 news sources. We use the dataset with 4 largest classes constructed in .
2. IMDB movie review dataset 2 , a binary sentiment classification dataset consisting of movie review comments with positive/negative sentiment labels (Maas et al., 2011).
3. 20 Newsgroups (20NG for short), an email collection dataset categorized into 20 news groups. Simiar to (Dai and Le, 2015), we use the post-processed version 3 , in which attachments, PGP keys and some duplicates are removed.
4. Permuted-MNIST (P-MNIST for short). Following (Le et al., 2015;Arjovsky et al., 2015), we shuffle pixels of each MNIST image (Le-Cun et al., 1998) with a fixed random permutation, and feed all pixels sequentially into recurrent network to predict the image label. Permuted-MNIST is assumed to be a good testbed for measuring the ability of modeling very long range dependencies (Arjovsky et al., 2015).
Detailed statistics of each dataset are listed in Table 1. For all the text datasets, we take every word as input and feed word embedding vectors pre-trained by Word2Vec  on Wikipedia into the recurrent neural network. The top most frequent words with 95% total frequency coverage are kept, while others are replaced by the token "UNK". We use the standard training/test split along with all these datasets and randomly pick 15% of training set as dev set, based on which we perform early stopping and for all models tune hyper-parameters such as dropout ratio (on non-recurrent layers) (Zaremba et al., 2014), gradient clipping value (Pascanu et al., 2013) and the skip connection length L for SC-LSTM (cf. equation (5)). The last hidden states of recurrent networks are put into logistic regression classifiers for label predictions. We use Adadelta (Zeiler, 2012) to perform parameter optimization. All our implementations are based on Theano (Theano Development Team, 2016) and run on one K40 GPU. All the source codes and datasets can be downloaded at https://publish.illinois. edu/yirenwang/emnlp16source/.
We compare our proposed models mainly with the state-of-art standard LSTM RNN. In addition, to fully demonstrate the effects of residual learning in our HRL model, we employ another hybrid model as baseline, which combines LSTM and GRU (Cho et al., 2014), another state-of-art RNN variant, in a similar way as HRL. We use LSTM+GRU to denote such a baseline. The model sizes (word embedding size × hidden state size) configurations used for each dataset are listed in Table 2. In Table 2, "Non-Hybrid" refers to LSTM, RRN and SC-LSTM models, while "Hybrid" refers to two methods that combines two basic models: HRL and LSTM+GRU. The model sizes of all hybrid models are smaller than the standard LSTM. All models have only one recurrent layer.

Experimental Results
All the classification accuracy numbers are listed in Table 3    half of the model parameters, indicating that residual network structure, with connecting mechanism to enhance the information flow, is also an effective approach for sequence learning. However, the fact that it fails to significantly outperform other models (as it does in image classification) implies that forgetting mechanism is desired in recurrent structures to handle multiple inputs.
2. Skip-Connected LSTM performs much better than standard LSTM. For tasks with shorter sequences such as AG's News, the improvement is limited. However, the improvements get more significant with the growth of sequence lengths among different datasets 4 , and the performance is particularly good in P-MNIST with very long sequences. This reveals the importance of skip connections in carrying on historical information through a long range of time steps, and demonstrates the effectiveness of our approach that adopts the residual connecting mechanism to improve LSTM's capability of handling long-term dependency. Furthermore, SC-LSTM is robust with different hyperparam-4 t-test on SC-LSTM-P and SC-LSTM-I with p value < 0.001.
eter values: we test L = 10, 20, 50, 75 in P-MNIST and find the performance is not sensitive w.r.t. these L values.
3. HRL also outperforms standard LSTM with fewer model parameters 5 . In comparison, the hybrid model of LSTM+GRU cannot achieve such accuracy as HRL. As we expected, the additional long range historical information propagated by RRN is proved to be good assistance to standard LSTM.

Conclusion
In this paper, we explore the possibility of leveraging residual network to improve the performance of LSTM RNN. We show that direct adaptation of ResNet performs well in sequence classification. In addition, when combined with the gating mechanism in LSTM, residual learning significantly improve LSTM's performance. As to future work, we plan to apply residual learning to other sequence tasks such as language modeling, and RNN based neural machine translation (Sutskever et al., 2014) (Cho et al., 2014).