Low-rank passthrough neural networks

Various common deep learning architectures, such as LSTMs, GRUs, Resnets and Highway Networks, employ state passthrough connections that support training with high feed-forward depth or recurrence over many time steps. These “Passthrough Networks” architectures also enable the decoupling of the network state size from the number of parameters of the network, a possibility has been studied by Sak et al. (2014) with their low-rank parametrization of the LSTM. In this work we extend this line of research, proposing effective, low-rank and low-rank plus diagonal matrix parametrizations for Passthrough Networks which exploit this decoupling property, reducing the data complexity and memory requirements of the network while preserving its memory capacity. This is particularly beneficial in low-resource settings as it supports expressive models with a compact parametrization less susceptible to overfitting. We present competitive experimental results on several tasks, including language modeling and a near state of the art result on sequential randomly-permuted MNIST classification, a hard task on natural data.


OVERVIEW
Deep neural networks can perform non-trivial computation by the repeated the application of parametric non-linear transformation layers to vectorial (or, more generally, tensorial) data.This staging of many computation steps can be done over a time dimension for tasks involving sequential inputs or outputs of varying length, yielding a recurrent neural network, or over an intrinsic circuit depth dimension, yielding a deep feed-forward neural network, or both.Training these deep models is complicated by the exploding and vanishing gradient problems (Hochreiter, 1991;Bengio et al., 1994).
Starting from the original LSTM of Hochreiter & Schmidhuber (1997), various network architectures have been proposed to ameliorate the vanishing gradient problem in the recurrent neural network setting, such as the modern LSTM (Graves & Schmidhuber, 2005), the GRU (Cho et al., 2014b) and other variants (Greff et al., 2015;Józefowicz et al., 2015).These architectures led to a number of breakthroughs in different tasks such as speech recognition (Graves et al., 2013), machine translation (Cho et al., 2014a;Bahdanau et al., 2014), natural language parsing (Vinyals et al., 2014), question answering (Iyyer et al., 2014) and many others.More recently, similar methods have been applied in the feed-forward neural network setting yielding state of the art results with architectures such as Highway Networks (Srivastava et al., 2015), Deep Residual Networks (He et al., 2015) and Grid LSTM1 (Kalchbrenner et al., 2015).All these architectures are based on a single structural principle which, in this work, we will refer to as the state passthrough.We will thus refer to these architectures as Passthrough Networks.
In typical "fully connected" neural architectures, a layer acting on a n-dimensional state vector has O(n 2 ) parameters stored in one or more matrices.Since a sufficiently complex function requires a large number of bits to be represented regardless of architectural details, we can't hope to find low-dimensional representation for really hard learning tasks, but there can be many functions of practical interest that are simple enough to be represented by a relatively small number of bits while still requiring some sizable amount of memory to be computed.Therefore, representing these functions on a fully connected neural network can be wasteful in terms of number of parameters.For some tasks, this quadratic dependency between state size and parameter number can cause a model going from underfitting the training set to overfitting it just by the addition of a single state component.For this reason, a number of neural low-dimensional layer parametrization have been proposed, such as convolutional layers (LeCun et al., 2004;Krizhevsky et al., 2012) which impose a sparse, local, periodic structure on the parameter matrices, or multiplicative matrix decompositions, notably the Unitary Evolution RNNs (Arjovsky et al., 2015) (which also addresses the vanishing gradient problem) and others (Le et al., 2013;Moczulski et al., 2015).
In this work we observe that the state passthrough allows for a systematic decoupling of the network state size from the number of parameters: since by default the state vector passes mostly unaltered through the layers, each layer can be made simple enough to be described only by a small number of parameters without affecting the overall memory capacity of the network.This effectively spreads the computation over the depth or time dimension of the network, but without making the network "thin" (as proposed, for instance, by Srivastava et al. (2015)).
To the best of our knowledge, this systematic decoupling has not been described in a systematic way, although it has been exploited by some convolutional passthrough architectures for image recognition (Srivastava et al., 2015;He et al., 2015) or algorithmic tasks (Kaiser & Sutskever, 2015), or architectures with addressable read-write memory (Graves et al., 2014;Gregor et al., 2015;Neelakantan et al., 2015;Kurach et al., 2015;Danihelka et al., 2016).
In this work we introduce an unified view of passthrough architectures, describe their state sizeparameter size decoupling property, propose simple but effective low-dimensional parametrizations that exploit this decoupling based on low-rank or low-rank plus diagonal matrix decompositions.Our approach extends the LSTM architecture with a single projection layer by Sak et al. (2014) which has been applied to speech recognition, natural language modeling (Józefowicz et al., 2016), video analysis (Sun et al., 2015) et cetera.We provide experimental evaluation of our approach on GRU and Highway Network architectures on various machine learning tasks, including a near state of the art result for the hard task of sequential randomly-permuted MNIST image recognition (Le et al., 2015).

MODEL
In this section we will introduce a notation to describe various neural network architectures, then we will formally describe passthrough architectures and finally will introduce our low-dimensional parametrizations for these architectures.
A neural network can be described as a dynamical system that transforms an input u into an output y over multiple time steps T .At each step t the network has a n-dimensional state vector x(t) ∈ R n defined as where in is a state initialization function, f is a state transition function and θ ∈ R k is vector of trainable parameters.The output y = out(x(0 : T ), θ) is generated by an output function out, where x(0 : T ) denotes the whole sequence of states visited during the execution.
In a feed-forward neural network with constant hidden layer width n, the input u ∈ R m and the output y ∈ R l are vectors of fixed dimension m and l respectively, T is a model hyperparameter and the functions above can be simplified as highlighting the dependence of the different layers on different subsets of parameters.
In a recurrent neural network the input u is typically a list of T m-dimensional vectors u(t) ∈ R m for t ∈ 1, . . ., T where T is variable, the output y is either a single l-dimensional vector or a list of T such vectors.The model functions can be written as where for a fixed-dimensional output we assume that only y(T ) is meaningful.
Other neural architectures, such as "seq2seq" transducers without attention (Cho et al., 2014a), can be also described with this framework.

PASSTHROUGH NETWORKS
Passthough networks can be defined as networks where the state transition function f has a special form such that, at each step t the state vector x(t) (or a sub-vector x(t)) is propagated to the next step modified only by some (nearly) linear, element-wise transformations.
Let the state vector x(t) ≡ (x(t), x(t)) be the concatenation of x(t) ∈ R n and x(t) ∈ R ñ with n + ñ = n (where ñ can be equal to zero).We define a network to have a state passthrough on x if x evolves as where f π is the next state proposal function, f τ is the transform function, f γ is the carry function and denotes element-wise vector multiplication.
The rest of the state vector x(t), if present, evolves according to some other function f .In practice x(t) is only used in LSTM variants, while in other passthrough architectures x(t) = x(t).
We denote the state passthrough as This choice is used in GRUs (Cho et al., 2014b) and Highway Networks (Srivastava et al., 2015).Modern LSTM variants (Greff et al., 2015) typically use a transform function ("forget gate") f τ and carry function ("input gate") f τ independent of each other.
As concrete example, we can describe a fully connected Highway Network as where g is an element-wise activation function, usually the ReLU (Glorot et al., 2011) or the hyperbolic tangent, σ is the element-wise logistic sigmoid, and ∀t ∈ 1, . . ., T , the parameters θ are vectors in R n .Dependence on the input u occurs only though the initialization function, which is model-specific and is omitted here, as is the output function.

LOW-RANK PASSTHROUGH NETWORKS
In fully connected architectures there are n × n matrices that act on the state vector, such as the θ (Wπ) t and θ (Wτ ) t matrices of the Highway Network of eq. 6.Each of these matrices has n 2 entries, thus for large n, the entries of these matrices can make up the majority of independently trainable parameters of the model.
As discussed in the previous section, this parametrization can be wasteful.Specifically, this parameterization implies that, at each step, all the information in each state component can affect all the information in any state component at the next step.That is, the computation performed at each step is essentially fully global.Classical physical systems, however, consist of spatially separated parts with primarily local interactions, long-distance interactions are possible but they tend to be limited by propagation delays, bandwidth and noise.Therefore it may be beneficial to bias our model class towards models that tend to adhere to these physical constraints by using a parametrization which reduces the number of parameters required to represent them.
We can accomplish this low-dimensional parametrization by imposing some constraints on the n×n matrices that parametrize the state transitions.One way of doing this is to impose a convolutional structure on these matrices, which corresponds to strict locality and periodicity constraints as in a cellular automaton.These constraints may work well in certain domains such as vision, but may be overly restrictive in other domains.
We propose instead to impose a low-rank constraint on these matrices.This is easily accomplished by rewriting each of these matrices as the product of two matrices where the inner dimension d is a model hyperparameter.For instance, in the case of the Highway Network of eq.6 we can redefine ∀t ∈ 1, . . ., T (Rπ) t , θ (Rτ ) t ∈ R d×n .When d < n/2 this result in a reduction of the number of independent parameters of the model.This low-rank constraint can be thought as a bandwidth constraint on the computation performed at each step: the R matrices first project the state into a smaller subspace, extracting the information needed for that specific step, then the L matrices project it back to the original state space, spreading the selected information to all the state components that need to be updated.
Note that if we were to apply this constraint to a non-passthrough architecture, such as a Multi-Layer Perceptron or a Elman's Recurrent Neural Network, it would create an information bottleneck within each layer, effectively reducing the memory capacity of the model.But in a passthrough architecture the memory capacity is unaffected since the state passthrough takes care of propagating all the information that does not need to be updated during one step to the next step.Therefore we exploit the decoupling property of the state passthrough.A similar approach has been proposed for the LSTM architecture by Sak et al. (2014), although they force the the R matrices to be the same for all the functions of the state transition, while we allow each parameter matrix to be parametrized independently by a pair of R and L matrices.
Low-rank passthrough architectures are universal in that they retain the same representation classes of their parent architectures.This is trivially true if the inner dimension d is allowed to be O(n) in the worst case, and for some architectures even if d is held constant.For instance, it is easily shown that for any Highway Network with state size n and T hidden layers and for any > 0, there exist a Low-rank Highway Network with d = 1, state size at most 2n and at most nT layers that computes the same function within an margin of error.

LOW-RANK PLUS DIAGONAL PASSTHROUGH NETWORKS
As we show in the experimental section, on some tasks the low-rank constraint may prove to be excessively restrictive if the goal is to train a model with fewer parameters than one with arbitrary matrices.A simple extension is to add to each low-rank parameter matrix a diagonal parameter matrix, yielding a matrix that is full-rank but still parametrized in a low-dimensional space.For instance, for the Highway Network architecture we modify eq.7 to Low-rank plus diagonal decompositions have been used for over a century in factor analysis in statistics (Spearman, 1904), system identification (Kalman, 1982) and other applications.They arise naturally in the estimation of linear relationships between variables from noisy measurements, under certain independence assumptions on the measurement noise.Refer to Saunderson et al. (2012) and Ning et al. (2015) for a review.
At first, it may seem that adding diagonal parameter matrices is redundant in passthrough networks.After all, the state passthrough itself can be considered as a diagonal matrix applied to the state vector, which is then additively combined to the new proposed state computed by the f π function.However, since the state passthrough completely skips over all non-linear activation functions (except in the Residual Network architecture where it only skips over some of them), these formulations are not equivalent.In particular, the low-rank plus diagonal parametrization may help in recurrent neural networks which receive input at each time step, since they allow each component of the state vector to directly control how much input signal is inserted into it at each step.We demonstrate the effectiveness of this model in the sequence copy tasks described in the experiments section.

EXPERIMENTS
In this section we report a preliminary experiment on Low-rank Highway Networks on the MNIST dataset and several experiments on Low-rank GRUs.

LOW-RANK HIGHWAY NETWORKS
We applied the low-rank and low-rank plus diagonal Highway Network architecture to the classic benchmark task of handwritten digit classification on the MNIST dataset.We used the low-rank architecture described by equations 6 and 7, with T = 5 hidden layers, ReLU activation function, state dimension n = 1024 and maximum rank (internal dimension) d = 256.The input-to-state layer is a dense 784 × 1024 matrix followed by a (biased) ReLU activation and the state-to-output layer is a dense 1024 × 10 matrix followed by a (biased) identity activation.We did not use any convolution layer, pooling layer or data augmentation technique.
We used dropout (Srivastava et al., 2014) in order to achieve regularization.We applied standard dropout layers with dropout probability p = 0.2 just before the input-to-state layer and p = 0.5 just before the state-to-output layer.We also applied dropout inside each hidden layer in the following way: we inserted dropout layers with p = 0.3 inside both the proposal function and the transform function, immediately before both the R matrices and the L matrices, totaling to four dropout layers per hidden layer, although the random dropout matrices are shared between proposal and transform functions.Dropout applied this way does not disrupt the state passthrough, thus it does not cause a reduction of memory capacity during training.We further applied L2-regularization with coefficient λ = 1 × 10 −3 per example on the hidden-to-output parameter matrix.
We also used batch normalization (Ioffe & Szegedy, 2015) after the input-to-state matrix and after each parameter matrix in the hidden layers.
Parameter matrices are randomly initialized using an uniform distribution with scale equal to 6/a where a is the input dimension.Initial bias vectors are all initialized at zero except for those of the transform functions in the hidden layers, which are initialized at −1.0.
We trained to minimize the sum of the per-class L2-hinge loss plus the L2-regularization cost (Tang, 2013).Optimization was performed using Adam (Kingma & Ba, 2014) with standard hyperparameters, learning rate starting at 3 × 10 −3 halving every three epochs without validation improvements.Mini-batch size was equal to 100.Code is available online 3 .We ran our experiments on a machine with a 24 core Intel(R) Xeon(R) CPU X5670 2.93GHz, 24 GB of RAM.We did not use a GPU.Training took approximately 4 hours .
We obtained perfect training accuracy and 98.83% test accuracy.While this result does not reach the state of the art for this task (99.13% test accuracy with unsupervised dimensionality reduction reported by Tang ( 2013)), it is still relatively close.
We also tested the low-rank plus diagonal Highway Network architecture of eq. 8 with the same settings as above, obtaining a test accuracy of 98.64%.The inclusion of diagonal parameter matrices does not seem to help in this particular task.

LOW-RANK GRUS
We applied the Low-rank and Low-rank plus diagonal GRU architectures to a subset of sequential benchmarks described in the Unitary Evolution Recurrent Neural Networks article by Arjovsky et al. (2015), specifically the memory task, the addition task and the sequential randomly permuted MNIST task.For the memory tasks, we also considered two different variants proposed by Danihelka et al. (2016) and Henaff et al. (2016) which are hard for the uRNN architecture.
We chose to compare against the uRNN architecture because it set state of the art results in terms of both data complexity and accuracy and because it is an architecture with similar design objectives as low-rank passthrough architectures, namely a low-dimensional parametrization and the mitigation of the vanishing gradient problem, but it is based on quite different principles (it does not use a state passthrough as defined in this work, instead it relies on the reversibility and norm-preservation properties of unitary matrices in order preserve state information between time steps, and uses a multiplicative unitary decomposition in order to achieve low-dimensional parametrization).
The GRU architecture (Cho et al., 2014b) is a passthrough recurrent neural network defined as Note that with respect of the definition of the Highway Network architecture of eq.6, the initial state θ in is a model parameter, there is an additional function f ω (the "reset" gate), parameters don't depend on time t and input u(t) is included in the computation at each step though the θ U matrices.
We have also defined the transform function f τ in terms of the carry function f γ rather than vice versa for consistency with the literature, although the two formulations are isomorphic.
We turn this architecture into the Low-rank GRU architecture by redefining each of the θ W matrices as the product of two matrices with inner dimension d.For the memory tasks, which turned out to be difficult for the low-rank parametrization, we also consider the low-rank plus diagonal parametrization.We also applied the low-rank plus diagonal parametrization for the sequential permuted MNIST task.
In our experiments we optimized using RMSProp (Tieleman & Hinton, 2012) with gradient component clipping at 1. Code is available online4 .Our code is based on the published uRNN code5 (specifically, on the LSTM implementation) by the original authors for the sake of a fair comparison.In order to achieve convergence on the memory task however, we had to slightly modify the optimization procedure, specifically we changed gradient component clipping with gradient norm clipping (with NaN detection and recovery), and we added a small = 1 × 10 −8 term in the parameter update formula.No modifications of the original optimizer implementation were required for the other tasks.
We ran our experiments on the same machine as the experiments described in the previous section, with the exception of the largest sequential permuted MNIST experiment (low-rank plus diagonal GRU with n = 256, d = 24 which was run on a machine with a Geforce GTX TITAN X GPU).
We will now present a short description of each task, the experimental details and results.

MEMORY TASK
The input of an instance of this task is a sequence of T = N + 20 discrete symbols in a ten symbol alphabet a i : i ∈ 0, . . .9, encoded as one-hot vectors.The first 10 symbols in the sequence are "data" symbols i.i.d.sampled from a 0 , . . ., a 7 , followed by N −1 "blank" a 8 symbols, then a distinguished "run" symbol a 9 , followed by 10 more "blank" a 8 symbols.The desired output sequence consists of N + 10 "blank" a 8 symbols followed by the 10 "data" symbols as they appeared in the input sequence.Therefore the model has to remember the 10 "data" symbol string over the temporal gap of size N , which is challenging for a recurrent neural network when N is large.In our experiment we set N = 500, which is the hardest setting explored in the uRNN work.The training set consists of 100, 000 training examples and 10, 000 validation/test examples.
The architecture is described by eq. ( 9), with an additional output layer with a dense n × 10 matrix followed a (biased) softmax.We train to minimize the cross-entropy loss.
We were able to solve this task using a GRU with full recurrent matrices with state size n = 128, learning rate 1 × 10 −3 , mini-batch size 20, initial bias of the carry functions (the "update" gates) 4.0, however this model has many more parameters, nearly 50, 000 in the recurrent layer only, than the uRNN work which has about 6, 500, and it converges much more slowly than the uRNN.
We were not able to achieve convergence with a pure low-rank model without exceeding the number of parameters of the fully connected model, but we achieved fast convergence with a low-rank plus diagonal model with d = 50, with other hyperparameters set as above.This model has still more parameters (39, 168 in the recurrent layer, 41, 738 total) than the uRNN model and converges more slowly but still reasonably fast, reaching test cross-entropy < 1 × 10 −3 nats and almost perfect classification accuracy in less than 35, 000 updates.
We also consider two variants of this task which are difficult for the uRNN model.For both these tasks we used the same settings as above except that the task size parameter is set at N = 100 for consistency with the works that introduced these variants.
In the variant of Danihelka et al. (2016), the length of the sequence to be remembered is randomly sampled between 1 and 10 for each sequence.They manage to achieve fast convergence with Sequence copy with fixed lag N=500 Variable-length sequence copy with fixed lag N=100 their Associative LSTM architecture with 65, 505 parameters, and slower convergence with standard LSTM models.Our low-rank plus diagonal GRU architecture, which has less parameters than their Associative LSTM, performs comparably or better, reaching test cross-entropy < 1 × 10 −3 nats and almost perfect classification accuracy in less than 30, 000 updates.
In the variant of Henaff et al. (2016), the length of the sequence to be remembered is fixed at 10 but the model is expected to copy it after a variable number of time steps randomly chosen, for each sequence, between 1 and N = 100.The authors achieve slow convergence with a standard LSTM model, while our low-rank plus diagonal GRU architecture achieves fast convergence, reaching test cross-entropy < 1 × 10 −3 nats and almost perfect classification accuracy in less than 38, 000 updates, and perfect test accuracy in 87, 000 updates.

ADDITION TASK
For each instance of this task, the input sequence has length T and consists of two real-valued components, at each step the first component is independently sampled from the interval [0, 1] with uniform probability, the second component is equal to zero everywhere except at two randomly chosen time step, one in each half of the sequence, where it is equal to one.The result is a single real value computed from the final state which we want to be equal to the sum of the two elements of the first component of the sequence at the positions where the second component was set at one.In Sequence copy with variable lag N=100 parameters in the recurrent hidden layer.Learning rate was set at 1 × 10 −3 , mini-batch size 20, initial bias of the carry functions (the "update" gates) was set to 4.
We trained on 14, 500 mini-batches, obtaining a mean squared error on the test set of 0.003, which is a better result than the one reported in the uRNN article, in terms of training time and final accuracy.

SEQUENTIAL MNIST TASK
This task consists of handwritten digit classification on the MNIST dataset with the caveat that the input is presented to the model one pixel value at time, over T = 784 time steps.To further increase the difficulty of the task, the inputs are reordered according to a random permutation (fixed for all the task instances).
We use a Low-rank GRU with 1 × n input matrix, n × 10 output matrix and (biased) softmax output activation.
Learning rate was set at 5 × 10 −4 , mini-batch size 20, initial bias of the carry functions (the "update" gates) was set to 5.
Configuration 1 reaches a validation accuracy of 93.4% in 320, 400 iterations.Final test accuracy is 91.8%.The reported uRNN accuracy is 91.4%.Our model however takes 100, 500 to reach a validation accuracy comparable to the final accuracy of the uRNN model, which is instead reached in about 20, 000 iterations.Configuration 2 reaches a validation accuracy of 92.5% in 464, 700 iterations, with test accuracy of 91.3%.Note that even with the rather extreme bottleneck of d = 4, this model performs well.
For this task, we also consider three low-rank plus diagonal parametrizations.We report the best validation accuracy and test accuracy results, in addition to the results for a full-rank baseline GRU: Note that the low-rank plus diagonal GRU is more accurate than the full rank GRU with the same state size, while the low-rank GRU is slightly less accurate, indicating the utility of the diagonal component of the parametrization for this task.
These results surpass the uRNN and are on par with more complex architectures with time-skip connections (Zhang et al., 2016) (reported test set accuracy 94.0%).To our knowledge, at the time of this writing, the best result on this task is the LSTM with recurrent batch normalization by Cooijmans et al. (2016) (reported test set accuracy 95.2%).The architectural innovations of these works are orthogonal to our own and in principle they can be combined to it.

CONCLUSIONS AND FUTURE WORK
We presented a framework that unifies the description various types of recurrent and feed-forward neural networks as passthrough neural networks.
We proposed low-dimensional parametrizations for passthrough neural networks based on low-rank or low-rank plus diagonal decompositions of the n × n matrices that occur in the hidden layers.
We experimentally compared our models with state of the art models, obtaining competitive results including a state of the art for the randomly-permuted sequential MNIST task.
Our parametrizations are alternative to convolutional parametrizations explored by Srivastava et al. (2015); He et al. (2015); Kaiser & Sutskever (2015).We note that the two approaches can be combined in at least two ways: • A low-rank (plus diagonal) decompostion (with a suitable axis reshaping) can be applied to convolutional filter banks when the number of channels is large.
• The "local" state acted on by the convolutional passthrough filters can be paired with a "global" state acted on by low-rank (plus diagonal) passthrough matrices.The global state is replicated on additional channels to update the local state and the local state is pooled to update the global state.This arrangement may be useful in particular in the Neural GPU (Kaiser & Sutskever, 2015) in order to augment the cellular automaton with "global variables", which would otherwise need to be replicated on the cell states and threaded over the computation.
Low-rank and low-rank plus diagonal parametrizations are linear, alternative parametrizations could include non-linear activation functions, effectively replacing each hidden parameter matrix with a MLP, similar to the network-in-network approach of Lin et al. (2013).
We leave the exploration of these extensions to future work.

Figure 3 :
Figure 3: Low-rank plus diagonal GRU on the fixed-length sequence copy task with fixed lag.Cross-entropy on validation set.

Figure 4 :
Figure 4: Low-rank plus diagonal GRU on the variable-length sequence copy task with fixed lag.Cross-entropy on validation set.
our experiment we set T = 750.The training set consists of 100, 000 training examples and 10, 000 validation/test examples.We use a Low-rank GRU with 2 × n input matrix, n × 1 output matrix and (biased) identity output activation.We train to minimize the mean squared error loss.We use the following hyperparameter configuration: State size n = 128, maximum rank d = 24.This results in approximately 6

Figure 5 :Figure 6 :
Figure5: Low-rank plus diagonal GRU on the fixed-length sequence copy task with variable lag.Cross-entropy on validation set.(note that the axes scale is different than fig.3 and 4.)

Figure 7 :
Figure 7: Low-rank GRU on the permuted sequential MNIST task.Accuracy on validation set.Horizontal line indicates 90% accuracy.

Figure 8 :
Figure 8: Low-rank plus diagonal GRU and baseline GRU on the permuted sequential MNIST task.Accuracy on validation set.