Implicitly-Defined Neural Networks for Sequence Labeling

In this work, we propose a novel, implicitly-defined neural network architecture and describe a method to compute its components. The proposed architecture forgoes the causality assumption used to formulate recurrent neural networks and instead couples the hidden states of the network, allowing improvement on problems with complex, long-distance dependencies. Initial experiments demonstrate the new architecture outperforms both the Stanford Parser and baseline bidirectional networks on the Penn Treebank Part-of-Speech tagging task and a baseline bidirectional network on an additional artificial random biased walk task.


Introduction
Feedforward neural networks were designed to approximate and interpolate functions. Recurrent Neural Networks (RNNs) were developed to predict sequences. RNNs can be 'unwrapped' and thought of as very deep feedforward networks, with weights shared between each layer. Computation proceeds one step at a time, like the trajectory of an ordinary differential equation when solving an initial value problem. The path of an initial value problem depends only on the current state and the current value of the forcing function. In a RNN, the analogy is the current hidden state and the current input sequence. However, in certain applications in natural language processing, especially those with long-distance dependencies or where grammar matters, sequence predic-tion may be better thought of as a boundary value problem. Changing the value of the forcing function (analogously, of an input sequence element) at any point in the sequence will affect the values everywhere else. The bidirectional recurrent network (Schuster and Paliwal, 1997) attempts to addresses this problem by creating a network with two recurrent hidden states -one that progresses in the forward direction and one that progresses in the reverse. This allows information to flow in both directions, but each state can only consider information from one direction. In practice many algorithms require more than two passes through the data to determine an answer. We provide a novel mechanism that is able to process information in both directions, with the motivation being a program which iterates over itself until convergence.

Related Work
Bidirectional, long-distance dependencies in sequences have been an issue as long as there have been NLP tasks, and there are many approaches to dealing with them.
Hidden Markov models (HMMs) (Rabiner, 1989) have been used extensively for sequencebased tasks, but they rely on the Markov assumption -that a hidden variable changes its state based only on its current state and observables. In finding maximum likelihood state sequences, the Forward-Backward algorithm can take into account the entire set of observables, but the underlying model is still local.
In recent years, popularity of the Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) and variants such as the Gated Recurrent Unit (GRU) (Cho et al., 2014) has soared, as they enable RNNs to process long sequences without the problem of vanishing or exploding gradients (Pascanu et al., 2013). However, these models only allow for information/gradient information to flow in the forward direction.
The Bidirectional LSTM (b-LSTM) (Graves and Schmidhuber, 2005), a natural extension of (Schuster and Paliwal, 1997), incorporates past and future hidden states via two separate recurrent networks, allowing information/gradients to flow in both directions of a sequence. This is a very loose coupling, however.
In contrast to these methods, our work goes a step further, fully coupling the entire sequences of hidden states of an RNN. Our work is similar to (Finkel et al., 2005), which augments a CRF with long-distance constraints. However, our work differs in that we extend an RNN and uses Netwon-Krylov (Knoll and Keyes, 2004) instead of Gibbs Sampling.

Traditional Recurrent Neural Networks
A typical recurrent neural network has a (possibly transformed) input sequence [ξ 1 , ξ 2 , . . . , ξ n ] and initial state h s and iteratively produces future states: The LSTM, GRU, and related variants follow this formula, with different choices for the state transition function. Computation proceeds linearly, with each next state depending only on inputs and previously computed hidden states.

Proposed Architecture
In this work, we relax this assumption by allowing h t = f (ξ t , h t−1 , h t+1 ) 1 . This leads to an implicit set of equations for the entire sequence of hidden states, which can be thought of as a single tensor 1 A wider stencil can also be used, e.g. f (ht−2, ht−1, . . .).
This yields a system of nonlinear equations. This setup has the potential to arrive at nonlocal, whole sequence-dependent results. We also hope such a system is more 'stable', in the sense that the predicted sequence may drift less from the true meaning, since errors will not compound with each time step in the same way.
There are many potential ways to architect a neural network -in fact, this flexibility is one of deep learning's best features -but we restrict our discussion to the structure depicted in Figure 2. In this setup, we have the following variables: data X labels Y parameters θ and functions: Our implicit definition function, F , is made up of local state transitions and forms a system of nonlinear equations that require solving, denoting n as the length of the input sequence and h s , h e as boundary states:

Computing the forward pass
To evaluate the network, we must solve the equation H = F (H). We computed this via an approximate Newton solve, where we successively refine an approximation H n of H: Let k be the dimension of a single hidden state. (I − ∇ H F ) is a sparse matrix, since ∇ H F is zero except for k pairs of n × n block matrices, corresponding to the influence of the left and right neighbors of each state.
Because of this sparsity, we can apply Krylov subspace methods (Knoll and Keyes, 2004), specifically the BiCG-STAB method (Van der Vorst, 1992), since the system is non-symmetric. This has the added advantage of only relying on matrix-vector multiplies of the gradient of F .

Gradients
In order to train the model, we perform stochastic gradient descent. We take the gradient of the loss function: The gradient of the hidden units with respect to the parameters can found via the implicit definition: where the factorization follows from the noting that The entire gradient is thus: (1) Once again, the inverse of I − ∇ H F appears, and we can compute it via Krylov subspace methods. It is worth mentioning the technique of computing parameter updates by implicit differentiation and conjugate gradients have been applied before, in the context of energy minimization models in image labeling and denoising (Domke, 2012).

Transition Functions
Recall the original GRU equations (Cho et al., 2014), with slight notational modifications: We make the following substitution forĥ t (which was set to h t−1 in the original GRU definition): (2) This modification makes the architecture both implicit and bidirectional, sinceĥ t is a linear combination of previous and future hidden states. The switch variable s is determined by a competition between two sigmoidal units s p and s n , representing the contributions of the previous and next hidden states, respectively.

Implementation Details
We implemented the implicit GRU structure using Theano (Bergstra et al., 2011). The product ∇ H F v for various v, required for the BiCG-STAB method, was computed via the Rop operator. In computing ∇ θ L (Equation 1), we noted it is more efficient to compute ∇ H (I − ∇ H F ) −1 first, and thus used the Lop operator.
All experiments used a batch size of 20. To batch solve the linear equations, we simply solved a single, very large block diagonal system of equations: each sequence in the batch was a single block matrix, and we input the encompassing matrix into our Theano BiCG solver. (In practice the block diagonal system is represented as a 3-tensor, but it is equivalent.) In this setup, each step does receive separate update directions, but one global step length. h S and h e were fixed at zero, but could be trained as parameters.
In solving multiple simultaneous systems of equations, we noted some elements converged significantly faster than others. For this reason, we found it helpful to run Newton's method from two separate initializations for each element in our batch, one selected randomly and the other set to a "one-step" approximation: Hidden states of a traditional GRU were computed in both forward (h f i ) and reverse (h b i ) directions, and h i was initialized to f (h f i−1 , h b i+1 , ξ i ). If either of the two candidates converged, we took its value and stopped computing the other. We also limited both the number Newton iterations and BiCG-STAB iterations per Newton iteration to 40.

Biased random walks
We developed an artificial task with bidirectional sequence-level dependencies to explore the performance of our model. Our task was to find the point at which a random walk, in the spirit of the Wiener Process (Durrett, 2010), changes from a zero to nonzero mean. We trained a network to predict when the walk is no longer unbiased. We generated algorithmic data for this problem, the specifics of which are as follows: First, we chose an integer interval length N uniformly in the range 1 to 40. Then, we chose a (continuous) time t ∈ [0, N ), and a direction v ∈ R d . We produced the input sequence x i ∈ R d , setting x 0 = 0 and iteratively computing x i+1 = x i + N (0, 1). After time t, a bias term of b · v was added at each time step (b·v·(t −t)) for the first time step greater than t . b is a global scalar parameter. The network was fed in these elements, and asked to predict y = 0 for times t ≤ t and y = 1 for times t > t .
For each architecture, ξ was simply the unmodified input vectors, zero-padded to the embedding dimension size. The output was a simple binary logistic regression. We produced 50,000 random training examples, 2500 random validation examples, and 5000 random test examples. The implicit algorithm used a hidden dimension of 200, and the b-LSTM had an embedding dimension ranging from 100 to 1000. b-LSTM dimension of 300 was the point where the total number of parameters were roughly equal.
The results are shown in Table 1. The b-LSTM scores reported are the maximum over sweeps from 100 to 1500 hidden dimension size. The INN outperforms the best b-LSTM in the more challenging cases where the bias size b is small.

Part-of-speech tagging
We next applied our model to a real-world problem. Part-of-speech tagging fits naturally in the sequence labeling framework, and has the advantage of a standard dataset that we can use to compare our network with other techniques. To train a partof-speech tagger, we simply let L be a softmax layer transforming each hidden unit output into a part of speech tag. Our input encoding ξ, is a concatenation of three sets of features, adapted from (Huang et al., 2015): first, word vectors for 39,000 case-insensitive vocabulary words; second, six additional 'word vector' components indicating the presence of the top-2000 most common prefixes and suffixes of words, for affix lengths 2 to 4; and finally, eight other binary features to indicate the presence of numbers, symbols, punctuation, and more rich case data.
We trained the Part of Speech (POS) tagger on the Penn Treebank Wall Street Journal corpus (Marcus et al., 1993), blocks 0-18, validated on 19-21, and tested on 22-24, per convention. Training was done using stochastic gradient descent, with an initial learning rate of 0.5. The learning rate was halved if validation perplexity increased. Word vectors were of dimension 320, prefix and suffix vectors were of dimension 20. Hidden unit size was equal to feature input size, so in this case, 448. Table 2, the INN outperformed baseline GRU, bidirectional GRU, LSTM, and b-LSTM networks, all with 628-dimensional hidden layers (1256 for the bidirectional architectures), The INN also outperforms the Stanford Part-of-Speech tagger (Toutanova et al., 2003) (model wsj-0-18-bidirectional-distsim.tagger from 10-31-2016). Note that performance gains past approximately 97% are difficult due to errors/inconsistencies in the dataset, ambiguity, and complex linguistic constructions including dependencies across sentence boundaries (Manning, 2011).

Time Complexity
The implicit experiments in this paper took approximately 3-5 days to run on a single Tesla K40, while the explicit experiments took approximately 1-3 hours. Running time of the solver is approximately n n × n b × t b where n n is the number of Newton iterations, n b is the number of BiCG-STAB iterations, and t b is the time for a single BiCG-STAB iteration. t b is proportional to the number of non-zero entries in the matrix (Van der Vorst, 1992), in our case n(2k 2 + 1). Newton's method has second order convergence (Isaacson and Keller, 1994), and while the specific bound depends on the norm of (I − ∇ H F ) −1 and the norm of its derivatives, convergence is wellbehaved. For n b , however, we are not aware of a bound. For symmetric matrices, the Conjugate Gradient method is known to take O( √ κ) iterations (Shewchuk et al., 1994), where κ is the condition number of the matrix. However, our matrix is nonsymmetric, and we expect κ to vary from problem to problem. Because of this, we empirically estimated the correlation between sequence length and total time to compute a batch of 20 hidden layer states. For the random walk experiment with b = 0.5, we found the the average run time for a given sequence length to be approximately 0.17n 1.8 , with r 2 = 0.994. Note that the exponent would have been larger had we not truncated the number of BiCG-STAB iterations to 40, as the inner iteration frequently hit this limit for larger n. However, the average number of Newton iterations did not go above 10, indicating that exiting early from the BiCG-STAB loop did not prevent the Newton solver from converging. Run times for the other random walk experiments were very similar, indicating run time does not depend on b; However, for the POS task runtime was 0.29n 1.3 , with r 2 = 0.910.

Conclusion and Future Work
We have introduced a novel, implicitly defined neural network architecture based on the GRU and shown that it outperforms a b-LSTM on an artificial random walk task and slightly outperforms both the Stanford Parser and a baseline bidirectional network on the Penn Treebank Part-of-Speech tagging task.
In future work, we intend to consider implicit variations of other architectures, such as the LSTM, as well as additional, more challenging, and/or data-rich applications. We also plan to explore ways to speed up the computation of (I −∇ H F ) −1 . Potential speedups include approximating the hidden state values by reducing the number of Newton and/or BiCG-STAB iterations, using cached previous solutions as initial values, and modifying the gradient update strategy to keep the batch full at every Newton iteration.