Rational Recurrences

Despite the tremendous empirical success of neural models in natural language processing, many of them lack the strong intuitions that accompany classical machine learning approaches. Recently, connections have been shown between convolutional neural networks (CNNs) and weighted finite state automata (WFSAs), leading to new interpretations and insights. In this work, we show that some recurrent neural networks also share this connection to WFSAs. We characterize this connection formally, defining rational recurrences to be recurrent hidden state update functions that can be written as the Forward calculation of a finite set of WFSAs. We show that several recent neural models use rational recurrences. Our analysis provides a fresh view of these models and facilitates devising new neural architectures that draw inspiration from WFSAs. We present one such model, which performs better than two recent baselines on language modeling and text classification. Our results demonstrate that transferring intuitions from classical models like WFSAs can be an effective approach to designing and understanding neural models.


Introduction
Neural models, and in particular gated variants of recurrent neural networks (RNNs, e.g., Hochreiter and Schmidhuber, 1997;Cho et al., 2014), have become a core building block for stateof-the-art approaches in NLP (Goldberg, 2016).While these models empirically outperform classical NLP methods on many tasks (Zaremba et al., 2014;Bahdanau et al., 2015;Dyer et al., 2016;Peng et al., 2017, inter alia), they typically lack the intuition offered by classical models, making it hard to understand the roles played by each of their components.In this work we show that many neural models are more interpretable than previously 8↵/µ(↵) Figure 1: A two-state WFSA B described in §2.It is closely related to several models studied in this paper ( §4.1).Bold circles indicate initial states, and double circles final states, which are associated with final weights.Arrows represent transitions, labeled by the symbols α they consume, and the weights as a function of α.Arcs not drawn are assumed to have weight 0. For brevity, ∀α means ∀α ∈ Σ, with Σ being the alphabet.
thought, by drawing connections to weighted finite state automata (WFSAs).We study several recently proposed RNN architectures and show that one can use WFSAs to characterize their recurrent updates.We call such models rational recurrences ( §3). 1 Analyzing recurrences in terms of WFSAs provides a new view of existing models and facilitates the development of new ones.In recent work, Schwartz et al. (2018) introduced SoPa, an RNN constructed from WFSAs, and thus rational by our definition.They also showed that a single-layer max-pooled CNN (Le-Cun, 1998) can be simulated by a set of simple WFSAs (one per output dimension), and accordingly are also rational.In this paper we broaden such efforts, and show that rational recurrences are in frequent use (Mikolov et al., 2014;Balduzzi and Ghifary, 2016;Lei et al., 2016Lei et al., , 2017a,b;,b;Bradbury et al., 2017;Foerster et al., 2017).For instance, we will show in §4 that the WFSA diagrammed One common semiring is the real (or plustimes) semiring: R, +, •, 0, 1 .The other one used in this work is the max-plus semiring R ∪ {−∞}, max, +, −∞, 0 .We refer the reader to Kuich and Salomaa (1986) for others.Definition 2. A weighted finite-state automaton (WFSA) over a semiring K is a 5-tuple, A = Σ, Q, τ, λ, ρ ,2 with: • a finite input alphabet Σ; • a finite state set Q; • transition weights τ : Q × Q × (Σ ∪ {ε}) → K; • initial weights λ : Q → K; • and final weights ρ : Q → K.

ε /
∈ Σ marks special ε-transitions that may be taken without consuming any input.A assigns a score A x to a string x = x 1 . . .x n ∈ Σ * by summing over the scores of all possible paths deriving x.The score of each individual path is the product of the weights of the transitions it consists of.Formally: Definition 3 (path score).Let π = π 1 . . .π n be a sequence of adjacent transitions in A, with each transition π i = (q i , q i+1 , z i ) ∈ Q×Q×(Σ ∪ {ε}).The path π derives string x ∈ Σ * , which is the substring of z = z 1 z 2 . . .z n that excludes ε symbols (for example, if z = aεbcεεεd, then x = abcd).π's score in A is given by Definition 4 (string score).Let Π(x) denote the set of all paths in A that derive x.Then the score assigned by A to x is defined to be ( Because K is a semiring, A x can be computed in time linear in |x| by the Forward algorithm (Baum and Petrie, 1966).Here, for simplicity, we describe the Forward algorithm without εtransitions. 3Its dynamic program is given by: Ω i (q) gives the total score of all paths that derive x 1 . . .x i and end in state q.
Example 5. Figure 1 diagrams a WFSA B, consisting of two states.A path starts from the initial state q 0 (with λ(q 0 ) = 1); it then takes any num-ber of "self-loop" transitions, each consuming an input without changing the path score (since it's weighted by 1); it then consumes an input symbol α and takes a transition weighted by µ(α), and reaches the final state q 1 (with ρ(q 1 ) = 1); it may further consume more input by taking selfloops at q 1 , updating the path score by multiplying it by φ(α) for each symbol α.Then from Definition 4, we can calculate that B gives the empty string score 0, and gives any nonempty B can be seen as capturing soft unigram patterns (Davidov et al., 2010), in the sense that it consumes one input symbol to reach the final state from the initial state.It is straightforward to design WFSAs capturing longer patterns by including more states (Schwartz et al., 2018), as we will discuss later in §4 and §5.

Rational Recurrences
Before formally defining rational recurrences in §3.2, we highlight the connection between WFSAs and RNNs using a motivating example ( §3.1).

A Motivating Example
We describe a simplified RNN which strips away details of some recent RNNs, in order to highlight the behaviors of the forget gate and the input.
Example 6.For an input sequence x = x 1 . . .x n , let the word embedding vector for x t be v t .
As in many gated RNN variants (Hochreiter and Schmidhuber, 1997;Cho et al., 2014), we use a forget gate f t , which is computed with an affine transformation followed by an elementwise sigmoid function σ.The current input representation u t is similarly computed, but with an optional nonlinearity (e.g., tanh) g.The hidden state c t can be seen as a weighted sum of the previous state and the new input, controlled by the forget gate.
The hidden state c t can then be used in downstream computation, e.g., to calculate output state h t = tanh(c t ), which is then fed to an MLP classifier.We focus only on the recurrent computation.
In Example 6, both f t and u t depend only on the current input token x t (through v t ), and not the previous state.Importantly, the interaction with the previous state c t−1 is not via affine transformations followed by nonlinearities, as in, e.g., an Elman network (Elman, 1990), where c t = tanh(W c c t−1 + W v v t + b c ).As we will discuss later, this is important in relating this recurrent update function to WFSAs.
Since the recurrent update in Equation 5c is elementwise, for simplicity we focus on just the ith dimension.Unrolling it in time steps, we get where [•] i denotes the ith dimension of a vector.As noted by Lee et al. (2017), the hidden state at time step t can be seen as a sum of previous input representations, weighted by the forget gate; longer histories typically get a smaller weight, since the forget gate values are between 0 and 1 due to the sigmoid function.
Let's recall the WFSA B (Figure 1 and Example 5) using the real semiring R, +, •, 0, 1 .Equation 6 is recovered by parameterizing B's weight functions µ and φ with Denote the resulting WFSA by B i , and we have: Proposition 7. Running a single layer RNN in Example 6 over any nonempty input string x ∈ Σ + , the ith dimension of its hidden state at time step t equals the score assigned by B i to x :t : In other words, the ith dimension of the RNN in Example 6 can be seen as a WFSA structurally equivalent to B. Its weight functions are implemented as the ith dimension of Equations 5, and the learned parameters are the ith row of W and b.Then it is straightforward to recover the full ddimensional RNN, by collecting d such WFSAs, each of which is parametrized by a row in the Ws and bs.Based on this observation, we are now ready to formally define rational recurrences.

Recurrences and Rationality
For a function c: Σ * → K d , its recurrence is said to be the dependence of c(x :t ) on c(x :t−1 ), for input sequence ∀x ∈ Σ + .We discuss a class of recurrences that can be characterized by WFSAs.The mathematical counterpart of WFSAs are rational power series (Berstel and Reutenauer, 1988), justifying naming such recurrences rational: Definition 8 (rational recurrence).The recurrence of c : Σ * → K d is said to be rational, if there exists a set of weighted finite state automata {A i } d i=1 over alphabet Σ and semiring K, ⊕, ⊗, 0, 1 with both ⊕ and ⊗ taking constant time and space, such It directly follows from Proposition 7 that Corollary 9.The recurrence in Example 6 is rational.

Relationship to Existing Neural Models
This section studies several recently proposed neural architectures, and relates them to rational recurrences.§4.1 begins by relating some of them to the RNN defined in Example 6, and then to the WFSA B (Example 5).We then describe a WFSA similar to B, but with one additional state, and discuss how it provides a new view of RNN models motivated by n-gram features ( §4.2).In §4.3 we study rational recurrences that are not elementwise, using an existing model.
In the following discussion, we shall assume the real seimiring, unless otherwise noted.

Neural Architectures Related to B
Despite its simplicity, Example 6 corresponds to several existing neural architectures.For instance, quasi-RNN (QRNN; Bradbury et al., 2017) and simple recurrent unit (SRU; Lei et al., 2017b) aim to speed up the recurrent computation.To do so, they drop the matrix multiplication dependence on the previous hidden state, resulting in similar recurrences to that in Example 6. 5 Other works start from different motivations, but land on similar recurrences, e.g., strongly-typed RNNs (T-RNN; Balduzzi and Ghifary, 2016) and its gated variants (T-LSTM and T-GRU), and structurally constrained RNNs (SCRN; Mikolov et al., 2014).
The analysis in §3.1 directly applies to SRU, T-RNN, and SCRN.In fact, Example 6 presents a slightly more complicated version of them.In these models, input representations are computed without the bias term or any nonlinearity: By Proposition 7 and Corollary 9: Corollary 10.The recurrences of single-layer SRU, T-RNN, and SCRN architectures are rational.
It is slightly more complicated to analyze the recurrences of the QRNN, T-LSTM, and T-GRU.Although their hidden states c t are updated in the same way as Equation 5c, the input representations and gates may depend on previous inputs.For example, in T-LSTM and T-GRU, the forget gate is a function of two consecutive inputs: QRNNs are similar, but may depend on up to K tokens, due to the K-window convolutions.Eisner ( 2002) discuss finite state machines for second (or higher) order probabilistic sequence models.Following the same intuition, we sketch the construction of WFSAs corresponding to QRNNs with 2-window convolutions in Appendix A, and summarize the key results here: Proposition 11.The recurrences of single-layer T-GRU, T-LSTM, and QRNN are rational.In particular, a single-layer d-dimensional QRNN using K-window convolutions can be recovered by a set of d WFSAs, each with O(2 |Σ| K−1 ) states.
The size of WFSAs needed to recover QRNN grows exponentially in the window size.Therefore, at least for QRNNs, Proposition 11 has more conceptual value than practical.

More than Two States
So far our discussion has centered on B, a twostate WFSA capturing unigram patterns (Example 5).In the same spirit as going from unigram to n-gram features, one can use WFSAs with more states to capture longer patterns (Schwartz et al., 2018).In this section we augment B by introducing more states, and explore its relationship to some neural architectures motivated by n-gram features.We start with a three-state WFSA as an example, and then discuss more general cases.
Figure 2 diagrams a WFSA C, augmenting B with another state.To reach the final state q 2 , at least two transitions must be taken, in contrast to one in B. History information is decayed by the self-loop at the final state q 2 , assuming φ 2 is between 0 and 1. C has another self-loop over q 1 , weighted by φ 1 ∈ (0, 1).The motivation is to allow (but down-weight) nonconsecutive bigrams, as we will soon show.
The scores assigned by C can be inductively computed by applying the Forward algorithm ( §2).Given input sequence x longer than one, let where and β 0 = 0. Unrolling β t in time, we get Due to the self-loop over state q 1 , β t can be seen as a weighted sum of the µ 1 terms up to x t (Equaltion 13).The second product term in Equation 11then provides multiplicative interactions between µ 2 , and the weighted sum of µ 1 s.In this sense, it captures nonconsecutive bigram features.
At a first glance, Equations 11 and 12 resemble recurrent convolutional neural networks (RCNN; Lei et al., 2016).RCNN is inspired by nonconsecutive n-gram features and low rank tensor factorization.It is later studied from a string kernel perspective (Lei et al., 2017a).Here we review its nonlinear bigram version: (2) where the u (j) t s are computed similarly to Equation 5b, and c (2) t is used as output for onward computation.Different strategies to computing λ t were explored (Lei et al., 2015(Lei et al., , 2016)).When λ t is a constant, or depends only on x t , e.g., λ t = σ(W λ v t +b λ ), the ith dimension of Equations 14 q 2 /1 q 3 q 4 Figure 3: WFSA D 1 discussed in §4.3.Two initial states q 1 and q 4 are used here.
can be recovered from Equation 11, by letting (15) It is straightforward to generalize the above discussion to higher order cases: n-gram RCNN corresponds to WFSAs with n + 1 states, constructed similarly to how we build C from B (Appendix B).
Proposition 12.For a single-layer RCNN with λ t being a constant or depending only on x t , the recurrence is rational.
As noted later in §4.3, its recurrence may not be rational when

Beyond Elementwise Operations
So far we have discussed rational recurrences for models using elementwise recurrent updates (e.g., Equation 5c).This section uses an existing model as an example, to study a rational recurrence that is not elementwise.We focus on the input switched affine network (ISAN; Foerster et al., 2017).Aiming for efficiency and interpretability, it does not use any explicit nonlinearity; its affine transformation parameters depend only on the input: Due to the matrix multiplication, the recurrence of a single-layer ISAN is not elementwise.Yet, we argue that it is rational.We will sketch the proof for a 2-dimensional case, and it is straightforward to generalize to higher dimensions (Appendix C).We define two WFSAs, each recovering one dimension of ISAN's recurrent updates.Figure 3 diagrams one of them, D 1 .The other one, D 2 , is identical (including shared weights), except using q 3 instead of q 2 as the final state.For any nonempty input sequence x ∈ Σ + , the scores assigned by D 1 and D 2 can be inductively computed by applying the Forward algorithm.Letting D 1 x :0 = D 2 x :0 = 0, for t ≥ 1 where Then Equation 16, in the case of hidden size 2, is recovered by letting Proposition 13.The recurrence of a single-layer ISAN is rational.
Corollary 14.For a single-layer Elman network, in the absence of any nonlinearity, the recurrence is rational.
Discussion.It is known that an Elman network can approximate any recursively computable partial function (Siegelmann and Sontag, 1995).On the other hand, in their single-layer cases, WFSAs (and thus models with rational recurrences) are restricted to rational series (Schützenberger, 1961).Therefore, we hypothesize that models like Elman networks, LSTMs, and GRUs, where the recurrences depend on previous states through affine transformations followed by nonlinearities, are not rational.This work does not intend to propose rational recurrences as a concept general enough to include most existing RNNs.Rather, we wish to study a more constrained class of methods to better understand the connections between WFSAs and RNNs.Therefore in Definition 8, we restrict the semirings to be "simple," in the sense that both operations take constant time and space.Such a restriction aims to exclude the possibility of hiding arbitrarily complex computations inside the semiring, which might allow RNNs to satisfy the definition in a trivial and unilluminating way.
Such theoretical limitations might be less severe than they appear, since it is not yet entirely clear what they correspond to in practice, especially when multiple vertical layers of these models are used (Leshno and Schocken, 1993).We defer to future work the further study of the connections between WFSAs and Elman-style RNNs.
Closing this section, Table 1 summarizes the discussed recurrent neural architectures and their corresponding WFSAs.

Deriving Neural Models from WFSAs
Rational recurrences provide a new view of several recently proposed neural models.Based on such

Models
Recurrence Function WFSA §4.1  1 and 2).We then explore alternative semirings ( §5.2), an approach orthogonal to what we've discussed so far.We note that our goal is not to devise new stateof-the-art architectures.Rather, we illustrate a new design process for neural architectures that draws inspiration from WFSAs.That said, in our experiments ( §6), one of our new architectures performs as well as or better than strong baselines.

Aggregating Different Length Patterns
We start by presenting a straightforward extension to 2-state and 3-state rational models: one combining both.It is inspired by many classical NLP models, where unigram features and higher-order ones are interpolated.
Figure 4 diagrams a 4-state WFSA F. Compared to C (Figure 2), F uses q 1 as a second final state, aiming to capture both unigram and bigram patterns, since a path is allowed to stop at q 1 after consuming one input.The final states are weighted by ρ 1 and ρ 2 respectively.Another notable modification is the additional state q 3 , which is used to create a "shortcut" to reach q 2 , together with an ε-transition.Specifically, starting from q 0 , a path can now take the ε-transition and reach q 3 , and then take a transition with weight µ 2 to reach q 2 .Recall from §2, that ε-transitions do not consume any input, yet they can still be weighted by a (parameterized) function γ not depending on the inputs.The ε-transition allows for skipping the Figure 4: A WFSA F that combines both unigram and bigram features ( §5.1).Two final states q 1 and q 2 are used, with weights ρ 1 and ρ 2 , respectively.
first word in a bigram.It can be discouraged by using γ ∈ (0, 1), just as we do in our experiments.
Deriving the neural architecture.As in §3, we relate hidden states of an RNN to the scores assigned by WFSAs to input strings.We then derive the neural architecture with a dynamic program.
Here we keep the discussion self-contained by explicitly overviewing the procedure.It is a direct application of the Forward algorithm ( §2), though now in a form that deals with the ε-transition.Such an approach applies, of course, to more general cases, as noted by Schwartz et al. (2018).Given an input string x ∈ Σ + , let z (j) t denote the total score of all paths landing in state q j just after consuming x t .Let z We now collect d of these WFSAs to construct an RNN, and we parameterize their weight functions with the technique we've been using: (2) where The p vectors correspond to the final state weights ρ 1 and ρ 2 .Despite the similarities, p are different from output gates (Bradbury et al., 2017) p .The same applies to r and b r , which correspond to the weights for ε-transitions γ.

Alternative Semirings
Our new understanding of rational recurrences allows us to consider a different kind of extension: replacing the semiring.We introduce an example, which modifies Example 6 by replacing its real (plus-times) semiring with the max-plus semiring R ∪ {−∞}, max, +, −∞, 0 : Example 15.
Example 15 does not use the forget gate when computing u t (Equation 21b), which is different from its plus-times counterpart, where The reason is that, unlike the real semiring, the max-plus semiring lacks a well-defined negation.Possible alternatives include taking the log of a separate input gate, or using log(1 − f t ), which we leave for future work.
Example 15 can be seen as replacing sumpooling with max-pooling.Both max and sumpooling have been used successfully in vision and NLP models.Intuitively, max-pooling "detects" the occurrence of a pattern while sum-pooling "counts" the occurrence of a pattern.One advantage of max operator is that the model's decisions can be back-traced and interpreted, as argued by Schwartz et al. (2018).Such a technique is applicable to all the models with rational recurrences.

Experiments
This section evaluates four rational RNNs on language modeling ( §6.2) and text categorization ( §6.3).Our goal is to compare the behaviors of models derived from different WFSAs, showing that our understanding of WFSAs allows us to improve existing rational models.
We also compare to an LSTM baseline.Aiming to control for comfounding factors, we do not use highway connections in any of the models. 6In the interest of space, the full architectures and hyperparameters are detailed in Appendices D and E.

Language Modeling
Dataset and implementation.We experiment with the Penn Treebank corpus (PTB; Marcus et al., 1993).We use the preprocessing and splits from Mikolov et al. (2010), resulting in a vocabulary size of 10K and 1M tokens.
Following standard practice, we treat the training data as one long sequence, split into mini batches, and train using BPTT truncated to 35 time steps (Williams and Peng, 1990).The input embeddings and output softmax weights are tied (Press and Wolf, 2017).
Results.Following Collins et al. (2017) and Melis et al. (2018), we compare models controlling for parameter budget.Table 3 summarizes language modeling perplexities on PTB test set.The middle block compares all models with two layers and 10M trainable parameters.RRNN(B) and RRNN(C) achieve roughly the same performance; interpolating both unigram and bigram features, RRNN(F) outperforms others by more than 2.9 test perplexity.For the three-layer and 24M setting (the bottom block), we observe similar trends, except that RRNN(C) slightly underperforms RRNN(B).Here RRNN(F) outperforms others by more than 2.1 perplexity.for computing input representations in the former ( §5.2).Finally, most compared models outperform the LSTM baselines, whose numbers are taken from Lei et al. (2017b).7

Text Classification
Implementation.We use unidirectional 2-layer architectures for all compared models.To build the classifiers, we feed the final RNN hidden states into a 2-layer tanh-MLP.Further implementation details are described in Appendix E.
Datasets.We experiment with four binary text classification datasets, described below.
• Amazon (electronic product review corpus; McAuley and Leskovec, 2013). 8We focus on the positive and negative reviews.Pang and Lee, 2004).
As subj doesn't come with official splits, we randomly split it to train (80%), development (10%), and test (10%) sets.• CR (customer reviews dataset; Hu and Liu, 2004). 10As with subj, we randomly split this dataset using the same ratio.Table 4 summarizes the sizes of the datasets.
Results.Table 5 summarizes text classification test accuracy.We report the average performance of 5 trials different only in random seeds.RRNN(F) outperforms all other models on 3 out of the 4 datasets.For Amazon, the largest one, we do not observe significant differences between RRNN(F) and RRNN(C), while both outperform others.This may suggest that the interpolation of unigram and bigram features by RRNN(F) is especially useful in small data setups.As in the language modeling experiments, RRNN(B) m+ underperforms all other models in most cases, and in particular RRNN(B).These results provide evidence that replacing the real semiring in rational models might be challenging.We leave further exploration to future work.

Conclusion
We presented rational recurrences, a new construction to study the recurrent updates in RNNs, drawing inspiration from WFSAs.We showed that rational recurrences are in frequent use by several recently proposed recurrent neural architectures, providing new understanding of them.Based on such connections, we discussed approaches to deriving novel neural architectures from WFSAs.Our empirical results demonstrate the potential of doing so.We publicly release our implementation at https://github.com/Noahs-ARK/rational-recurrences.

Gradient Clipping
[1.0, 5.0] compared models based on the trainable parameter budget, and adjust the dropout probabilities accordingly to keep the number of remaining hidden units is roughly the same in expectation.Besides, we observe that RRNN(C) and RRNN(F) fail to converge when optimized with the SGD algorithm using 1.0 initial learning rate.And thus we use 0.5 for both models.Other hyperparameters are kept the same as Lei et al. (2017b).

E.3 Text classification
We train our models using Adam (Kingma and Ba, 2015) with a batch size of 16 (for Amazon) or 64 (for the smaller datasets).Initial learning rate and 2 regularization are hyperparameters.We use 300-dimensional GloVe 840B embeddings (Pennington et al., 2014) normalized to unit length and fixed, replacing unknown words with a special UNK token.Two layer RNNs are used in all cases.For regularization, we use three types of dropout: a recurrent variational dropout, vertical dropout and a dropout on the embedding layer.
We tune the hyperparameters of our model on the development set by running 20 epochs of random search.We then take the best development configuration, and train five models with it using different random seeds.We report the average test results.The hyperparameters values explored are summarized in Table 6.We train all models for 500 epochs, stopping early if development accuracy does not improve for 30 epochs.During training, we halve the learning rate if development accuracy does not improve for 10 epochs.

Table 1
, since the former do not depend on the input, and are param-

Table 3 :
Lei et al. (2017b)perplexity on PTB test set (lower is better).LSTM numbers are taken fromLei et al. (2017b).denotes the number of layers.Bold font indicates best performance.

Table 4 :
Number of instances in the text classification datasets ( §6.3).

Table 6 :
The hyperparameters explored using random search algorithm in the text classification experiments.