Sequential Neural Networks as Automata

This work attempts to explain the types of computation that neural networks can perform by relating them to automata. We first define what it means for a real-time network with bounded precision to accept a language. A measure of network memory follows from this definition. We then characterize the classes of languages acceptable by various recurrent networks, attention, and convolutional networks. We find that LSTMs function like counter machines and relate convolutional networks to the subregular hierarchy. Overall, this work attempts to increase our understanding and ability to interpret neural networks through the lens of theory. These theoretical insights help explain neural computation, as well as the relationship between neural networks and natural language grammar.


Introduction
In recent years, neural networks have achieved tremendous success on a variety of natural language processing (NLP) tasks.Neural networks employ continuous distributed representations of linguistic data, which contrast with classical discrete methods.While neural methods work well, one of the downsides of the distributed representations that they utilize is interpretability.It is hard to tell what kinds of computation a model is capable of, and when a model is working, it is hard to tell what it is doing.
This work aims to address such issues of interpretability by relating sequential neural networks to forms of computation that are more well understood.In theoretical computer science, the computational capacities of many different kinds of automata formalisms are clearly established.Moreover, the Chomsky hierarchy links natural language to such automata-theoretic languages (Chomsky, 1956).Thus, relating neural networks to automata both yields insight into what general forms of computation such models can perform, as well as how such computation relates to natural language grammar.
Recent work has begun to investigate what kinds of automata-theoretic computations various types of neural networks can simulate.Weiss et al. (2018) propose a connection between long shortterm memory networks (LSTMs) and counter automata.They provide a construction by which the LSTM can simulate a simplified variant of a counter automaton.They also demonstrate that LSTMs can learn to increment and decrement their cell state as counters in practice.Peng et al. (2018), on the other hand, describe a connection between the gating mechanisms of several recurrent neural network (RNN) architectures and weighted finite-state acceptors.
This paper follows Weiss et al. (2018) by analyzing the expressiveness of neural network acceptors under asymptotic conditions.We formalize asymptotic language acceptance, as well as an associated notion of network memory.We use this theory to derive computation upper bounds and automata-theoretic characterizations for several different kinds of recurrent neural networks (Section 3), as well as other architectural variants like attention (Section 4) and convolutional networks (CNNs) (Section 5).This leads to a fairly complete automata-theoretic characterization of sequential neural networks.
In Section 6, we report empirical results investigating how well these asymptotic predictions describe networks with continuous activations learned by gradient descent.In some cases, networks behave according to the theoretical predictions, but we also find cases where there is gap between the asymptotic characterization and ac-arXiv:1906.01615v2[cs.CL] 5 Jun 2019 tual network behavior.
Still, discretizing neural networks using an asymptotic analysis builds intuition about how the network computes.Thus, this work provides insight about the types of computations that sequential neural networks can perform through the lens of formal language theory.In so doing, we can also compare the notions of grammar expressible by neural networks to formal models that have been proposed for natural language grammar.

Introducing the Asymptotic Analysis
To investigate the capacities of different neural network architectures, we need to first define what it means for a neural network to accept a language.There are a variety of ways to formalize language acceptance, and changes to this definition lead to dramatically different characterizations.
In their analysis of RNN expressiveness, Siegelmann and Sontag (1992) allow RNNs to perform an unbounded number of recurrent steps even after the input has been consumed.Furthermore, they assume that the hidden units of the network can have arbitrarily fine-grained precision.Under this very general definition of language acceptance, Siegelmann and Sontag (1992) found that even a simple recurrent network (SRN) can simulate a Turing machine.
We want to impose the following constraints on neural network computation, which are more realistic to how networks are trained in practice (Weiss et al., 2018): 1. Real-time: The network performs one iteration of computation per input symbol.

Bounded precision:
The value of each cell in the network is representable by O(log n) bits on sequences of length n.
Informally, a neural sequence acceptor is a network which reads a variable-length sequence of characters and returns the probability that the input sequence is a valid sentence in some formal language.More precisely, we can write: Definition 2.1 (Neural sequence acceptor).Let X be a matrix representation of a sentence where each row is a one-hot vector over an alphabet Σ.A neural sequence acceptor 1 is a family of functions parameterized by weights θ.For each θ and X, the function 1θ takes the form 1θ : X → p ∈ (0, 1).
Figure 1: With sigmoid activations, the network on the left accepts a sequence of bits if and only if x t = 1 for some t.On the right is the discrete computation graph that the network approaches asymptotically.
In this definition, 1 corresponds to a general architecture like an LSTM, whereas 1θ represents a specific network, such as an LSTM with weights that have been learned from data.
In order to get an acceptance decision from this kind of network, we will consider what happens as the magnitude of its parameters gets very large.Under these asymptotic conditions, the internal connections of the network approach a discrete computation graph, and the probabilistic output approaches the indicator function of some language (Figure 1).Definition 2.2 (Asymptotic acceptance).Let L be a language with indicator function 1 L .A neural sequence acceptor 1 with weights θ asymptotically accepts L if Note that the limit of 1Nθ represents the function that 1Nθ converges to pointwise. 1iscretizing the network in this way lets us analyze it as an automaton.We can also view this discretization as a way of bounding the precision that each unit in the network can encode, since it is forced to act as a discrete unit instead of a continuous value.This prevents complex fractal representations that rely on infinite precision.We will see later that, for every architecture considered, this definition ensures that the value of every unit in the network is representable in O(log n) bits on sequences of length n.
It is important to note that real neural networks can learn strategies not allowed by the asymptotic definition.Thus, this way of analyzing neural networks is not completely faithful to their practical usage.In Section 6, we discuss empirical studies investigating how trained networks compare to the asymptotic predictions.While we find evidence of networks learning behavior that is not asymptotically stable, adding noise to the network during training seems to make it more difficult for the network to learn non-asymptotic strategies.
Consider a neural network that asymptotically accepts some language.For any given length, we can pick weights for the network such that it will correctly decide strings shorter than that length (Theorem A.1).
Analyzing a network's asymptotic behavior also gives us a notion of the network's memory.Weiss et al. (2018) illustrate how the LSTM's additive cell update gives it more effective memory than the squashed state of an SRN or GRU for solving counting tasks.We generalize this concept of memory capacity as state complexity.Informally, the state complexity of a node within a network represents the number of values that the node can achieve asymptotically as a function of the sequence length n.For example, the LSTM cell state will have O(n k ) state complexity (Theorem 3.3), whereas the state of other recurrent networks has O(1) (Theorem 3.1).
State complexity applies to a hidden state sequence, which we can define as follows: Definition 2.3 (Hidden state).For any sentence X, let n be the length of X.For 1 ≤ t ≤ n, the klength hidden state h t with respect to parameters θ is a sequence of functions given by Often, a sequence acceptor can be written as a function of an intermediate hidden state.For example, the output of the recurrent layer acts as a hidden state in an LSTM language acceptor.In recurrent architectures, the value of the hidden state is a function of the preceding prefix of characters, but with convolution or attention, it can depend on characters occurring after index t.
The state complexity is defined as the cardinality of the configuration set of such a hidden state: Definition 2.4 (Configuration set).For all n, the configuration set of hidden state h n with respect to parameters θ is given by where |X| is the length, or height, of the sentence matrix X.
Definition 2.5 (Fixed state complexity).For all n, the fixed state complexity of hidden state h n with respect to parameters θ is given by Definition 2.6 (General state complexity).For all n, the general state complexity of hidden state h n is given by To illustrate these definitions, consider a simplified recurrent mechanism based on the LSTM cell.The architecture is parameterized by a vector θ ∈ R 2 .At each time step, the network reads a bit x t and computes (3) When we set θ + = 1, 1 , h t asymptotically computes the sum of the preceding inputs.Because this sum can evaluate to any integer between 0 and n, h θ + n has a fixed state complexity of However, when we use parameters θ Id = −1, 1 , we get a reduced network where h t = x t asymptotically.Thus, Finally, the general state complexity is the maximum fixed complexity, which is O(n).
For any neural network hidden state, the state complexity is at most 2 O(n) (Theorem A.2).This means that the value of the hidden unit can be encoded in O(n) bits.Moreover, for every specific architecture considered, we observe that each fixed-length state vector has at most O(n k ) state complexity, or, equivalently, can be represented in O(log n) bits.
Architectures that have exponential state complexity, such as the transformer, do so by using a variable-length hidden state.State complexity generalizes naturally to a variable-length hidden state, with the only difference being that h t (Definition 2.3) becomes a sequence of variably sized objects rather than a sequence of fixed-length vectors.Now, we consider what classes of languages different neural networks can accept asymptotically.We also analyze different architectures in terms of state complexity.The theory that emerges from these tools enables better understanding of the computational processes underlying neural sequence models.

Recurrent Neural Networks
As previously mentioned, RNNs are Turingcomplete under an unconstrained definition of acceptance (Siegelmann and Sontag, 1992).The classical reduction of a Turing machine to an RNN relies on two unrealistic assumptions about RNN computation (Weiss et al., 2018).First, the number of recurrent computations must be unbounded in the length of the input, whereas, in practice, RNNs are almost always trained in a real-time fashion.Second, it relies heavily on infinite precision of the network's logits.We will see that the asymptotic analysis, which restricts computation to be real-time and have bounded precision, severely narrows the class of formal languages that an RNN can accept.
A well-known problem with SRNs is that they struggle with long-distance dependencies.One explanation of this is the vanishing gradient problem, which motivated the development of more sophisticated architectures like the LSTM (Hochreiter and Schmidhuber, 1997).Another shortcoming of the SRN is that, in some sense, it has less memory than the LSTM.This is because, while both architectures have a fixed number of hidden units, the SRN units remain between −1 and 1, whereas the value of each LSTM cell can grow unboundedly (Weiss et al., 2018).We can formalize this intuition by showing that the SRN has finite state complexity: Theorem 3.1 (SRN state complexity).For any length n, the SRN cell state Proof.For every n, each unit of h n will be the output of a tanh.In the limit, it can achieve either −1 or 1.Thus, for the full vector, the number of configurations is bounded by 2 k .
It also follows from Theorem 3.1 that the languages asymptotically acceptable by an SRN are a subset of the finite-state (i.e.regular) languages.Lemma B.1 provides the other direction of this containment.Thus, SRNs are equivalent to finitestate automata.
Theorem 3.2 (SRN characterization).Let L(SRN) denote the languages acceptable by an SRN, and RL the regular languages.Then, This characterization is quite diminished compared to Turing completeness.It is also more descriptive of what SRNs can express in practice.We will see that LSTMs, on the other hand, are strictly more powerful than the regular languages.

Long Short-Term Memory Networks
An LSTM is a recurrent network with a complex gating mechanism that determines how information from one time step is passed to the next.Originally, this gating mechanism was designed to remedy the vanishing gradient problem in SRNs, or, equivalently, to make it easier for the network to remember long-term dependencies (Hochreiter and Schmidhuber, 1997).Due to strong empirical performance on many language tasks, LSTMs have become a canonical model for NLP.Weiss et al. (2018) suggest that another advantage of the LSTM architecture is that it can use its cell state as counter memory.They point out that this constitutes a real difference between the LSTM and the GRU, whose update equations do not allow it to increment or decrement its memory units.We will further investigate this connection between LSTMs and counter machines.Definition 3.2 (LSTM layer).
In ( 12), we set f to either the identity or tanh (Weiss et al., 2018), although tanh is more standard in practice.The vector h t is the output that is received by the next layer, and c t is an unexposed memory vector called the cell state.
Proof.At each time step t, we know that the configuration sets of f t , i t , and o t are each subsets of {0, 1} k .Similarly, the configuration set of ct is a subset of {−1, 1} k .This allows us to rewrite the elementwise recurrent update as where a ∈ {0, 1} and b ∈ {−1, 0, 1}.Let S t be the configuration set of [c t ] i .At each time step, we have exactly two ways to produce a new value in S t that was not in S t−1 : either we decrement the minimum value in S t−1 or increment the maximum value.It follows that For all k units of the cell state, we get The construction in Theorem 3.3 produces a counter machine whose counter and state update functions are linearly separable.Thus, we have an upper bound on the expressive power of the LSTM: Theorem 3.4 (LSTM upper bound).Let CL be the real-time counter languages (Fischer, 1966;Fischer et al., 1968).Then, Theorem 3.4 constitutes a very tight upper bound on the expressiveness of LSTM computation.Asymptotically, LSTMs are not powerful enough to model even the deterministic contextfree language w#w R .Weiss et al. (2018) show how the LSTM can simulate a simplified variant of the counter machine.Combining these results, we see that the asymptotic expressiveness of the LSTM falls somewhere between the general and simplified counter languages.This suggests counting is a good way to understand the behavior of LSTMs.

Gated Recurrent Units
The GRU is a popular gated recurrent architecture that is in many ways similar to the LSTM (Cho et al., 2014).Rather than having separate forget and input gates, the GRU utilizes a single gate that controls both functions.
Weiss et al. ( 2018) observe that GRUs do not exhibit the same counter behavior as LSTMs on languages like a n b n .As with the SRN, the GRU state is squashed between −1 and 1 (20).Taken together, Lemmas C.1 and C.2 show that GRUs, like SRNs, are finite-state.

RNN Complexity Hierarchy
Synthesizing all of these results, we get the following complexity hierarchy: Basic recurrent architectures have finite state, whereas the LSTM is strictly more powerful than a finite-state machine.

Attention
Attention is a popular enhancement to sequenceto-sequence (seq2seq) neural networks (Bahdanau et al., 2014;Chorowski et al., 2015;Luong et al., 2015).Attention allows a network to recall specific encoder states while trying to produce output.
In the context of machine translation, this mechanism models the alignment between words in the source and target languages.More recent work has found that "attention is all you need" (Vaswani et al., 2017;Radford et al., 2018).In other words, networks with only attention and no recurrent connections perform at the state of the art on many tasks.
An attention function maps a query vector and a sequence of paired key-value vectors to a weighted combination of the values.This lookup function is meant to retrieve the values whose keys resemble the query.Definition 4.1 (Dot-product attention).For any n, define a query vector q ∈ R l , matrix of key vectors K ∈ R nl , and matrix of value vectors V ∈ R nk .Dot-product attention is given by attn(q, K, V) = softmax(qK T )V.
In Definition 4.1, softmax creates a vector of similarity scores between the query q and the key vectors in K.The output of attention is thus a weighted sum of the value vectors where the weight for each value represents its relevance.
In practice, the dot product qK T is often scaled by the square root of the length of the query vector (Vaswani et al., 2017).However, this is only done to improve optimization and has no effect on expressiveness.Therefore, we consider the unscaled version.
In the asymptotic case, attention reduces to a weighted average of the values whose keys maximally resemble the query.This can be viewed as an arg max operation.Theorem 4.1 (Asymptotic attention).Let t 1 , .., t m be the subsequence of time steps that maximize qk t . 2 Asymptotically, attention computes lim Corollary 4.1.1(Asymptotic attention with unique maximum).If qk t has a unique maximum over 1 ≤ t ≤ n, then attention asymptotically computes Now, we analyze the effect of adding attention to an acceptor network.Because we are concerned with language acceptance instead of transduction, we consider a simplified seq2seq attention model where the output sequence has length 1: 2 To be precise, we can define a maximum over the similarity scores according to the order given by Definition 4.2 (Attention layer).Let the hidden state v 1 , .., v n be the output of an encoder network where the union of the asymptotic configuration sets over all v t is finite.We attend over V t , the matrix stacking v 1 , .., v t , by computing In this model, h t represents a summary of the relevant information in the prefix v 1 , .., v t .The query that is used to attend at time t is a simple linear transformation of v t .
In addition to modeling alignment, attention improves a bounded-state model by providing additional memory.By converting the state of the network to a growing sequence V t instead of a fixed length vector v t , attention enables 2 Θ(n) state complexity.
Theorem 4.2 (Encoder state complexity).The full state of the attention layer has state complexity The O(n k ) complexity of the LSTM architecture means that it is impossible for LSTMs to copy or reverse long strings.The exponential state complexity provided by attention enables copying, which we can view as a simplified version of machine translation.Thus, it makes sense that attention is almost universal in machine translation architectures.The additional memory introduced by attention might also allow more complex hierarchical representations.
A natural follow-up question to Theorem 4.2 is whether this additional complexity is preserved in the attention summary vector h n .Attending over V n does not preserve exponential state complexity.Instead, we get an O(n 2 ) summary of V n .
Theorem 4.3 (Summary state complexity).The attention summary vector has state complexity With minimal additional assumptions, we can show a more restrictive bound: namely, that the complexity of the summary vector is finite.Appendix D discusses this in more detail.

Convolutional Networks
While CNNs were originally developed for image processing (Krizhevsky et al., 2012), they are also used to encode sequences.One popular application of this is to build character-level representations of words (Kim et al., 2016).Another example is the capsule network architecture of Zhao et al. (2018), which uses a convolutional layer as an initial feature extractor over a sentence.Definition 5.1 (CNN acceptor).
In this network, the k-convolutional layer (25) produces a vector-valued sequence of outputs.This sequence is then collapsed to a fixed length by taking the maximum value of each filter over all the time steps (26).
The CNN acceptor is much weaker than the LSTM.Since the vector h t has finite state, we see that L(CNN) ⊆ RL.Moreover, simple regular languages like a * ba * are beyond the CNN (Lemma E.1).Thus, the subset relation is strict.

L(CNN) ⊂ RL.
So, to arrive at a characterization of CNNs, we should move to subregular languages.In particular, we consider the strictly local languages (Rogers and Pullum, 2011).
Theorem 5.2 (CNN lower bound).Let SL be the strictly local languages.Then,

SL ⊆ L(CNN).
Notably, strictly local formalisms have been proposed as a computational model for phonological grammar (Heinz et al., 2011).We might take this to explain why CNNs have been successful at modeling character-level information.
However, Heinz et al. (2011) suggest that a generalization to the tier-based strictly local languages is necessary to account for the full range of phonological phenomena.Tier-based strictly local grammars can target characters in a specific tier of the vocabulary (e.g.vowels) instead of applying to the full string.While a single convolutional layer cannot utilize tiers, it is conceivable that a more complex architecture with recurrent connections could.

Empirical Results
In this section, we compare our theoretical characterizations for asymptotic networks to the empirical performance of trained neural networks with continuous logits.3

Counting
The goal of this experiment is to evaluate which architectures have memory beyond finite state.We train a language model on a n b n c with 5 ≤ n ≤ 1000 and test it on longer strings (2000 ≤ n ≤ 2200).Predicting the c character correctly while maintaining good overall accuracy requires O(n) states.The results reported in Table 1 demonstrate that all recurrent models, with only two hidden units, find a solution to this task that generalizes at least over this range of string lengths.Weiss et al. (2018) report failures in attempts to train SRNs and GRUs to accept counter languages, unlike what we have found.We conjecture that this stems not from the requisite memory, but instead from the different objective function we used.Our language modeling training objective is a robust and transferable learning target (Radford et al., 2019), whereas sparse acceptance classification might be challenging to learn directly for long strings.Weiss et al. (2018) also observe that LSTMs use their memory as counters in a straightforwardly interpretable manner, whereas SRNs and GRUs do not do so in any obvious way.Despite this, our results show that SRNs and GRUs are nonetheless able to implement generalizable counter memory while processing strings of significant length.Because the strategies learned by these architectures are not asymptotically stable, however, their schemes for encoding counting are less interpretable.

Counting with Noise
In order to abstract away from asymptotically unstable representations, our next experiment investigates how adding noise to an RNN's activations impacts its ability to count.For the SRN and GRU, noise is added to h t−1 before computing h t , and for the LSTM, noise is added to c t−1 .In either case, the noise is sampled from the distribution N (0, 0.1 2 ).
The results reported in the right column of Table 1 show that the noisy SRN and GRU now fail to count, whereas the noisy LSTM remains successful.Thus, the asymptotic characterization of each architecture matches the capacity of a trained network when a small amount of noise is introduced.
From a practical perspective, training neural networks with Gaussian noise is one way of improving generalization by preventing overfitting (Bishop, 1995;Noh et al., 2017).From this point of view, asymptotic characterizations might be more descriptive of the generalization capacities of regularized neural networks of the sort necessary to learn the patterns in natural language data as opposed to the unregularized networks that are typically used to learn the patterns in carefully curated formal languages.

Reversing
Another important formal language task for assessing network memory is string reversal.Reversing requires remembering a Θ(n) prefix of characters, which implies 2 Θ(n) state complexity.
We frame reversing as a seq2seq transduction task, and compare the performance of an LSTM encoder-decoder architecture to the same architecture augmented with attention.We also report the results of Hao et al. (2018) for a stack neural network (StackNN), another architecture with 2 Θ(n) state complexity (Lemma F.1).
Following Hao et al. (2018), the models were trained on 800 random binary strings with length ∼ N (10, 2) and evaluated on strings with length ∼ N (50, 5).As can be seen in Table 2, the LSTM with attention achieves 100.0%validation accuracy, but fails to generalize to longer strings.In contrast, Hao et al. (2018) report that a stack neural network can learn and generalize string reversal flawlessly.In both cases, it seems that having 2 Θ(n) state complexity enables better performance on this memory-demanding task.However, our seq2seq LSTMs appear to be biased against finding a strategy that generalizes to longer strings.

Conclusion
We have introduced asymptotic acceptance as a new way to characterize neural networks as automata of different sorts.It provides a useful and generalizable tool for building intuition about how a network works, as well as for comparing the formal properties of different architectures.Further, by combining asymptotic characterizations with existing results in mathematical linguistics, we can better assess the suitability of different architectures for the representation of natural language grammar.We observe empirically, however, that this discrete analysis fails to fully characterize the range of behaviors expressible by neural networks.In particular, RNNs predicted to be finite-state solve a task that requires more than finite memory.On the other hand, introducing a small amount of noise into a network's activations seems to prevent it from implementing non-asymptotic strategies.Thus, asymptotic characterizations might be a good model for the types of generalizable strategies that noise-regularized neural networks trained on natural language data can learn.

A Asymptotic Acceptance and State Complexity
Theorem A.1 (Arbitary approximation).Let 1 be a neural sequence acceptor for L. For all m, there exist parameters θ m such that, for any string x 1 , .., x n with n < m, where [•] rounds to the nearest integer.
Proof.Consider a string X.By the definition of asymptotic acceptance, there exists some number M X which is the smallest number such that, for all Now, let X m be the set of sentences X with length less than m.Since X m is finite, we pick θ m just by taking Theorem A.2 (General bound on state complexity).Let h t be a neural network hidden state.For any length n, it holds that Proof.The number of configurations of h n cannot be more than the number of distinct inputs to the network.By construction, each x t is a one-hot vector over the alphabet Σ.Thus, the state complexity is bounded according to
Proof.We must show that any language acceptable by a finite-state machine is SRN-acceptable.We need to asymptotically compute a representation of the machine's state in h t .We do this by storing all values of the following finite predicate at each time step: where q t (i) is true if the machine is in state i at time t.Let F be the set of accepting states for the machine, and let δ −1 be the inverse transition relation.Assuming h t asymptotically computes ð t , we can decide to accept or reject in the final layer according to the linearly separable disjunction We now show how to recurrently compute ð t at each time step.By rewriting q t−1 in terms of the previous ð t−1 values, we get the following recurrence: (33) Since this formula is linearly separable, we can compute it in a single neural network layer from x t and h t−1 .
Finally, we consider the base case.We need to ensure that transitions out of the initial state work out correctly at the first time step.We do this by adding a new memory unit f t to h t which is always rewritten to have value 1.Thus, if f t−1 = 0, we can be sure we are in the initial time step.For each transition out of the initial state, we add f t−1 = 0 as an additional term to get This equation is still linearly separable and guarantees that the initial step will be computed correctly.

C GRU Lemmas
These results follow similar arguments to those in Subsection 3.1 and Appendix B. Proof.The configuration set of z t is a subset of {0, 1} k .Thus, we have two possibilities for each value of This implies that, at most, there are only three possible values for each logit: −1, 0, or 1.Thus, the state complexity of h n is Lemma C.2 (GRU lower bound).

RL ⊆ L(GRU).
Proof.We can simulate a finite-state machine using the ð construction from Theorem 3.2.We compute values for the following predicate at each time step: (38) Since ( 38) is linearly separable, we can store ð t in our hidden state h t and recurrently compute its update.The base case can be handled similarly to (34).A final feedforward layer accepts or rejects according to (32).

D Attention Lemmas
Theorem D.1 (Theorem 4.1 restated).Let t 1 , .., t m be the subsequence of time steps that maximize qk t .Asymptotically, attention computes Proof.Observe that, asymptotically, softmax(u) approaches a function (39) Thus, the output of the attention mechanism reduces to the sum Lemma D.1 (Theorem 4.2 restated).The full state of the attention layer has state complexity Proof.By the general upper bound on state complexity (Theorem A.2), we know that m(V n ) = 2 O(n) .We now show the lower bound.We pick weights θ in the encoder such that v t = x t .Thus, m(v θ t ) = |Σ| for all t.Since the values at each time step are independent, we know that where l is the number of t such that x t = 1.We can vary the input to produce l from 1 to n.Thus, we have This sum evaluates to a vector in {0, ∞} k , which means that Lemma D.5 applies if the sequence v 1 , .., v n is computed as the output of ReLU.A similar result holds if it is computed as the output of an unsquashed linear transformation.Proof.By contradiction.Assume we can write a network with window size k that accepts any string with exactly one b and reject any other string.Consider a string with two bs at indices i and j where |i − j| > 2k + 1.Then, no column in the network receives both x i and x j as input.

E CNN Lemmas
When we replace one b with an a, the value of h + remains the same.Since the value of h + (26) fully determines acceptance, the network does not accept this new string.However, the string now contains exactly one b, so we reach a contradiction.Proof.We construct a k-CNN to simulate a strictly 2k +1-local grammar.In the convolutional layer (25), each filter identifies whether a particular invalid 2k + 1-gram is matched.This condition is a conjunction of one-hot terms, so we use tanh to construct a linear transformation that comes out to 1 if a particular invalid sequence is matched, and −1 otherwise.Next, the pooling layer (26) collapses the filter values at each time step.A pooled filter will be 1 if the invalid sequence it detects was matched somewhere and −1 otherwise.
Finally, we decide acceptance (27) by verifying that no invalid pattern was detected.To do this, we assign each filter a weight of −1 use a threshold of −K + 1 2 where K is the number of invalid patterns.If any filter has value 1, then this sum will be negative.Otherwise, it will be 1 2 .Thus, asymptotic sigmoid will give us a correct acceptance decision.

F Neural Stack Lemmas
Refer to Hao et al. (2018) for a definition of the StackNN architecture.The architecture utilizes a differentiable data structure called a neural stack.We show that this data structure has 2 Θ(n) state complexity.
Lemma F.1 (Neural stack state complexity).Let S n ∈ R nk be a neural stack with a feedforward controller.Then, m(S n ) = 2 Θ(n) .
Proof.By the general state complexity bound (Theorem A.2), we know that m(S n ) = 2 O(n) .We now show the lower bound.
The stack at time step n is a matrix S n ∈ R nk where the rows correspond to vectors that have been pushed during the previous time steps.We set the weights of the controller θ such that, at each step, we pop with strength 0 and push x t with strength 1.Then, we have Lemma C.1 (GRU state complexity).The GRU hidden state has state complexity m(h n ) = O(1).

Definition E. 1 (
Strictly k-local grammar).A strictly k-local grammar over an alphabet Σ is a set of allowable k-grams S. Each s ∈ S takes the form s ∈ Σ ∪ {#} k where # is a padding symbol for the start and end of sentences.Definition E.2 (Strictly local acceptance).A strictly k-local grammar S accepts a string σ if, at each index i, σ i σ i+1 ..σ i+k−1 ∈ S. Lemma E.2 (Implies Theorem 5.2).A k-CNN can asymptotically accept any strictly 2k+1-local language.

Table 1 :
Generalization performance of language models trained on a n b n c.Each model has 2 hidden units.

Table 2 :
Max validation and generalization accuracies on string reversal over 10 trials.The top section shows our seq2seq LSTM with and without attention.The bottom reports the LSTM and StackNN results of Hao
n ) = Ω(n).Proof.Consider the case where keys and values have dimension 1.Further, let the input strings come from a binary alphabet Σ = {0, 1}.We pick parameters θ in the encoder such that, for all t, If lim N →∞ v t ∈ {0, ∞} k for 1 ≤ t ≤ n, then m(h n ) = O(1).
48)Lemma D.4 (Attention state complexity with unique maximum).If, for all X, there exists a unique t * such that t * = max t q n k t , then m(h n ) = O(1).