Recurrent Neural Networks as Weighted Language Recognizers

We investigate computational complexity of questions of various problems for simple recurrent neural networks (RNNs) as formal models for recognizing weighted languages. We focus on the single-layer, ReLU-activation, rational-weight RNNs with softmax, which are commonly used in natural language processing applications. We show that most problems for such RNNs are undecidable, including consistency, equivalence, minimization, and finding the highest-weighted string. However, for consistent RNNs the last problem becomes decidable, although the solution can be exponentially long. If additionally the string is limited to polynomial length, the problem becomes NP-complete and APX-hard. In summary, this shows that approximations and heuristic algorithms are necessary in practical applications of those RNNs. We also consider RNNs as unweighted language recognizers and situate RNNs between Turing Machines and Random-Access Machines regarding their real-time recognition powers.


Introduction
Recurrent neural networks (RNNs) are an attractive apparatus for probabilistic language modeling (Mikolov and Zweig, 2012). Recent experiments show that RNNs significantly outperform other methods in assigning high probability to held-out English text (Jozefowicz et al., 2016).
Roughly speaking, an RNN works as follows. At each time step, it consumes one input token, updates its hidden state vector, and predicts the next token by generating a probability distribution over all permissible tokens. The probability of an input string is simply obtained as the product of the predictions of the tokens constituting the string followed by a terminating token. In this manner, each RNN defines a weighted language; i.e. a total function from strings to weights. Siegelmann and Sontag (1995) showed that single-layer rational-weight RNNs with saturated linear activation can compute any computable function. To this end, a specific architecture with 886 hidden units can simulate any Turing machine in real-time (i.e., each Turing machine step is simulated in a single time step). However, their RNN encodes the whole input in its internal state, performs the actual computation of the Turing machine when reading the terminating token, and then encodes the output (provided an output is produced) in a particular hidden unit. In this way, their RNN allows "thinking" time (equivalent to the computation time of the Turing machine) after the input has been encoded.
We consider a different variant of RNNs that is commonly used in natural language processing applications. It uses ReLU activations, consumes an input token at each time step, and produces softmax predictions for the next token. It thus immediately halts after reading the last input token and the weight assigned to the input is simply the product of the input token predictions in each step.
Other formal models that are currently used to implement probabilistic language models such as finite-state automata and context-free grammars are by now well-understood. A fair share of their utility directly derives from their nice algorithmic properties. For example, the weighted languages computed by weighted finite-state automata are closed under intersection (pointwise product) and union (pointwise sum), and the corresponding unweighted languages are closed under intersection, union, difference, and complementation (Droste et al., 2013). Moreover, toolkits like OpenFST (Allauzen et al., 2007) and Carmel 1 implement efficient algorithms on automata like minimization, intersection, finding the highestweighted path and the highest-weighted string.
RNN practitioners naturally face many of these same problems. For example, an RNN-based machine translation system should extract the highest-weighted output string (i.e., the most likely translation) generated by an RNN, (Sutskever et al., 2014;Bahdanau et al., 2014). Currently this task is solved by approximation techniques like heuristic greedy and beam searches. To facilitate the deployment of large RNNs onto limited memory devices (like mobile phones) minimization techniques would be beneficial. Again currently only heuristic approaches like knowledge distillation (Kim and Rush, 2016) are available. Meanwhile, it is unclear whether we can determine if the computed weighted language is consistent; i.e., if it is a probability distribution on the set of all strings. Without a determination of the overall probability mass assigned to all finite strings, a fair comparison of language models with regard to perplexity is simply impossible.
The goal of this paper is to study the above problems for the mentioned ReLU-variant of RNNs. More specifically, we ask and answer the following questions: • Consistency: Do RNNs compute consistent weighted languages? Is the consistency of the computed weighted language decidable? • Highest-weighted string: Can we (efficiently) determine the highest-weighted string in a computed weighted language? • Equivalence: Can we decide whether two given RNNs compute the same weighted language? • Minimization: Can we minimize the number of neurons for a given RNN?

Definitions and notations
Before we introduce our RNN model formally, we recall some basic notions and notation. An alphabet Σ is a finite set of symbols, and we write |Σ| for the number of symbols in Σ. A string s over the alphabet Σ is a finite sequence of zero or more symbols drawn from Σ, and we write Σ * for the set of all strings over Σ, of which is the empty string. The length of the string s ∈ Σ * is denoted |s| and coincides with the number of symbols constituting the string. As usual, we write A B for the set of functions {f | f : B → A}. A weighted language L is a total function L : Σ * → R from strings to real-valued weights. For example, L(a n ) = e −n for all n ≥ 0 is such a weighted language.
We restrict the weights in our RNNs to the ratio-nal numbers Q. In addition, we reserve the use of a special symbol $ to mark the start and end of an input string. To this end, we assume that $ / ∈ Σ for all considered alphabets, and we let Σ $ = Σ∪{$}.
Next, let us define how such an RNN works. We first prepare our input encoding and the effect of our activation function. For an input string s = s 1 s 2 · · · s n ∈ Σ * with s 1 , . . . , s n ∈ Σ, we encode this input as $s$ and thus assume that s 0 = $ and s n+1 = $. Our RNNs use ReLUs (Rectified Linear Units), so for every v ∈ Q N we let σ v (the ReLU activation) be the vector In other words, the ReLUs act like identities on nonnegative inputs, but clip negative inputs to 0. We use softmax-predictions, so for every vector p ∈ Q Σ $ and a ∈ Σ $ we let softmax p (a) = e p(a) a ∈Σ $ e p(a ) .
RNNs act in discrete time steps reading a single letter at each step. We now define the semantics of our RNNs.
Definition 2. Let R = Σ, N, h −1 , W, W , E, E be an RNN, s an input string of length n and 0 ≤ t ≤ n a time step. We define • the hidden state vector h s,t ∈ Q N given by where h s,−1 = h −1 and we use standard matrix product and point-wise vector addition, Finally, the RNN R computes the weighted language R : Σ * → R, which is given for every input s = s 1 · · · s n as above by In other words, each component h s,t (n) of the hidden state vector is the ReLU activation applied to a linear combination of all the components of the previous hidden state vector h s,t−1 together with a summand W st that depends on the t-th input letter s t . Thus, we often specify h s,t (n) as linear combination instead of specifying the matrix W and the vectors W a . The semantics is then obtained by predicting the letters s 1 , . . . , s n of the input s and the final terminator $ and multiplying the probabilities of the individual predictions.
Let us illustrate these notions on an example. We consider the RNN Σ, N, h −1 , W, W , E, E with γ ∈ Q and computing the next hidden state components. Given the initial activation, we thus obtain h s,t = σ t, t − 1 . Using this information, we obtain Consequently, we assign weight e −M 1+e −M to input ε, weight 1 1+e −M · e 1 e 1 +e 1 to a, and, more generally, weight 1 1+e −M · 1 2 n to a n . Clearly the weight assigned by an RNN is always in the interval (0, 1), which enables a probabilistic view. Similar to weighted finite-state automata or weighted context-free grammars, each RNN is a compact, finite representation of a weighted language. The softmax-operation enforces that the probability 0 is impossible as assigned weight, so each input string is principally possible. In practical language modeling, smoothing methods are used to change distributions such that impossibility (probability 0) is removed. Our RNNs avoid impossibility outright, so this can be considered a feature instead of a disadvantage.
The hidden state h s,t of an RNN can be used as scratch space for computation. For example, with a single neuron n we can count symbols in s via: Here the letter-dependent summand W a is universally 1.
Similarly, for an alphabet Σ = {a 1 , . . . , a m } we can use the method of Siegelmann and Sontag (1995) to encode the complete input string s in base m + 1 using: where c : Σ $ → {0, . . . , m} is a bijection. In principle, we can thus store the entire input string (of unbounded length) in the hidden state value h s,t (n), but our RNN model outputs weights at each step and terminates immediately once the final delimiter $ is read. It must assign a probability to a string incrementally using the chain rule decomposition p(s 1 · · · s n ) = p(s 1 ) · . . . · p(s n | s 1 · · · s n−1 ).
Let us illustrate our notion of RNNs on some additional examples. They all use the alphabet Σ = {a} and are illustrated and formally specified in Figure 1. The first column shows an RNN R 1 that assigns R 1 (a n ) = 2 −(n+1) . The next-token prediction matrix ensures equal values for a and $ at every time step. The second column shows the RNN R 2 , which we already discussed. In the beginning, it heavily biases the next symbol prediction towards a, but counters it starting at t = 1. The third RNN R 3 uses another counting mechanism with h s,t = σ t − 100, t − 101, t . The first two components are ReLU-thresholded to zero until t > 101, at which point they overwhelm the bias towards a turning all future predictions to $. if s∈Σ * R(s) = 1. We first show that there is an inconsistent RNN, which together with our examples shows that consistency is a nontrivial property of RNNs. 2 We immediately use a slightly more complex example, which we will later reuse.
Example 3. Let us consider an arbitrary RNN with the single-letter alphabet Σ = {a}, the neurons {1, 2, 3, n, n } ⊆ N , initial activation h −1 (i) = 0 for all i ∈ {1, 2, 3, n, n }, and the following linear combinations: For comparison, all probabilistic finite-state automata are consistent, provided no transitions exit final states. Not all probabilistic context-free grammars are consistent; necessary and sufficient conditions for consistency are given by Booth and Thompson (1973). However, probabilistic context-free grammars obtained by training on a finite corpus using popular methods (such as expectation-maximization) are guaranteed to be consistent (Nederhof and Satta, 2006).
Now we distinguish two cases: and E s,t (a) = t + 1. In this case the termination probability 1 + e 2(t+1) (i.e., the likelihood of predicting $) shrinks rapidly towards 0, so the RNN assigns less than 15% of the probability mass to the terminating sequences (i.e., the finite strings), so the RNN is inconsistent (see Lemma 15 in the appendix).
Case 2: Suppose that there exists a time Then h s,t (1) = 0 for all t ≤ T and h s,t (1) = 1 otherwise. In addition, we have h s,t (2) = t + 1 and h s, which shows that the probability otherwise of predicting $ increases over time and eventually (for t 3T ) far outweighs the probability of predicting a. Consequently, in this case the RNN is consistent (see Lemma 16 in the appendix).
We have seen in the previous example that consistency is not trivial for RNNs, which takes us to the consistency problem for RNNs: Consistency: Given an RNN R, return "yes" if R is consistent and "no" otherwise.
We recall the following theorem, which, combined with our example, will prove that consistency is unfortunately undecidable for RNNs.
Theorem 4 (Theorem 2 of Siegelmann and Sontag (1995)). Let M be an arbitrary deterministic Turing machine. There exists an RNN with saturated linear activation, input alphabet Σ = {a}, and 1 designated neuron n ∈ N such that for all s ∈ Σ * and 0 ≤ t ≤ |s| • h s,t (n) = 0 if M does not halt on ε, and • if M does halt on empty input after T steps, then In other words, such RNNs with saturated linear activation can semi-decide halting of an arbitrary Turing machine in the sense that a particular neuron achieves value 1 at some point during the evolution if and only if the Turing machine halts on empty input. An RNN with saturated linear activation is an RNN following our definition with the only difference that instead of our ReLU-activation σ the following saturated linear activation σ : Q N → Q N is used. For every vector v ∈ Q N and n ∈ N , let Since σ v = σ v − σ v − 1 for all v ∈ Q N , and the right-hand side is a linear transformation, we can easily simulate saturated linear activation in our RNNs. To this end, each neuron n ∈ N of the original RNN R = Σ, N, h −1 , U, U , E, E is replaced by two neurons n 1 and n 2 in the new RNN R = Σ, N , h −1 , V, V , F, F such that h s,t (n) = h s,t (n 1 ) − h s,t (n 2 ) for all s ∈ Σ * and 0 ≤ t ≤ |s|, where the evaluation of h s,t is performed in the RNN R . More precisely, we use the transition matrix V and bias function V , which is given by for all n, n ∈ N and a ∈ Σ ∪ {$}, where n 1 and n 2 are the two neurons corresponding to n and n 1 and n 2 are the two neurons corresponding to n (see Lemma 17 in the appendix). We can now use this corollary together with the RNN R of Example 3 to show that the consistency problem is undecidable. To this end, we simulate a given Turing machine M and identify the two designated neurons of Corollary 5 as n and n in Example 3. It follows that M halts if and only if R is consistent. Hence we reduced the undecidable halting problem to the consistency problem, which shows the undecidability of the consistency problem.
Theorem 6. The consistency problem for RNNs is undecidable.
As mentioned in Footnote 2, probabilistic context-free grammars obtained after training on a finite corpus using the most popular methods are guaranteed to be consistent. At least for 2-layer RNNs this does not hold.
Theorem 7. A two-layer RNN trained to a local optimum using Back-propagation-throughtime (BPTT) on a finite corpus is not necessarily consistent.
Proof. The first layer of the RNN R with a single alphabet symbol a uses one neuron n and has the following behavior: The second layer uses neuron n and takes h s,t (n ) as input at time t: Let the training data be {a}. Then the objective we wish to maximize is simply R(a). The derivative of this objective with respect to each parameter is 0, so applying gradient descent updates does not change any of the parameters and we have converged to an inconsistent RNN.
It remains an open question whether there is a single-layer RNN that also exhibits this behavior.

Highest-weighted string
Given a function f : Σ * → R we are often interested in the highest-weighted string. This corresponds to the most likely sentence in a language  model or the most likely translation for a decoder RNN in machine translation. For deterministic probabilistic finite-state automata or context-free grammars only one path or derivation exists for any given string, so the identification of the highest-weighted string is the same task as the identification of the most probable path or derivation. However, for nondeterministic devices, the highest-weighted string is often harder to identify, since the weight of a string is the sum of the probabilities of all possible paths or derivations for that string. A comparison of the difficulty of identifying the most probable derivation and the highest-weighted string for various models is summarized in Table 1, in which we marked our results in bold face.
We present various results concerning the difficulty of identifying the highest-weighted string in a weighted language computed by an RNN. We also summarize some available algorithms. We start with the formal presentation of the three studied problems.
1. Best string: Given an RNN R and c ∈ (0, 1), does there exist s ∈ Σ * with R(s) > c? 2. Consistent best string: Given a consistent RNN R and c ∈ (0, 1), does there exist s ∈ Σ * with R(s) > c? 3. Consistent best string of polynomial length: Given a consistent RNN R, polynomial P with P(x) ≥ x for x ∈ N + , and c ∈ (0, 1), does there exist s ∈ Σ * with |s| ≤ P(|R|) and R(s) > c? As usual the corresponding optimization problems are not significantly simpler than these decision problems. Unfortunately, the general problem is also undecidable, which can easily be shown using our example.
Theorem 8. The best string problem for RNNs is undecidable.
Proof. Let M be an arbitrary Turing machine and again consider the RNN R of Example 3 with the neurons n and n identified with the designated neurons of Corollary 5. We note that R(ε) = 1 1+e 2 < 0.12 in both cases. If M does not halt, then R(a n ) ≤ 1 1+e 2(n+1) ≤ 1 1+e 2 < 0.12 for all n ∈ N. On the other hand, if M halts after T steps, then using Lemma 14 in the appendix. Consequently, a string with weight above 0.12 exists if and only if M halts, so the best string problem is also undecidable.
If we restrict the RNNs to be consistent, then we can easily decide the best string problem by simple enumeration.
Theorem 9. The consistent best string problem for RNNs is decidable.
Proof. Let R be the RNN over alphabet Σ and c ∈ (0, 1) be the bound. Since Σ * is countable, we can enumerate it via f : N → Σ * . In the algorithm we compute S n = n i=0 R(f (i)) for increasing values of n. If we encounter a weight R(f (n)) > c, then we stop with answer "yes." Otherwise we continue until S n > 1 − c, at which point we stop with answer "no." Since R is consistent, lim i→∞ S i = 1, so this algorithm is guaranteed to terminate and it obviously decides the problem.
Next, we investigate the length |w max R | of the shortest string w max R of maximal weight in the weighted language R generated by a consistent RNN R in terms of its (binary storage) size |R|. As already mentioned by Siegelmann and Sontag (1995) and evidenced here, only small precision rational numbers are needed in our constructions, so we assume that |R| ≤ c · |N | 2 for a (reasonably small) constant c, where N is the set of neurons of R. We show that no computable bound on the length of the best string can exist, so its length can surpass all reasonable bounds.
Theorem 10. Let f : N + → N be the function with for all n ∈ N + . There exists no computable function g : N → N with g(n) ≥ f (n) for all n ∈ N.
Proof. In the previous section (before Theorem 6) we presented an RNN R M that simulates an arbitrary (single-track) Turing machine M with n states. By Siegelmann and Sontag (1995) we have |R M | ≤ c · (4n + 16). Moreover, we observed that this RNN R M is consistent if and only if the Turing machine M halts on empty input. In the proof of Theorem 8 we have additionally seen that the length |w max R | of its best string exceeds the number T M of steps required to halt.
For every n ∈ N, let BB(n) be the n-th "Busy Beaver" number (Radó, 1962), which is BB(n) = max normalized n-state Turing machine M with 2 tape symbols that halts on empty input T M It is well-known that BB : N + → N cannot be bounded by any computable function. However, BB(n) ≤ max normalized n-state Turing machine M with and 2 tape symbols that halts on empty input so f clearly cannot be computable and no computable function g can provide bounds for f .
Finally, we investigate the difficulty of the best string problem for consistent RNN restricted to solutions of polynomial length.
Theorem 11. Identifying the best string of polynomial length in a consistent RNN is NP-complete.
Proof. To show NP-hardness, we reduce from the 3-SAT problem. Let x 1 , . . . , x m be m Boolean variables and be a formula in conjunctive normal form, where ij ∈ {x 1 , . . . , x m , ¬x 1 , . . . , ¬x m }. 3-SAT asks whether there is a setting of x i s that makes F true.
We initialize h −1 (n) = 0, ∀ n ∈ N = {x 1 , . . . , x m , c 1 , . . . , c k , c 1 , . . . , c k , F, n 1 , n 2 , n 3 , }. Let s ∈ {0, 1} * be the input string. Denote the value of F when x j = s j for all j ∈ [m] as F (s). Let t ∈ N with t ≤ |s|. Set h s,t (x m ) = σ I(s t ) , where I(0) = I($) = 0 and I(1) = 1. This stores the current input symbol in neuron x m , so h s,t (x m ) = I(s t ). In addition, we let Next, we evaluate the clauses. For each i ∈ [k], we use two neurons c i and c i such that so h s,t (F ) = F (s) contains the evaluation of the formula F using the values in neurons x 1 , . . . , x m .
Our goal neuron is , which we set to Let m = m + 4. The output is set as follows: This yields E s,t (0) = E s,t (1) = −E s,t ($) = −m if t = m + 2 and F (s) = 1, and m otherwise. For a ∈ {0, 1}, Finally, we set the threshold ξ = 3 −m . When |s| = m + 2, s m+3 = $, so the weight of s contains the factor e −m 2e −m +e m = 1 2+e 2m and thus is upper-bounded by 1 2+e 2m < ξ. Hence no input of length different from m + 2 achieves a weight that exceeds ξ. A string s of length m + 2 achieves the weight w s given by 2e m +e −m < ξ, so if F is unsatisfiable, no input string achieves a weight above the threshold ξ. When F (s) = 1, > ξ. An input string with weight above ξ exists if and only if F is satisfiable. Obviously, the reduction can be computed in polynomial time since all constants can be computed in logarithmic space. The constructed RNN is consistent, since the output prediction is constant after m + 3 steps.

Equivalence
We prove that equivalence of two RNNs is undecidable. For comparison, equivalence of two deterministic WFSAs can be tested in time O(|Σ|(|Q A | + |Q B |) 3 ), where |Q A |, |Q B | are the number of states of the two WFSAs and |Σ| is the size of the alphabet (Cortes et al., 2007); equivalence of nondeterministic WFSAs are undecidable (Griffiths, 1968). The decidability of language equivalence for deterministic probabilistic pushdowntown automata (PPDA) is still open (Forejt et al., 2014), although equivalence for deterministic unweighted push-downtown automata (PDA) is decidable (Sénizergues, 1997).
The equivalence problem is formulated as follows: Equivalence: Given two RNNs R and R , return "yes" if R(s) = R (s) for all s ∈ Σ * , and "no" otherwise.
Theorem 12. The equivalence problem for RNNs is undecidable.
Proof. We prove by contradiction. Suppose Turing machine M decides the equivalence problem. Given any deterministic Turing Machine M , construct the RNN R that simulates M on input as described in Corollary 5. Let E s,t (a) = 0 and E s,t ($) = h s,t (n 1 ) − h s,t (n 2 ). If M does not halt on , for all t ∈ N, E s,t (a) = E s,t ($) = 1/2; if M halts after T steps, E s,T (a) = 1/(e + 1), E s,T ($) = e/(e + 1). Let R be the trivial RNN that computes {a n : P (a n ) = 2 −(n+1) , n ≥ 0}. We run M on input R, R . If M returns "no", M halts on x, else it does not halt. Therefore the Halting Problem would be decidable if equivalence is decidable. Therefore equivalence is undecidable.

Minimization
We look next at minimization of RNNs. For comparison, state-minimization of a deterministic PFSA is O(|E| log |Q|) where |E| is the number of transitions and |Q| is the number of states (Aho et al., 1974). Minimization of a non-deterministic PFSA is PSPACE-complete (Jiang and Ravikumar, 1993).
We focus on minimizing the number of hidden neurons (|N |) in RNNs: Minimization: Given RNN R and non-negative integer n, return "yes" if ∃ RNN R with number of hidden units |N | ≤ n such that R(s) = R (s) for all s ∈ Σ * , and "no" otherwise.
Proof. We reduce from the Halting Problem. Suppose Turing Machine M decides the minimization problem. For any Turing Machine M , construct the same RNN R as in Theorem 12. We run M on input R, 0 . Note that an RNN with no hidden unit can only output constant E s,t for all t. Therefore the number of hidden units in R can be minimized to 0 if and only if it always outputs E s,t (a) = E s,t ($) = 1/2. If M returns "yes", M does not halt on , else it halts.

Conclusion
We proved the following hardness results regarding RNN as a recognizer of weighted languages:  RNN (Siegelmann and Sontag, 1995), our NP-completeness result is original, and surprising, since the analogous hardness results in PFSA relies on the fact that there are multiple derivations for a single string (Casacuberta and de la Higuera, 2000). The fact that these results hold for the relatively simple RNNs we used in this paper suggests that the case would be the same for more complicated models used in NLP, such as long short term memory networks (LSTMs; Hochreiter and Schmidhuber 1997).
Our results show the non-existence of (efficient) algorithms for interesting problems that researchers using RNN in natural language processing tasks may have hoped to find. On the other hand, the non-existence of such efficient or exact algorithms gives evidence for the necessity of approximation, greedy or heuristic algorithms to solve those problems in practice. In particular, since finding the highest-weighted string in RNN is the same as finding the most-likely translation in a sequence-to-sequence RNN decoder, our NPcompleteness result provides some justification for employing greedy and beam search algorithms in practice. Proof. We set h s,−1 (n) = h −1 (n) for all n ∈ N and h s,−1 (n ) = h −1 (n ) for all n ∈ N . Then trivially h s,−1 (n 1 ) − h s,−1 (n 2 ) = h −1 (n) − 0 = h s,−1 (n). Moreover, h s,t (n 1 ) = σ V · h s,t−1 + V s[t] (n 1 ) = σ n ∈N V (n 1 , n ) · h s,t−1 (n ) + V s[t] (n 1 ) = σ n ∈N V (n 1 , n 1 ) · h s,t−1 (n 1 ) + V (n 1 , n 2 ) · h s,t−1 (n 2 ) + V s[t] (n 1 ) = σ n ∈N U (n, n ) · h s,t−1 (n 1 ) − h s,t−1 (n 2 ) + U s[t] (n) = σ n ∈N U (n, n ) · h s,t−1 (n ) + U s[t] (n) Similarly, we can show that h s,t (n 2 ) = σ n ∈N U (n, n ) · h s,t−1 (n ) + U s[t] (n) − 1 Hence h s,t (n 1 ) − h s,t (n 2 ) = h s,t (n) as required.