On the Practical Ability of Recurrent Neural Networks to Recognize Hierarchical Languages

While recurrent models have been effective in NLP tasks, their performance on context-free languages (CFLs) has been found to be quite weak. Given that CFLs are believed to capture important phenomena such as hierarchical structure in natural languages, this discrepancy in performance calls for an explanation. We study the performance of recurrent models on Dyck-n languages, a particularly important and well-studied class of CFLs. We find that while recurrent models generalize nearly perfectly if the lengths of the training and test strings are from the same range, they perform poorly if the test strings are longer. At the same time, we observe that RNNs are expressive enough to recognize Dyck words of arbitrary lengths in finite precision if their depths are bounded. Hence, we evaluate our models on samples generated from Dyck languages with bounded depth and find that they are indeed able to generalize to much higher lengths. Since natural language datasets have nested dependencies of bounded depth, this may help explain why they perform well in modeling hierarchical dependencies in natural language data despite prior works indicating poor generalization performance on Dyck languages. We perform probing studies to support our results and provide comparisons with Transformers.


Introduction
Recurrent models (RNNs and more specifically LSTMs) have been used extensively across several NLP tasks such as machine translation (Sutskever et al., 2014), language modeling (Melis et al., 2017) and question answering (Seo et al., 2016).Natural languages involve phenomena such as hierarchical and long-distance dependencies.Although RNNs are known to be Turing-complete (Siegelmann and Sontag, 1992) given unbounded precision, their practical ability to model such phenomena remains unclear.
Recently, several works (Weiss et al., 2018;Sennhauser and Berwick, 2018;Skachkova et al., 2018) have attempted to understand the capabilities of LSTMs by empirically analyzing them on different types of formal languages.Natural languages, for the most part, can be modeled by context-free languages (Jäger and Rogers, 2012) and their hierarchical structure has been emphasized by Chomsky (2002).Thus studying the capabilities of RNNs in recognizing context-free languages (CFLs) can shed light on how well they can model hierarchical structures.An important family of context-free languages is the Dyck-n language 1 .
Previous works (Suzgun et al., 2019a;Suzgun et al., 2019c;Yu et al., 2019) showed that LSTMs achieve limited generalization performance on recognizing Dyck-2.This prompted the development of memory-augmented variants (Joulin and Mikolov, 2015;Suzgun et al., 2019c) of LSTMs which generalize well on Dyck languages but are notoriously hard to train and fail to perform well on practical NLP tasks (Shen et al., 2019).On the other hand, despite the limited performance of LSTMs on Dyck languages, several studies (Gulordava et al., 2018;Tran et al., 2018) have found that LSTMs perform well in modeling hierarchical structure in natural language data.In this work, we take a step towards bridging this gap.
They say all the prayers those priests preach, and the boy hears were written centuries ago.
Our Contributions.We investigate the ability of recurrent models to learn and generalize on Dyck languages.We first evaluate the ability of LSTMs to recognize randomly sampled Dyck-n sequences and find, in contrast to prior works (Suzgun et al., 2019a;Suzgun et al., 2019c), that they can generalize nearly perfectly when the test samples are within the same lengths as seen during training.Similar to prior works, when evaluated on randomly generated samples of higher lengths we observe limited performance.Dyck languages and (deterministic) CFLs can be recognized by (deterministic) pushdown automata (PDA).We construct an RNN that directly simulates a PDA given unbounded precision.A key observation is that the higher the depth of the stack the higher is the required precision.This implies that fixed precision RNNs are expressive enough to recognize strings of arbitrary lengths if the required depth of the stack is bounded.Based on this observation, we test the hypothesis whether LSTMs can generalize to higher lengths if the depth of the inputs in the training and test set is bounded by the same value.In the bounded depth setting, LSTMs are able to generalize to much higher lengths compared to the lengths used during training.Given that natural languages in practical settings also contain nested dependencies of bounded depths (Gibson, 1991;McElree, 2001), this may help explain why LSTMs perform well in modeling natural language corpora containing nested dependencies.We then assess the generalization performance of the model across higher depths and find that although LSTMs can generalize to a certain extent, their performance degrades gradually with increasing depths.Since Transformer (Vaswani et al., 2017) is also a dominant model in NLP (Devlin et al., 2019), we include it in our experiments.To our knowledge, prior works have not empirically analyzed Transformers on formal languages, particularly context-free languages.We further conduct robustness experiments and probing studies to support our results.

Preliminaries and Motivation
The language Dyck-n parameterized by n ≥ 1 consists of well-formed words from n types of parentheses.Its derivation rules are: S → ( i S) i , S → SS, S → where i ∈ {1, . . ., n}.Dyck-n is context-free for every n, Dyck-2 being crucial among them, since all Dyck-n for n > 2 can be reduced to Dyck-2 (Suzgun et al., 2019a).For words in Dyck-n, the required depth of the stack in its underlying PDA is the maximum number of unbalanced parentheses in a prefix.For instance, the word ( [ ( ) ] ) [ ] in Dyck-2 has maximum depth of 3 corrsponding to the prefix ( [ (.
RNNs.RNNs are defined by the update rule , where the function f could be a feedforward network, h t is the model's memory vector usually referred to as the hidden state vector and x t denotes the input vector at the t-th step.In practice, f is usually a single layer feedforward network with tanh or ReLU activation.To mitigate the vanishing gradients problem, LSTMs (Hochreiter and Schmidhuber, 1997), a variant of RNNs with additional gates, is most commonly used in practice.In our experiments, we will primarily work with LSTMs.

Expressiveness Results
Proposition 2.1.Any Deterministic Pushdown Automaton can be simulated by an RNN with ReLU activation.
We provide a proof by construction for the above result in Appendix B by using the Cantor-set like  52,100] respectively without any restriction on the depth of the Dyck words.Our second setting also resembles the previous dataset in terms of lengths of the strings in the training and validation sets.However, in this case, we restrict all the strings to have depths in the range [1,10].This is to test the generalization ability across lengths when the depth is bounded.In the third case, we test the generalization ability across depths, when the lengths in train and validation sets are in the same interval [2,100].Along with LSTMs, we also report the performance of Transformers (as used in GPT (Radford et al., 2018)) on each task since it is also a dominant model in NLP Tasks.We train and evaluate our models on the Next Character Prediction Task (NCP) introduced in Gers and Schmidhuber (2001) and used in Suzgun et al. (2019a) and Suzgun et al. (2019c).Similar to an LM setup, the model is only presented with positive samples from a given language.In NCP, for a sequence of symbols s 1 , s 2 , . . ., s n , the model is presented with the sequence s 1 , s 2 , . . ., s i at the i th step and the goal of the model is to predict the set of valid characters for the (i + 1) th step, which can be represented as a k-hot vector.The model assigns a probability to each character in the vocabulary corresponding to its validity in the next step, which is achieved by applying sigmoid activation over the unnormalized scores predicted through its output layer.Following Suzgun et al. (2019b) and Suzgun et al. (2019a), we use mean squared error between the predicted probabilities and k-hot labels as the loss function.During inference, we use a threshold of 0.5 to obtain the final prediction.The model's prediction is considered to be correct if and only if its output at every step is correct.The accuracy of the model over test samples is the fraction of total samples which are predicted correctly.Note that, this is a relatively stringent metric as a correct prediction is obtained only when the model's output is correct at every step as opposed to standard classification tasks.Refer to Bhattamishra et al. (2020) for a discussion on the choice of character prediction task and its relation with other tasks such as standard classification and language modeling.Details of the dataset and parameters relevant for reproducibility can be found in section C in Appendix.We have made our source code available at https://github.com/satwik77/RNNs-Context-Free.

Results
Table 1 shows the performance of LSTMs and Transformers on the considered languages.When inputs are randomly sampled in a given range of lengths, LSTMs can generalize well on the validation set containing inputs of the same lengths as seen during training (Bin-1A)3 , for all considered Dycks.However, for higher lengths (Bin-2A), it struggles to generalize on these languages.For the case when the depth is bounded, LSTMs generalize very well to much higher lengths (Bin 1B, 2B, and 3B).Transformers, on the other hand, fail to generalize to higher lengths in either case, which might be attributed to the fact that at test time, it receives positional encodings that it was not trained on.To investigate the generalization ability of models across increasing depths, we trained the models up to depth 15 and evaluated on 5 validation sets with an incremental increase in depth in each set (refer to Figure 2).We found that the models were able to generalize up to a certain extent but their performance degraded gradually as we increase the depth.However, Transformers performed relatively better as compared to LSTMs.Probing.We also conducted probing experiments on LSTMs to better understand how they perform these particular tasks.We first attempt to extract the depth of the underlying stack from the cell state of an LSTM model trained to recognize Dyck-2.We found that a single layer feedforward network was easily able to extract the depth with perfect (100%) accuracy and generalize to unseen data.Figure 3a shows a visualization of t-SNE projection of the hidden state labeled by their corresponding depths.Additionally, we also try to extract the stack elements from the hidden states of the network.For Dyck-2 samples within lengths [2,50] and depths [1,10], along with training LSTM for the NCP task, we co-train auxiliary classifiers to predict each element of the stack up to depth 10, i.e. the hidden state of the LSTM that is used to predict the next set of valid characters is now also utilized in parallel to predict (by supplying 10 separate linear layers for each element) the elements of the stack.We find that not only was the model able to predict the elements in a validation set from the same distribution, it was also able to extract the elements for sequences of higher lengths ([52,150]) on which it was not trained on (see Figure 3b).This further provides evidence to show that LSTMs are able to robustly generalize to inputs of higher lengths when their depths are bounded.We also conducted a few additional robustness experiments to ensure that the model does not overfit on training distribution.Details of probing tasks as well as additional robustness experiments can be found in the Appendix.

Discussion
LSTMs and Transformers have been effective on language modeling tasks.In practice, during language modeling on a natural language corpus, the entire input is fed sequentially to the LSTM.Hence, the length of the input processed by the LSTM is bound to be large, requiring it to model a number of Figure 3 nested dependencies.Prior works in psycholinguistics (Gibson, 1991;McElree, 2001) have pointed out that given limited working memory of humans, natural language as used in practice should have nested dependencies of bounded depth.Some works (Jin et al., 2018;Noji et al., 2016) have even sought to build parsers for natural language with depth-bounded PCFGs.Our generalization results for LSTMs on depth-bounded CFGs could help explain why they perform well on modeling hierarchical dependencies in natural language datasets.For Transformer, although it did not generalize to higher lengths, but in practice Transformers (as used in BERT and GPT) process inputs in a fixed-length context window.Our results indicate that it does not have trouble in generalizing when the train and validation sets contain inputs of the same lengths.
Our experiments also demonstrate that the limiting factor in LSTMs is in generalizing to higher depths as opposed to its memory-augmented variants.The exact mechanism with which trained LSTMs perform the task is not entirely clear.The limited performance of LSTMs could be due to precision issues or unstable stack encoding mechanism (Stogin et al., 2020).However, given that natural language datasets are likely to have nested dependencies of bounded depth, this limitation may not play a significant role and may help explain why prior works (Gulordava et al., 2018;Tran et al., 2018) have found LSTMs to perform well in modeling hierarchical structure on natural language datasets.
A Preliminaries Definition A.1 (Deterministic Pushdown Automata (Hopcroft et al., 2006)).A DPDA is a 7-tuple Σ, Q, Γ, q 0 , Z 0 , δ, F with 1.A finite alphabet Σ 2. A finite set of states Q 3. A finite stack alphabet 4.An initial state q 0 5.An initial stack symbol Z 0 6.F ⊆ Q set of accepting states 7. A state transition function The output of δ is a finite set of pairs (q, γ), where q is a new state and γ is the string of stack symbols that replaces the top of the stack.If X is at the top of the stack, then γ = indicates that the stack is popped, γ = X denotes that the stack is unchanged and if γ = Y Z, then X is replaced by Z, and Y is pushed onto the stack.
A machine processes an input string x ∈ Σ * one token at a time.For each input token, the machine looks at the current input, state and top of the stack to make a transition into a new state and update its stack.The machine can also take empty string as input and make a transition based on the stack and its current state.The machine halts after reading the input and a string is accepted if the state after reading the complete input is a final state.

A.1 Cantor-set Encodings
In our construction, we will make use of Cantor-set like encodings as introduced in (Siegelmann and Sontag, 1992).The Cantor-set like encodings in base-4 provides us a means to encode a stack of values 0s and 1s and easily apply stack operations like push, pop and top to update them.Let υ s denote the encoding of a stack.The contents of the stack can be viewed as a rational number p 4 q where 0 < p < 4 q .The i-th element from the top of the stack can be seen as the i-th element to the right of the decimal point the in a base-4 expansion.A 0 stored in the stacked is associated with a 1 in the expansion while a 1 stored in the stack is associated with 3. Hence, only numbers of the special form Σ t i=1 a i 4 i where a i ∈ {1, 3} will appear in the encoding.For inputs of the form I ∈ {0, 1}, the standard stack operations can be applied by simple affine operations.For instance, push(I) operation can be obtained by υ s → 1 4 υ s + 1 2 I + 1 4 and the pop(I) can be obtained by υ s → 4υ s − 2I − 1.The top of the stack can be obtained by σ(4υ s − 2) which will be 1 if the top of the stack is 1 else 0. The emptiness of a stack can be checked by σ(υ s ) which will be 1 if the stack is nonempty or else it will be 0.

B Construction
We will make use of some intermediate notions to describe our construction.We will use these multiple times in our construction.Particularly, Lemma B.1 will be used to combine the information of the state vector, input and the symbol at the top of the stack.Lemma B.2 and Lemma B.3 will be used to implement the state transition and decisions related to stack operations.
For the feed-forward networks we use the activation as in (Siegelmann and Sontag, 1992), namely the saturated linear sigmoid activation function: (1) Note that, we can easily work with the standard ReLU activation via σ(x) = ReLU(x) − ReLU(x − 1).
Lemma B.1.There exists a function O(x) : Proof.Let A i for i ∈ {1, . . ., |Φ|} denote a matrices of dimensions |Ψ| × |Φ| such that A i has 1s in its i-th row and 0 everywhere else.For any one-hot vector φ, note that φA i = 1 if i = π Φ (φ) or else it is 0. Thus, consider the transformation, Note that, the vector t (φ,ψ) has a value 2 exactly at the position (π Ψ (ψ) − 1)|Φ| + π Φ (φ) and it is either 0 or 1 at the rest of the positions.Hence by making use of bias vectors, it is easy to obtain [(φ, ψ)], which is what we wanted to show.
We will describe another technical lemma that we will make use of to implement our transition functions and other such mappings.Consider two sets Φ and Ψ (such as the set of states Q and the set of inputs Σ or set of stack symbols Γ), and consider the one-hot representations of their elements φ ∈ Φ and ψ ∈ Ψ as φ ∈ Q |Φ| and ψ ∈ Q |Ψ| respectively.Let δ : Φ × Ψ → Φ be any transition function that takes elements of two sets as input and produces an element of one of the sets.Let [(φ, ψ)] ∈ Q |Φ|×|Ψ| denote a unique one-hot encoding for each pair of φ and ψ as defined earlier.We demonstrate that given [(φ 1 , ψ)] as input, there exists a single layer feedforward network that produces the vector Proof.This can easily implemented using a linear transformation.Consider a matrix A ∈ Q |Φ||Ψ|×|Φ| .Given two inputs φ i , φ k ∈ Φ and ψ j ∈ Ψ such that δ(φ i , ψ j ) = φ k , the row (π Ψ (ψ j ) − 1)|Φ| + π Φ (φ i ) of the matrix A will be the one-hot vector corresponding to φ k , that is, Similarly, a mapping θ : Φ × Ψ → {0, 1} n can also be implemented using linear transformation using a transformation matrix A ∈ Q |Φ||Ψ|×n where each row of the matrix A will be the corresponding mapping {0, 1} n for the pair of (φ, ψ) corresponding to that row.

Lemma B.3. There exists a function
Proof is similar to proof of Lemma B.2. Proposition B.4.For any Deterministic Pushdown Automaton, there exists an RNN that can simulate it.
Proof.The construction is straightforward and follows by induction.We will show that at the t-th timestep, given that the model has information about the state and a representation of the stack, the model can compute the next state and update the stack representation based on the input.More formally, given a sequence x 1 , x 2 , . . ., x n ∈ Σ * , consider that the hidden state vector at the t-th timestep is h t = [q t , ω t ] where q t ∈ Q |Σ| is a one-hot encoding of the state vector and ω t ∈ Q |Γ| is a representation of the stack based on the cantor-set like encoding.Then, given an input x t+1 ∈ Σ, we will show how the network can compute h t+1 = [q t+1 , ω t+1 ].After reading the whole input, a sequence is accepted if q n is in the set of final states F or else it is rejected.
Our construction will use a 5-layer feed forward network that takes as input the vectors h t−1 and x t at each timestep and produces the vector h t .The vectors h t ∈ Q |Q|+|Γ| will have two subvectors of size |Q| and |Γ| containing a one-hot representation of the state of the underlying automaton and a representation of the stack encoded in the Cantor-set representation respectively.For each input symbol x ∈ Σ, its corresponding input vector x ∈ Q |Σ| will be a one-hot vector.If the underlying stack takes empty string as input at particular step, the RNN will take a special symbol as input which also have a unique one-hot representation similar to other input symbols.
As opposed to the construction of Siegelmann and Sontag (1992), which only takes 0s and 1s as input and use a scalar to encode a stack of 0s and 1s, we will encode one-hot representation of stack symbols in vectors of size |Γ|.The push and pop operations will always be in the form of one-hot vectors and this will ensure that retrieving the top element provides a one-hot encoding of the stack symbol.
The first layer of the feedforward network σ(W h h t−1 + W x x t + b) will produce the vector h (1) , where τ top t−1 ∈ Q |Γ| denotes a one-hot vector representation of the symbol at the top of our stack representation and the subvector (q t−1 , x t ) ∈ Q |Q|×|Σ| is a unique one-hot vector for each pair of state q and input x.Thus, the vector h (1) t−1 is of dimension |Q|.|Σ| + 2|Γ|.The vector (q t−1 , x t ) can be obtained by using Lemma B.1 where Φ = Q and Ψ = Σ.The vector corresponding to the symbol at the top of the stack can be easily obtained using the top operation (σ(4ω t−1 − 2)) defined in section A.1.
In the second layer, we will use Lemma B.1 again to obtain a unique one-hot vector for each combination of the state, input and stack symbol.The output of the second layer of the feedforward network will be of the form h (2) where the subvector (q t−1 , x t , τ top t−1 ) is a unique one-hot encoding for each combination of the state q ∈ Q, input x ∈ Σ and a stack symbol τ ∈ Γ.Hence, the vector h (2) t−1 will be of the dimension |Q|.|Σ|.|Γ|+2|Γ|.Since we already had the vector (q t−1 , x t ), and the vector τ top t−1 , the vector (q t−1 , x t , τ top t−1 ) can be obtained using Lemma B.1 by considering Φ = Q × Σ and Ψ = Γ.The primary idea is that the vector (q t−1 , x t , τ top t−1 ) provides us with all the necessary information required to implement the further steps and produce q t and ω t .

Language Model
Validation Set 1 p = 0.5, q = 0.25 Validation Set 2 p = 0.4, q = 0.35 Validation Set 3 p = 0.6, q = 0.15 Table 4: The performance of neural models on considered Dyck languages for data generated from three different distributions.Validation Set 1 was constructed from the same distribution used to generate the training data, while the other two were generated from different distributions.All the validation sets had strings with the lengths in the interval [2, 50] and there was no restriction kept on the depth of these strings.

Dyck
process at that point.We run all of our experiments on 4 Nvidia Tesla P100 GPUs each containing 16GB memory.
Probing Details For designing a probe for extracting the depth of the underlying stack of Dyck-2 substrings from the cell states of pretrained LSTMs, we used a single hidden layer Feed-Forward network.The hidden size of the network was kept 32 and it was trained using Adam Optimizer (Kingma and Ba, 2014) with a batch size of 200.The accuracy on the validation set was computed by only considering the predictions to be correct for a sequence if it predicted the correct depth at every step of that sequence.In the second set of experiments that aimed to predict the elements of the stack, we trained the LSTM model on the NCP task along with an auxillary loss for predicting the top-10 elements of the stack.For the computation of the auxillary loss, we added 10 parallel linear layers on the top of the LSTM's output with i-th linear layer tasked to predict if the i-th element of the stack was i) a round opening bracket or ii) a square opening bracket or iii) If no element was present at that position.For each of these 10 linear layers we compute Cross Entropy Loss which are then averaged to obtain the auxillary loss.The final loss is computed as: where L N CP is the loss obtained from the next character prediction task and L aux is the auxillary loss just described and we use λ = 1 20 in our experiments.The stack extraction auxillary task is evaluated by computing Accuracy and Recall metrics for each stack element.The accuracy for the i-th element is computed by considering if the model can predict the i-th element correctly at each step of a sequence.Since there will be a fewer cases for smaller lengths containing elements at higher depths, we also report Recall for each i-th stack element, where we only consider if the model can correctly predict the sequences containing at least one occurrence of depth i.

D Robustness Experiments
To ensure that our results didn't overfit on the training distribution we did some robustness experiments to check efficacy of the considered neural models.As a reminder, the PCFG for Dyck-n languages is given by the following derivation rules: S → ( i S) i , with probability p (6) S → SS, with probabililty q (7) S → , with probbaility 1 − (p + q) (8) (9) For the experiments described in the main paper we used p = 0.5 and q = 0.25.To check the generalization ability of our models we checked whether a model trained with the strings generated using these values can generalize on Dyck words obtained from a different distribution.Table 4 shows the accuracies obtained by LSTMs and Transformers on data generated from different distributions.It can be observed from the results that the performance of the models for all languages remain more or less the same across different distributions.
A model trained on the Next Character Prediction task can also be used to generate strings of the language it was trained on by starting from an empty string and then exhaustively iterating over all possible valid characters predicted by the model.We used this idea to check if a pretrained LSTM model can indeed generate all possible Dyck-2 strings upto a certain length (since the number of possible strings will grow exponentially with increasing lengths).For a maximum length of 10, there exists a total of 1619 valid Dyck-2 strings.When we used a pretrained model to exhaustively generate the valid strings, we observed that it produced exactly those 1619 strings, no more and no less.

E Additional Results
The depth generalization results for the neural models on Dyck-3 and Dyck-4, are given in Figure 4. Similar to Dyck-2, here also we see a gradual drop in performance as we move to the higher depths but Transformers perform relatively better than LSTMs.

Figure 2 :
Figure 2: Generalization of LSTMs and Transformers on higher depths.The lengths of strings in the training set and all validation sets were fixed to lie between 2 to 100.
(a) Visualization of t-SNE Projections of the hidden states obtained from a pre-trained LSTM for different Dyck-2 substrings, colored by their depths.(b) Accuracies and recalls obtained on extracting top-10 elements (averaged over them) of the stack corresponding to different Dyck-2 strings using the hidden state vector of the LSTM.

Figure 4 :
Figure 4: Generalization of LSTMs and Transformers on higher depths for (a) Dyck-3 and (b) Dyck-4.The lengths of strings in the training set and all validation sets were fixed to lie between 2 to 100.