LSTM Networks Can Perform Dynamic Counting

In this paper, we systematically assess the ability of standard recurrent networks to perform dynamic counting and to encode hierarchical representations. All the neural models in our experiments are designed to be small-sized networks both to prevent them from memorizing the training sets and to visualize and interpret their behaviour at test time. Our results demonstrate that the Long Short-Term Memory (LSTM) networks can learn to recognize the well-balanced parenthesis language (Dyck-$1$) and the shuffles of multiple Dyck-$1$ languages, each defined over different parenthesis-pairs, by emulating simple real-time $k$-counter machines. To the best of our knowledge, this work is the first study to introduce the shuffle languages to analyze the computational power of neural networks. We also show that a single-layer LSTM with only one hidden unit is practically sufficient for recognizing the Dyck-$1$ language. However, none of our recurrent networks was able to yield a good performance on the Dyck-$2$ language learning task, which requires a model to have a stack-like mechanism for recognition.


Introduction
Recurrent Neural Networks (RNNs) are known to capture long-distance and complex dependencies within sequential data.In recent years, RNNbased architectures have emerged as a powerful and effective architecture choice for language modeling (Mikolov et al., 2010).When equipped with infinite precision and rational state weights, RNN models are known to be theoretically Turingcomplete (Siegelmann and Sontag, 1995).However, there still remain some fundamental questions regarding the practical computational expressivity of RNNs with finite precision.Weiss et al. (2018) have recently demonstrated that Long Short-Term Memory (LSTM) models (Hochreiter and Schmidhuber, 1997), a popular variant of RNNs, can, theoretically, emulate a simple real-time k-counter machine, which can be described as a finite state controller with k separate counters, each containing integer values and capable of manipulating their content by adding ±1 or 0 at each time step (Fischer et al., 1968).The authors further tested their theoretical result by training the LSTM networks to learn a n b n and a n b n c n .Their examination of the cell state dynamics of the models exhibited the existence of simple counting mechanisms in the cell states.Nonetheless, these two formal languages can be captured by a particularly simple form of automaton, a deterministic one-turn two-counter automaton (Ginsburg and Spanier, 1966).Hence, there is still an open question of whether the LSTMs can empirically learn to emulate more general finite-state automata equipped with multiple counters capable of performing an arbitrary number of turns.
In the present paper, we answer this question in the affirmative.We assess the empirical performance of three types of recurrent networks-Elman-RNNs (or RNNs, in short), LSTMs, and Gated Recurrent Units (GRUs)-to perform dynamic counting by training them to learn the Dyck-1 language.Our results demonstrate that the LSTMs with only a single hidden unit perform with perfect accuracy on the Dyck-1 learning task, and successfully generalize far beyond the training set.Furthermore, we show that the LSTMs can learn the shuffles of multiple Dyck-1 languages, defined over disjoint parenthesis-pairs, which require the emulation of multiple-counter arbitraryturn machines.Our results corroborate the theoretical findings of Weiss et al. (2018), while extending their empirical observations.On the other hand, when trained to learn the Dyck-2 language, which is a strictly context-free language, all our recurrent models failed to learn the language.

arXiv:1906.03648v1 [cs.CL] 9 Jun 2019 2 Preliminaries
We start by defining several subclasses of deterministic pushdown automata (DPA).Following Valiant and Paterson (1975), we define a deterministic one-counter automaton (DCA 1 ) to be a DPA with a stack alphabet consisting of only one symbol.Traditionally, this construction allowsmoves (that is, executing actions on the stack without the observance of any inputs), but we restrict our attention to simple DCA 1 s without -moves in the rest of this paper.Similarly, we call a DPA that contains k separate stacks, with each stack using only one stack symbol, a deterministic k-counter automaton (DCA k ). 1ne can impose a further restriction on the direction of stack movement of a DPA.This notion leads to the definition of a deterministic nturn pushdown automaton (or n-turn DPA, in short) and is well-studied by Ginsburg and Spanier (1966) and Valiant (1974): A DPA is said to be an n-turn DPA if the total number of direction changes in the stack movement of the DPA is at most n for each stack.Note that a one-turn DCA 1 can recognize a n b n (Valiant, 1973), whereas a one-turn DCA 2 can recognize a n b n c n .We say that a DCA k with no limit on the number of turns can perform dynamic counting.

Related Work
Formal languages have long been used to demonstrate the computational power of neural networks.Early studies (Steijvers, 1996;Tonkes and Wiles, 1997;Rodriguez and Wiles, 1998;Bodén et al., 1999;Bodén and Wiles, 2000;Rodriguez, 2001) employed Elman-style RNNs (Elman, 1990) to recognize simple context-free and context-sensitive languages, such as a n b n , a n b n c n , and a n b n cb m a m .Most of these architectures, however, suffered from the vanishing gradient problem (Hochreiter, 1998) and could not generalize far beyond their training sets.
Using LSTMs (Hochreiter andSchmidhuber, 1997), Gers andSchmidhuber (2001) showed that their models could learn two strictly context-free languages and one strictly context-sensitive language by effectively using their gating mechanisms.In contrast, Das et al. (1992) proposed an RNN model with an external stack memory, named Recurrent Neural Network Pushdown Automaton (NNPDA), to learn basic context-free grammars.
More recently, Joulin and Mikolov (2015) introduced simple RNN models equipped with differentiable stack modules, called Stack-RNN, to infer algorithmic patterns, and showed that their model could successfully learn various formal languages, in particular a n b n , a n b n c n , a n b n c n d n , a n b 2n .Inspired by the early model design of NNPDAs, Grefenstette et al. (2015) also proposed memory-augmented recurrent networks (Neural Stacks, Queues, and DeQues), which are RNNs equipped with unbounded differentiable memory modules, to perform sequence-to-sequence transduction tasks that require specific data structures.Deleu and Dureau (2016) investigated the ability of Neural Turing Machines (NTMs; Graves et al. (2014)) to capture long-distance dependencies in the Dyck-1 language.Their empirical findings demonstrated that an NTM can recognize this language by emulating a DPA.Similarly, Sennhauser and Berwick (2018), Bernardy (2018), andHao et al. (2018) conducted experiments on the Dyck languages to explore whether recurrent networks can learn nested structures.These studies assessed the performance of their recurrent models to predict the next possible parenthesis, assuming that it is a closing parenthesis. 2n fact, Bernardy (2018) used a purpose-designed architecture, called RUSS, which contains recurrent units with stack-like states, to perform the closing-parenthesis-completion task.Though the RUSS model had no trouble generalizing to longer and deeper sequences, as the author mentions, the specificity of the architecture disqualifies it as a practical model choice for natural language modeling tasks.Additionally, Skachkova et al. (2018) trained recurrent networks to predict the last appropriate closing parenthesis, given a Dyck-2 sequence without its last symbol.They showed that their GRU and LSTM models performed with almost full accuracy on this parenthesis-completion task, but their task does not illustrate that these RNN models can recognize the Dyck language.
Most recently, Weiss et al. (2018) and Suzgun et al. (2019) showed that the LSTM networks can develop natural counting mechanisms to rec-ognize simple context-free and context-sensitive languages, particularly a n b n , a n b n c n , a n b n c n d n .Their examination of the cell states of the LSTMs revealed that the models learned to emulate simple one-and two-turn counters to recognize these formal languages, but the authors did not conduct any experiments on tasks that require counters to perform arbitrary number of turns.

Models
All the models in this paper are recurrent neural architectures and known to capture long-distance relationships in sequential data.We would like to compare and contrast their ability to perform dynamic counting to recognize simple counting languages.We further investigate whether they can learn the Dyck-2 language by emulating a DPA.
A simple RNN architecture (Elman, 1990) is a recurrent model that takes an input x t and a previous hidden state representation h t−1 to produce the next hidden state representation h t , that is: where x t ∈ R D is the input, h t ∈ R H the hidden state, y t ∈ R D the output at time t, W y ∈ R D×H the linear output layer, f an activation function3 and σ an elementwise logistic sigmoid function.
In theory, it is known that RNNs with infinite precision and rational state weights are computationally universal models (Siegelmann and Sontag, 1995).However, in practice, the exact computational power of RNNs with finite precision is still unknown.Empirically, RNNs suffer from the vanishing or exploding gradient problem, as the length of the input sequences grow (Hochreiter, 1998).To address this issue, different neural architectures have been proposed over the years.Here, we focus on two popular RNN variants with similar gating mechanism, namely LSTMs and GRUs.
The LSTM model was introduced by Hochreiter and Schmidhuber (1997) to capture long-distance dependencies more accurately than simple RNNs.It contains additional gating components to facilitate the flow of gradients during back-propagation.
The GRU model was proposed by Cho et al. ( 2014) as an alternative to LSTM.GRUs are similar to LSTMs in their design, but do not contain an additional memory unit.

Experimental Setup
To evaluate the capability of RNN-based architectures to perform dynamic counting and to encode hierarchical representations, we conducted experiments on four different synthetic sequence prediction tasks.Each task was designed to highlight some particular feature of recurrent networks.All the tasks were formulated as supervised learning problems with discrete k-hot targets and meansquared-error loss under the sequence prediction framework, defined next.We repeated each experiment ten times but used the same random seed across each run for each of the tasks to ensure comparability of RNN, GRU, and LSTM models.

The Sequence Prediction Task
Following Gers and Schmidhuber ( 2001), we trained the models as follows: Given a sequence in the language, we presented one character at each time step to the network and trained the network to predict the set of next possible characters in the language, based on the current input character and the prior hidden state.We used a one-hot representation to encode the inputs and a k-hot representation to encode the outputs.In all the experiments, the objective was to minimize the mean-squared error of the sequence predictions.We used an output threshold criterion of 0.5 for the sigmoid layer to indicate which characters were predicted by the model.Finally, we turned this sequence prediction task into a sequence classification task by accepting a sequence if the model predicted all of its output values correctly and rejecting it otherwise.

Training Details
Given the nature of the four languages that we will describe shortly, if recurrent models can learn them, then they should be able to do so with reasonably few hidden units and without the employment of any embedding layer or the dropout operation. 4To that end, all the recurrent models used in our experiments were single-layer networks containing less than 10 hidden units.The number of hidden units that the networks contained for the Dyck-1, Dyck-2, Shuffle-2, and Shuffle-6 experiments were 3, 4, 4, and 8, respectively.(In Section 7, we describe further experiments with as few as a single hidden unit.)In all our experiments, we used the Adam optimizer (Kingma and Ba, 2014) with hyperparameter α = 0.001.Output 1 3 3 3 2 0 2 3 3 3 1 0 2 6 7 6 2 0 8 24 8 0   Table 2: Example input-output pairs for the Shuffle-2 (left) and Shuffle-6 (right) languages.

Experiments
The first language, Dyck-1 (or D 1 ), consists of well-balanced sequences of opening and closing parentheses.Recall that a neural network need not be equipped with a stack-like mechanism to recognize the Dyck-1 language under the sequence prediction paradigm; a single counter DCA 1 is sufficient.However, dynamic counting is required to capture the language.
The next two languages are the shuffles of two and six Dyck-1 languages, each defined over disjoint parentheses; we refer to these two languages as Shuffle-2 and Shuffle-6, respectively.These two tasks are formulated to investigate whether recurrent networks can emulate deterministic k-counter automata by performing dynamic counting, separately counting the number of opening and closing parentheses for each of the distinct parenthesispairs and predicting the closing parentheses for the pairs for which the counters are non-zero, in addition to the opening parentheses.In contrast, the final language, Dyck-2, is a context-free language which cannot be captured by a simple counting mechanism; a model capable of recognizing the Dyck-2 language must contain a stack-like component (Sennhauser and Berwick, 2018).
Tables 1 and 2 provide example input-output pairs for the four languages under the sequenceprediction task.For purposes of presentation only, we use a simple binary encoding of the output sets to concisely represent the output.In all of the languages we investigate in this paper, the open parentheses are always allowable as next symbol; we assign the set of open parentheses the number 0. Each closing parenthesis is assigned a different power of 2: ) is assigned to 1, ] to 2, } to 4, to 8, to 16, and to 32. (These latter closing parentheation (Bernardy, 2018) in their experiments.
ses are needed for the Shuffle-6 language below.)The set of predicted symbols is then the sum of the associated numbers.For instance, an output 3 represents the prediction of any of the open parentheses as well as ) and ].
We note that even though an input sequence might appear in two different languages, it might have different target representations.This observation is important especially when making comparisons between the Dyck-2 and the Shuffle-2 languages.For instance, the output sequence for ( [ ] ) in the Dyck-2 language is 1 2 1 0, whereas the output sequence for ( [ ] ) in the Shuffle-2 language is 1 3 1 0.

The Dyck-1 Language
The Dyck-1 language, or the well-balanced parenthesis language, arises naturally in enumerative combinatorics, statistics, and formal language theory.A sequence in the Dyck-1 language needs to contain an equal number of opening and closing parentheses, with the constraint that at each time step the total number of opening parentheses must be greater than or equal to the total number of closing parentheses so far.In other words, for a given sequence s = s 1 • • • s 2n of length 2n in the Dyck-1 language over the alphabet Σ = {(, )}, if we have a function f that assigns value +1 to '(' and value −1 to ')', then we know that it is always true that j i=1 f (s i ) ≥ 0 with strict equality when j = 2n, for all j ∈ [1, . . ., 2n].Therefore, a model with a single unbounded counter can recognize this language.
A probabilistic context-free grammar for the Dyck-1 language can be written as follows: (S) with probability p SS with probability q ε with probability 1 − (p + q) where 0 < p, q < 1 and (p + q) < 1. Table 3: The performances of the RNN, GRU, and LSTM models on four language modeling tasks.Shuffle-2 denotes the shuffle of two Dyck-1 languages defined over different alphabets, and similarly Shuffle-6 denotes the shuffle of six Dyck-1 languages defined over different alphabets.Min/Max/Median results were obtained from 10 different runs of each model with the same random seed across each run.We note that the LSTM models not only outperformed the RNN and GRU models but also often achieved full accuracy on the short test set in all the "counting" tasks.Nevertheless, even the LSTMs were not able to yield a good performance on the Dyck-2 language modeling task, which requires a stack-like mechanism.
Setting p = 1 2 and q = 1 4 , we generated 10, 000 distinct Dyck sequences, whose lengths were bounded to [2, 50], for the training set.We used two test sets: The "short" test set contained 5, 000 distinct Dyck words defined in the same length interval as the training set but distinct from it.The "long" test set contained 5, 000 distinct Dyck words defined in the interval [52, 100].Hence, there was no overlap between any of the training and test sets.Results: Table 3 lists the performances of the RNN, GRU, and LSTM models on the Dyck-1 language.First, we highlight that all the LSTM models obtained full accuracy on the training set and short test set, whose sequences were bounded to [2,50], in all the ten trials.They were also able to easily generalize to longer and deeper sequences in the long test set: They obtained perfect accuracy in nine out of ten trials and 99.98% accuracy (only 1 incorrect prediction) in the remaining trial.These results exhibit that the LSTMs can indeed perform unbounded dynamic counting.
The GRUs yielded an almost similar qualitative performance on the training and first test sets; however, they could not generalize well to longer and deeper sequences.On the other hand, the RNN models performed significantly worse than the first two recurrent models, in terms of their median accuracy rate.We note that similar empirical observations about the performance-level differences between the RNNs, GRUs, and LSTMs for other simple formal languages were also reported by Weiss et al. (2018) and Bernardy (2018).

The Shuffle-k Language
Next, we consider two shuffle languages, which are both generated by the Dyck-1 language.Before describing each task in detail, let us first define the notion of shuffling formally.The shuffling operation || : Σ * × Σ * → P(Σ * ) can be inductively defined as follows:5 for any α, β ∈ Σ and u, v ∈ Σ * .For instance, the shuffling of ab and cd would be: There is a natural extension of the shuffling operation || to languages.The shuffle of two languages L 1 and L 2 , denoted L 1 ||L 2 , is defined as the set of all the possible interleavings of the elements of L 1 and L 2 , respectively, that is: Given a language L, we define its self-shuffling L|| 2 to be L||σ(L), where σ is an isomorphism on the vocabulary of L to a disjoint vocabulary.More generally, we define the k-self-shuffling The Shuffle-2 Language: The first language is D 1 || 2 , the shuffle of D 1 and D 1 , where the first D 1 is over the alphabet {(, )} and the second over the alphabet {[, ]}.For instance, the sequence but not the other way around. 6o generate the training and test corpora, we used a probabilistic context-free grammar for the Dyck-2 language, which we will describe shortly, but considered correct target values for the sequences interpreted as per the Shuffle-2 language.
The training set contained 10, 000 distinct sequences of lengths in [2,50].As before, the short test set had 5, 000 distinct samples defined over the same length interval but disjoint from the training set, and the long test set had 5, 000 distinct samples, whose lengths were bounded to [52,100].
The Shuffle-6 Language: The second shuffle language is D 1 || 6 , the shuffle of six Dyck-1 languages, each defined over different parenthesispairs.Concretely, we used the following pairs: ( ), [ ], { }, , , .In theory, an automaton with six separate unbounded-turn counters (DCA 6 ) can recognize this language.Hence, we wanted to explore whether our recurrent networks can learn to emulate a dynamic-counting 6counter machine.The training and two test corpora were generated in the same style as the previous sequence prediction task; however, we included 30, 000 samples in the training set for this language, due to the complexity of the language.Figure 1 shows the length and maximum depth distributions of the training and test sets for one of the Shuffle-6 experiments.
Results: As shown shown in Table 3, the LSTM models achieved a median accuracy of 100% on the training and short test sets in both of the shuffle language variants.Furthermore, they were able to generalize well to longer and deeper sequences in both shuffle languages, achieving an almost perfect median accuracy score on the long test set.In contrast, the GRU models performed slightly worse than the LSTM models on the training and short test sets, but the GRUs did not yield the same performance as the LSTMs on the long test set, obtaining median scores of 93.12% and 85.14% in the Shuffle-2 and Shuffle-6 languages, respectively.Additionally, the simple RNN models always performed much worse than the GRU and LSTM models and could not even learn the training sets in either of the shuffle languages.These empirical findings show that the LSTM models can successfully emulate a DCA k , a deterministic (real-time) automaton with k-counters, each capable of performing an arbitrary number of turns.

The Dyck-2 Language
The generalized Dyck language, D n , represents the core of the notion of context-freeness by virtue of the Characterization Theorem of Chomsky and Schützenberger (1963), which provides a natural way to characterize the CFL class: Theorem 6.1.Any language in CFL can be represented as a homomorphic image of the intersection of a Dyck language D n and a regular language R. Furthermore, D n can be reduced to D 2 at the expense of increasing the depth and length of the original sequences in the former language.Given an open parenthesis p i , we first determine the m-digit binary representation of the number i − 1 and then use ( to encode 0's and [ to encode 1's in this representation.Given a closing parenthesis p i , we determine the m-digit binary representation of the number i − 1, write the binary number in the reverse order, and then use ) to encode 0's and ] to encode 1's.That is, The previous proposition simply shows that we can map an expression in D n to an expression in D 2 at the expense of creating a deeper structure in the latter language by a factor of m = log 2 n .For instance, if an expression s in D n has a maximum depth of k, then the expression generated by the mapping above would have a maximum depth of k × m in D 2 .
Motivated by context-free-language universality (Sennhauser and Berwick, 2018), we therefore experimented with the Dyck-2 language defined over two types of parenthesis-pairs, namely {(, )} and {[, ]}, as well.The recognition of the Dyck-2 language requires a model to possess a stack-like component; real-time primitive counting does not enable us to capture the Dyck-2 language.Hence, if an RNN-based architecture learns to recognize this language, we can conclude that RNNs with finite precision can actually learn complex deeply nested representations.
A probabilistic context-free grammar for the Dyck-2 language can be written as follows: where 0 < p, q < 1 and (p + q) < 1.
Setting p = 1 2 and q = 1 4 , we generated 10, 000 distinct sequences, whose lengths were bounded to [2, 50], for the training set.Again, we generated 5, 000 other distinct Dyck-2 sequences of lengths defined in the interval [2,50] for the first test set and 5, 000 distinct sequences of lengths defined in the interval [52, 100] for the second test set.As in the previous case, there was no overlap between the training and test sets.Results: As shown in Table 3, we found that none of our RNNs was able to emulate a DPA to recognize the Dyck-2 language, a context-free language that requires a model to contain a stack-like mechanism for recognition.Overall, the LSTM models had the best performances among all the networks, but they still failed to employ a stackbased strategy to learn the Dyck-2 language.Even the best LSTM model could achieve only 48.24% and 1.46% accuracy scores on the short and long test sets, respectively.
7 Discussion and Analysis

Visualization of Hidden+Cell States
Our empirical results on the Dyck-1 and Shuffle languages suggest that our LSTM models were performing dynamic counting to recognize these languages.In order to validate our hypothesis, we visualized the hidden and cell states of some of our LSTM models that achieved full accuracy on the test sets.
Figure 2 illustrates that our LSTM is able to recognize the samples in D 1 || 6 by emulating a DCA 6 .In fact, the discrete even transitions in the cell state dynamics of the model reveal that six out of eight hidden units in the model are acting like separate counters.In some cases, we further discovered that certain units learned to count the length of the input sequences.Such length counting behaviours are also observed in machine translation (Shi et al., 2016;Bau et al., 2019;Dalvi et al., 2019) when the LSTMs are trained on a fixed-length training corpus. 8n the other hand, Figure 3 provides visualizations of the hidden and cell state dynamics of one of our single-layer LSTM models with four hidden units when the model was presented two sequences in the Dyck-2 language.Both sequences have some noticeable patterns and were chosen to explore whether the model behaves differently in repeated (or similar) subsequences.It seems that the LSTM model is trying to employ a complex counting strategy to learn the Dyck-2 language but failing to accomplish this task.

LSTM with a Single Hidden Unit
In theory, a DCA 1 should be able to easily recognize Dyck-1, the well-balanced parenthesis language.Can an LSTM with one hidden unit learn Dyck-1?Our empirical results (Figure 4) confirmed that LSTMs can indeed learn this language by effectively using the single hidden unit to count up the total number of left and right parentheses in the sequence.Similarly, we found that an LSTM with only two hidden units can recognize D 1 || 2 .

Predicting the Last Closing Parenthesis
Following Skachkova et al. (2018), we also trained an LSTM model with four hidden units to learn to predict the last closing parenthesis in the Dyck-2 language.The model learned the task in a couple of epochs and achieved perfect accuracy on the training and test sets.However, our simple analysis of the cell state dynamics of the LSTM in Figure 5 suggests that the model is doing some complex form of counting to perform the desired task, rather than learning the Dyck-2 language.

Conclusion
We investigated the ability of standard recurrent networks to perform dynamic counting and to encode hierarchical representations, by considering three simple counting languages and the Dyck-2 language.Our empirical results highlight the overall high-caliber performance of the LSTM models over the simple RNNs and GRUs, and further inflect our understanding of the limitations and strengths of these models.

Acknowledgement
The first author gratefully acknowledges the support of the Harvard College Research Program (HCRP).The third author was supported by the Harvard Mind, Brain, and Behavior Initiative.The computations in this paper were run on the Odyssey cluster supported by the FAS Division of Science, Research Computing Group at Harvard University.The time steps at which the model made incorrect predictions are marked with an apostrophe in the horizontal axis.The plots on the left provide a demonstration of the periodic behaviour of the hidden and cell states of the model for a long sequence.Similarly, the plots on the right provide the complex counting behaviour of the model as it observes a nested sequence.We witnessed similar behaviours in our other models as well.
Figure 4: A single-layer LSTM with one hidden unit learns the Dyck-1 language by counting up upon the observance of ( and down upon the observance of ).

Figure 1 :
Figure 1: Length and maximum depth distributions of training/test sets for an example Shuffle-6 experiment.
Proposition 6.2.D n is reducible to D 2 .Proof. 7Let D n be the Dyck language with n distinct pairs of parentheses.Let us further suppose that p = {p 1 , p 2 , p 3 , . . ., p n } are the opening parentheses and that p = {p 1 , p 2 , p 3 , . . ., p n } are their corresponding closing parentheses.We set m = log 2 n and encode each opening and closing parenthesis in D n with m bits using either ( and [ or ) and ].Furthermore, we map empty string to empty string.

Figure 2 :
Figure2: Visualization of the cell state dynamics of one of the LSTM models trained to learn D 1 || 6 , the Shuffle-6 language.The solid lines show the values of the cell states of the six out of eight units in the model, whereas the dashed lines depict the current depth of each distinct parenthesis-pair in D 1 || 6 .We highlight the striking parallelism between the solid lines and the dashed-lines.Our visualizations confirm that the LSTM models employ a simple counting mechanism to recognize the Shuffle languages.

Figure 3 :
Figure 3: Visualization of the hidden and cell state dynamics of one of the LSTMs trained to learn the Dyck-2 language.The time steps at which the model made incorrect predictions are marked with an apostrophe in the horizontal axis.The plots on the left provide a demonstration of the periodic behaviour of the hidden and cell states of the model for a long sequence.Similarly, the plots on the right provide the complex counting behaviour of the model as it observes a nested sequence.We witnessed similar behaviours in our other models as well.