On the Ability and Limitations of Transformers to Recognize Formal Languages

Transformers have supplanted recurrent models in a large number of NLP tasks. However, the differences in their abilities to model different syntactic properties remain largely unknown. Past works suggest that LSTMs generalize very well on regular languages and have close connections with counter languages. In this work, we systematically study the ability of Transformers to model such languages as well as the role of its individual components in doing so. We ﬁrst provide a construction of Transformers for a subclass of counter languages, including well-studied languages such as n -ary Boolean Expressions, Dyck-1, and its generalizations. In experiments, we ﬁnd that Transformers do well on this subclass, and their learned mechanism strongly correlates with our construction. Perhaps surprisingly, in contrast to LSTMs, Transformers do well only on a subset of regular languages with degrading performance as we make languages more complex according to a well-known measure of complexity. Our analysis also provides insights on the role of self-attention mechanism in modeling certain behaviors and the inﬂuence of positional encoding schemes on the learning and generalization abilities of the model.


Introduction
Transformer (Vaswani et al., 2017) is a selfattention based architecture which has led to stateof-the-art results across various NLP tasks (Devlin et al., 2019;Liu et al., 2019;Radford et al., 2018). Much effort has been devoted to understand the inner workings and intermediate representations of pre-trained models; Rogers et al. (2020) is a recent survey. However, our understanding of their practical ability to model different behaviors relevant to sequence modeling is still nascent. On the other hand, a long line of research has sought to understand the capabilities of recurrent neural models such as the LSTMs (Hochreiter and Schmidhuber, 1997) . Recently, Weiss et al. (2018), Suzgun et al. (2019a) showed that LSTMs are capable of recognizing counter languages such as Dyck-1 and a n b n by learning to perform counting like behavior. Suzgun et al. (2019a) showed that LSTMs can recognize shuffles of multiple Dyck-1 languages, also known as Shuffle-Dyck. Since Transformer based models (e.g., GPT-2 and BERT) are not equipped with recurrence and start computation from scratch at each step, they are incapable of directly maintaining a counter. Moreover, it is known that theoretically RNNs can recognize any regular language in finite precision, and LSTMs work well for this task in practical settings. However, Transformer's ability to model such properties in practical settings remains an open question.
Prior to the current dominance of Transformers for NLP tasks, recurrent models like RNN-based models such as LSTMs were the most common choice, and their computational capabilities have been studied for decades, e.g., (Kolen and Kremer, 2001). In this work, we investigate the ability of Transformers to express, learn, and generalize on certain counter and regular languages. Formal languages provide us a controlled setting to study a network's ability to model different syntactic properties in isolation and the role of its individual components in doing so.
Recent work has demonstrated close connections between LSTMs and counter automata. Hence, we seek to understand the capabilities of Transformers to model languages for which the abilities of LSTMs are well understood. We first show that Transformers are expressive enough to recognize certain counter languages like Shuffle-Dyck and n-ary Boolean Expressions by using self-attention mechanism to implement the relevant counter operations in an indirect manner. We then extensively evaluate the model's learning and generalization abilities on such counter languages and find that models generalize well on such languages. Visualizing the intermediate representations of these models shows strong correlations with our proposed construction. Although Transformers can generalize well on some popularly used counter languages, we observe that they are limited in their ability to recognize others. We find a clear contrast between the performance of Transformers and LSTMs on regular languages (a subclass of counter languages). Our results indicate that, in contrast to LSTMs, Transformers achieve limited performance on languages that involve modeling periodicity, modular counting, and even simpler star-free variants of Dyck-1, which they were able to recognize effortlessly. Our analysis provides insights about the significance of different components, namely self-attention, positional encoding, and the number of layers. Our results also show that positional masking and positional encoding can both aid in generalization and training, but in different ways. We conduct extensive experiments on over 25 carefully chosen formal languages. Our results are perhaps the first indication of the limitations of Transformers for practical-sized problems that are, in a precise sense, very simple, and in particular, easy for recurrent models.

Related Work
Numerous works, e.g., Suzgun et al. (2019b); Sennhauser and Berwick (2018); Skachkova et al. (2018), have attempted to understand the capabil-ities and inner workings of recurrent models by empirically analyzing them on formal languages. Weiss et al. (2018) showed that LSTMs are capable of simulating counter operations and explored their practical ability to recognize languages like a n b n and a n b n c n . Suzgun et al. (2019a) further showed that LSTMs can learn to recognize Dyck-1 and Shuffle-Dyck and can simulate the behavior of k-counter machines. Theoretical connections of recurrent models have been established with counter languages (Merrill, 2019;Merrill et al., 2020;Merrill, 2020). It has also been shown that RNN based models can recognize regular languages (Kolen and Kremer, 2001;Korsky and Berwick, 2019) and efforts have been made to extract DFAs from RNNs trained to recognize regular languages (Weiss et al., 2019;Wang et al., 2018b;Michalenko et al., 2019). We are not aware of such studies for Transformers.
Recently, researchers have sought to empirically analyze different aspects of the Transformer trained on practical NLP tasks such as the information contained in intermediate layers (Rogers et al., 2020;Reif et al., 2019;Warstadt et al., 2019). Voita et al. Complementary to these, our work is focused on analyzing Transformer's ability to model particular behaviors that could be relevant to modeling linguistic structure. Recently, it has been shown that Transformers are Turing-complete (Pérez et al., 2019;Bhattamishra et al., 2020) and are universal approximators of sequence-to-sequence functions given arbitrary precision (Yun et al., 2020). Hahn (2020) shows that Transformers cannot recognize languages Parity and Dyck-2. However, these results only apply to very long words, and their applicability to practical-sized inputs is not clear (indeed, we will see different behavior for practicalsized input). Moreover, these results concern the expressive power of Transformers and do not apply to learning and generalization abilities. Thus Transformers' ability to model formal languages requires further investigation.

Definitions
We consider the Transformer as used in popular pretrained LM models such as BERT and GPT, which is the encoder-only model of the original seq-to-seq architecture (Vaswani et al., 2017). The encoder consists of multiple layers with two blocks each: (1) self-attention block, (2) a feed-forward network (FFN). For 1 ≤ i ≤ n, at the i-th step, the model takes as input the sequence s 1 , s 2 , . . . , s i where s ∈ Σ and generates the output vector y i . Each input s i is first converted into an embedding vector using the function f e : Σ → R d model and usually some form of positional encoding is added to yield the final input vector x i . The embedding dimension d model is also the dimension of intermediate vectors of the network. Let X i := (x 1 , . . . , x i ) for i ≥ 1.
In the self-attention block, the input vectors undergo linear transformations Q(·), K(·), and V (·) yielding the corresponding query, key and value vectors, respectively. The self-attention mechanism takes as input a query vector Q(x i ), key vectors K(X i ), and value vectors V (X i ). An attention- The output of a layer denoted by z i is computed by z i = O(a i ) where 1 ≤ i ≤ n and O(·) typically denotes an FFN with ReLU activation. The complete L-layer model is a repeated application of the single-layer model described above, which produces a vector z L i at its final layer where L denotes the last layer. The final output is obtained by applying a projection layer with some normalization or an FFN over the vectors z L i 's and is denoted by y i = F (z L i ). Residual connections and layer normalization are also applied to aid the learning process of the network.
In an LM setting, when the Transformer processes the input sequentially, each input symbol can only attend over itself and the previous inputs, masking is applied over the inputs following it. Note that, providing positional information in this form via masked self-attention is also referred to as positional masking (Vaswani et al., 2017;Shen et al., 2018). A Transformer model without positional encoding and positional masking is orderinsensitive.

Formal Languages
Formal languages are abstract models of the syntax of programming and natural languages; they also relate to cognitive linguistics, e.g., Jäger and Rogers (2012); Hahn (2020) and references therein. Counter Languages. These are languages recognized by a deterministic counter automaton (DCA), that is, a DFA with a finite number of unbounded counters (Fischer et al., 1968). The counters can be incremented/decremented by constant values and can be reset to 0 (details in App. B.1). The commonly used counter languages to study sequence models are Dyck-1, a n b n , and a n b n c n . Several works have explored the ability of recurrent models to recognize these languages as well as their underlying mechanism to do so. We include them in our analysis as well as some general form of counter languages such as Shuffle-Dyck (as used in Suzgun et al. (2019a)) and n-ary Boolean Expressions. The language Dyck-1 over alphabet Σ = {[, ]} consists of balanced parentheses defined by derivation rules S → [ S ] | SS | . Shuffle-Dyck is a family of languages containing shuffles of Dyck-1 languages. Shuffle-k denotes the shuffle of k Dyck-1 languages: it contains k different types of brackets, where each type of bracket is required to be wellbalanced, but their relative order is unconstrained. For instance, a Shuffle-2 language over alphabet Σ = {[, ], (, )} contains the words ([)] and [((])) but not ])[(. We also consider n-ary Boolean Expressions (hereby BoolExp-n), which are a family of languages of valid Boolean expressions (in the prefix notation) parameterized by the number of operators and their individual arities. For instance, an expression with unary operator ∼ and binary operator ∧ contains the word '∧ ∼ 01' but not '∼ 10' (formal definitions in App. B).
Note that, although languages such as Dyck-1 and a n b n are context-free, a DCA with a single counter is sufficient to recognize Dyck-1 and a n b n . Similarly, a DCA with two single-turn counters can recognize a n b n c n . On the other hand, recognizing Shuffle-Dyck requires multiple multi-turn counters, where for a given type of bracket, its corresponding counter is incremented or decremented by 1. Hence, it represents a more general form of counter languages. Similarly, recognizing BoolExp requires a 1-counter DCA with the counter updates depending on the operator: a ternary operator will increment the counter by 2 (= arity − 1) whereas a unary operator will increment it by 0. Figure 1 shows the relationship between counter languages and other classes of formal languages. Regular Languages. Regular languages, perhaps the best studied class of formal languages, form a subclass of counter languages 1 . They neatly divide into two subclasses: star-free and non-star-free. Star-free languages can be described by regular expressions formed by union, intersection, complementation, and concatenation operators but not the Kleene star ( * ). Like regular languages, star-free languages are surprisingly rich with algebraic, logical, and multiple other characterizations and continue to be actively researched, e.g., (McNaughton and Papert, 1971;Jäger and Rogers, 2012). They form a simpler subclass of regular languages where the notion of simplicity can be made precise in various ways, e.g. they are first-order logic definable and cannot represent languages that require modular counting.
We first consider Tomita grammars containing 7 regular languages representable by DFAs of small sizes, a popular benchmark for evaluating recurrent models and extracting DFA from trained recurrent models (see, e.g., Wang et al. (2018a)). Tomita grammars contain both star-free and nonstar-free languages. We further investigate some non-star-free languages such as (aa) * , Parity and (abab) * . Parity contains words over {0, 1} with an even number of 1's. Similarly (aa) * and (abab) * require modeling periodicity.
On the other hand, the seemingly similar looking language (ab) * is star-free: (ab) * = (b∅ c + ∅ c a + ∅ c aa∅ c + ∅ c bb∅ c ) c , where · c denotes set complementation, and thus ∅ c = Σ * . The dot-depth of a star-free language is a measure of nested concatenation or sequentiality required in a star-free regular expression (formal definition in App. B.2). We define a family D 0 , D 1 , . . . of star-free languages. For n ≥ 0, the language D n over Σ = {a, b} is defined inductively as follows: D n = (aD n−1 b) * where D 0 = , the empty word. Thus D 1 = (ab) * and D 2 = (a(ab) * b) * . Language D n is known to have dot-depth n.
The list of all considered languages and their definitions are provided in the App. B.

Expressiveness Results
Proposition 4.1. There exists a Transformer as defined in Section 3 that can recognize the family of languages Shuffle-Dyck.
For each type of open bracket [ j where,0 ≤ j < k, the vector f e ([ j ) has the value +1 and −1 at the indices 2j and 2j + 1, respectively. It has the value 0 at the rest of the indices. Similarly for each closing bracket, the vector f e (] j ) has the value −1 and +1 at the indices 2j and 2j + 1, and it has the value 0 at the rest of the indices. For Dyck-1, this would lead to f e ([) = [+1, −1] T and f e (]) = [−1, +1] T (with d model = 2). We use a single-layer Transformer where we set the matrix corresponding to linear transformation for key vectors to be null matrix, that is K(x) = 0 for all x. This will lead to equal attention weights for all inputs. The matrices corresponding to Q(·) and V (·) are set to Identity.
In a i , the value σ([) − σ(]) represents the depth (difference between the number of open and closing brackets) of the Dyck-1 word at index i. Hence, the first coordinate is the ratio of the depth of the Dyck-1 word and its length at that index, while the other coordinate is its negative.
We then apply a simple FFN with ReLU activation over the vector a i . The vector z i = ReLU(Ia i ). The even indices of the vector z i will be nonzero if the number of open brackets of the corresponding type is greater than the number of closing brackets. A similar statement holds for the odd indices. Thus, for a given word to be in Shufflek, the values at odd indices of the vector z i must never be nonzero, and the values of all coordinates must be zero at the last step to ensure the number of open and closing brackets are the same.
For an input sequence s 1 , s 2 , . . . , s n , the model will produce z 1 , . . . , z n based on the construction specified above. A word w belongs to language Shuffle-k if z i,2j+1 = 0 for all 1 ≤ i ≤ n, 0 ≤ j < k and z n = 0 and does not belong to the language otherwise. This can be easily implemented by an additional layer of self-attention and feedforward network to classify a given sequence.
The bottleneck for precision in the construc-  in the vector a i . Since in a finite precision setting with r bits, this can be computed up to a value exponential in r, our proof entails that Transformers can recognize languages in Shuffle-Dyck for lengths exponential in the number of bits.
Using a similar logic, one can also show that Transformers can recognize the family of languages BoolExp-n (refer to Lemma C.2). By setting the value vectors according to the arities of the operators, the model can obtain the ratio of the counter value of the underlying automata and the length of the input at each step via self-attention. Although the above construction is specific to these language families, we provide a proof for a more general but restricted subclass of Counter Languages in the appendix (refer to Lemma C.1). The above construction serves to illustrate how Transformers can recognize such languages by indirectly doing relevant computations. As we will later see, this will also help us interpret how trained models recognize such languages.

Experimental Setup
In our experiments, we consider 27 formal languages belonging to different parts in the hierarchy of counter and regular languages. For each language, we generate samples within a fixed-length window for our training set and generate multiple validation sets with different windows of length to evaluate the model's generalization ability.
For most of the languages, we generate 10k samples for our training sets within lengths 1 to 50 and create different validation sets containing samples with distinct but contiguous windows of length. The number of samples in each validation set is 2k, and the width of each window is about 50. For languages that have very few positive examples in a given window of length, such as (ab) * and a n b n c n , we train on all positive examples within the training window. Similarly, each validation set contains all possible strings of the language for a particular range. Table 6 in appendix lists the dataset statistics of all 27 formal languages we consider. 2 . We have made our source code available at https://github.com/satwik77/Transformer-Formal-Languages.

Training details
We train the model on character prediction task as introduced in Gers and Schmidhuber (2001) and as used in Suzgun et al. (2019b,a). Similar to an LM setup, the model is only presented with positive samples from the given language. For an input sequence s 1 , s 2 , . . . , s n , the model receives the sequence s 1 , . . . , s i for 1 ≤ i ≤ n at each step i and the goal of the model is to predict the next set of legal/valid characters in the (i + 1) th step. From here onwards, we say a model can recognize a language if it can perform the character prediction task perfectly.
The model assigns a probability to each character in the vocabulary of the language corresponding to its validity in the next time-step. The output can be represented by a k-hot vector where each coordinate corresponds to a character in the vocabulary of the language. The output is computed by applying a sigmoid activation over the scores assigned by the model for each character. Following Suzgun et al. (2019b,a), the learning objective of the model is to minimize the mean-squared error between the predicted probabilities and the k-hot labels. 3 During inference, we use a threshold of 0.5 to obtain the predictions of the model. For a test sample, the model's prediction is considered to be correct if and only if its output at every step is correct. Note that, this is a relatively stringent metric as a correct prediction is obtained only when the output is correct at every step. The accuracy of the model over test samples is the fraction of total samples predicted correctly 4 . Similar to Suzgun et al. (2019a) we consider models of small sizes to prevent them from memorizing the training set and make it feasible to visualize the model. In our experiments, we consider Transformers with up to 4 layers, 4 heads and the dimension of the intermediate vectors within 2 to 32. We extensively tune the model across various hyperparameter settings. We also examine the influence of providing positional information in different ways such as absolute encodings, relative encodings (Dai et al., 2019) and using only positional masking without any explicit encodings.

Results on Counter Languages
We evaluated the performance of the model on 9 counter languages. Table 1 shows the performance of different models described above on some representative languages. We also include the performance of LSTMs as a baseline. We found that Transformers of small size (single head and single layer) can generalize well on some general form of counter languages such as Shuffle-Dyck and BoolExp-n. Surprisingly, we observed this behavior when the network was not provided any form of explicit positional encodings, and positional information was only available in the form of masking. For models with positional encoding, the lack of the ability to generalize to higher lengths could be attributed to the fact that the model has never been trained on some of the positional encodings that it receives at test time. On the other hand, the model without any explicit form of positional encoding is less susceptible to such issues if it is capable of performing the task and was found to generalize well across various hyperparameter settings.

( [ ( ) ] ) ( ) [ [ [ [ ( ) ] ] ] ]
20 10 0 Output of Self-Attention Block Figure 2: Values of different coordinates of the output of self-attention block of the models trained on Shuffle-2 and BoolExp-3. The dotted lines are the scaled depth to length ratios for Shuffle-2 and scaled counter value to length ratios for BoolExp-3. We observe a near perfect Pearson correlation coefficent of 0.99 between outputs of self attention block and the DL and CL ratios.

Role of Self-Attention
In order to check our hypothesis in Sec. 4, we visualize certain attributes of trained models that generalize well on Shuffle-2 and BoolExp-3. 5 Our construction in Sec. 4 recognizes sequences in Shuffle-Dyck by computing the depth to length ratio of the input at each step via self-attention mechanism. For BoolExp-n, the model can achieve the task similarly by computing the corresponding counter value divided by length (refer to Lemma C.2). Interestingly, upon visualizing the outputs of the self-attention block for a model trained on Shuffle-2, we found a strong correlation of its elements with the depth to length ratio. As shown in Fig. 2a, different coordinates of the output vector of the self-attention block contain computations corresponding to different counters of the Shuffle-2 language. We observe the same behavior for models trained on Shuffle-4 language (refer to Figure 5 in appendix). Similarly, upon visualizing a model trained on Boolean Expressions with 3 operators, we found strong correlation 6 between its elements and the ratio of the counter value and length of the input (refer to Figure 2b). This indicates that the model learns to recognize inputs by carrying out the required computation in an indirect manner,  as described in our construction. Additionally, for both models, we found that the attention weights of the self-attention block were uniformly distributed (refer to Figure 4 in appendix). Further, on inspecting the embedding and value vectors of the open and closing brackets, we found that their respective coordinates were opposite in sign and similar in magnitude. As opposed to Shuffle-Dyck, for BoolExp-n, the magnitudes of the elements in the value vectors were according to their corresponding arity. For instance, the magnitude for a ternary operator was (almost) thrice the magnitude for a unary operator (refer to Figure 3 in appendix). These observations are consistent with our construction, indicating that the model uses its value vectors to determine the counter updates and then at each step, aggregates all the values to obtain a form of the final counter value in an indirect manner. This is complementary to LSTMs, which can simulate the behavior of k-counters more directly by making respective updates to its cell states upon receiving each input (Suzgun et al., 2019a).

Limitations of the Single-Layer Transformer
Although we observed that single-layer Transformers are easily able to recognize some of the popularly studied counter languages, at the same time, it is not necessarily true for counter languages that require reset operations. We define a variant of the Dyck-1 language. Let Reset-Dyck-1 be the language defined over the alphabet Σ = {[, ], #}, where # denotes a symbol that resets the counter. Words in Reset-Dyck-1 have the form Σ * #v, where the string v belongs to Dyck-1. When the machine encounters the reset symbol #, it must ignore all the previous input, reset the counter to 0 and go to start state. It is easy to show that this cannot be directly implemented with a single layer self-attention network with positional masking (Lemma C.3 in Appendix). The key limitation for both with and without encodings is the fact that for a single layer network the scor-  ing function Q(x n ), K(x # ) and the value vector corresponding to the reset symbol is independent of the preceding inputs which it is supposed to negate (reset). The same limitation does not hold for multilayer networks where the value vector, as well as the scoring function for the reset symbol, are dependent on its preceding inputs. On evaluating the model on data generated from such a language, we found that single-layer networks are unable to perform well in contrast to networks with two layers (

Results on Regular Languages
We first examine the popular benchmark of Tomita grammars. While the LSTMs generalize perfectly on all 7 languages, Transformers are unable to generalize on 3 languages, all of which are non-starfree. Note that, all star-free languages in Tomita grammar have dot-depth 1. Recognizing non-starfree languages requires modeling properties such as periodicity and modular counting. Consequently, we evaluate the model on some of the simplest nonstar-free languages such as the languages (aa) * and Parity. We find that they consistently fail to learn or generalize on such languages, whereas LSTMs of very small sizes perform flawlessly. Table 4 lists the performance on some non-star-free languages. Note that LSTMs can easily recognize such simple non-star-free languages considered here by using its internal memory and recurrence 8 . However, doing the same task via self-attention mechanism without using any internal memory could be highly non-trivial and potentially impossible. Languages such as (aa) * and Parity are among the simplest  Table 4: Results on non-star-free languages (non-SF) and the language D n . The values in parenthesis correspond to the scores obtained for a model without residual connections. This is to prevent the model from solving the task by memorizing the positional encodings and study the ability of self-attention mechanism to solve the task.
non-star-free languages, and hence limitations in recognizing such languages carry over to a larger class of languages. The results above may suggest that the star-free languages are precisely the regular languages recognizable by Transformers. As we will see in the next section, this is not so.

Necessity of Positional Encodings
The architecture of Transformer imposes limitations for recognizing certain types of languages.
Although Transformers seem to generalize well when they are capable of performing a task with only positional masking, they are incapable of recognizing certain types of languages without explicit positional encodings. We consider the family of star-free languages D n defined in Sec. 3.1. Note that the task of recognizing D n is equivalent to recognizing Dyck-1 with maximum depth n, where the symbols a and b in D n are analogous to open and closing brackets in Dyck-1 respectively. The primary difference between recognizing D n and Dyck-1 is that in case of D n , when the input reaches the maximum depth n, the model must predict a (the open bracket) as invalid for the next character, whereas in Dyck-1, open brackets are always allowed. We show that although Transformers with only positional masking can generalize well on Dyck-1, they are incapable of recognizing the language D n for n > 1. The limitation arises from the fact that when the model receives a sequence of only a's, then due to the softmax based aggregation, the output of the self-attention block a i will be a constant vector, implying that the output of the feed-forward will also be a constant vector, that is, z 1 = z 2 = . . . = z n . In case of languages such as D n , if the input begins with n consecutive as, then, since the model cannot distinguish between  Table 5: Performance of transformer based models on (aa) * and (aaaa) * , for different types of position encoding schemes. To separately study the effect of different position encodings on the self attention mechanism, we do not include residual connections in the models studied here.
the n-th a and the preceding a's, the model cannot recognize the language D n . This limitation does not exist if the model is provided explicit positional encoding. Upon evaluating Transformers with positional encodings on instances of the language D n , we found that the models are able to generalize to a certain extent on strings within the same lengths as seen during training but fail to generalize on higher lengths (Table 4). It is perhaps surprising that small and simpler self-attention networks can generalize very well on languages such as Dyck-1 but achieve limited performance on a language that belongs to a much simpler class such as star-free. Similarly, since (aa) * , is a unary language (alphabet size is 1), the model will always receive the same character at each step. Hence, for a model with only positional masking, the output vector will be the same at every step, making it incapable of recognizing the language (aa) * . For the language Parity, when the input word contains only 1's, the task reduces to recognizing (11) * and hence a model without positional encodings is incapable of recognizing Parity even for very small lengths regardless of the size of the network (refer to Lemma C.4). We find it surprising that for Parity, which is permutation invariant, positional encodings are necessary for transformers to recognize them even for very small lengths.

Influence of Custom Positional Encodings
The capability and complexity of the network could significantly depend on the positional encoding scheme. For instance, for language (aa) * , the ability of a self-attention network to recognize it depends solely on the positional encoding. Upon evaluating with standard absolute and relative encoding schemes, we observe that the model is unable to learn or generalize well. At the same time, it is easy to show that if cos(nπ), which has a period of two is used as positional encoding, the self-attention mechanism can easily achieve the task which we also observe when we empirically evaluated with such an encoding. However, the same encoding would not work for a language such as (aaaa) * , which has a periodicity of four. Table 5 shows the performance of the model with different types of encodings. When we used fixed-length trainable positional embeddings, the obtained learned embeddings were very similar to the cos(nπ) form; however, such embeddings cannot be used for sequences of higher lengths. This also raises the need for better learnable encodings schemes that can extrapolate to variable lengths of inputs not seen during training data such as (Liu et al., 2020).
Our experiments on over 15 regular languages seem to indicate that Transformers are able to generalize on star-free languages within dot-depth 1 but have difficulty with higher dot-depths or more complex classes like non-star-free languages. Table 9 in Appendix lists results on all considered regular languages.

Discussion
We showed that Transformers can easily generalize on certain counter languages such as Shuffle-Dyck and Boolean Expressions in a manner similar to our proposed construction. Our visualizations imply that Transformers do so with a generalizable mechanism instead of overfitting on some statistical regularities. Similar to natural languages, Boolean Expressions consist of recursively nested hierarchical constituents. Recently, Papadimitriou and Jurafsky (2020) showed that pretraining LSTMs on formal languages like Shuffle-Dyck transfers to LM performance on natural languages. At the same time, our results show clear limitations of Transformers compared to LSTMs on a large class of regular languages. Evidently, the performance and capabilities of Transformers heavily depend on architectural constituents e.g., the positional encoding schemes and the number of layers. Recurrent models have a more automata-like structure wellsuited for counter and regular languages, whereas self-attention networks' structure is very different, which seems to limit their abilities for the considered tasks.
Our work poses a number of open questions. Our results are consistent with the hypothesis that Transformers generalize well for star-free languages with dot-depth 1, but not for higher depths. Clarifying this hypothesis theoretically and empirically is an attractive challenge. What does the disparity between the performance of Transformers on natural and formal languages indicate about the complexity of natural languages and their relation to linguistic analysis? (See also Hahn (2020)). Another interesting direction would be to understand whether certain modifications or recently proposed variants of Transformers improve their performance on formal languages. Regular and counter languages model some aspects of natural language while contextfree languages model other aspects such as hierarchical dependencies. Although our results have some implications on them, we leave a detailed study on context-free languages for future work.

A Roadmap
The appendix is organized as follows. In section B we first provide formal definitions of the key languages used in our investigation in the main paper. In sections B.1 and B.2, we also provide the formal definitions of automata, star-free languages and the dot-depth hierarchy. In section C, we provide the details of all our expressiveness results. Section D contains the details of our experimental setup which could be relevant for reproducibility of the results and includes a thorough discussion of the choice of character prediction task. The list of all the formal languages we have considered, their dataset statistics as well as the results are provided in section D.

B Definitions
In this section, we provide formal definitions of some of the languages used in our analysis. In counter languages, we first define the family of shuffled Dyck-1 languages. The language Dyck-1 is a simple context-free language that can also be recognized by a counter automaton with a single counter. We generate the data for Dyck-1 based on the following PCFG, where 0 < p, q < 1 and (p + q) < 1. We use 0.5 as the value of p and 0.25 as the value for q. Shuffle-Dyck. We now define the Shuffle-Dyck language introduced and described in (Suzgun et al., 2019a). We first define the shuffling operation formally. The shuffling operation || : Σ * × Σ * → P(Σ * ) can be inductively defined as follows: 9 for any α, β ∈ Σ and u, v ∈ Σ * . For instance, the shuffle of ab and cd is ab||cd = {abcd, acbd, acdb, cabd, cadb, cdab}.
There is a natural extension of the shuffling operation || to languages. The shuffle of two languages 9 We abuse notation by allowing a string to stand for the singleton containing that string. is the empty string. L 1 and L 2 , denoted L 1 ||L 2 , is the set of all possible interleavings of the elements of L 1 and L 2 , respectively, that is: Given a language L, we define its self-shuffling L|| 2 to be L||σ(L), where σ is an isomorphism on the vocabulary of L to a disjoint vocabulary. More generally, we define the k-self-shuffle We use Shuffle-k to denote the shuffle of k Dyck-1 languages (Dyck-1|| k ) each with its own brackets. Shuffle-1 is the same as Dyck-1.  (Suzgun et al., 2019a) we generate the training data by generating sequence for Dyck-n but by providing the correct target values for the character prediction task. n-ary Boolean Expressions. We now define the family of languages n-ary Boolean Expressions parameterized by the number and arities of its operators. An instance of the language contains operators of different arities and as shown in (Fischer et al., 1968), these languages can be recognized by counter-machines with a single counter. However as opposed to Dyck-1 the values with which the counters will be incremented or decremented will depend on the arity of its operator. A language with n operators can be defined by the following derivation rules <exp> -> <VALUE> <exp> -> <UNARY> <exp> <exp> -> <BINARY> <exp> <exp> .. <exp> -> <n-ARY> <exp> .. <exp> Tomita Grammars Tomita Grammars are 7 regular langauges defined on the alphabet Σ = {0, 1}. Tomita-1 has the regular expression 1 * i.e. the strings containing only 1's and no 0s are allowed. Tomita-2 is defined by the regular expression (10) * . Tomita-3 accepts the strings where odd number of consecutive 1s are always followed by an even number of 0s. Tomita-4 accepts the strings that do not contain 3 consecutive 0s. In Tomita-5 only the strings containing an even number of 0s and even number of 1s are allowed. In Tomita-6 the difference in the number of 1s and 0s should be divisible by 3 and finally, Tomita-7 has the regular expression 0 * 1 * 0 * 1 * .

B.1 Counter Automata
We define the general counter machine following (Fischer et al., 1968). We are concerned with realtime counter machines here in which the number of computation steps is bounded by the number of inputs similar to how we use sequence models in practice. The machine has a finite number of unbounded counters and it modifies it by adding or subtracting values or resetting the counter value to 0. For m ∈ Z, let +m denote the function x → x + m. Let ×0 denote the constant zero function x → 0.

A finite set of states Q
3. An initial state q 0

A counter update function
A machine processes an input string x one token at a time. For each token, we use u to update the counters and δ to update the state according to the current input token, the current state, and a finite mask of the current counter values.
For a vector v, let z(v) denote the broadcasted "zero-check" function, i.e. z(v) i is 0 if v i = 0 or 1 otherwise. Let q, c ∈ Q × Z k be a configuration of machine M . Upon reading input x t ∈ Σ, we define the transition q, c → xt δ(x t , q, z(c)), u(x t , q, z(c))(c) .
For any string x ∈ Σ * with length n, a counter machine accepts x if there exist states q 1 , .., q n and counter configurations c 1 , .., c n such that A counter machines accepts a language L if, for each x ∈ Σ * , it accepts x iff x ∈ L. Refer to (Merrill, 2020) for more details on counter machines, variants and their properties.

B.2 Star-free regular languages and the dot-depth hierarchy
Star-free regular languages (defined in the main paper) are a simpler subclass of regular languages; they have regular expressions without Kleene star (but use set complementation). The set of star-free languages is further stratified by the dot-depth hierarchy, which is a hierarchy of families of languages whose union is the family of star-free languages. Informally, the position of a language in this hierarchy is a measure of the number of nested concatenations or sequentiality required to express the language in a star-free regular expression. Both the star-free regular languages as well as the dot-depth hierarchy are well-studied with rich connections and multiple (equivalent) definitions. For more information, see e.g. (McNaughton and Papert, 1971;Cohen and Brzozowski, 1971;Straubing, 1994;Diekert and Gastin, 2008;Jäger and Rogers, 2012;Pin, 2017).
To define the dot-depth hierarchy, we first define Boolean and concatenation closures of language families. For a language family L over a finite alphabet Σ = {a 1 , . . . , a k }, its Boolean closure BL is the set of languages obtained by applying Boolean operators (union, intersection and set complementation w.r.t. Σ * ) to the languages in L. In other words, BL is the smallest family of languages containing L and closed under Boolean operations: if L 1 , L 2 ∈ L then L 1 ∩ L 2 ∈ BL and L 1 ∪ L 2 ∈ BL and L c 1 , L c 2 ∈ BL. Similarly, define the concatenation closure of L as the smallest family of languages containing L and closed under concatenation: if L 1 , L 2 ∈ L then L 1 L 2 ∈ ML.
We begin with the class E of basic languages consisting of {a 1 }, . . . {a k }, { }, ∅. By alternately applying the operators B and M to E we can define the hierarchy E ⊆ ME ⊆ BME ⊆ MBME ⊆ . . . .
Let B 0 = BME. The dot-depth hierarchy is the sequence of families of languages B 0 ⊆ B 1 ⊆ . . . defined inductively by B n+1 = BMB n for n ≥ 0. It is known that all the inclusions in B 0 ⊆ B 1 ⊆ . . . are strict and is exemplified by the languages D n (see Pin (2017)). Minor variations in the definition exist in the literature; in particular, we could have applied the operator B first, but these have only minor effects on the overall concept and results.

C Expressiveness Results
We define a weaker version of counter automata which are restricted in a certain sense. Then, we show that Transformers are at least as powerful as such automata.
Definition C.1 (Simplified and Stateless counter machine). We define a counter machine to be simplified and stateless if u and δ have the following form, u : Σ → {+m : m ∈ Z} k , δ : Σ → Q This implies that the machine can have k counters. The counters can be incremented or decremented by any values but it will only depend on the input symbol. Similarly, the state transition will also depend on the current input. A string x ∈ Σ * will be accepted if q n , z(c n ) ∈ F . We use L RCL to denote the class of languages recognized by such a counter machine. The above language is similar to Σ-restricted counter machine defined in (Merrill et al., 2020).
Proof. Let s 1 , s 2 , . . . , s n denote a sequence w ∈ Σ * . If the counter machine has k counters, then let the dimension of intermediate vectors d model = 2k + |Σ|. The first 2k dimensions will be reserved for counter related operations and then |Q| dimensions will be reserved to obtain the state vector. The embedding vector x i of each symbol will have 0s in the first 2k dimensions and the last |Σ| dimensions will have the one-hot encoding representation of the symbol. For a k counter machine the value vectors would have a subvector of dimension 2 reserved for computations pertaining to each of the counter. That is, x 2j:2j+1 will be reserved for the jth counter where 0 ≤ j < k. For any given input symbol s, if u(s) has counter operation of +m at the jth counter, then the value will be such that v will contain +m at index 2j and −m at index 2j + 1 upto index 2k. The last |Σ| dimensions will have the value 0 in the value vectors. This can be easily obtained by a linear transformation V (.) over one-hot encodings. The linear transformation K(.) to obtain the key vectors will lead to zero vectors and hence all inputs will have equal attention weights. The linear transformation V (.) to obtain the value vectors v i will be identity function. Hence the output of the self-attention block along with residual connection will be of the form The last |Σ| dimensions of the vector a i will have one-hot encoding of the input vector at i-th step. The one-hot encoding of the input can be easily mapped to the one-hot encoding for the corresponding state using a simple FFN. Additionally, this will ensure that, at the i-th step, the output of the self-attention block a i will have the value c j i at indices 2j, where c j denotes the counter value of the counter automata representing the language. Similarly, the odd indices 2j + 1 will have the value − c j i . After applying a simple feed-forward network with ReLU activation, we obtain the output vector z i . It is easy to implement the zero check function with a simple linear layer over the output vector. The network accepts an input sequence w when the values in the output vector corresponding to each counter and state at the n-th correspond to that required for the final state.
We next show that n-ary Boolean Expressions can be recognized by Transformers with a similar construction.
Proof. Let L m denote a language of type n-ary Boolean Expressions with m operators defined over the alphabet Σ. Consider a single layer Transformer network with d model = 2. Let s 0 , s 1 , . . . , s n be sequence w where w ∈ Σ * . Let s 0 be a special start symbol with embedding f e = [+1, −1]. The embeddings of each input symbol s ∈ Σ are defined as follows, f e (s) = [+(r − 1), −(r − 1)] where r denotes the arity of the symbol. The arity of values such as 0 and 1 is taken as 0. Similar to the previous construction, the key values are null and hence attention weights are uniform leading to a i = 1 i i t=1 v t . Hence the output of the self-attention block will be a i = [ where c j denotes the counter value of the automata representing the language. Essentially, for each operator, the value added to the counter is equal to its arity subtracted by 1. For each value such as 0 and 1, the counter value is decremented by 1. We then apply a simple FFN with ReLU activation to obtain the output vector z i = ReLU(Ia i ).
An input sequence w belongs to the language L m if the second coordinate of the output is zero at every step, that is, z i,2 = 0 for 0 ≤ i ≤ n and z n = 0.
Let Reset-Dyck-1 be a language defined over alphabet Σ = {[, ], 1}, where 1 denotes a symbol that requires a reset operation. Words in Reset-Dyck-1 have the form Σ * 1v, where the string v belongs to Dyck-1. So essentially, when the machine encounters the reset symbol 1, it has to ignore all the previous inputs, reset the counter to 0 and go to start state. Lemma C.3. A single-layer Transformer with only positional masking cannot recognize the language Reset-Dyck-1.
Proof. The proof is straightforward. Let s 1 , s 2 , . . . , s n be an input sequence w. Let s r denote the r-th symbol where the reset symbol occurs. It is easy to see that the scoring function q n , K(v r ) is independent of the position as well as the inputs before the reset symbol which are relevant for the reset operation. Consider the case where the first half of the input contains a sequence of open and closing brackets such that it does not belong to Dyck-1 and the second half contains a sequence that belongs to Dyck-1. If the reset symbol occurs after the first half of the sequence, then the word belongs to Dyck-1 and if it occurs in the beginning then it does not belong to the language Dyck-1. However, by construction, the output of the model z n will remain the same regardless of the position of the reset symbol and hence by contradiction, it cannot recognize such a language.
The above limitation does not exist if there is a two layer network. The scoring function as well as value vector of the reset symbol will be dependent of the inputs that precede it. Hence it is not necessary that a two layer network will not be able to recognize such a language. Indeed, as shown in the main paper, the 2-layer Transformer performs well on Reset-Dyck-1.
Lemma C.4. Transformers with only positional masking cannot recognize the language (aa) * .
Proof. Let s 1 , s 2 , . . . , s n be an input sequence w where w ∈ a * . Since it is a unary language, the input at each step will be the same symbol and hence the embedding as well as query, key and value vectors will be the same. Since all the value vectors are the same, regardless of the attention weights, the output of the self-attention vector a i will be a constant vector at each timestep. This implies that the output vectors z 1 = z 2 = . . . = z n . Inductively, it is easy to see that regardless of the number of layers this phenomenon will carry forward and hence the output vector at each timestep will be the same. Thus, the network cannot distinguish output at even steps and odd steps which is necessary to recognize the language (aa) * .
For parity, in the case where the input consists of only 1s, the problem reduces to recognizing (11) * . Hence it follows from the above result that a network without positional encoding cannot recognize parity even for minimal lengths.

D.1 Discussion on Character Prediction Task
As described in section 5.1, we use character prediction task in our experiments to evaluate the model's ability to recognize a language. In character prediction task the model is only presented with positive samples from a given language and its goal is to predict the next set of valid characters. During inference, the model predicts the next set of legal characters at each step and a prediction is considered to be correct if and only if the model's output at every step is correct. The character prediction task is similar to predicting which of the input characters are allowed to make a transition in a given automaton such that it leads to a non-dead state. If an input character is not among the legal characters, that implies the underlying automaton will transition to a dead state and regardless of the following characters, the input word will never be accepted. When the end-of-sequence symbol is allowed as one of the next set of legal characters, it implies that the underlying automaton is in the final state and the input can be accepted.
Character prediction and classification. If a model can perform character prediction task perfectly, then it can also perform classification in the following way. For an input sequence s 1 , s 2 , . . . , s n , the model receives the sequence s 1 , . . . , s i for 1 ≤ i ≤ n at each step i and model predicts the set of valid characters in the (i + 1) th position. If the next character is among the model's predicted set of valid characters at each step i and the end of symbol character is allowed at the n-th step, then the word is accepted and if any character is not within the model's predicted set of valid characters, then the word is rejected. One of the primary reason for the choice of character prediction task is that it is arguably more robust than the standard classification task. The metric for character prediction task is relatively stringent and the model is required to model the underlying mechanism as opposed to just one label in standard classification. Note that the null accuracy (accuracy when all the predictions are replaced by a single label) is 50% if the distribution of labels is balanced (higher otherwise), on the other hand the null accuracy of character prediction task is close to 0. Additionally, in case of classification, depending on how the positive or negative data are generated, the model may also be biased to predict based on some statistical regularites instead of modeling the actual mechanism. In (Weiss et al., 2019), they find that LSTMs trained to recognize Dyck-1 via classification on randomly sampled data do not learn the correct mechanism and fail on adversarially generated samples. On the other hand, Suzgun et al. (2019a) show that LSTMs trained to recognize Dyck-1 via character prediction task learn to perform the correct mechanism required to do the task.
Character prediction and language modelling. The character prediction task has clear connections with Language modelling. If a model can perform language modelling perfectly, then it can perform character prediction task in the following way. For an input sequence s 1 , s 2 , . . . , s n , the model receives the sequence s 1 , . . . , s i , for 1 ≤ i ≤ n at each step i and predicts a distribution over the vocabulary. Mapping all the characters for which the model assigns a nonzero probability to 1 and mapping to 0 for all characters that are assigned zero probability will reduce it to character prediction task. However, there are a few issues with using language modelling in our formal language setting. Firstly, as mentioned in (Suzgun et al., 2019a), the task of recognizing a language is not inherently probabilistic. Our goal here is to understand whether a network can or cannot model a particular language. Using language modelling will require us to impose a distribution arbitrarily for the given setting. More importantly, in character prediction task, some signals are explicitly provided. In the case of language modelling, we may just have to rely on the model to pick up those nuanced signals. For instance, in the language D n , when the input reaches the maximum depth n, in character prediction task it is explicitly provided the target value that a is not allowed anymore whereas in language modelling the model is expected to assign zero probability to a at the maximum depth based on the fact that it will never see a word depth more than n in the training data. This phenomenon has major issues. For instance, when we consider Dyck-1 in practical setting, we can only provide it with limited data which implies there will be a sequence with a maximum finite depth. In this scenario, a language model trained on such data may learn the Dyck-1 language or the language D n with that particular maximum depth. This limitation does not exist in the character prediction task where the signal is explicitly provided during training.

D.2 Experimental Details
We use 4 NVIDIA Tesla P100 GPUs each with 16 GB memory to run our experiments, and train and evaluate our models on about 9 counter languages and 18 regular languages. The important details of all of these languages like the training and test sizes and the lengths of the strings considered, have been summarized in Table 6. In all of our experiments, the first bin always has the same length range as the training set, i.e. if the training set contains strings with lengths in range [2, 50], then the strings in the first test bin will also lie in the same range. Width of bin is the difference between upper and lower limits of the string lengths that lie in that bin.  For each of these languages, we extensively tune on a bunch of different architectural and optimization related hyperparameters. Table 7 lists the hyperparameters considered in our experiments and the bounds for each of them. This corresponds to about 162 different configurations for tuning transformers (for a hidden size of 3, 4 heads are not allowed) and 40 configurations for LSTMs . Over all the languages and hyperparameters there were a minimum of 117 parameters and a maximum of 17,888 parameters for the models that we considered. We use a grid search procedure to tune the hyperparameters. While reporting the accuracy scores for a given language, we compute the mean of the top 5 accuracies, corresponding to all hyperparameter configurations. For some experiments we had to consider the hyperparameters lying outside of the values specified in Table 7. As an instance, we considered 4 layer transformers in the cases where the training accuracies obtained were low for single and two layered networks and reported the results accordingly.
For training our models we used RMSProp optimizer with the smoothing constant α = 0.99. In our initial few experiments we also tried Stochastic Gradient Descent with learning rate decay and Adam Optimizer, but decided to go ahead with RM-SProp as it outperformed SGD in majority of experiments and gave similar performance as Adam but needed fewer hyperparameters. For each language we train models corresponding to each language for 100 epochs and a batch size of 32. In case of convergence, i.e. perfect accuracies for all the bins, before completion of all epochs, we stop the training process early. The results of our experiments on counter and regular languages are provided in Tables 8 and 9 respectively. Table 7: Different hyperparameters and the values considered for each of them. Note that certain parameters like Heads and Position Encoding Scheme are only relevant for Transformer based models and not for LSTMs. We considered upto 4 layers transformers in the cases where the training accuracies obtained were low for single and two layered networks and reported the results accordingly.

E Plots
We visualize different aspects of the trained models to understand how they achieve a particular task and if the learned behaviour resembles our constructions. Figure 3 shows the value vectors corresponding to the trained models on Shuffle-2 and Boolean-3 Language. We also visualize the attention weights corresponding to these two models in Figure 4. Similar to the self-attention output visualizations for Shuffle-2 and Boolean-3 in the main paper, we visualize these values for a model trained on Shuffle-4 in Figure 5 and again, find close correlations with the depth to length ratios of different types of brackets in the language. Finally, in Figure 6, we visualize a component of the learned position embeddings vectors and found a similar behaviour to cos(nπ) agreeing with our hypothesis. The Shuffle-2 model had a hidden size of 8 and boolean-3 model had a hidden size of 3. The x-axis corresponds to different components of the value vectors for both models. Shuffle-2 language consisted of square and round brackets, while for Boolean-3 we considered 3 operators namely: ∼ a unary operator, + a binary operator and finally, > which is a ternary operator..   14 0.14 0.14 0.14 0.14 0.14 0.14 0.12 0.13 0.12 0.13 0.12 0.13 0.12 0.13 0.11 0.11 0.11 0.11 0.11 0.11 0.11 0.11 0. 14 0.14 0.14 0.14 0.14 0.14 0.14 0.13 0.12 0.13 0.13 0.12 0.12 0.13 0.12 0.11 0.11 0.11 0.11 0.11 0.11 0.11 0.11 0.