Closing Brackets with Recurrent Neural Networks

Many natural and formal languages contain words or symbols that require a matching counterpart for making an expression well-formed. The combination of opening and closing brackets is a typical example of such a construction. Due to their commonness, the ability to follow such rules is important for language modeling. Currently, recurrent neural networks (RNNs) are extensively used for this task. We investigate whether they are capable of learning the rules of opening and closing brackets by applying them to synthetic Dyck languages that consist of different types of brackets. We provide an analysis of the statistical properties of these languages as a baseline and show strengths and limits of Elman-RNNs, GRUs and LSTMs in experiments on random samples of these languages. In terms of perplexity and prediction accuracy, the RNNs get close to the theoretical baseline in most cases.


Introduction
Brackets are a challenge for language models. They regularly appear in texts, they typically produce long-range dependencies, and a failure to properly close them is readily recognized by a human evaluator as a severe error (Shen et al., 2017). Beyond the syntactical level, many natural languages exhibit brackets-like structures. For example, the German language is infamous for its convoluted sentences with verb-particle constructions, in which words from the beginning have to be properly closed at the end (Dewell, 2011;Müller et al., 2015).
In this paper we present a dedicated study of the capability of Elman-RNNs, GRUs and LSTMs to model expressions with brackets and properly * Both authors contributed equally. close them. Towards this end, we conduct experiments on Dyck languages, which consist of balanced bracket expressions.

Related Work
Synthetic datasets and formal languages have long been used for checking the ability of RNNs to capture a particular feature. For example, Elman (1990), Das et al. (1992), or Gers and Schmidhuber (2001) already did such investigations. Recent work in this direction was done by Weiss et al. (2017Weiss et al. ( , 2018. More specifically, the interplay of RNNs with certain grammatical constructs, brackets and Dyck languages has been the subject of several studies. Karpathy et al. (2016) show that RNNs are capable of capturing bracket structures on real-world datasets. Linzen et al. (2016) study the application of LSTMs to certain grammatical phenomena. RNNs and their variants have been used for recognizing Dyck words (Kalinke and Lehmann, 1998;Deleu and Dureau, 2016). Li et al. (2017) evaluate their nonlinear weighted finite automata model on a Dyck language. Most recently, Bernardy (2018) conducted a very similar study to ours on Dyck languages with a slightly different focus.

Contributions
In this work, we sample Dyck words in such a way that we can give theoretical lower bounds for the perplexity of a respective language model. This way, we can compare the performance of RNNs with a theoretical baseline and not just with the performance of other architectures.

Dyck Languages
We use artificially generated data in order to have a completely controlled environment for the experiments. In particular, the training and test datasets consist of balanced sequences of n different types of brackets, ( 1 , ) 1 , ( 2 , ) 2 , ... , where n depends on the specific experiment. The set of such sequences forms the so-called Dyck language D n (Duchon, 2000). Elements of D n are called Dyck words. Each D n is a context-free but not regular formal language (Chomsky, 1956) and can be described by the grammar: Some examples of such Dyck words are: It is well-known that there are C m := 1 m+1 2m m words of length 2m in D 1 , where C m is the m-th Catalan number (Chung and Feller, 1949). As the type of each pair of brackets can independently be chosen, it follows that there are n m C m words of length 2m in D n . There are obviously no Dyck words with odd length.

Generation of Dyck Words
Each "sentence" in the datasets is a randomly generated non-empty Dyck word. The first symbol of a word is always an open bracket. From there, the generation proceeds in a sequential manner: With probability p an open bracket is emitted. Otherwise and thus with probability p − 1, a matching closed bracket is emitted or the generation terminates if all open brackets already have a matching partner. If not stated otherwise, we assume 0 < p < 1 in all calculations because the edge cases usually have to be treated differently but do not add significant value to our discussion. The type of bracket is chosen randomly from a uniform distribution but might follow some other distribution for future studies.

Statistical Properties of Dyck Words
We quickly review some statistical properties of such sequences for explaining choices in the setups and in order to get a baseline for the experiments. It can readily be seen that the sequences of length 2m all have the same probability: The asymmetry in the exponents is due to leaving out empty sequences. The factor n −m accounts for the equally probable choices of brackets types. We can check consistency by considering the normalization condition: (3) While the result for p ≤ 1 2 is expected, the case p > 1 2 might appear curious at first. The reason for this behaviour is that the sum only takes finite sequences into account, while there is a nonzero probability for getting infinite sequences for p > 1 2 . This is easily seen for the case p = 1, where brackets are never closed so that the overall probability of obtaining a finite sequence is indeed zero.

Average Length
This naturally leads to the question what the average lengthL of the sequences is, depending on p. For p < 1 2 , we find The graph of this function can be seen in Fig. 1. In line with our previous findings, problems with infinite sequences arise for p ≥ 1 2 , as the expression (4) diverges for p = 1 2 . For these reasons, we only consider the case p < 1 2 in the experiments.

Baseline for Perplexity
A prediction system for the next symbol emitted by the generator will not be able to perform arbitrarily well due to the random nature of the process. In order to get a baseline for the performance, we consider the perplexity per symbol PP of the probability distribution of the generated Dyck languages. For a sequence of symbols w  (7). As a reference, the graph of 1+2p is given.
with length |w| it is defined as where P (w i |w 1 , ..., w i−1 ) is the probability of the i-th symbol under the model, given the previous i − 1 symbols. Eq. (5) corresponds to the way in which perplexity is calculated by the software that we use for our experiments (cf. Sec. 3). We estimate the baseline for the perplexity PP n for the language D n by considering (5) under the true probability model in the limit of an infinite amount of samples from the corresponding probability distribution. Under these conditions and for our case, (5) can be transformed into: The numerator contains the sum of the logprobabilities, where P is the probability of a Dyck word with 2 brackets. The denominator represents the total number of symbols. Adding one to the length in the term (2 + 1) accounts for the end-of-sentence symbols that are counted by the software. The limit L → ∞ for the maximum length of a word is taken at the end because the normalization by the number of symbols has to be carried out for a finite value. Finally, N represents the number of samples and N stands for the number of words with 2 brackets in the dataset, so that N N converges to the probability of generating a word of this length.
All these quantities are known, so that we can obtain the following result: While the expression (7) with its singularity at zero does not readily reveal the characteristics of the perplexity, its graph (Fig. 2) shows that it is close to a simple affine function. In the edge cases it behaves just like expected: p = 0 means that there is no randomness at all and there is just one possible next symbol. For p = 1 2 however, opening and closing brackets are equally likely, so that there are always two symbols to choose from without any way to tell which to prefer. The dependence on n must be sublinear because for closing brackets the type is uniquely predictable.

Neural Network Architecture
We use three different RNN architectures for our experiments: Elman-RNN (abbreviated as SRNN for simple RNN), GRU (gated recurrent unit), and LSTM (long short-term memory).
For the experiments with SRNNs we use the RNNLM toolkit (version 0.3e) developed by Mikolov et al. (2011b). The SRNN has one hidden layer of arbitrary size N hidden with a sigmoid activation function. At each time step the input vector is built by concatenating the vector of the current word and the output produced by the hidden layer during the previous time step. The next word is predicted by applying the softmax function to the last layer. The RNNLM toolkit offers the possibility to group words into classes (Mikolov et al., 2011a), but this feature is more interesting for boosting efficiency in cases of large vocabularies with a natural frequency distribution. After initializing the weights with random Gaussian noise, the training of the SRNNs is performed using the standard stochastic gradient descent algorithm with an initial learning rate α = 0.1 and the recurrent weight is trained by the truncated backpropagation through time algorithm (Rumelhart et al., 1985). We refer to the respective hyperparameter that specifies the number of time steps taken into account as T BPTT .
For the other more elaborate architectures, we make use of TF-NNLM-TK 1 by Oualil et al. (2016) which provides implementations of LSTM and GRU networks. LSTMs and GRUs work similar to SRNNs but exhibit specific units that allow for storing previous activations and for tracking long-term dependencies in a more flexible and efficient way. In the case of LSTMs, the memory state is being handled via input, forget and output gate. Those gates allow to decide on the amount of a cell state that should be preserved or forgotten and the amount that should be passed to the cells in the next layer of the network. Similarly, the GRU regulates its memory state using an update and a reset gate that allows to either delete the previous cell state and decide on the amount of the current activation that should be added to the current cell state. For the experiments with LSTMs and GRUs the hyperparameter settings are chosen to be similar to the ones used with the RNNLM toolkit. Weights are again initialized using random Gaussian noise and standard stochastic gradient descent is utilized. The initial learning rate is set to α = 0.1 and in later training epochs decayed using a factor of γ = 0.9. The models are trained for 100 epochs using a batch size of 128 and T BPTT = 16 (if not stated otherwise) with learning rate decay starting at epoch 80.

Setup and Perplexities
We conduct a number of experiments for investigating the overall performance and the influence of the hyperparameters on the perplexity. For all experiments we use datasets that were artificially generated in the previously described way (cf. Sec 2.1). All training sets contain 131,072 Dyck words, while the test sets contain 10,000 Dyck words that were sampled from the same distribution. In all experiments, the value of p is varied between 1 /16 and 7 /16 in steps of 1 /16. The ratio behind this choice is that 7 /16 yields an average sequence length of 16, which is roughly a typical sentence length for natural languages (Sichel, 1974;Sigurd et al., 2004). The smaller values of p are considered for comparison.   Fig. 4 for a graphical representation of the SRNN values). The standard deviation is roughly around 0.001 and slightly larger for the SRNN.
In a first set of experiments, we consider D 1 and vary the number of hidden units between 1 and 512, doubling the hidden layer size in each iteration. Having more than 512 units does not bring much perplexity improvement but slows down the training process considerably. Typical results for the Elman-RNN can be found in Fig. 3. For all values of p, the test time perplexity reaches the baseline. The convergence is slower for larger values of p, which is the expected behavior. For larger values of N hidden there are some increases of PP 1 that are most probably connected to the specific software implementation. Despite that, the models are surprisingly good at recovering the baseline. All in all we conclude that N hidden = 64 is a good compromise between optimization for perplexity and speed.
In a second set of experiments, we change T BPTT from 0 to 16, increasing its size by one in each iteration. We limit T BPTT with 16 as this is the maximum expected length of a sentence in our setting. This time, we do not only vary p, but also the number of types of brackets n. Typical results can be seen in Fig. 4. It is striking that T BPTT has hardly any influence on the performance as long as it is larger than zero. This can be exploted for making the comparison easier: The average value of the test perplexities for the different architectures is given in Tab 7)) are plotted as solid lines.  challenging. While all curves rsp. values are close to the baseline in Fig. 4a and Tab. 1a, the gap increases with n in Fig. 4b and Tab. 1b. Given that the average length of Dyck words for p = 3 /16 is only 3.2, compared to a length of 8 for p = 6 /16, the differences in the performance is not surprising. While the Elman-RNN performs similar or even slightly better than the other architectures for the easier tasks, the more elaborate methods increasingly outperform it with increasing task difficulty.

Accuracy
Based on the results of the previous section, the RNNs appear to perform quite well in terms of the perplexity. In order to get a better feeling for the capability of the networks, we consider a second task: Given a Dyck word without the last closing bracket, the RNN has to predict the most likely candidate for this missing symbol. The success is measured in terms of accuracy, which is the number of successfully finished tasks divided by the total number of tasks. The respective values, based on a dataset of 10000 Dyck words, are given in Tab. 2. Except for one case, the RNNs reach an accuracy close to one. Only one experiment is done per configuration and even harder tasks appear to be solvable, so the lower value is probably just an outlier. GRU and in particular LSTM perform almost perfectly in this task. Some additional statistics are given in the table. The average word lengthL indeed follows (4). Besides that, a new quantity is introduced here: The average length of the task¯ measures how far the algorithm has to look back in order to find the open brackets it has to close on average. The difference between length L and task length is best illustrated with an example. For the Dyck word , the length is ten but the task length is six because the first four brackets are irrelevant for determining the last one, which is boxed for emphasis. The task length is the relevant measure for the hardness of the task, because small values of¯ would mean that there are hardly any long-range dependencies. For p = 6 /16,¯ lies around five, so we would expect to need at least a five-gram model or something equivalent for achieving good results in this task.
The full frequency distribution of length and task length can be seen in Fig. 6. By far the largest  part of the distribution is distributed over small values, so the really long words do not play a big role in the statistics. This naturally raises the question how the RNNs perform for those. Fig. 5 reveals that the performance indeed depends on the length of the sequence respectively the task length and that there are huge differences between the architectures. Only the bigger picture can be compared because the test sets differ between the ar-chitectures. While the Elman-RNN reaches perfect accuracy for lengths of up to eight symbols, the GRU gets along very well with lengths of up to 20 symbols. After that, the performance breaks in for these networks. Due to the low number of samples, the curve is very noisy for intermediate values, so it is hard to draw conclusions for this region. There is not a single correct guess by the Elman-RNN for task lengths beyond 90. The LSTM once again performs best in this task and exhibits an almost perfect accuracy over the whole spectrum of lengths.
Finally, the kind of error that is made is of interest. A good representation of that is the confusion matrix given in Fig. 7. For our task, the true bracket is always a closing one. Interestingly, the SRNN appears to "understand" that and hardly ever chooses an opening one. Apart from that Fig. 7 reveals that the SRNN does not consider the different types of brackets as equally likely, otherwise the probability mass would be distributed more evenly.

Conclusion and Outlook
We evaluated the capability of different RNNs to model an artificial language that consists of convoluted bracket expressions. In terms of perplexity, the models easily get close to the theoretical baseline in most cases. For the task of predicting the last bracket of a sequence, the Elman-RNN mostly reaches accuracies between 0.96 and 1 and hardly ever chooses an opening bracket, while GRU and LSTM score almost perfectly. Based on such good results, our plans for future work are to make the task harder by extending the artificial language. This would help to better carve out the weaknesses of particular architectures. In this context, an important point would be some kind of control over the long-range dependencies.