How much complexity does an RNN architecture need to learn syntax-sensitive dependencies?

Long short-term memory (LSTM) networks and their variants are capable of encapsulating long-range dependencies, which is evident from their performance on a variety of linguistic tasks. On the other hand, simple recurrent networks (SRNs), which appear more biologically grounded in terms of synaptic connections, have generally been less successful at capturing long-range dependencies as well as the loci of grammatical errors in an unsupervised setting. In this paper, we seek to develop models that bridge the gap between biological plausibility and linguistic competence. We propose a new architecture, the Decay RNN, which incorporates the decaying nature of neuronal activations and models the excitatory and inhibitory connections in a population of neurons. Besides its biological inspiration, our model also shows competitive performance relative to LSTMs on subject-verb agreement, sentence grammaticality, and language modeling tasks. These results provide some pointers towards probing the nature of the inductive biases required for RNN architectures to model linguistic phenomena successfully.


Introduction
For the last couple of decades, neural networks have been approached primarily from an engineering perspective, with the key motivation being efficiency, consequently moving further away from biological plausibility.Recent developments (Song et al., 2016;Gao and Ganguli, 2015;Sussillo and Barak, 2013) have however incorporated explicit constraints in neural networks to model specific parts of the brain and have found a correlation between the learned activation maps and actual neural activity recordings.Thus, these trained networks can perhaps act as a proxy for a theoretical investigation into biological circuits.
Recurrent Neural Networks (RNNs) have been used to analyze the principles and dynamics of neural population responses by performing the same tasks as animals (Mante et al., 2013).However, these networks violate Dale's law (Dale, 1935;Strata and Harvey, 1999), which states that the neurons have either a purely excitatory or inhibitory effect on other neurons in the mammalian brain.
The decaying nature of the potential in the neuron membrane after receiving signals (excitatory or inhibitory) from the surrounding neurons is also well-studied (Gluss, 1967).The goal of our work is to incorporate these biological features into the RNN structure, which gives rise to a neuro-inspired and computationally inexpensive recurrent network for language modeling, which we call a Decay RNN (Section 4).We perform learning using the backpropagation algorithm.Despite its differences with the way learning is believed to happen in the brain, it has been argued that the brain can implement its core principles (Hinton, 2007;Lillicrap et al., 2020).We assess our model's ability to capture syntax-sensitive dependencies via multiple linguistic tasks (Section 6): number prediction, grammaticality judgement (Linzen et al., 2016) which entails subject-verb agreement, and a more complex language modeling task (Marvin and Linzen, 2018).
Subject-verb agreement, where the main noun and the associated verb must agree in number, is considered as evidence of hierarchical structure in English.This is exemplified using a sentence taken from the dataset made available by Linzen et al. (2016): 1. *All trips on the expressway requires a toll.
2. All trips on the expressway require a toll.
The effect of agreement attractors (nouns having number opposite to the main noun; expressway in the above example 1 ) between the main noun and main verb of a sentence has been well-studied (Linzen et al., 2016;Kuncoro et al., 2018).Our work also highlights the influence of non-attractor intervening nouns.For example, • A chair created by a hobbyist as a gift to someone is not a commodity. 2 In the number prediction task, if a model correctly predicts the grammatical number of the verb (singular in case of 'is'), it might be due to the (helpful) interference of non-attractor intervening nouns ('hobbyist', 'gift', 'someone') rather than necessarily capturing its dependence the main noun ('chair').
From our investigation in Section 6.2, we find that the linear recurrent models take cues present in the vicinity of the main verb to predict its number, apart from the agreement with the main noun.
In the subsequent sections, we investigate the performance of the Decay RNN and other recurrent networks, showing that no single sequential model generalizes well on all (grammatical) phenomena, which include subject-verb agreements, reflexive anaphora, and negative polarity items as described in Marvin and Linzen (2018).Our major outcomes are: 1. Designing a relatively simple and bio-inspired recurrent model: the Decay RNN, which performs on-par with LSTMs for linguistic tasks such as subject-verb agreement and grammaticality judgement.
2. Pointing to some limitations of analyzing the intervening attractor nouns alone for the subject-verb agreement task and attempting joint analysis of non-attractor intervening nouns and attractor nouns in the sentence.
3. Showing that there is no linear recurrent scheme which generalizes well on a variety of sentence types and motivating research in better understanding of the nature of biases induced by varied RNN structures.

Related Work
There has been prior work on using LSTMs (Hochreiter and Schmidhuber, 1997) for language 1 Main noun and verb are highlighted in bold.Intervening nouns are underlined.Asterisks mark unacceptable sentences.
2 Sentence taken from the dataset made available by Linzen et al. (2016).modeling tasks.The work of Gers and Schmidhuber (2001) has shown that LSTMs can learn simple context-free and context-sensitive languages.However, as per the investigations carried out in Kuncoro et al. (2018), it was observed that if the model capacity is not enough, then LSTMs may not generalize the long-range dependencies.Recently many architectures have explicitly incorporated the knowledge of phrase structure trees (Kuncoro et al., 2018;Alvarez-Melis and Jaakkola, 2017;Tai et al., 2015) which have shown improvement in generalizing over long-range dependencies.At the same time, Shen et al. (2019) proposed ON-LSTMs, a modification to LSTMs that provides an inductive tree bias to the structure.However, Dyer et al. (2019) have shown that the success of ON-LSTMs was due to their proposed metric to analyze the model, not necessarily due to their architecture.
From the biological point of view, Capano et al. (2015) used a hard reset of the membrane potential in contrast to a soft decay observed in a neuronal membrane.At the same time, their learning paradigm is similar to the Hebbian learning scheme (Hebb, 1949), which does not involve error backpropagation (Rumelhart et al., 1986).Our work is closely related to the idea of modeling the population of neurons as a dynamical system (EIRNN) proposed by Song et al. (2016).However, their time constant parameter was based on the concepts described in Wang (2002) while the sampling rate was arbitrarily chosen.Given that the chosen values only considered a certain class of neurons (Yang et al., 2019), we believe that it is not necessary to have the same values of the parameters for each cognitive task.Thus, we build on their formulation by making the sampling rate and time constant learnable as manifested by our decay parameter, described in the next section.and excitability (synaptic strength) of a network maximizes the overall learning.This balance is governed by the ratio of inhibitory and excitatory neurons.They have further shown that this balance also maximizes the overall performance in multitask learning.Catsigeras (2013) mathematically prove that Dale's principle is necessary for an optimal3 neuronal network's dynamics.
In the postsynaptic neuron, the integration of synaptic potentials is realized by the addition of excitatory (+ve) and inhibitory (-ve) postsynaptic potentials (PSPs).PSPs are electronic voltages, that decay as a function of time due to spontaneous reclosure of the synaptic channels.The decay of the PSPs is controlled by the membrane constant τ , i.e., the time required by the PSP to decay to 37% of its peak value (Wallisch et al., 2009).

Decay RNN
Here we present our proposed architecture, which we call the Decay RNN (DRNN).Our architecture aims to model the decaying nature of the voltage in a neuron membrane after receiving impulses from the surrounding neurons.At the same time, we incorporate Dale's principle in our architecture.Thus, our model captures both the microscopic and macroscopic properties of a group of neurons.Adhering to the stated phenomena, we define our model with the following update equations for given input x (t) at time t: Here f is a nonlinear activation function, W and U are weight matrices, b is the bias and h (t) represents the hidden state (analogous to voltage).We define α ∈ (0,1) as a learnable parameter to incorporate a decay effect in the hidden state (analogous to the decay in the membrane potential).Here α acts as a balancing factor between the hidden state h (t−1) and c (t) .4W dale is a diagonal matrix, and based on the empirical results on the mammalian brain (Hendry and Jones, 1981), we set the last 20% of entries to -1, representing the inhibitory connections, and the rest to 1 (See Appendix A.3).5 Unlike Song et al. (2016), we keep self-connections in the network.Besides biological inspiration, our model also has the following salient features.
First, the presence of α acts as a coupled gating mechanism to the flow of information (Figure 1), at the same time maintaining an exponential moving average of the hidden state.Thus, α values close to 1 correspond to memories of the distant past.It is worth mentioning that Oliva et al. ( 2017) have considered the exponential moving average in the context of RNNs.However, their approach manually selected a set of scaling parameters, whereas we have a systematic way of arriving at the values of those parameters by making them learnable for the task at hand.
Second, our model also has an intrinsic skip connection deriving out of its formulation.Yue et al. (2018) has shown that the architectures with skip connections provide an alternate path for the flow of gradients during the error backpropagation.At the same time presence of coupled gates slows down the vanishing of gradient (Bengio et al., 2013).Thus, despite of its simple un-gated structure, the features discussed above provide safeguards against vanishing gradient.
To examine the importance of Dale's principle in the learning process, we made a variant of our Decay RNN without Dale's principle, which we call the Slacked Decay RNN (SDRNN), with updates to c (t) made as follows: To understand the role of the correlation between the hidden states in the Decay RNN formulation, we devised an ablated version of our architecture, which we refer to as the Ab-DRNN.With the following update equation, we remove the mathematical factor (Wh (t−1) ) that gives rise to a correlation between hidden states:

Datasets
For the number prediction (Section 6.1) and grammaticality judgment (Section 6.3) tasks, we used a corpus of 1.57 million sentences from Wikipedia (Linzen et al., 2016), of which 10% were used for training, 0.4% for validation, and the remaining were reserved for testing.On the other hand, for the language modeling task (Section 6.4), the model was trained on a 90 million word subset of Wikipedia comprising of 3 million training and 0.3 million validation sentences (Gulordava et al., 2018).
Despite having a large number of training points, these datasets have certain drawbacks, including the lack of a sufficient number of syntactically challenging examples leading to poor generalization over the sentences out of the training data distribution.Therefore, we construct a generalization set as described in Marvin and Linzen (2018), where we generate the sentences out of templates that can be described using a non-recursive context-free grammar.The use of the generalization set allows us to test on a much broader range of linguistic phenomena.We will use this dataset for the targeted syntactic evaluation of our trained models.

Experiments
Here we will describe our experiments6 to assess the models' ability to capture syntax-sensitive dependencies.Details regarding the training settings are available in Appendix A.4.

Number Prediction Task
The number prediction task was proposed by Linzen et al. (2016).In this task, the model is required to predict the grammatical number of the verb when provided a sentence up to the verb.
1.The path to success is not straight forward.

The path to success
The model will take the second sentence as input and has to predict the number of the verb (here, singular).Table 1 shows the results on the number prediction task.All the models including SRNs performed well on this task.Thus, this indicates that even vanilla RNNs can identify singular and plural words and can associate the main subject with the upcoming verb.

Joint Analysis of Intervening Nouns
So far in the literature, when looking at intervening material in agreement tasks, the research has tended to focus on agreement attractors, the intervening nouns with the opposite number to the main noun (Kuncoro et al., 2018).However, we posit that the role of non-attractor intervening nouns may also be important when understanding a model's decisions.For long-range dependencies in agreement tasks, a model may be influenced by the presence of non-attractor intervening nouns instead of purely capturing the verb's relationship with the main subject.Hence an analysis done solely based on the number of agreement attractors may be misleading.
Table 2 shows an improvement in the verb number prediction accuracy with an increasing number of non-attractors (n), even as the subject-verb distance and the attractor count are kept fixed.This indicates that the models are also using cues present in the vicinity of the main verb to predict its number, apart from agreement with the main noun.The distance between the main subject and the corresponding verb is held constant at 7 and the attractor count at 1.

Grammaticality Judgement
The previous objective was predicting the grammatical number of the verb after providing the model an input sentence only up to the verb.However, this way of training may give the model a cue to the syntactic clause boundaries.In this section, we describe the grammaticality judgment task.Given an input sentence, the model has to predict whether it is grammatical or not.To perform well on this task, the model would presumably need to allocate more resources to determine the locus of ungrammaticality.For example, consider the following pair of sentences 2 : 1.The roses in the vase by the door are red.
2. *The roses in the vase by the door is red.
The model has to decide, for input sentences such as the above, whether each one is grammatically correct or not.Table 1 shows the performance of different recurrent architectures on this task.It can be seen that SRNs, which were comparable to LSTMs and GRUs on the prediction experiment described in Section 6.1, are no better than random on the grammaticality judgment task.On the other hand, the Ab-DRNN performed better than the SRN.This highlights the importance of a balance between the uncorrelated hidden states (h (t) ), and the connected hidden states (Wh (t) ), which is modeled by the Decay RNN.Due to its architectural similarity with the Independent RNN (Li et al., 2018), which has independent connections among neurons in a layer, Ab-DRNN did not suffer from the vanishing gradient problem.
Importance of the generalization set Capano et al. (2015) had argued that the inclusion of Dale's principle improved generalization abilities for multitask learning.For our models trained on a single task, we use the generalization set to determine the number prediction confidence profile over the sentences.Figure 2 describes the average number prediction confidence at each part of speech for all prepositional phrases with inanimate subjects.We note the anomalously low confidence of the SDRNN at plural inanimate subjects (like 'movies', 'books'), unlike the DRNN.In Table 3, 7 we present the result of the models trained for the grammaticality judgment task and tested on the synthetic generalization set.From the results, we can see that despite having nearly the same accuracy on the original testing data (Table 7 Here, we present three tests from the targeted syntactic evaluation framework.Others test results can be found in Appendix A.2. 1), there is a substantial difference in the generalization accuracies of the DRNN and SDRNN.The DRNN shows better generalization than the SDRNN in the experiments mentioned in Table 3 and Figure 2.This might be due to regularising effects induced by Dale's constraint.This is an interesting observation that merits further investigation.

Language Modeling
Word-level language modeling is a task that helps in the evaluation of the model's capacity to capture the general properties of language beyond what is tested in specialized tasks focused on, e.g., subjectverb agreement.We use perplexity to compare our model's performance against standard sequential recurrent architectures.Table 4 shows the validation perplexity of different language models along with the number of learnable parameters for the task.From the Table 4, we observe that incorporating the components of the Ab-DRNN and the SRN in a coupled way might have led to the improved performance of the Decay RNN.

Targeted Syntactic Evaluation
Targeted syntactic evaluation (Marvin and Linzen, 2018) is a way to evaluate the language model across different classes of structure-sensitive phenomena.This includes subject-verb agreement, reflexive anaphora, and negative polarity items (NPI). 8Table 4 shows that even with a simple architecture, the Decay RNN class of models performs fairly similarly to LSTMs and much better than SRNs for many tests. 9In the case of long-range dependencies and NPI involving relative-object clauses, our models perform substantially better than LSTMs.High variability in the performance of the models in the case of NPIs might be due to non-syntactic cues as pointed out by Marvin and Linzen (2018).Based on the mean ranks observed in Table 4, we conjecture that there is no sequential recurrent structure at present which outperforms the others across the board.However, SRNs alone are not sufficient for most purposes.

Conclusion
In this paper, we proposed the Decay RNN, a bioinspired recurrent network that emulates the decaying nature of neuronal activations after receiving excitatory and inhibitory impulses from upstream neurons.We have found that the balance between the free term (h (t) ) and the coupled term (Wh (t) ) enabled the model to capture syntax-level dependencies.As shown by McCoy et al. (2020); Kuncoro et al. (2018), explicitly modeling hierarchical structure helps to discover non-local structural dependencies.The contrast in the performance of 9 Results for the ON-LSTM are directly quoted from Shen et al. (2019).
the language models encourages us to look at the inductive biases, which might have led to better syntactic generalization in certain cases.Recently, Maheswaranathan and Sussillo (2020) showed the existence of a line attractor in the dynamics of the hidden states for sentiment classification.Thus, similar dynamical-system-based analysis can be extended to our settings to further understand the working of the Decay RNN.
From the cognitive neuroscience perspective, it would be interesting to investigate if the proposed Decay RNN can capture some aspects of actual neuronal behaviour and language cognition.Our results here do at least indicate that the complex gating mechanisms of LSTMs (whose cognitive plausibility has not been established) may not be essential to their performance on many linguistic tasks, and that simpler and perhaps more cognitively plausible RNN architectures are worth exploring further as psycholinguistic models.

A.1 Effect of agreement attractors
In this section, we present the trends in the testing performance of the LSTM and the Decay RNN (DRNN) for the grammaticality judgment task.Figure 3 shows the performance of the models when we fix the number of intervening nouns and vary the count of attractors between the main subject and the corresponding verb.The decreasing performance of the models with the introduction of more attractors indicates that they cause the models to get more confused about the upcoming verb number.

A.2 Comparison between DRNN and SDRNN
In Section 6.3, we saw that in terms of testing accuracy for grammaticality judgment, the Slacked Decay RNN (SDRNN) outperformed the Decay RNN (DRNN).For a robust investigation of this behaviour, we tested our models on the generalization set and mentioned a subset of our results on grammaticality judgment in Table 3.Here we present a bar graph (Figure 4) depicting the model performance when tested on the generalization set for the grammaticality judgment task.A substantial difference in the performance of the SDRNN and the DRNN reinforces the possibility of the regularizing effects of Dale's principle.
A.3 Implementation of Dale's constraint  For the number prediction task and the grammaticality judgment task the network is trained as a binary classifier.The network is single-layered, with ReLU activation and trained with embedding and hidden layer dimension being 50, and a batch size of 1.We have reported the average accuracies after 3 separate runs in Table 1.For targeted syntactic evaluation, we have trained a language model to predict the grammaticality of a sentence.In our language model, we used a 2-layered network with tanh activation, a dropout rate of 0.2 with embedding dimension 200, hidden dimension 650, and a batch size of 128.All models are trained with a learning rate of 0.001 using the Adam optimizer (Kingma and Ba, 2015).

A.5 Decay parameter (α) learning
In the main text, we describe the balancing effect of α in the Decay RNN model.We present the trend in the learned value of α throughout training for the grammaticality task for various initializations in Figure 5.We observe that for all α initializations in the range (0,1), the learned value converges to around 0.8.Hence, we initialize our α to 0.8 at the start of the training process.

Figure 1 :
Figure 1: Decay RNN cell, comprising of a skip connection and coupled scalar gates.

Figure 2 :
Figure 2: Number prediction confidence (for the correct verb number) averaged over the generalization set (540 sentences) for prepositional phrases with plural inanimate subjects (IS).An example word for each position is indicated in parentheses.Values at ES indicate the confidence for the following verb/auxiliary.For the example sentence, confidence < 0.5 implies singular verb number prediction, and confidence > 0.5 plural.

Figure 3 :
Figure 3: Trends in the performance of the LSTM (blue) and DRNN (orange) models with increasing numbers of intervening nouns.For each subplot corresponding to a fixed intervening noun number, the number of agreement attractors increases as we move from left to right on the x-axis.

Figure 4 :
Figure 4: Performance of the LSTM (blue), DRNN (orange), and SDRNN (green) models for the different types of sentences in the generalization set, when trained for the grammaticality judgment task.There were at least 200 test sentences for each of these types.

Figure 5 :
Figure 5: Moving average of α over the course of training for different initializations. 1 unit of training length is 1 forward pass.

Table 2 :
Number prediction % accuracy with an increasing number of non-attractor intervening nouns (n).

Table 3 :
Accuracy comparison of DRNN and SDRNN when tested on the generalization set for the grammaticality judgement task; 'anim' refers to an animated noun.

Table 4 :
Accuracy of models on targeted syntactic evaluation.RC: Relative Clause, PP: Prepositional Phrase, VP : Verb Phrase.Closeness in the mean arithmetic rank of models (other than SRNs) across tasks suggests that within the current space of sequential recurrent models, none dominates the others.