Explicitly Modeling Syntax in Language Models with Incremental Parsing and a Dynamic Oracle

Syntax is fundamental to our thinking about language. Failing to capture the structure of input language could lead to generalization problems and over-parametrization. In the present work, we propose a new syntax-aware language model: Syntactic Ordered Memory (SOM). The model explicitly models the structure with an incremental parser and maintains the conditional probability setting of a standard language model (left-to-right). To train the incremental parser and avoid exposure bias, we also propose a novel dynamic oracle, so that SOM is more robust to wrong parsing decisions. Experiments show that SOM can achieve strong results in language modeling, incremental parsing, and syntactic generalization tests while using fewer parameters than other models.


Introduction
Several recent works have systematically studied the linguistic abilities of modern language models, particularly syntax (Linzen et al., 2016;Marvin and Linzen, 2018;Gulordava et al., 2018). They find that most language models are good at capturing frequent syntactic structures but do not generalize well to those in the long tail. Moreover, although some excel at having low perplexity scores, this is less due to their syntactic ability but more due to capturing collocations (frequently co-occurring words). Recently, Hu et al. (2020) show that RNNs underperform on a syntactic generalization (SG) test set, whereas models that have an explicit notion of syntax, such as RNNG , fare well on SG but at the cost of generally poorer language modeling (higher perplexity). Transformerbased models achieve strong performance when trained with large datasets, but are worse than random when trained on a small dataset.
These works showed that building language models with an explicit internal model of syntax Figure 1: The mechanism of SOM. "Context" is a distributed representation of previous sentences. It could also represents the source sentence in a sequence to sequence task. It incrementally build subtrees given the input sentences. A RNN will takes the context representation and the representations of subtrees in the current sentence to predict next token. helps in achieving better performance in SG tasks and is also thought to help learn more efficiently in low data settings. However, building syntaxaware models that also obtain strong language modeling performance, when compared with recent transformer-based models, has until now seemed elusive. In this work, we propose a new syntaxaware language model dubbed Syntactic Ordered Memory (SOM; Fig. 1), which jointly acts as a language model and an incremental parser. SOM inherits the syntax representation used in Ordered Memory (OM; Shen et al. 2019) in which syntax trees are embedded in a grid-like memory representation. Whereas OM was trained as an unsupervised parser, SOM is explicitly trained to predict both ground-truth syntax trees incrementally and, using the predicted partial syntactic structure, to predict the next token. Fig.1 shows the mechanism of SOM.
SOM factorizes the next-token prediction process into two steps: first, we predict the attachment position for the next token with a zero-step lookahead parser, trained in a supervised fashion; then, we predict the next token distribution conditioned on the partially predicted structure. One way of training the incremental parser is to use teacherforcing. However, this can lead to exposure bias, due to the fact that the model was never exposed to its own predictions during training. To avoid this, we introduce a dynamic oracle (Goldberg and Nivre, 2012) for our model, so that our model can learn to recover from previous parsing mistakes during inference. We found this to be crucial to obtain good performance.
We compare SOM with existing methods that integrate syntax into language models. RN-NGs  and Ordered Neurons (Shen et al., 2018) are particularly related. RNNGs are generative models of language which define a joint distribution on syntactic structures and sequence of words. Ordered Neurons attempt to model the hierarchical structure of language by defining an ordering to the hidden states and the gates that impose that structure. We show that our proposed SOM model can achieve strong language modeling, parsing and SG performance even when trained on small amounts of data.
In summary, our contributions are threefold: • We introduce SOM, a new syntax-augmented language model that learns an incremental parser and use its predictions to improve language modeling.
• We propose a novel dynamic oracle that allows to reduce the exposure bias and is instrumental to achieving good downstream performance.
• We report high SG score, language modeling and incremental parsing performance for various dataset sizes. We also find that jointly learning both language modelling and parsing improves both these capabilities in the model.

Related Work
Syntax-aware models There has been work to integrate syntax into our current models of language. Socher et al. (2013) used parse trees for composing sentences in order to predict sentiment over movie reviews. However, having an external parser and restriction of batched computations in that early model made the method unwieldy. Bowman et al. (2016) introduced the SPINN model, which alleviated those issues, turning sentences into a sequence of actions to be executed by a shiftreduce parser. Our SOM model is based on shiftreduce as well, because of the incremental nature of the parsing we want to achieve. RNNG  was an example of integrating syntax information for language modelling.
There is also work that attempts to learn these syntactic structures without supervision. Kim et al. (2019) later devised an unsupervised version of the RNNG, a method which produced good parsing performance. DIORA (Drozdov et al., 2019(Drozdov et al., , 2020 was a method that leveraged the Inside-Outside algorithm to construct sentence embeddings for downstream tasks, with the benefit of being able to read off parse trees in the encoding process. Swayamdipta et al. (2019) finds that there are no improvements over using ELMo (Peters et al., 2018) embeddings when shallow syntactic information is included, concluding that ELMo-style pretraining has learned the syntactic information. However,  investigated the importance of the learnt syntactic knowledge RNNG in a large pre-trained model like BERT, they found that syntax information helps with downstream tasks. In our experiments, we find that explicitly training OM with syntax (with our dynamic oracle scheme) improves performance on syntactic generalization tasks.

Incremental Parsing & Language Modelling
In SOM, we specifically focus on incremental parsing. Ghezzi and Mandrioli (1979) discusses incremental parsing in the context of programming languages, with shift-reduce parsers being a specific type of incremental parsing. OM, RNNG, and SPINN are models that were designed with shiftreduce in mind.
Incremental parsing lends itself well to the task of autoregressive language modelling. Since the parser only sees the prefix of a sentence, the model can use the partial parse to make a prediction about upcoming words. Demberg et al. (2013) summarises several empirical results that provide evidence for incremental and predictive parsing in humans, and makes several connections between incrementality (that comprehenders do not wait to the end of the sentence before building a representation) and prediction about future words coming in the sentence.
Given that an incremental parser processes a sentence from left to right, there are naturally some limitations. Hassan et al. (2009) show why either a beam or delay is necessary if performing incremental parsing with monotonic extensions: They experiment with a parser based on Combinatory Categorial Grammar (Steedman, 2000). They find that without the look-ahead, there is a 30 % point reduction in the parsing results. One of our con-tributions in this paper is the one-step lookahead while performing parsing, but zero-step lookahead when performing next-word prediction, allowing the model to be trained jointly as a incremental parser and language model.
Despite the left-to-right nature of incremental parsing, this setting may aid language modelling too. Shieber (1983) suggests the biases may correspond to the way humans parse English, and use a modified shift-reduce parser to disambiguate between different parses of a sentence. There have been work that show that incremental parsing can improve language modelling. Köhn and Baumann (2016) demonstrate that combining an incremental dependency parser with a language model yields improvements in perplexity. Roark (2001) presents a top-down phrase structure parser that performs beam-search to generate connected intermediate structures for every sentence prefix. This model can be used for language modeling and beats trigram models on the Penn Treebank (Marcus et al., 1994) Dynamic Oracles Since incremental parsing requires that we break down the problem of structure prediction into sequential decisions, we are prone to exposure bias. There are techniques to address this by allowing the model to make mistakes and supervising future actions based on the state arrived at (Daumé et al., 2009). Goldberg and Nivre (2012) introduces the concept of dynamic oracles for dependency parsing. Coavoux and Crabbé (2016) uses this technique for incremental constituency parsing, but uses morphological features, and does not perform language modelling. Fried and Klein (2018) cover in further detail the related work relating to dynamic oracles and parsing. We find that using dynamic oracles for training is crucial in seeing benefits in both language modelling and incremental parsing.
Evaluating Syntactic Generalization Recent tests have been developed that attempt to probe the linguistic abilities of language models. Gulordava et al. (2018) explores the extent to which RNNs are able to model grammar, independent of the semantics of the sentence. Marvin and Linzen (2018) evaluate language models on their ability to score sentences with and without the proper subject-verb agreements over a variety of different settings. Hu et al. (2020) expands on these ideas, and propose a suite of syntactic generalization tests for language models over a series of different sized datasets. They find that while GPT-2 performs well, their performance is highly dependent on the scale of the language modeling training dataset, while other models remain more robust. In this paper, we use this test suite for the evaluation.

Ordered Memory
We first provide useful background on Ordered Memory. Ordered Memory (OM, Shen et al. 2019) is a recurrent neural network that explicitly models recursive structure through memory writing and erasing operations. OM maps the latent syntax into a T × N memory gridM , where T is the length of input sequence and N is the maximum number of memory slots. Figure 2 gives an intuition of what the grid contains. Empty blocks in the figure represent memory slots that can be discarded during inference. Ideally, the memory network should generate the t-th column of the gridM t at time step t. But generatingM t requires the model to have access about the tree structure which is usually latent. For this reason, OM induces the latent structure through inductive biases of its reading and writing operations.
As a recurrent model, OM performs one-step look-ahead incremental parsing through maintaining three states: where each occupied slot is a distributed representation for a node spanning an subsequence in x 1 , .., x t−1 conditioned on x t , i.e. M t represents a one-step look-ahead parser stack. It's represented by gray blocks in Figure 3.  In the recurrent transition, a onestep look-ahead parser predict the syntax once e is observed and can be seen as a posterior over the syntax given the current word. The prediction network uses a zero-step look-ahead parser to predict the location of the next phrase and acts as a prior on the syntactic structure.
• Candidate memoryM t : a matrix of dimension N × D contains representations for all possible new nodes at time step t. At next time step t + 1, the model will decide whether or not to write these candidates into memory M t+1 conditioned on x t+1 . They are represented by orange blocks in Figure 3. if the model is making correct parsing decisions, then M t =M t−1 .
• Memory mask − → π t : − → π t ∈ {0, 1} N , where each entry indicates whether the respective slot inM t is occupied by a candidate, e.g., if − → π t = (0, 1, 1), then the occupied slots arê M ≥2 t . At next time step, the model can only choose a candidate from masked slots to write into the memory M t+1 .
At each time step, the model takes To generate the new memory M t , we combine M t−1 andM t−1 to matchM t−1 . The model uses x t as its query to attend on previous candidateŝ M t−1 . The attention distribution is p t , which models the split point of gray blocks and orange blocks in Figure 2. Suppose p t is a one-hot distribution and p i t = 1. The candidatesM ≤i t−1 are written into the respective memory slot M ≤i t , while M >i t−1 are copied to M >i t : We will refer to the process of generating M t as a one-step look-ahead parser, since the model is using the current input x t as extra information to build the partial parse for time step t − 1. To generate new candidatesM t , the input embedding x t is written intoM i−1 t , andM ≥i t are computed recurrently with eq.3: where cell() is the composition function that takes its childrens' representations as input and output the parent's representation. The non-empty slots in candidate memory are thenM ≥i−1 t , and they can be masked by: In other words, − → π i t = j≤i+1 p j t , and − → π i t is monotonically increasing. More details of the OM can be found in Shen et al. (2019).

Syntactic Ordered Memory
We propose two augmentations to OM in order to better perform language modelling and incremental parsing: a prediction network and the dynamic oracle. a) Previous language models mostly focus on predicting the next token or a missing token. In our case, we are explicitly modeling the latent structure. By predicting the structure for the next token, we exploit this latent structure for word prediction. This helps the model better organize information for predicting next word, allowing shortcuts to be created for long-term dependencies, as shown in Fig.1. b) If the model only observes states resulting from correct past decisions at training time, it will not be prepared to recover from its own mistakes during prediction, suffering from exposure bias (Schmidt, 2019;Fried and Klein, 2018). In the experiment section, we demonstrate how this phenomenon will significantly hurt the language model performance and, to a lesser extent, also hurt the parsing performance.

Prediction Network
At time step t, the prediction network takes [M t ,M t , − → π t ] as input, and produces a probability distribution over the next token p(w t+1 |w ≤t ).
To do this, we need to have a temporary estimate of the local structure. We therefore need to approximate p t+1 with a zero-step look-ahead prediction p t : where W Att 1 is N × N weight matrix, w Att 2 is a N dimension weight vector, and α i t is a scalar. We then sample the slot at index i from the distribution p t . i is the zero-step look-ahead parsing decision, which means that the next phrase will be a sibling of nodeM i t . We therefore need to predict the next token conditioned onM i t and its previous contexts. So we feed memory slots into a recurrent neural network: where h t is the final hidden state of the RNN. As shown in Figure 3b, the input sequence are representations of non-overlapping subtrees spanning from x 1 to x t . h t can therefore be seen as a distributed representation of the sequence w ≤t . In the RNN, we use the same architecture as the cell function in OM to model the recurrent transition function: where σ is the sigmoid function, LN is layer normalization function, f j , i j , c j are controlling gates, c j is cell state, and h N +1 t is a zero vector. After obtaining h t , we can compute the distribution over the next token and the language modelling loss: Algorithm 1: The structure label generation algorithm, where Γ is the ground-truth tree and θ i is the structural decisions made by our model. This algorithm produces a parse close to the original given the errors already made, and that new gold parse is converted into grid decisions. Given Γ, the function first_sibling Γ (i) returns the index of the first token in the smallest clause that contains w i , and where w i is not the first token. Ideally, w i should be written into the slot (ξ j − 1). For example, in Figure 2, c is written into the slot 2, then d, e should be written into the slot 1. However, the model could make a wrong decision between w j and w i . If the model has merged information from w j into a higher slot µ i , x i should be written into slot µ i as well.
One way to provide a supervision signal for p t and p t is to train the parser with static oracle: feed the gold tree to the model, and have the model predict future decisions. However, static oracle makes the language model overfit on the gold tree, resulting in bad perplexity scores (  Figure 4: The universal dependency tree is converted into a constituency tree Γ through merging the head and its children into one single constituent. Since the grid view only works with binary trees, we binarize n-ary nodes with a left branching bias. by the dynamic oracles proposed in (Goldberg and Nivre, 2012;Coavoux and Crabbé, 2016), we propose a dynamic oracle for ordered memory, which dynamically changes the reference structure based on mistakes made by our model on previous steps. To do this, we build the structure label for each time step based on the gold tree and previous decisions made by the model. During training, we sample the model's decision from p t : and we make greedy decisions during evaluation: The same operations are applied to p t as well. We use the Algorithm.1 to convert the gold tree Γ into labels ξ t for p t . Since the zero-step lookahead distribution p t should match the one-step look-ahead distribution p t+1 at next time step t + 1, we use ξ t+1 as label for p t . The structure loss is the negative log-likelihood: LS = − t log(pt(ξt|w ≤t )) + log(p t (ξt+1|w ≤t )) For our model, the depth of Γ has a linear relation to the computational complexity and GPU memory consumption. To maximize the model's efficiently, the gold tree Γ is constructed from universal dependency trees. 1 There are two reasons we chose universal dependency trees instead of constituency trees: 1) In Table 1, the dependency trees are on average shallower than constituency trees; this means faster computation time and less memory consumption for our model. 2) Universal dependency trees can be applied to many more languages than Penn Treebank-style constituency grammar. Additionally, Penn Treebank-style trees can easily be converted to universal dependency trees. As shown in Figure 4, we convert the universal dependency tree into Γ by merging the head and its children into one single constituent.

Experiments
We present the results of SOM on language modeling, syntactic generalization, and incremental parsing. Details of hyperparameters and experiment settings can be found in Appendix B.

Language Modeling
Penn Treebank has one million words of 1989 Wall Street Journal corpus annotated with constituency trees. Since SOM primarily focuses on sentence-level structure and language modeling, we use the same preprocessing schema as RNNG 2 . Sentences are modeled separately, punctuation is retained, and singleton words are replaced with the Berkeley parser's mapping rules 3 , resulting in 23,815-word types. Orthographic case distinction is preserved, and numbers (beyond singletons) are not normalized.
BLLIP is a large Penn Treebank-style parsed corpus of approximately 24 million sentences. We train and evaluate SOM on three splits of BLLIP: BLLIP-XS (40k sentences, 1M tokens), BLLIP-SM (200K sentences, 5M tokens), and BLLIP-MD (600K sentences, 14M tokens). They are obtained by randomly sampling sections from BLLIP 1987-89 Corpus Release 1. All models are tested on a shared held-out tested set.
Following the settings provided in (Hu et al., 2020), datasets are preprocessed into two different versions. The first setting is similar to the PTB dataset. Singleton words are mapped to UNK classes that preserve fine-grained information, such as orthographic case distinctions and morphological suffixes (e.g. UNK-ed, UNK-ly). The second setting use subword-level vocabulary extracted  Table 2: Ablation tests on the PTB dataset. "p acc" and "p acc" are the prediction accuracies of the one-step look-ahead and zero-step look-ahead parsers respectively. "UF1" is the parsing performance with respect to the converted constituency tree Γ. "− Prediction network": this model uses the last candidate memory slotM N t to predict the next token, instead of using the h t from the prediction network. "− Predicted tree + Gold tree": the model's parsing decisions were replaced with ground truth decisions; these results can be considered as the performance upper bound of SOM.
Results of language modeling are given in Table 3 and Table 4. SOM consistently outperforms both the annotated model and non-annotated models. While GPT-2 seems to fail to learn on smaller datasets, SOM still outperforms GPT-2 on the BLLIP-MD dataset with far fewer parameters (34.8M vs 124.4M), and achieves comparable results with the GPT-2 that is trained on a 3 times larger dataset BLLIP-LG (Hu et al., 2020). The ablation test results are shown in Table 2.
The biggest performance drop comes from replacing the dynamic oracle with static oracle. We believe that this is due to the model overfitting on the gold tree, and suffering from exposure bias as a result. Another big performance drop happens after removing the prediction network. This suggests that predicting the attaching nodes of the next phrase with the zero-step look-ahead parsers helps to predict the next token. Replacing the gold tree labels with trivial left-branching tree labels also hurts the perplexity. This suggests that learning syntactic structure helps language modeling.

Syntactic Generalization
Syntactic Generalization (SG) test suites evaluate the syntactic knowledge of neural language models. Hu et al. (2020) proposed a set of 34 test suites to evaluation 6 different aspects of syntax: 1) agreement, 2) licensing, 3) garden-path effects, 4) gross syntactic expectation, 5) center embedding, 6) long-distance dependencies. Following their settings, we evaluate our language models trained on the BLLIP datasets. Language models are presented with a group of sentences with minor differences. To pass each test, the model needs to assign higher conditional probabilities to designated phrases in the sentence that are more grammatical. Figure 6 shows the average accuracy over all model on the complete set of SG test suites. SOM achieves the best average accuracy, outperforms models with hierarchical structure bias (RNNG, ON-LSTM), and transformer-based model (GPT-2). However, according to Figure 8a in Appendix C.1, GPT-2 trained on BLLIP-LG and BLLIP-MD still outperform SOM. This could due to that the number of parameters in SOM is largely falling behind GPT-2. Figure 5 provides fine-grained results on six SG classes. SOM achieves strong performance on licensing, gross syntactic state, center embedding, and long-distance embeddings. These classes require the model to keep track of syntactic features across large syntactic chunks (e.g., relative or subordination clauses). SOM can effectively keep this long-term information in higher-level memory slots, and revisit the information after the clause in the middle is ended. More detailed results can be found in Appendix C.1.  To evaluate SOM's performance on incremental parsing, we trained and evaluated our models on the standard PTB constituency trees. Baseline models include: a) a standard incremental shift-reduce parser with one-step look-ahead; b) a incremental shift-reduce parser that equipped with our predic-tion network and trained on same dynamic oracle and language model loss as our model; c) a recently proposed ONLSTM-SYD model (Du et al., 2020) that is also trained on both language model and parsing loss; d) unsupervised ONLSTM; e) unsupervised PRPN. As shown in Table 5, SOMs outperform all baseline models, including the shiftreduce parser that has the same extra components as SOMs. For language modelling performance, original constituency tree based models achieve similar perplexity as dependency tree based counterparts. But constituency tree based models require 2× GPU time and memory to train and evaluate.

Incremental Parsing
For ablation test, we also compare parsing results given by SOM with binary constituency trees Γ converted from universal dependency trees. 4 These results are shown in Table 2. We observe that using static oracle instead of dynamic oracle results in the worst parsing performance. This suggests that our dynamic oracle helps the model to learn a better parser. After removing the language model loss, the UF1 drops 1.7 points. This suggests that the language model loss helps the model to learn better representations for syntax.

Conclusion
In this work, we propose a new language model with an integrated incremental parser. This was done by augmenting the Ordered Memory model with a prediction network, and by using a dynamic oracle for training it to perform incremental parsing. The resulting model models the joint distribution of syntactic structure and sequence words. We find that by using the dynamic oracle and explicitly modeling the syntax, we can achieve strong performance on language modelling and syntactic generalization and both these techniques are crucial in the model's performance.

A Disentangling Semantic and Syntactic representations
Given the architecture of our model, we can easily disentangle the language model information flow and parsing information flow. Figure 7 illustrates the disentangled information and gradient flows in our model. The language model depends on both prior context and structural inputs, and derivatives are computed with respect to both of these inputs and backpropagated. However, while the structure also depends on both inputs, we limit backpropagation so that it can only update with respect to the syntactic input. This is because we want the parsing component to function independently of the language modelling component, but still leverage the semantic information to deal with syntactic ambiguity. It is possible that existing model architectures could implicitly learn to split these representations, even without the explicit disentanglement that we proposed here. Yet, Table 2 shows that entangled model can actually achieve stronger in-domain performance, thanks to the liberty to allocate capacity to the two different functionalities based on the training set.
To do so, we propose splitting word embeddings, memory slots, and intermediate hidden states into two segments: semantic segment and syntactic segment. We then replace linear layers in our cell functions with the following function: where y sem and x sem are the semantic segment which is optimized to minimize language modeling loss, y syn and x syn are the syntactic segment which is optimized to minimize both parsing and language modeling loss. This architecture results in the solid lines in Figure 7. Additionally, layer normalization functions are replaced with two separate functions for the two segments respectively.
Meanwhile, p t still depends on both semantic and syntactic segment, but the structural loss does not backpropagate into the semantic segment: p t = f (x t,sem , x t,syn ,M t,sem ,M t,syn ) (14) ∂p t ∂x t,sem = 0, ∂p t ∂M t,sem = 0 (15) and the same for p t : This gradient detachment is represented by the dash line in Figure 7. In the experiment section, the disentangled models are denoted as dSOM, and entangled models are denoted as SOM. For dSOM, the dimension of semantic and syntactic segments for memory slots are denoted as D sem and D syn respectively. Among the proposed models, the eSOM has the best performance on the in-domain test sets. Appendix C.2 shows that the dSOM slightly outperforms eSOM in perplexity on out-of-domain test sets.  Dropout is applied before all linear layers in our model. They all share the same dropout rate, except the dropout before language model output layer has a different rate. We also applied embedding dropout which randomly set some embedding vectors to 0. Hyperparameters are chosen based on the perplexity on validation set.

C.2 Out of Domain Evaluation
Out-of-domain Test set contains testsets from other English universal dependencies treebanks. It contains corpora of different genres, including academic, email, blog, fiction, legal, news, etc. We use these datasets to test the generalization ability of models that are trained on PTB.