Memory-bounded Neural Incremental Parsing for Psycholinguistic Prediction

Syntactic surprisal has been shown to have an effect on human sentence processing, and can be predicted from prefix probabilities of generative incremental parsers. Recent state-of-the-art incremental generative neural parsers are able to produce accurate parses and surprisal values but have unbounded stack memory, which may be used by the neural parser to maintain explicit in-order representations of all previously parsed words, inconsistent with results of human memory experiments. In contrast, humans seem to have a bounded working memory, demonstrated by inhibited performance on word recall in multi-clause sentences (Bransford and Franks, 1971), and on center-embedded sentences (Miller and Isard,1964). Bounded statistical parsers exist, but are less accurate than neural parsers in predict-ing reading times. This paper describes a neural incremental generative parser that is able to provide accurate surprisal estimates and can be constrained to use a bounded stack. Results show that the accuracy gains of neural parsers can be reliably extended to psycholinguistic modeling without risk of distortion due to un-bounded working memory.


Introduction
Syntactic surprisal has been shown to have an effect on human sentence processing, and can be calculated from prefix probabilities of generative incremental parsers (Hale, 2001;Levy, 2008), making it a useful baseline predictor when looking for effects of other factors, like limits of memory or attention. Recent work in generative neural network parsing (Dyer et al., 2016;Hale et al., 2018) has shown that generative parsers based on neural networks are more accurate than earlier statistical generative parsers, and can be used for surprisal calculation. Although a typical shift-reduce neural network parser like that used by Hale et al. (2018) and Crabbé et al. (2019) may be successful in predicting brain imaging data, the depth of its stack memory, the model component where past predicted items are faithfully stored, can be as long as the whole derivational history of the parse (Kuncoro et al., 2018). This potentially sentence-length stack may be used by the neural parser to maintain explicit in-order representations of all previously parsed words. In contrast, humans seem to have a bounded working memory, demonstrated by inhibited performance on word recall in multiclause sentences (Bransford and Franks, 1971), and on center-embedded sentences (Miller and Isard, 1964). 1 Explicit storage of this long parsing history may improve parsing accuracy, but it also risks distorting the predictions of the model when used as a statistical control in psycholinguistic experiments.
Left-corner parsers (Rosenkrantz and Lewis, 1970;Johnson-Laird, 1983) have been argued to provide human-like limits on working memory, because the stack memory requirements of this kind of parser do not grow unboundedly in linguistically common cases of left-or right recursion, only in linguistically rare cases of center recursion. For example, a left corner parser would require only one memory element to process the right recursive sentence, 'The dog chased the cat that ate the rat that nibbled the malt,' but would require three elements to process the center recursive sentence, 'The rat that the cat that the dog chased ate nibbled the malt,' consistent with findings that humans have more difficulty understanding the latter sentence. Left-corner parsers also define a fixed set of probabilistic decisions at each word, which naturally paces the surprisal measures produced by the model. Unfortunately, existing leftcorner parsers (van Schijndel et al., 2013) are statistical rather than neural, and are therefore substantially less accurate than state-of-the-art neural network parsers.
This paper therefore defines a neural-network left-corner parser with bounded stack memory for parsing and psycholinguistic prediction. Experiments described in this paper show that this generative left-corner neural network parser is competitive with incremental generative parsers that use unbounded stack memory in a parsing task, and outperforms statistical memory-bounded generative left-corner parsers both in parsing accuracy and in fitting human behavioral data on two different datasets, consistent with the conclusion that accuracy gains of neural parsers can be reliably extended to psycholinguistic modeling without risk of distortion due to unbounded working memory.

Related work
Incremental generative constituent parsers are able to process sentences in time order and provide psycholinguistically predictive measures like syntactic surprisal and entropy reduction (Levy, 2008;Hale, 2001Hale, , 2006, which in turn are used in psycholinguistic experiments for probing effects of syntax on behavioral data (Demberg and Keller, 2008;Demberg et al., 2012;van Schijndel and Schuler, 2015). Statistical incremental parsers like ones proposed by Roark (2001) and van Schijndel et al. (2013) are based on contextfree grammars. The Roark (2001) parser builds syntactic structures top-down incrementally and has been used in studies for calculating surprisal (Demberg and Keller, 2008;Roark et al., 2009;Frank, 2009). Left-corner parsers (Rosenkrantz and Lewis, 1970;Johnson-Laird, 1983) are often used to model limits on center embedding (Abney and Johnson, 1991;Gibson, 1991;Resnik, 1992;Stabler, 1994;Lewis and Vasishth, 2005). van Schijndel et al. (2013) proposed an incremental parser that takes working memory constraints into account, and is able to produce probabilistic measures as well as predictions about working memory operations (van Schijndel and Schuler, 2015). Demberg et al. (2013) propose a parser which is also able to produce prefix probabilities for tree-adjoining grammars. All of these statistical parsers lag behind state-of-the-art parsers in parsing accuracy, because of psycholinguistic constraints like incrementality and because they use less expressive statistical models.
State-of-the-art constituency parsers generally are neural network models (Choe and Charniak, 2016;Dyer et al., 2016;Kitaev and Klein, 2018). Dyer et al. (2016) propose a generative neural model for top-down incremental parsing but use it only as a reranker for a discriminative parser. Extensions to the Dyer et al. (2016) model allow the parser to do in-order tree traversal (Liu and Zhang, 2017;Kuncoro et al., 2018). 2 However, the in-order transition system has a bias towards left children of constituents, which is not desirable when the model is used to calculate prefix probabilities. This issue was addressed by using word-synchronous beam search (Stern et al., 2017;Hale et al., 2018) or variable sized beam search (Crabbé et al., 2019) and successfully predict brain imaging data. However, all of these parsers do not limit the number of stack elements the parser has direct access to at any timestep, which in some cases can be equal to number of derivational decisions made up to the current timestep. This behavior of unbounded stack does not match what we know about human working memory and is undesirable for calculating predictors like probabilistically-weighted emdedding depth (Wu et al., 2010). The model described in this paper avoids these problems by using a left-corner transition system, which uses a bounded pushdown store and a fixed set of probabilistic decisions per word. The bounded stack memory not only more closely implements human working memory limits in a model designed to calculate cognitive predictors, other work (Jin et al., 2018) shows that it also helps limit search space for unsupervised grammar acquisition. 3 Incremental left corner transition system with bounded working memory This paper introduces a neural left-corner transition system for incremental constituency parsing with minimal working memory requirements. This system defines a fixed set of parser decisions at each time step. Following these parsing decisions, the parser incrementally generates each word in a sentence and the syntactic structures associated with that word, in time order. Because the parser needs space on the stack only when there is center-embedding in the sentence, this transition system uses much less stack memory than other shift-reduce transition systems (Kuncoro et al., 2018), modeling the psycholinguistic phenomenon that center-embedding is rare and hard for humans to process.

Types of nodes in left-corner parsing
A left-corner parser maintains a pushdown store of one or more derivation fragments A/B, each of which consists of a top node of category A lacking a bottom node of category B yet to come. The parser generates each word in a sentence at each time step, and then makes predictions about the top nonterminal category of the current deriva- tion fragment and the bottom rightmost unfinished nonterminal category. This process uses stored states only within center-embedded structures, reflecting the difficulty of center-embedding for humans. For example, in processing the sentence The cart the horse pulled broke (see Figure 1), in timestep t2 immediately after the word cart is generated, the derivation fragment is NP/RC, shown in the figure with an orange-yellow striped plate. The top nonterminal category, or the A category, of this derivation fragment is NP and the bottom rightmost unfinished nonterminal category, or the B category, is RC. At t3 when a center-embedded structure appears, a new derivation fragment is created and stored in the stack memory, making t3 a timestep with two derivation fragments: NP/RC and NP/NP. Figure 2 defines the set of parser decisions that the parser must make at each time step. They consist of the following:

Parser decisions
generate: First a word must be generated given the current state of the parser and pushed onto the stack. There are two rules associated with generate decisions, and they have different stack configurations when the push operation happens. If the stack has a derivation fragment X/Y at its head, then the word is pushed onto the top of the stack without further operation (generate-w). If the top of the stack has a fragment followed by a ⊥ sign, then the word is first merged with the bottom node Y and then the merged Y node is merged with X. In the end only the top node X remains (generatew). The parser deterministically decides which rule to use based on the state of the stack.
pja : Next a nonterminal top node must be projected onto the head of the stack. The set of possible top nodes include all the nonterminal categories in the training data X as well as a special category null. The pja-x decision projects a nonterminal top node X together with a placeholding bottom node with the same category onto the stack, and appends the stack with a sign. pjanull merges the final node Y on the stack, which is often a terminal, with the closest bottom node, and appends the sign to the stack.
pjb : Finally a nonterminal bottom node must be projected onto the head of the stack. The set of possible bottom nodes includes all the nonterminal categories in the training data X as well as a special category null and discourse level category T. The pjb-x decision merges the last bottom node Y to the bottom node with the predicted category X. pjanull changes the sign to the ⊥ sign at the head of stack. Table 1 shows how the sentence in Figure 1 is parsed with this left-corner transition system. The state of the stack in the parser in the beginning is implicitly [T/T], which represents the top-level discourse structure which has a top node of T and bottom node of T. We omit this initial fragment in the table for brevity. After parsing the whole sentence, the state of the parser will be [ ], again omitting the top level discourse nodes.
The relationship between a parse tree and a sequence of decisions generated by the transition system is bijective. Trees produced with this system are all binary-branching.

Use of stack memory
Stack memory depth increases only when a left nonterminal child of a right child is generated (Schuler et al., 2010) as a center-embedded structure is generated. In the current transition system, the pjb decision at the previous time step (pjb t−1 ) and the pja decision at the current time step (pja t ) together decide how depth of a parse will change: • if pjb t−1 = null and pja t = null, then the The generative incremental left-corner transition system. Adding a buffer to the system yields a discriminative transition system. depth of the sentence will decrease by 1.
• if pjb t−1 null and pja t null, then the depth of the sentence will increase by 1.
• in all other cases, the depth remains the same as the previous time step.
The depth of the stack memory for a parse is closely related to the well-formedness of a parse. As Figure 1 shows, a valid parse starts at depth 0, stays at larger depths during parsing the sentence, and returns to depth 0 at the end of the sentence. The figure also shows that the depth of the stack memory only increases when a center embedding is being parsed at t3. The average stack memory depth for the transition system in these parsing experiments is 2, which means that on average there are 4 tree nodes in stack memory, much smaller than 12 which is the average number of items for the top-down system used in Hale et al. (2018). This shows that the left-corner transition system makes much more parsimonious use of stack memory than a typical shift-reduce system. The left-corner model evaluated in the experiments also applies bounding to the stack memory and uses a relatively liberal maximum depth of 5 derivation fragments (10 tree nodes), reflecting the fact that remembering more than 10 items faithfully at once is highly unlikely in sentence processing due to working memory limits in humans.
There are two sets of constraints for different use cases for the parser to prune parses on the beam. 3 The basic set only drops a parse when the parse reaches depth 0 before the end of the sentence. This set is used by the parser when psycholinguistic measures are needed. The extended set provides information to the parser about the length of the sentence currently parsing, guiding the parser to drop parses with stack memory too deep or too shallow while parsing. Because this set provides some forward context, it is only used when the parser is used to find best parses in linguistic evaluation.

Parsing model
This section defines a memory-bounded neural incremental generative parser as a generative probability model for surprisal calculation using the proposed left-corner transition system. In the description below, all LSTMs are stack-LSTMs (Dyer et al., 2016) with coupled input and forget gates (Greff et al., 2017) and all FFs are feed-forward neural networks.
Surprisal at a word w t is defined as the negative log of the probability of that word given its preceding words w 1..t−1 under some model θ. This can be computed by marginalizing over the final hidden state of a sequence: (1) then decomposing the marginalized term into a recurrence of marginalized transition-observation probabilities: using P θ (q 0 w 0 ) = 1 for some start-of-sentence word w 0 and initial state q 0 .
The hidden states q t of the model described in this paper consist of: • cell and hidden vectors c t , h t ∈ R n for a word LSTM, • a preterminal decision p t ∈ C over category labels C, • a top decision a t ∈ C∪{⊥} over labels and null results ⊥, • a bottom decision b t ∈ C∪{⊥} over labels and null results, .D}, and • cell and hidden vectors c t , h t ∈ R n for a decision LSTM.
Probabilities for observing a word and transitioning to a new hidden state at each time step t, given a hidden state at the previous time step t−1, can then be decomposed into terms for each individual decision and resulting vector: The probability of observing a word depends on bounded representations of the store, the decision sequence and the word sequence: where δ i is a Kronecker delta vector, consisting of a one at element i and zeros elsewhere, and q t is a summary of the current stack: This probability term defines a distribution over generate decisions.
The probability of a cell and hidden vector of the word LSTM is deterministic given the preceding operations, and is modeled as an indicator equal to one when the vectors are as defined by the corresponding LSTM model, zero otherwise: where: (7) and e t = E δ w t is a trained word embedding, and e t = E δ w t is a pre-trained word embedding.
Similarly for the cell and hidden states of the decision LSTM: where m t−1 is a trainable embedding for the decision made at timestep t − 1. Note that there are three timesteps for m corresponding to three decisions, compared to one timestep for w. Figure  3 shows an illustration of how the model works to predict the generate-the decision at timestep 3 in Table 1. In the illustration, the decision LSTM takes all previous decisions m 1 , m 2 , . . . , and generates a hidden state h 2 which represent the decision history (Equation 8). The word LSTM takes the words which have already been generated, and produces a hidden state h 2 which represents the word history (Equation 6). The stack composer composes all top and bottom categories on the stack represented by the vectors, and produces the representation of the stack (Equation 5). Finally, all three representations of different kinds of information are processed by the generate feedforward network FF θ W , which makes a prediction about which word is next (Equation 4). Other decisions are made in a similar fashion as shown below.
The probability of a preterminal category decision depends on a bounded representation of the word sequence at the current time step, and a bounded representation of the decision sequence and the store at the previous time step, and the trained and untrained pre-trained word embeddings: (9) This term defines a distribution over preterminal (part of speech) decisions.
The probability of a top category decision depends on a bounded representation of the word sequence at the current time step and a bounded representation of the decision sequence and the store at the previous time step: This term defines a distribution over pja decisions.
The probability of a bottom category decision depends on a bounded representation of the word  Table 1.
sequence at the current time step and decision sequence including the current top decision and the store at the previous time step: is the result of adding the bottom decision to the decision LSTM. This term defines a distribution over pjb decisions.
The probability of a top vector is deterministic given the preceding operations, and is modeled as an indicator function equal to one when the vectors are as defined by a set of LSTMs over dependent top and bottom store vectors, depending on the previous bottom category and current top category decisions: 0} is the previous store depth, and φ d = a 1..d−1 t = a 1..d−1 t−1 , a d+1..D t = 0 is a maintenance constraint on stores. These stack operations related to the top category are illustrated in Figure 5 in the appendix.
The probability of a bottom vector is also deterministic and modeled as an indicator function equal to one when the vectors are as defined by a set of LSTMs over dependent top and bottom store vectors, depending on the previous bottom category and current top category decisions: 0} is the previous store depth, and Finally, the probability of a cell and hidden vector of the decision LSTM is also deterministic and modeled as an indicator equal to one when the vectors are as defined by the corresponding LSTM model, zero otherwise:

Training and Parsing
The proposed model here is a generative model for sequence prediction with no forward context, therefore ideally it should be trained with a structured training scheme (Weiss et al., 2015). However since it is expensive to search a wide beam in training with a neural network, this model uses a two-stage training scheme. The model is first trained to minimize a cross-entropy loss objective with an l 2 regularization term, defined by: where P θ (w 1..T q 1..T ) = t P θ (w t q t | q t−1 ), and λ is an l 2 regularization strength hyper-parameter.
Training with the local cross-entropy objective quickly leads to overfitting, because the left-corner parsing decisions can be ambiguous at early parts of the sentence, and the objective drives the model to make such decisions perfectly by memorizing the training data. This model therefore stops the cross-entropy training when parsing performance starts to decrease on a development set, and switches to use the REINFORCE algorithm (Williams, 1992;Le and Fokkens, 2017) to finetune the model with sequence level supervision. The loss becomes: 4 where P θ (q 1..T | w 1..T ) = P θ (w 1..T q 1..T ) q 1..T P θ (w 1..T q 1..T ) , F is a function from gold and hypothesized decision sequences to parsing F-scores, andb is a global running average of F scores of all sampled trees.
After the model is trained, the parser uses beam search to find the approximate best parse. A large beam width is desirable because it provides more accurate parses and straightforward ways to calculate psycholinguistic measures like surprisal which requires marginalization. rate to be 0.001, and λ = 1 × 10 −6 . Training then switches to the REINFORCE objective until the parser reaches the maximum F1 score (about 3 epochs), with SGD using learning rate = 5 × 10 −3 and λ = 1 × 10 −5 . Using REINFORCE adds 0.3 F1 points on the development set. The pretrained English word embeddings are from Liu and Zhang (2017). Dropout is applied to input to all layers. Experiments first evaluate model performance on Section 22 of WSJ as the development set and Section 23 as the test set for linguistic accuracy evaluation with a beam width of 2000. These experiments use the extended set of constraints for parsing WSJ for efficiency. This evaluation reports EVALB F scores on both datasets. Trees in the training set are binarized with left-branching constituents and the unary nodes are removed from gold trees following van Schijndel et al. (2013).

Experiments
The trained model then is used to calculate surprisal for sentences in the Natural Stories Corpus and the Dundee Corpus for psycholinguistic accuracy evaluation. These experiments only use the exploratory set of both corpora. Corpus cleaning follows van Schijndel and Schuler (2013). The parser uses the basic set of constraints to parse the Natural Stories and Dundee corpora, only rejecting parses that would lead to premature termination of the parsing process while doing beam search, with width 2000.

Linguistic accuracy evaluation
A linguistic accuracy evaluation compares the performance of the bounded neural parser with the published results of generative incremental parsers that are able to calculate psycholinguistic predictors. These experiments first compare parsing scores of the current parser on the develop-

Psycholinguistic accuracy evaluation
The psycholinguistic accuracy of the parser is evaluated by comparing surprisal predictors calculated by the neural left-corner model against surprisal predictors from the statistical left-corner parser of van Schijndel et al. (2013), which is the memory-bounded generative incremental parser with current state-of-the-art linguistic accuracy. This evaluation uses linear mixed effects models in lme4 5 to regress to both reading time (how long a word is read) data in the Natural Stories Corpus and first-pass (how long a word is first fixated) and go-past (how long before a subsequent word is fixated) fixation durations in the Dundee Corpus, with all four combinations of the neural psycholinguistic (referred to as Neural) and van Schijndel et al. (2013) (referred to below as vS13) surprisal predictors in a diamond ANOVA. 6 All the models also have random intercepts for subjectsentence interaction and word, and random bysubject slopes for all fixed effects. Since this evaluation uses ablative testing to determine whether a fixed effect significantly improves the fit of a model compared to that model without that fixed effect, all models also include random slopes for all fixed effects, even if that particular fixed effect is not used in that model.
Psycholinguistic evaluation results are shown in Figure 4 in terms of model fit to human behavioral data. Results show that surprisal values derived from the bounded neural parser explain behavioral data better than the bounded statistical parser. First, the results show that the neural parser produces more human-like surprisal values than the vS13 parser in all three experiments. This is shown by the fact that adding vS13 to a model which already has Neural (Base-line+Both vs. Baseline+Neural) yields no significant improvement in model fit. The other comparison in which Neural surprisal values are added on top of vS13 surprisal values (Baseline+Both vs. Baseline+vS13) also shows this effect, because significant improvement in model fit is observed. Second, both surprisals derived from memorybounded generative incremental parsers significantly increase model fit, showing that surprisal is a reliable predictor of both reading times and fixation durations, but in all three experiments, the results show that Baseline+Neural achieves much better model fit to the data than Baseline+vS13 with larger loglikelihood improvements compared to Baseline.

Conclusion
This paper proposes a new incremental left-corner transition system that can calculate surprisal and other psycholinguistic predictors, and a new neural generative incremental parser to use this transition system to do memory-bounded incremental generative parsing. Experiments described in this paper show that this generative left-corner neural network parser is competitive with incremental generative parsers that use unbounded stack memory in a parsing task, and outperforms statistical memory-bounded generative left-corner parsers both in parsing accuracy and in fitting human behavioral data on two different datasets, showing that accuracy gains of neural parsers can be reliably extended to psycholinguistic modeling without risk of distortion due to unbounded working memory. illustration. For all operations, the current working depth is 2. When both the bottom decision at the previous timestep b t−1 and the top decision at the current time step a t are null, the parser returns to the stack depth above, copying the top category a 1 t−1 and its vector a 1 t−1 from the stack at the previous timestep to the new stack, shown in Figure  5.1. If the bottom category is null, but the top category at the current timestep is predicted to be a real category, the parser copies the categories from the stack depths above and generates a new vector a 2 t for the newly predicted top category a t using the top category from the depth above a 2 t−1 , the current word w t and the embedding of the predicted top category a t as input to LSTM θ Q , shown in Fig O + X d y t 7 + w e F R 9 f i k Y 5 J M M 9 5 m i U x 0 L 6 C G S 6 F 4 G w V K 3 k s 1 p 3 E g e T e Y 3 M 3 9 7 h P X R i T q E a c p 9 2 M 6 U i I S j K K V u o O A 6 j y c D a s 1 t + 4 u Q N a J V 5 A a F G g N q 1 + D M G F Z z B U y S Y 3 p e 2 6 K f k 4 1 C i b 5 r D L I D E 8 p m 9 A R 7 1 u q a M y N n y / O n Z E L q 4 Q k S r Q t h W S h / p 7 I a W z M N A 5 s Z 0 x x b F a 9 u f i f 1 8 8 w u v F z o d I M u W L L R V E m C S Z k / j s J h e Y M 5 d Q S y r S w t x I 2 p p o y t A l V b A j e 6 s v r p H N V 9 x r 1 2 4 d G r d k o 4 i j D G Z z D J X h w D U 2 4 h x a 0 g c E E n u E V 3 p z U e X H e n Y 9 l a 8 k p Z k 7 h D 5 z P H 4 U + j 6 s = < / l a t e x i t > O + X d y t 7 + w e F R 9 f i k Y 5 J M M 9 5 m i U x 0 L 6 C G S 6 F 4 G w V K 3 k s 1 p 3 E g e T e Y 3 M 3 9 7 h P X R i T q E a c p 9 2 M 6 U i I S j K K V u o O A 6 j y c D a s 1 t + 4 u Q N a J V 5 A a F G g N q 1 + D M G F Z z B U y S Y 3 p e 2 6 K f k 4 1 C i b 5 r D L I D E 8 p m 9 A R 7 1 u q a M y N n y / O n Z E L q 4 Q k S r Q t h W S h / p 7 I a W z M N A 5 s Z 0 x x b F a 9 u f i f 1 8 8 w u v F z o d I M u W L L R V E m C S Z k / j s J h e Y M 5 d Q S y r S w t x I 2 p p o y t A l V b A j e 6 s v r p H N V 9 x r 1 2 4 d G r d k o 4 i j D G Z z D J X h w D U 2 4 h x a 0 g c E E n u E V 3 p z U e X H e n Y 9 l a 8 k p Z k 7 h D 5 z P H 4 U + j 6 s = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " V Y n R v 2 k H A Z X i + b l v T e 6 5 n i X L G a s = " > A A A C D 3 i c b V C 7 S g N B F J 2 N r x h f U U u b x a B Y h V 0 J q F 3 A x k I h w b w g G 8 P s 5 C Y Z M v t g 5 q 4 Y l v 0 D G 3 / F x k I R W 1 s 7 / 8 b Z J I U m H r j c w z n 3 M n O P G w q u 0 L K + j c z S 8 s r q W n Y 9 t 7 G 5 t b 2 T 3 9 1 r q C C S D O o s E I F s u V S B 4 D 7 U k a O A V i i B e q 6 A p j u 6 T P 3 m P U j F A 7 + G 4 x A 6 H h 3 4 v M 8 Z R S 1 1 8 8 c O w g P G 1 7 e 1 m 6 Q b O z g E p G l P x W p y 5 4 S S e 5 A k 3 X z B K l o T m I v E n p E C m a H S z X 8 5 v Y B F H v j I B F W q b V s h d m I q k T M B S c 6 J F I S U j e g A 2 p r 6 1 A P V i S f 3 J O a R V n p m P 5 C 6 f D Q n 6 u + N m H p K j T 1 X T 3 o U h 2 r e S 8 X / v H a E / f N O z P 0 w Q v D Z 9 K F + J E w M z D Q c s 8 c l M B R j T S i T X P / V Z E M q K U M d Y U 6 H Y M + f v E g a p 0 W 7 V L y o l g r l 0 i y O L D k g h + S E 2 O S M l M k V q Z A 6 Y e S R P J N X 8 m Y 8 G S / G u / E x H c 0 Y s 5 1 9 8 g f G 5 w / 6 C 5 3 f < / l a t e x i t > LSTM ✓ Q 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " V Y n R v 2 k H A Z X i + b l v T e 6 5 n i X L G a s = " > A A A C D 3 i c b V C 7 S g N B F J 2 N r x h f U U u b x a B Y h V 0 J q F 3 A x k I h w b w g G 8 P s 5 C Y Z M v t g 5 q 4 Y l v 0 D G 3 / F x k I R W 1 s 7 / 8 b Z J I U m H r j c w z n 3 M n O P G w q u 0 L K + j c z S 8 s r q W n Y 9 t 7 G 5 t b 2 T 3 9 1 r q C C S D O o s E I F s u V S B 4 D 7 U k a O A V i i B e q 6 A p j u 6 T P 3 m P U j F A 7 + G 4 x A 6 H h 3 4 v M 8 Z R S 1 1 8 8 c O w g P G 1 7 e 1 m 6 Q b O z g E p G l P x W p y 5 4 S S e 5 A k 3 X z B K l o T m I v E n p E C m a H S z X 8 5 v Y B F H v j I B F W q b V s h d m I q k T M B S c 6 J F I S U j e g A 2 p r 6 1 A P V i S f 3 J O a R V n p m P 5 C 6 f D Q n 6 u + N m H p K j T 1 X T 3 o U h 2 r e S 8 X / v H a E / f N O z P 0 w Q v D Z 9 K F + J E w M z D Q c s 8 c l M B R j T S i T X P / V Z E M q K U M d Y U 6 H Y M + f v E g a p 0 W 7 V L y o l g r l 0 i y O L D k g h + S E 2 O S M l M k V q Z A 6 Y e S R P J N X 8 m Y 8 G S / G u / E x H c 0 Y s 5 1 9 8 g f G 5 w / 6 C 5 3 f < / l a t e x i t >

B Constraint sets for parsing
There are two sets of constraints for different use cases for the parser to prune parses on the beam. Basic set is the set of constraints used when psycholinguistic measures are needed. It includes two constraints: the first pja must not be null, and all parses with d = 0 are removed from the beam while parsing.
Extended set is the set of constraints used when searching for a best parse. It guarantees that all parses on the beam to be valid parses. Let n be the length of the sentence, d the depth at the current time step and o be the offset of the current word at i to the end of the sentence, we can use the following constraints to ensure well-formed parses: 1. if d = o − 1, then both pja t and pjb t must be null.
2. if d = 1 and o > 1, then pja t and pjb t cannot be null at the same time.
3. if d = o − 2, if constraint 2 is also true, then pja t and pjb t cannot be both null and both x, otherwise pja t and pjb t cannot be both x.
The parser with the extended set can be seen as the parser with the basic set and a wider beam if it is used for getting the best parse. If a parse is at the top of the beam with the extended set, then it will be also at the top of beam with the basic set provided that it is not lost in beam search and a better one is not found due to using a wider beam.   Table 3 shows the hyperparameters the model use for all experiments. These values are tuned on the development set.

D lmer formulae for psycholinguistic experiments
The following sections record the lmer formulae for all psycholinguistic experiments mentioned in the paper. The independent variables included in all models are: word length (wlen), unigram probability (unigram) and 5-gram forward probability of the current word given the preceding context (fwprob5surp). All independent variables are centered and scaled before being added to each model. The 5-gram probabilities are interpolated 5-grams computed over the Gigaword corpus using KenLM (Heafield et al., 2013). Regressions to eye tracking data also include word position (wdelta) as well as whether the previous word was fixated on (prevwasfix). For regressing to go-past durations, one-position spillover measures for unigram (unigramS1) and 5-gram forward probability (fw-prob5surpS1) are also added.

D.1 Natural stories
The lmer formula for regression to reading times in the Natural Stories Corpus is: log(reading times) ∼ z. The difference between four evaluated models is whether each surprisal variable is used as a fixed effect or not. This is true for all the experiments.

D.2 Dundee: first pass
The lmer formula for regression to first pass fixation durations in the Dundee Corpus is: log(first pass fixation duration) ∼ z.