Contributions of Propositional Content and Syntactic Category Information in Sentence Processing

Expectation-based theories of sentence processing posit that processing difficulty is determined by predictability in context. While predictability quantified via surprisal has gained empirical support, this representation-agnostic measure leaves open the question of how to best approximate the human comprehender’s latent probability model. This work presents an incremental left-corner parser that incorporates information about both propositional content and syntactic categories into a single probability model. This parser can be trained to make parsing decisions conditioning on only one source of information, thus allowing a clean ablation of the relative contribution of propositional content and syntactic category information. Regression analyses show that surprisal estimates calculated from the full parser make a significant contribution to predicting self-paced reading times over those from the parser without syntactic category information, as well as a significant contribution to predicting eye-gaze durations over those from the parser without propositional content information. Taken together, these results suggest a role for propositional content and syntactic category information in incremental sentence processing.


Introduction
Much work in sentence processing has been dedicated to studying differential patterns of processing difficulty in order to shed light on the latent mechanism behind online processing. As it is now wellestablished that processing difficulty can be observed in behavioral responses (e.g. reading times, eye movements, and event-related potentials), recent psycholinguistic work has tried to account for these variables by regressing various predictors of interest. Most notably, in support of expectationbased theories of sentence processing (Hale, 2001;Levy, 2008), predictability in context has been quantified through the information-theoretical measure of surprisal (Shannon, 1948). Although there has been empirical support for n-gram, PCFG, and LSTM surprisal in the literature (Goodkind and Bicknell, 2018;Hale, 2001;Levy, 2008;Shain, 2019;Smith and Levy, 2013), as surprisal makes minimal assumptions about linguistic representations that are built during processing, this leaves open the question of how to best estimate the human language comprehender's latent probability model.
One factor related to memory usage that has received less attention in psycholinguistic modeling is the influence of propositional content, or meaning that is conveyed by the sentence. Early psycholinguistic experiments have demonstrated that the propositional content of utterances tends to be retained in memory, whereas the exact surface form and syntactic structure are forgotten (Bransford and Franks, 1971;Jarvella, 1971). This suggests that memory costs related to incrementally constructing a representation of propositional content might manifest themselves in behavioral responses during online sentence processing. In addition, there is evidence suggesting that parsing decisions are informed by the ongoing interpretation of the sentence (Brown-Schmidt et al., 2002;Tanenhaus et al., 1995).
Based on this insight, prior cognitive modeling research has sought to incorporate propositional content information into various complexity metrics. A prominent approach in this line of research has been to quantify complexity based on the compatibility between a predicate and its arguments (i.e. thematic fit, Baroni and Lenci 2010, Chersoni et al. 2016, Padó et al. 2009). However, these complexity metrics can only be evaluated at a coarse per-sentence level or at critical regions of constructed stimuli where predicates and arguments are revealed, making them less suitable for studying online processing. A more distribu-a) lexical attachment decision t b) grammatical attachment decision g t Figure 1: Left-corner parser operations: a) lexical match (m t =1) and no-match (m t =0) operations, creating new apex a t , and b) grammatical match (m g t =1) and no-match (m g t =0) operations, creating new apex a g t and base b g t . tional approach has also been explored that relies on word co-occurrence to calculate the semantic coherence between each word and its preceding context (Mitchell et al., 2010;Sayeed et al., 2015). Although these models allow more fine-grained perword metrics to be calculated, their dependence on an aggregate context vector makes it difficult to distinguish 'gist' or topic information from propositional content.
Unlike these models, our approach seeks to incorporate propositional content by augmenting a generative and incremental parser to build an ongoing representation of predicate context vectors, which is based on a categorial grammar formalism that captures both local and non-local predicateargument structure. This processing model can be used to estimate per-word surprisal predictors that capture the influence of propositional content differentially with that of syntactic categories, which are devoid of propositional content. 1 Our experiments demonstrate that the incorporation of both propositional content and syntactic category information into the processing model significantly improves fit to self-paced reading times and eye-gaze durations over corresponding ablated models, suggesting their role in online sentence processing. In addition, we present exploratory work showing how our processing model can be utilized to examine differential effects of propositional content in memory-intensive filler-gap constructions. 1 Note that this distinction of propositional content as retained information about the meaning of a sentence and syntactic categories as unretained information about the form of a sentence may differ somewhat from notions of semantics and syntax that are familiar to computational linguists -in particular, predicates corresponding to lemmatized words fall on the content side of this division here because they are retained after processing, even though it may be common in NLP applications to use them in syntactic parsing.

Background
The experiments presented in this paper use surprisal predictors calculated by an incremental processing model based on a probabilistic left-corner parser (Johnson-Laird, 1983;van Schijndel et al., 2013). This incremental processing model provides a probabilistic account of sentence processing by making a single lexical attachment decision and a single grammatical attachment decision for each input word. 2 Surprisal can be defined as the negative log of a conditional probability of a word w t and a state q t at some time step t given a sequence of preceding words w 1..t−1 , marginalized over these states: These conditional probabilities can in turn be defined recursively using a transition model: A probabilistic left-corner parser defines its transition model over possible working memory store states q t = a 1 t /b 1 many (λ x 1 some (λ e 1 person e 1 x 1 ) (λ e 1 true)) (λ x 1 some (λ x 3 some (λ e 3 pasta e 3 x 3 ) (λ e 3 true)) (λ x 3 some (λ e 2 eat e 2 x 1 x 3 ) (λ e 2 true))) Figure 2: Lambda calculus expression for the propositional content of the sentence Many people eat pasta, using generalized quantifiers over discourse entities and eventualities. tachment decision g t , and a resulting store state q t : As shown in Figure 1, the lexical attachment decision t generates a new complete node a t based on (m t ) whether the word matches the base of the most recent derivation fragment; and the grammatical attachment decision g t generates a new derivation fragment a g t /b g t based on (m g t ) whether the parent of a grammar rule with this new complete node as a left child matches the base of the most recent remaining derivation fragment. The semantic processing model described in this paper extends the above left-corner parser to incorporate propositional content by conditioning lexical and grammatical decisions on sparse vectors of predicate contexts h a d t and h b d t in addition to category labels c a d t and c b d t in apex and base nodes a d t and b d t . These predicate context vectors for nodes in a derivation tree of a sentence can be defined in terms of argument positions of variables signified by these nodes in predicates of a logical form translation of that sentence. For example, in Figure 2, the variable e 2 (signified by the word eat) would have the predicate context EAT 0 because it is the zeroth (initial) participant of the predication (eat e 2 x 1 x 3 ). 3 Similarly, the variable x 3 would have both the predicate context PASTA 1 , because it is the first participant (counting from zero) of the predication (pasta e 3 x 3 ), and the predicate context EAT 2 , because it is the second participant (counting from Figure 3: Derivation fragments resulting from example lexical decisions made at the word eat in the sentence People eat pasta. Note that the predicate contexts instead of predicate context vectors are displayed here for clarity. The predicate context PERSON 1,−1 represents an eventuality that takes the first argument of a PERSON predicate as its first argument. zero) of the predication (eat e 2 x 1 x 3 ). These predicate contexts are obtained by reannotating the training corpus using a generalized categorial grammar of English (Nguyen et al., 2012), which is sensitive to syntactic valence and non-local dependencies.
Lexical attachment probabilities. The probability of each lexical decision t in this parser is therefore decomposed into one term for generating a match decision m t and a predicate context vector h t , and another term for generating a syntactic category label c t for the new complete node a t : The probability of generating the match decision and the predicate context vector depends on the base node b d t−1 of the previous derivation fragment: where FF is a feedforward neural network, δ i is a Kronecker delta vector consisting of a one at element i and zeros elsewhere, depth d = argmax d {a d t−1 ⊥} is the number of non-null derivation fragments at the previous time step, and E L is a matrix of jointly trained dense embeddings for each syntactic category and predicate context. The probabilities of category labels are calculated using relative frequency estimation on training data based on the base node of the previous derivation fragment. The new complete node a t then depends on the match decision m t (see Figure 3): Word probabilities. Probabilities for generating words are estimated as the probability of generating their character sequence using a recurrent neural network implementation of a character model.
Grammatical attachment probabilities. The probability of each grammatical decision g t in this parser is similarly decomposed into a term for generating a match decision m g t and a composition operator for a grammar rule o g t , 4 and terms for category labels c g t and c g t at the apex and base nodes of the new derivation fragment: The probability of generating the match decision and the composition operator depends on the base node of the previous derivation fragment and the new complete node a t : where E G is a matrix of jointly trained dense embeddings for each syntactic category and predicate context. The probabilities of category labels c g t and c g t in Equation 7 are calculated using relative frequency estimation on training data based on the base node of the previous derivation fragment. The composition operator o g t in Equations 7 and 8 is associated with sparse composition matrices A o g t , which can be used to compose predicate context vectors associated with the apex node a g t of the new derivation fragment, and sparse composition matrices B o g t , which can be used to compose predicate context vectors associated with the base node b g t of the new derivation 4 Examples of composition operators include using the predicate context of the left child as a modifier or an argument, as well as introducing or discharging filler-gap dependencies. fragment (see Figure 4): These composition matrices allow predicate contexts to propagate appropriately through the tree to allow parsing decisions to depend on predicates that may be several words away.
Resulting store state probabilities. In order to update the store state based on the lexical and grammatical decisions, derivation fragments above the most recent nonterminal node are carried forward, and derivation fragments below it are set to null (⊥), where the indicator function ϕ = 1 if ϕ is true and 0 otherwise, and d = argmax d {a d t−1 ⊥} + 1 − m t − m g t . Together, these probabilistic decisions generate the n unary branches and n − 1 binary branches of a parse tree in Chomsky normal form for an n-word sentence.

Isolating Content and Category Contributions
In order to examine the contribution of propositional content on the content-sensitive processing model, the model is modified to allow it to be trained to make lexical and grammatical decisions without conditioning on the predicate context vec-tors, where 0 is a vector of 0s. Likewise, to examine the contribution of syntactic category information on the content-sensitive processing model, the model is modified to allow it to be trained to make decisions without conditioning on the syntactic category labels: These two ablated models will respectively be referred to as the content-and category-ablated models in the following experiments.

In-domain Linguistic Accuracy
In order to assess the parsing performance of the content-sensitive processing model outlined in Section 2, a linguistic accuracy evaluation was conducted on the development set and test set (i.e. sections 22 and 23 respectively) of the Wall Street Journal (WSJ) corpus of the English Penn Treebank (Marcus et al., 1993). The performance of the content-sensitive processing model is compared to the incremental left-corner parser of van Schijndel et al. (2013), which is based on a PCFG with subcategorized syntactic categories from the Berkeley latent variable inducer (Petrov et al., 2006).
The content-sensitive processing model was trained on a generalized categorial grammar (Nguyen et al., 2012)  parameters, the average performance of the contentsensitive processing model trained using three different random seeds is reported. Likewise, the left-corner parser of van Schijndel et al. (2013) was trained on the same generalized categorial grammar reannotation of sections 02 to 21 of the WSJ corpus, using four iterations of the split-merge-smooth algorithm (Petrov et al., 2006). Both parsers used beam search decoding with a beam width of 5,000 to return the most likely sequence of parsing decisions. The unlabeled WSJ bracketing F1 scores from both parsers are presented in the WSJ22 and WSJ23 columns of the vS et al. and Full model rows of Table 1. 5 The results show that the two parsers achieve comparable performance on WSJ22 and WSJ23, indicating that the current processing model is a reasonable model of syntactic parsing.

Cross-Domain Linguistic Accuracy
The two parsers were also evaluated on the Natural Stories Corpus (Futrell et al., 2018). This corpus consists of 10 naturalistic stories (10,245 tokens) adapted from existing texts such as fairy tales and short stories. As can be seen in the NS column of the vS et al. and Full model rows of Table 1, parsing accuracy on this corpus is substantially lower. This is likely due to the "deceptively naturalistic" nature of the Natural Stories Corpus; this corpus was designed to over-represent rare words and syntactic constructions, therefore representing a different "syntactic domain" from the WSJ corpus. Interestingly, the content-sensitive processing model seems to generalize better to the Natural Stories domain than the model based on the Berkeley latent variable inducer. This could be the result of the latent-variable subcategorized syntactic categories overfitting to the WSJ domain.

Linguistic Accuracy of Ablated Models
To determine the differential effect of propositional content and syntactic categories, models with each of the propositional content and syntactic category components ablated (i.e. the content-and categoryablated models) were evaluated against the full processing model. 6 As with the full model, the ablated models were trained using three different random seeds to account for sensitivity to initial parameters. The results in the Con-ablated and Cat-ablated rows of Table 1 show substantial contributions of both components to parsing accuracy in all domains. On Natural Stories, bootstrap significance tests revealed that seven out of nine (3 × 3) pairwise comparisons between the full model and the contentablated model, and all nine pairwise comparisons between the full model and the category-ablated model were statistically significant at the p < 0.05 level, which are both highly significant overall by a binomial test.

Experiment 2: Self-paced Reading
In order to evaluate the contribution of propositional content and syntactic categories to predicting behavioral responses, surprisal predictors were calculated from the content-sensitive processing model and its two ablated versions, which are outlined in Section 3. Subsequently, linear mixedeffects models containing common baseline predictors and one or more surprisal predictors were fitted to self-paced reading times. Finally, a series of likelihood ratio tests (LRTs) were conducted in order to evaluate the contribution of the surprisal predictor from the full processing model to regression model fit.

Response Data
Experiments described in this paper used the Natural Stories Corpus (Futrell et al., 2018), which contains self-paced reading times from 181 subjects that read 10 naturalistic stories consisting of 10,245 tokens. The data were filtered to exclude observations corresponding to sentence-initial and sentence-final words, observations from subjects who answered fewer than four comprehension questions correctly, and observations with durations shorter than 100 ms or longer than 3000 ms. This resulted in a total of 768,584 observations, which were subsequently partitioned into an exploratory set of 383,906 observations and a held-out set of 384,678 observations. The partitioning allows model selection to be conducted on the exploratory set and a single hypothesis test to be conducted on the held-out set, thus eliminating the need for multiple trials correction. All observations were log-transformed prior to model fitting.

Predictors
The baseline predictors commonly included in all regression models are word length measured in characters, index of word position within each sentence, and 5-gram surprisal. The 5-gram surprisal predictor is calculated from a 5-gram language model estimated using the KenLM toolkit (Heafield et al., 2013) trained on the Gigaword 4 corpus (Parker et al., 2009). 7 In addition to the baseline predictors, surprisal predictors were calculated from the full contentsensitive processing model, the content-ablated model, and the category-ablated model trained as part of Experiment 1 (FullSurp, NoConSurp, and NoCatSurp). To account for the time the brain takes to process and respond to linguistic input, it is standard practice in psycholinguistic modeling to include 'spillover' variants of predictors from preceding words (Rayner et al., 1983;Vasishth, 2006). However, as including multiple spillover variants of predictors leads to identifiability issues in mixedeffects modeling (Shain and Schuler, 2019), the FullSurp, NoConSurp, and NoCatSurp predictors were all spilled over by one position. Moreover, preliminary analysis showed that the surprisal predictors are highly collinear, which may result in identifiability issues for the regression model if included together as predictors. In order to mitigate this problem, the difference between the surprisal predictors from the ablated model and those from the full model (∆ConSurp, ∆CatSurp) were also calculated as predictors that represent the contribution of the full model over an ablated model. All predictors were centered and scaled prior to model fitting.

Likelihood Ratio Testing
Two sets of nested linear mixed-effects models were fitted to reading times in the held-out set using using lme4 (Bates et al., 2015). The first set manipulated the contribution of propositional content by including ∆ConSurp in the full regression model over the base model that contains the baseline predictors and NoConSurp. Similarly, the second set manipulated the contribution of syntactic categories by including ∆CatSurp in the full regression model over a base model that contains the baseline predictors and NoCatSurp. All regression models included by-subject random slopes for all fixed effects and random intercepts for each word and subject-sentence interaction. Subsequently, a series of LRTs were conducted between nested regression models in order to assess the contribution of surprisal predictors from the full processing model to regression model fit. As there were three variants of each surprisal predictor, a total of nine (3 × 3) LRTs were performed for each ablated surprisal predictor. 8

Results
The results show that the ∆CatSurp predictor made a statistically significant contribution to model fit over NoCatSurp in eight out of nine LRTs, 9 which is highly significant according to a binomial test (p < 0.001). In contrast, no significant contribution of ∆ConSurp over NoConSurp was observed, with none of the nine LRTs indicating significantly improved model fit. 10 This demonstrates that the full processing model captures the influence of propositional content and syntactic category information differentially, the latter of which contributed to predicting self-paced reading times. 8 Despite the risk of convergence issues, the LRTs were also replicated with full regression models that include raw FullSurp in addition to the baseline predictors and either No-CatSurp or NoCatSurp. 9 Any LRT in which either the base or full regression model failed to converge was considered as a null result. Regression models in one LRT failed to converge. In the replication using raw FullSurp, regression models in five LRTs failed to converge. However, the remaining four LRTs were statistically significant, which is highly significant according to a binomial test (p < 0.001).
10 Regression models in one LRT failed to converge. In the replication using raw FullSurp, regression models in five LRTs failed to converge, with the remaining four LRTs indicating non-significance. Additionally, removing 5-gram surprisal from the baseline did not change the pattern of significance.

Experiment 3: Eye-tracking Data
In order to examine whether the results observed in Experiment 2 generalize to other latency-based measures, linear-mixed effects models were fitted on the Dundee eye-tracking corpus (Kennedy et al., 2003). Following similar procedures to Experiment 2, a series of LRTs were conducted to test the contribution of propositional content and syntactic category information.

Procedures
The set of go-past durations from the Dundee Corpus (Kennedy et al., 2003) provided the response variable for the regression models. The Dundee Corpus contains gaze durations from 10 subjects that read 20 newspaper editorials consisting of 51,502 tokens. The data were filtered to exclude unfixated words, words following saccades longer than four words, and words at starts and ends of sentences, screens, documents, and lines. This resulted in the full set with a total of 195,296 observations, which were subsequently partitioned into an exploratory set of 97,391 observations and a held-out set of 97,905 observations. In the base regression models, word length in characters, index of word position in each sentence, and saccade length were included. Additionally, either NoConSurp or No-CatSurp spilled over by one position was included as a baseline predictor. Similarly to Experiment 2, the first set of LRTs examined the contribution of propositional content by including ∆ConSurp, and the second set of LRTs examined the contribution of syntactic category information by including ∆CatSurp in the full regression models.

Results
The results show that the ∆ConSurp predictor made a statistically significant contribution to model fit over NoConSurp in all nine LRTs. 11 A significant contribution of ∆CatSurp over NoCatSurp was observed as well, with three of the nine LRTs indicating significantly improved model fit (p = .008 according to a binomial test). 12 Interestingly, contrary to Experiment 2 that showed only a robust contribution of syntactic category information to predicting self-paced reading times, a strong influence of propositional content in predicting eye-gaze durations is observed. This corroborates the finding that the full processing model captures the distinct influence of propositional content and syntactic category information, the ablation of which results in qualitatively different predictions. In addition, this differential contribution of ∆ConSurp across self-paced reading and eye-tracking data suggests that these self-paced reading times and eye-gaze durations may capture different aspects of online processing difficulty.

Experiment 4: Filler-gap Constructions
Observing that surprisal from the full processing model did not contribute significantly to predicting broad-coverage self-paced reading times on top of its content-ablated counterpart in Experiment 2, we focus on filler-gap constructions, 13 in which information about the extracted object is thought to strongly influence the processing of the verb. In order to explore the extent to which integration costs associated with filler-gap constructions could be explained by the influence of propositional content, a series of LRTs were conducted to assess the contribution of surprisal from the full processing model to predicting reading times of object-extracted verbs.

Procedures
The subset of self-paced reading times from the Natural Stories Corpus corresponding to objectextracted verbs provided the response variable for the regression models. The object-extracted verbs were identified using a version of the Natural Stories Corpus that had been reannotated using a deep syntactic annotation scheme (Shain et al., 2018). Applying the same data exclusion criteria as Experiment 2 resulted in an exploratory set of 1,537 observations and a held-out set of 1,523 observations. As the number of data points for regression model fitting was substantially smaller in comparison to the full set used in Experiment 2, the regression models had to be simplified for reliable convergence. First, the 5-gram surprisal predictor was excluded as its effect estimate was not stable on the exploratory set. In addition, the random effects structure was simplified to include only the by-subject random intercept. In the base regression models, word length in characters, index of word position within each sentence, and NoConSurp were fitted to the logtransformed reading times in the held-out set. The contribution of propositional content was incorporated by including FullSurp in the full regression models. NoConSurp and FullSurp were spilled over by one position, and all predictors were centered and scaled. The same three variants of each surprisal predictor were used, which resulted in a total of nine LRTs testing the contribution of Full-Surp.

Results
The results showed that the FullSurp predictor made a statistically significant contribution to model fit over NoConSurp in all nine LRTs. The inclusion of FullSurp consistently improved model fit, indicating that integration costs associated with object-extracted filler-gap constructions can be partially explained by the influence of propositional content.

Conclusion
This paper presents a generative and incremental content-sensitive processing model which factors the contribution of propositional content and syntactic category information. This model can be cleanly ablated to calculate surprisal predictors that differentially isolate the influence of the two components. Subsequent experiments demonstrate the utility of both components in predicting human behavioral responses; the inclusion of propositional content resulted in significantly better fits to broadcoverage eye-gaze durations and self-paced reading times of object-extracted verbs. Additionally, the inclusion of syntactic category information significantly improved fits to both broad-coverage selfpaced reading times and eye-gaze durations. Taken together, these results suggest a role for propositional content and syntactic category information in incremental sentence processing. the authors and do not necessarily reflect the views of the National Science Foundation.