Finding syntax in human encephalography with beam search

Recurrent neural network grammars (RNNGs) are generative models of (tree , string ) pairs that rely on neural networks to evaluate derivational choices. Parsing with them using beam search yields a variety of incremental complexity metrics such as word surprisal and parser action count. When used as regressors against human electrophysiological responses to naturalistic text, they derive two amplitude effects: an early peak and a P600-like later peak. By contrast, a non-syntactic neural language model yields no reliable effects. Model comparisons attribute the early peak to syntactic composition within the RNNG. This pattern of results recommends the RNNG+beam search combination as a mechanistic model of the syntactic processing that occurs during normal human language comprehension.


Introduction
Computational psycholinguistics has "always been...the thing that computational linguistics stood the greatest chance of providing to humanity" (Kay, 2005). Within this broad area, cognitively-plausible parsing models are of particular interest. They are mechanistic computational models that, at some level, do the same task people do in the course of ordinary language comprehension. As such, they offer a way to gain insight into the operation of the human sentence processing mechanism (for a review see Hale, 2017).
As Keller (2010) suggests, a promising place to look for such insights is at the intersection of (a) incremental processing, (b) broad coverage, and (c) neural signals from the human brain.
The contribution of the present paper is situated precisely at this intersection. It combines a probabilistic generative grammar (RNNG; Dyer et al., 2016) with a parsing procedure that uses this grammar to manage a collection of syntactic derivations as it advances from one word to the next , cf. Roark, 2004. Via well-known complexity metrics, the intermediate states of this procedure yield quantitative predictions about language comprehension difficulty. Juxtaposing these predictions against data from human encephalography (EEG), we find that they reliably derive several amplitude effects including the P600, which is known to be associated with syntactic processing (e.g. Osterhout and Holcomb, 1992).
Comparison with language models based on long short term memory networks (LSTM, e.g. Hochreiter and Schmidhuber, 1997;Mikolov, 2012;Graves, 2012) shows that these effects are specific to the RNNG. A further analysis pinpoints one of these effects to RNNGs' syntactic composition mechanism. These positive findings reframe earlier null results regarding the syntaxsensitivity of human processing (Frank et al., 2015). They extend work with eyetracking (e.g. Roark et al., 2009;Demberg et al., 2013) and neuroimaging (Brennan et al., 2016;Bachrach, 2008) to higher temporal resolution. 1 Perhaps most significantly, they establish a general correspondence between a computational model and electrophysiological responses to naturalistic language.
Following this Introduction, section 2 presents recurrent neural network grammars, emphasizing their suitability for incremental parsing. Sections 3 then reviews a previously-proposed 1 Magnetoencephalography also offers high temporal resolution and as such this work fits into a tradition that includes Wehbe et al. (2014), van Schijndel et al. (2015, Wingfield et al. (2017) and Brennan and Pylkkänen (2017 beam search procedure for them. Section 4 goes on to introduce the novel application of this procedure to human data via incremental complexity metrics. Section 5 explains how these theoretical predictions are specifically brought to bear on EEG data using regression. Sections 6 and 7 elaborate on the model comparison mentioned above and report the results in a way that isolates the operative element. Section 8 discusses these results in relation to established computational models. The conclusion, to anticipate section 9, is that syntactic processing can be found in naturalistic speech stimuli if ambiguity resolution is modeled as beam search.  (Dyer et al., 2016). 2016) are probabilistic models that generate trees. The probability of a tree is decomposed via the chain rule in terms of derivational actionprobabilities that are conditioned upon previous actions i.e. they are history-based grammars (Black et al., 1993). In the vanilla version of RNNG, these steps follow a depth-first traversal of the developing phrase structure tree. This entails that daughters are announced bottom-up one by one as they are completed, rather than being predicted at the same time as the mother.
Each step of this generative story depends on the state of a stack, depicted inside the gray box in Figure 1. This stack is "neuralized" such that each stack entry corresponds to a numerical vector. At each stage of derivation, a single vector summarizing the entire stack is available in the form of the final state of a neural sequence model. This is implemented using the stack LSTMs of Dyer et al. (2015). These stack-summary vectors (central rectangle in Figure 1) allow RNNGs to be sensitive to aspects of the left context that would be masked by independence assumptions in a probabilistic context-free grammar. In the present paper, these stack-summaries serve as input to a multi-layer perceptron whose output is converted via softmax into a categorical distribution over three possible parser actions: open a new constituent, close off the latest constituent, or generate a word. A hard decision is made, and if the first or last option is selected, then the same vector-valued stack-summary is again used, via multilayer perceptrons, to decide which specific nonterminal to open, or which specific word to generate.
Phrase-closing actions trigger a syntactic composition function (depicted in Figure 2) which squeezes a sequence of subtree vectors into one single vector. This happens by applying a bidirectional LSTM to the list of daughter vectors, prepended with the vector for the mother category following §4.1 of Dyer et al. (2016).
The parameters of all these components are adaptively adjusted using backpropagation at training time, minimizing the cross entropy relative to a corpus of trees. At testing time, we parse incrementally using beam search as described below in section 3.

Word-synchronous beam search
Beam search is one way of addressing the search problem that arises with generative grammars -constructive accounts of language that are sometimes said to "strongly generate" sentences. Strong generation in this sense simply means that they derive both an observable word-string as well as a hidden tree structure. Probabilistic grammars are joint models of these two aspects. By contrast, parsers are programs intended to infer a good tree from a given word-string. In incremental parsing with history-based models this inference task is particularly challenging, because a decision that looks wise at one point may end up looking foolish in light of future words. Beam search addresses this challenge by retaining a collection called the "beam" of parser states at each word. These states are rated by a score that is related to the probability of a partial derivation, allowing an incremental parser to hedge its bets against temporary ambiguity. If the score of one analysis suddenly plummets after seeing some word, there may still be others within the beam that are not so drastically affected. This idea of ranked parallelism has become central in psycholinguistic modeling (see e.g. Gibson, 1991;Narayanan and Jurafsky, 1998;Boston et al., 2011).
As  observe, the most straightforward application of beam search to generative models like RNNG does not perform well. This is because lexical actions, which advance the analysis onwards to successive words, are assigned such low probabilities compared to structural actions which do not advance to the next word. This imbalance is inevitable in a probability model that strongly generates sentences, and it causes naive beam-searchers to get bogged down, proposing more and more phrase structure rather than moving on through the sentence. To address it,  propose a word-synchronous variant of beam search. This variant keeps searching through structural actions until "enough" high-scoring parser states finally take a lexical action, arriving in synchrony at the next word of the sentence. Their procedure is written out as Algorithm 1.
Algorithm 1 Word-synchronous beam search with fast-tracking. After  1: thisword ← input beam 2: nextword ← ∅ 3: while |nextword| < k do In Algorithm 1 the beam is held in a set-valued variable called nextword. Beam search continues until this set's cardinality exceeds the designated action beam size, k. If the beam still isn't large enough (line 3) then the search process explores one more action by going around the while-loop again. Each time through the loop, lexical actions compete against structural actions for a place among the top k (line 5). The imbalance mentioned above makes this competition fierce, and on many loop iterations nextword may not grow by much. Once there are enough parser states, another threshold called the word beam k word kicks in (line 15). This other threshold sets the number of analyses that are handed off to the next invocation of the algorithm. In the study reported here the word beam remains at the default setting suggested by Stern and colleagues, k/10.  go on to offer a modification of the basic procedure called "fast tracking" which improves performance, particularly when the action beam k is small. Under fast tracking, an additional step is added between lines 4 and 5 of k=100 k=200 k=400 k=600 k=800 k=1000 k=2000 Fried et al. (2017)  Algorithm 1 such that some small number k f t of parser states are promoted directly into nextword. These states are required to come via a lexical action, but in the absence of fast tracking they quite possibly would have failed the thresholding step in line 5. Table 1 shows Penn Treebank accuracies for this word-synchronous beam search procedure, as applied to RNNG. As expected, accuracy goes up as the parser considers more and more analyses. Above k = 200, the RNNG+beam search combination outperforms a conditional model based on greedy decoding (88.9). This demonstration reemphasizes the point, made by Brants and Crocker (2000) among others, that cognitively-plausible incremental processing can be achieved without loss of parsing performance.

Complexity metrics
In order to relate computational models to measured human responses, some sort of auxiliary hypothesis or linking rule is required. In the domain of language, these are traditionally referred to as complexity metrics because of the way they quantify the "processing complexity" of particular sentences. When a metric offers a prediction on each successive word, it is an incremental complexity metric. Table 2 characterizes four incremental complexity metrics that are all obtained from intermediate states of Algorithm 1. The metric denoted DISTANCE is the most classic; it is inspired by the count of "transitions made or attempted" in Kaplan (1972). It quantifies syntactic work by counting the number of parser actions explored by Algorithm 1 between each word i.e. the number of times around the while-loop on line 3. The information theoretical quantities SURPRISAL and ENTROPY came into more widespread use later.
They quantify unexpectedness and uncertainty, respectively, about alternative syntactic analyses at a given point within a sentence. Hale (2016) reviews their applicability across many different languages, psycholinguistic measurement techniques and grammatical models. Recent work proposes possible relationships between these two metrics, at the empirical as well as theoretical level (van Schijndel and Schuler, 2017;Cho et al., 2018  The SURPRISAL metric was computed over the word beam i.e. the k word highest-scoring partial syntactic analyses at each successive word. In an attempt to obtain a more faithful estimate, EN-TROPY and its first-difference are computed over nextword itself, whose size varies but is typically much larger than k word .

Regression models of naturalistic EEG
Electroencephalography (EEG) is an experimental technique that measures very small voltage fluctuations on the scalp. For a review emphasizing its implications vis-á-vis computational models, see Murphy et al. (2018).
We analyzed EEG recordings from 33 participants as they passively listened to a spoken recitation of the first chapter of Alice's Adventures in Wonderland. 2 This auditory stimulus was delivered via earphones in an isolated booth. All participants scored significantly better than chance on a post-session 8-question comprehension quiz. An additional ten datasets were excluded for not meeting this behavioral criterion, six due to excessive noise, and three due to experimenter error. All participants provided written informed consent under the oversight of the University of Michigan HSBS Institutional Review Board (#HUM00081060) and were compensated $15/h. 3 Data were recorded at 500 Hz from 61 active electrodes (impedences < 25 kΩ) and divided into 2129 epochs, spanning -0.3-1 s around the onset of each word in the story. Ocular artifacts were removed using ICA, and remaining epochs with excessive noise were excluded. The data were filtered from 0.5-40 Hz, baseline corrected against a 100 ms pre-word interval, and separated into epochs for content words and epochs for function words because of interactions between parsing variables of interest and word-class (Roark et al., 2009).
Linear regression was used per-participant, at each time-point and electrode, to identify content-word EEG amplitudes that correlate with complexity metrics derived from the RNNG+beam search combination via the complexity metrics in Table 2. We refer to these time series as Target predictors.
Each Target predictor was included in its own model, along with several Control predictors that are known to influence sentence processing: sentence order, word-order in sentence, log word frequency (Lund and Burgess, 1996), frequency of the previous and subsequent word, and acoustic sound power averaged over the first 50 ms of the epoch.
All predictors were mean-centered. We also constructed null regression models in which the rows of the design matrix were randomly permuted. 4 β coefficients for each effect were tested against these null models at the group level across 2 https://tinyurl.com/alicedata 3 A separate analysis of these data appears in Brennan and Hale (2018); datasets are available from JRB. 4 Temporal auto-correlation across epochs could impact model fits. Content-words are spaced 1 s apart on average and a spot-check of the residuals from these linear models indicates that they do not show temporal auto-correlation: AR(1) < 0.1 across subjects, time-points, and electrodes. all electrodes from 0-1 seconds post-onset, using a non-parametric cluster-based permutation test to correct for multiple comparisons across electrodes and time-points (Maris and Oostenveld, 2007).

Language models for literary stimuli
We compare the fit against EEG data for models that are trained on the same amount of textual data but differ in the explicitness of their syntactic representations.
At the low end of this scale is the LSTM language model. Models of this type treat sentences as a sequence of words, leaving it up to backpropagation to decide whether or not to encode syntactic properties in a learned history vector (Linzen et al., 2016). We use SURPRISAL from the LSTM as a baseline.
RNNGs are higher on this scale because they explicitly build a phrase structure tree using a symbolic stack. We consider as well a degraded version, RNNG −comp which lacks the composition mechanism shown in Figure 2. This degraded version replaces the stack with initial substrings of bracket expressions, following Choe and Charniak (2016); Vinyals et al. (2015). An example would be the length 7 string shown below (S (NP the hungry cat ) N P (VP Here, vertical lines separate symbols whose vector encoding would be considered separately by RNNG −comp . In this degraded representation, the noun phrase is not composed explicitly. It takes up five symbols rather than one. The balanced parentheses (NP and ) NP are rather like instructions for some subsequent agent who might later perform the kind of syntactic composition that occurs online in RNNGs, albeit in an implicit manner.
In all cases, these language models were trained on chapters 2-12 of Alice's Adventures in Wonderland. This comprises 24941 words. The stimulus that participants saw during EEG data collection, for which the metrics in Table 2 are calculated, was chapter 1 of the same book, comprising 2169 words.
RNNGs were trained to match the output trees provided by the Stanford parser (Klein and Manning, 2003). These trees conform to the Penn Treebank annotation standard but do not explicitly mark long-distance dependency or include any empty categories. They seem to adequately represent basic syntactic properties such as clausal embedding and direct objecthood; nevertheless we did not undertake any manual correction.
During RNNG training, the first chapter was used as a development set, proceeding until the per-word perplexity over all parser actions on this set reached a minimum, 180. This performance was obtained with a RNNG whose state vector was 170 units wide. The corresponding LSTM language model state vector had 256 units; it reached a per-word perplexity of 90.2. Of course the RNNG estimates the joint probability of both trees and words, so these two perplexity levels are not directly comparable. Hyperparameter settings were determined by grid search in a region near the one which yielded good performance on the Penn Treebank benchmark reported on Table 1.

Results
To explore the suitability of the RNNG + beam search combination as a cognitive model of language processing difficulty, we fitted regression models as described above in section 5 for each of the metrics in Table 2. We considered six beam sizes k = {100, 200, 400, 600, 800, 1000}. Table 3 summarizes statistical significance levels reached by these Target predictors; no other combinations reached statistical significance.  Table 3: Statistical significance of fitted Target predictors in Whole-Head analysis. p cluster values are minima for each Target with respect to a Monte Carlo cluster-based permutation test (Maris and Oostenveld, 2007).

Whole-Head analysis
Surprisal from the LSTM sequence model did not reliably predict EEG amplitude at any timepoint or electrode. The DISTANCE predictor did derive a central positivity around 600 ms post-word onset as shown in Figure 3a. SURPRISAL predicted an early frontal positivity around 250 ms, shown in Figure 3b. ENTROPY and ENTROPY ∆ seemed to drive effects that were similarly early and frontal, although negative-going (not depicted); the effect for ENTROPY ∆ localized to just the left side.

Region of Interest analysis
We compared RNNG to its degraded cousin, RNNG −comp , in three regions of interest shown in Figure 4. These regions are defined by a selection of electrodes and a time window whose zero-point corresponds to the onset of the spoken word in the naturalistic speech stimulus. Regions "N400" and "P600" are well-known in EEG research, while "ANT" is motivated by findings with a PCFG baseline reported by Brennan and Hale (2018).
Single-trial data were averaged across electrodes and time-points within each region and fit with a linear mixed-effects model with fixed effects as described below and random intercepts by-subjects (Alday et al., 2017). We used a stepwise likelihood-ratio test to evaluate whether individual Target predictors from the RNNG significantly improved over RNNG −comp , and whether a RNNG −comp model significantly improve a baseline regression model. The baseline regression model, denoted ∅, contains the Control predictors described in section 5 and SURPRISAL from the LSTM sequence model. Targets represent each of the eight reliable whole-head effects detailed in Table 3. These 24 tests (eight effects by three regions) motivate a Bonferroni correction of α = 0.002 = 0.05/24. Statistically significant results obtained for DIS-TANCE from RNNG −comp in the P600 region and for SURPRISAL for RNNG in the ANT region. No significant results were observed in the N400 region. These results are detailed in Table 4.

Discussion
Since beam search explores analyses in descending order of probability, DISTANCE and SUR-PRISAL ought to be yoked, and indeed they are correlated at r = 0.33 or greater across all of the beam sizes k that we considered in this study. However they are reliably associated with different EEG effects. SURPRISAL manifests at anterior electrodes relatively early. This seems to be a different effect from that observed by Frank et al. (2015). Frank and colleagues relate N400 ampli-    tude to word surprisals from an Elman-net, analogous to the LSTM sequence model evaluated in this work. Their study found no effects of syntax-based predictors over and above sequential ones. In particular, no effects emerged in the 500-700 ms window, where one might have expected a P600. The present results, by contrast, show that an explicitly syntactic model can derive the P600 quite generally via DISTANCE. The absence of an N400 effect in this analysis could be attributable to the choice of electrodes, or perhaps the modality of the stimulus narrative, i.e. spoken versus read.
The model comparisons in Table 4 indicate that the early peak, but not the later one, is attributable to the RNNG's composition function. Choe and Charniak's (2016) "parsing as language modeling" scheme potentially could explain the P600like wave, but it would not account for the earlier peak. This earlier peak is the one derived by the RNNG under SURPRISAL, but only when the RNNG includes the composition mechanism depicted in Figure 2.
This pattern of results suggests an approach to the overall modeling task. In this approach, both grammar and processing strategy remain the same, and alternative complexity metrics, such as SUR-PRISAL and DISTANCE, serve to interpret the unified model at different times or places within the brain. This inverts the approach of Brouwer et al. (2017) and Wehbe et al. (2014) who interpret different layers of the same neural net using the same complexity metric.

Conclusion
Recurrent neural net grammars indeed learn something about natural language syntax, and what they learn corresponds to indices of human language processing difficulty that are manifested in electroencephalography. This correspondence, between computational model and human electrophysiological response, follows from a system that lacks an initial stage of purely stringbased processing. Previous work was "two-stage" in the sense that the generative model served to  Table 4: Likelihood-ratio tests indicate that regression models with predictors derived from RNNGs with syntactic composition (see Figure 2) do a better job than their degraded counterparts in accounting for the early peak in region "ANT" (right-hand columns). Similar comparisons in the "P600" region show that the model improves, but the improvement does not reach the α = 0.002 significance threshold imposed by our Bonferroni correction (bold-faced text). RNNGs lacking syntactic composition do improve over a baseline model (∅) containing lexical predictors and an LSTM baseline (left-hand columns).
rerank proposals from a conditional model (Dyer et al., 2016). If this one-stage model is cognitively plausible, then its simplicity undercuts arguments for string-based perceptual strategies such as the Noun-Verb-Noun heuristic (for a textbook presentation see Townsend and Bever, 2001). Perhaps, as Phillips (2013) suggests, these are unnecessary in an adequate cognitive model. Certainly, the road is now open for more fine-grained investigations of the order and timing of individual parsing operations within the human sentence processing mechanism.