Dynamic encoding of structural uncertainty in gradient symbols

An important achievement in modeling online language comprehension is the discovery of the relationship between processing difﬁculty and surprisal (Hale, 2001; Levy, 2008). However, it is not clear how structural uncertainty can be represented and updated in a continuous-time continuous-state dynamical system model, a reasonable abstraction of neural computation. In this study, we investigate the Gradient Symbolic Computation (GSC) model (Smolensky et al., 2014) and show how it can dynamically encode and update structural uncertainty via the gradient activation of symbolic constituents. We claim that surprisal is closely related to the amount of change in the optimal activation state driven by a new word input. In a simulation study, we demonstrate that the GSC model implementing a simple probabilistic symbolic grammar can simulate the effect of surprisal on processing time. Our model provides a mechanistic account of the effect of surprisal, bridging between probabilistic symbolic models and subsymbolic connectionist models.


Introduction
A core computational problem in online language comprehension is to deal with local ambiguity, the one-to-many mapping from a unit symbol w k (e.g., word) to symbol strings containing w at the k-th position W * k = · · · w k · · · and their interpretations S (e.g., sentences and their parses). Rational models of sentence comprehension solve this problem by computing P (S|W k ), a conditional probability of interpretations given a partial string of symbols (henceforth, prefix) W k = w 1 · · · w k , and updating it discretely for every new symbol input (Jurafsky, 1996;Hale, 2001;Levy, 2008). We will refer to this class of incremental processing models simply as (structural) probabilistic models.
The probabilistic model has drawn a lot of attention because it predicts processing difficulty in different regions of a sentence based on information-theoretic complexity metrics. The surprisal hypothesis (e.g., Hale, 2001;Levy, 2008) claims that reading time of w k (as a measure of processing difficulty) is proportional to its surprisal, − log P (w k |W k−1 ), or equivalently, the Kullback-Leibler (KL) divergence of P (S|W k ) from P (S|W k−1 ) (Levy, 2008). This hypothesis has been supported in many psycholinguistic experiments (e.g., Boston et al., 2008;Demberg and Keller, 2008;Smith and Levy, 2013).
In this study, our goal is to provide a neurallyplausible, mechanistic account of the relationship between surprisal and processing time. For our purpose, we need a model from which both kinds of information, P (S|W k ) and processing times of w k , can be collected directly without relying on stipulated linking hypotheses. Since the model is a dynamical system, processing time is directly modeled. To model the probability P (S|W k ) relevant for rational analysis, we treat the model, primarily developed to study interpretation, as a generator: it is run to equilibrium with no input, producing a sentence parse as output. This is done repeatedly as the dynamical system is stochastic; this gives a probability distribution over generated parses we call * P (S) : this we take to be the knowledge of sentence probabilities that is embodied in the model's dynamics. Then for any W k , for rational analysis we compute * P (S|W k ) by conditioning * P (S) on W k , i.e., * P (S|W k ) is the proportion of all generated parses that have prefix equal to W k . We can then examine the extent to which the model, when serving as an incremental parser, behaves in accord with rational inference given its knowledge.
The Gradient Symbolic Computation (GSC) framework (Smolensky et al., 2014) serves our goal. The GSC model is a continuoustime, continuous-state stochastic dynamical system model that computes the representation of a discrete structure gradually. This framework grew out of the Integrated Connectionist/Symbolic cognitive architecture (Smolensky and Legendre, 2006). GSC aims to provide an integrated account of the contribution of the continuous dynamics of cognitive processing and the discrete competence that characterizes our knowledge of language. Cho et al. (2017) applied the framework to incremental processing problems focusing on transient dynamics during incremental processing and argued that the model can achieve two core computational goals in incremental processing: maintaining multiple context-appropriate and globallycoherent interpretations while rejecting interpretations that are context-inappropriate. The GSC parser meets these challenges by moving, during the processing of a word, to an intermediate activation state (a blend state) in which multiple symbolic constituents are simultaneously activated to varying partial degrees. From this state, the parser can reach all activation states representing context-appropriate and globally-coherent structures but does not move to activation states representing context-inappropriate structures (either grammatical or ungrammatical). The relation between intermediate activation states and probability distributions over discrete parses was briefly discussed but was not investigated systematically.
In this study, we propose a version of the GSC parser and show how it can be related to other probabilistic sentence-processing models. We argue that the parser's internal state -the activation values of multiple symbolic constituents along with control parameters of the parser -encodes a probability distribution over complete parses (Section 3). After encountering new input, the parser incrementally changes its internal state to encode a new probability distribution. The work the parser needs to do to shift this internal state is closely related to the KL divergence between the probability distributions, providing a link between processing time and surprisal (Section 4). In a simulation study (Section 5), we demonstrate that the GSC parser can approximate rational inference and re-port the correlation between processing time and surprisal in our model. In Section 6, we summarize our results and discuss some implications of our work.

Representation
Consider a tree structure S[1](A,B). 1 Let us assign a unique label for every position (called role) in the tree structure. For example, we assign labels r, 0, 1 to the mother (root) and the left and right daughter nodes, respectively. Then, we can describe the tree as an unordered set of symbol/position (or filler/role) bindings: Let f and r be subsymbolic vector encodings of filler f and role r. The encoding of binding f/r is defined as the tensor product of the two vectors: f/r ≡ f ⊗ r whose (i,j)-th component is the product of the i-th component of f and the j-th component of r. The encoding of a set of filler/role bindings is defined as the superposition (vector sum) of the encodings of component bindings: In this study, we used local representation (or one-hot encodings) of fillers and roles for facilitating computation. However, many equivalent models with distributed representations can be easily constructed by change of basis (Smolensky, 1986). The result will not change if the distributed representations of bindings remain orthonormal (Smolensky, 1990).

Constraints
The GSC model uses Harmonic Grammar (HG) (Hale and Smolensky, 2006) to specify grammars via soft constraints each of which imposes a reward (a 'positive constraint') or a penalty (a 'negative constraint') on the wellformedness or Grammatical Harmony of a gradient symbolic structure. The grammatical structures are those with maximal grammatical Harmony: these structures best satisfy the constraints of the grammar.
As an example, consider a rewrite rule: S[1] → A B. This rule defines a treelet S[1](A,B) as grammatical. HG assigns a positive Harmony reward to any structure for every grammatical pair of bindings -e.g., (S[1]/r, A/0) -it contains. In a network implementation of this HG, these binary rules are implemented as positive weights on between-binding connections, so that whenever one binding is active, it sends positive activation to its grammatical parent and child binding(s). In addition to these positive contributions from grammatical mother/daughter pairs, the Harmonic Grammar assigns a negative penalty −b to every filler, where b is the number of edges that the filler must have in a grammatical structure. If all those edges are grammatically legal, they will produce positive binary rewards which by design exactly cancel the unary penalties, so that an illformed tree has negative Harmony but a wellformed tree has zero Harmony -the maximum value. The unary HG rules are implemented as negative weights on self-connections of binding units.
The Grammatical Harmony of a set of active filler/role bindings is simply the sum of the Harmony values assigned by all binary and unary HG rules. In the GSC implementation, Grammatical Harmony is defined as in Eq. 1.
where a is an activation state vector, W is a weight matrix implementing the grammatical constraints, and ex is an external input vector, stimulating the target terminal binding corresponding to the present input word. For example, suppose the model is given a second word 'B'. Because it is the second word of a sentence, it must occupy the second terminal role (in our case, 1). 2 Thus, the component of ex corresponding to binding B/1 has a positive value (a model parameter) and all the other components have a value of 0. The goal of the GSC parser is to produce an output that represents a discrete tree (at least to a good approximation). This turns out to require further constraints which penalize representations that are not approximately discrete. The Harmony term in Eq. 2, in which f and r are filler and role indices, penalizes representations with multiple symbols filling the same role: it introduces competition among bindings in each role. It is called the Competition Constraint. The Harmony term in Eq. 3 penalizes every binding whose activation value is not close to either 0 or 1 -this is the crucial Discreteness Constraint, and H Q is Discreteness Harmony. Note that the Competition and Discreteness Constraints in collaboration force the model to choose one filler, with activation 1, in each role. The representations of discrete trees satisfy both these constraints 3 and fall on what we call the grid of states: in these states, for each role, the bindings of that role to all symbols all have activation 0 except one, which has activation 1. The representation of the tree S[1] [A B] is on the grid, while an example non-grid state is the one encod- Finally, to ensure the network state does not blow up, we also impose the Baseline Constraint (Eq. 4), which penalizes activation state distant from a baseline activation state z.
The Total Harmony H is the weighted sum of the four Harmony values in Equations 1 -4: where β, c, and q are the coefficients of nongrammatical constraints. While β and c are fixed, q changes in time, controlled by an external mechanism we do not model here.
The coefficient q governs the strength of the constraint to have discrete activation values (0 or 1) -that is, the strength of the requirement that the model commit to symbols being predicted to be present or absent. The Competition Constraint prohibits more than one symbol having activation 1 in any given role, so large q values force the model to choose among competitors. Hence we refer to q as the commitment level.

Processing dynamics
The model updates its activation state a as follows: where W is the standard multidimensional Wiener process and T is the level of noise. ∇ a H(a) is the gradient of the total harmony evaluated at a. The model optimizes the constraints by stochastically following the gradient, a Brownian motion with drift given by the gradient of Harmony hence, on average, increasing Harmony over time. q(t) is the commitment level at time t. For convenience, we assume that q(0) = 0 and q increases in time because the goal of computation (either in production or in comprehension) is to build a discrete symbolic structure. We will refer to how q changes in time as the commitment policy and discuss it in more detail in Section 3.

GSC parser as a probabilistic model 3.1 GSC parser
The GSC parser is an application of the GSC framework to incremental parsing. It processes a sentence word-by-word incrementally and passes through intermediate activation states (or blend states) to reach a grid point, the encoding of the parse of the sentence.
Let ex k , q k , and a k be the external input vector corresponding to w k , the commitment level and the activation state vector after processing the k-th word. a k is a local optimum if T = 0. For T > 0, we take a k to be an approximation of the local optimum. Let ex 0 (= 0), q 0 (= 0), and a 0 be the initial values of the variables before processing the first word of a sentence. As the parser processes a length-N sentence, its activation state changes from a 0 through a k to a N . Taking q N to be large, a N is close to a grid point and is classified into the nearby grid point by choosing the filler most strongly activated in each role (the snap-to-thegrid method). Word processing time for w k is the time the parser takes to move from a k−1 to a k .
More specifically, the parser processes each word w k in three phases. Let a j k be the activation state after phase j given word w k ; a k = a 3 k . • Phase 1a: Update ex from ex k−1 to ex k .
• Phase 2: Update a from a 1 k to a 2 k by using H(a, q k−1 ) → H(a, q k ), i.e., increasing from q k−1 to q k at a constant rate dq/dt = 1.
• Phase 3: Update a from a 2 k to a 3 k (= a k ), using H(a, q k ), allowing settling to convergence. 4 4 During phase 1 and phase 3, the model monitors conver-The processing time of w k is defined as the sum of the settling times in phase 1 and 3 and the duration of phase 2.
The parser, in phase 1, integrates a new word input with its internal language model (or structural prediction) and, in phase 2, updates the internal language model via the control of commitment level to make a new structural prediction. In the proposed model, the effect of instantaneous surprisal of w k (phase 1) is conceptually distinguished from the effect of model update (phase 2) (c.f., O'Reilly et al., 2013). 5 The role of phase 2 is to reduce the number of grid points reachable from the present activation state. 6 As q increases, the system passes through a series of bifurcations, the qualitative changes in the organization of the representation space. When q passes some critical values q c , more local optima emerge. Each local optimum forms a local hump (basin of attraction) on the Harmony surface. Those local optima are separated by Harmony valleys that block transitions from one hump to another: the state seeks higher Harmony. Metaphorically, the paths to some futures (corresponding to different parses) are separated from the present state by these valleys. That is, some structural hypotheses are rejected (Cho and Smolensky, 2016).
Given a length-N sentence, we define a commitment policy π N as a sequence of q values (q 0 , · · · , q k , · · · , q N ) where q k is the commitment level after processing the k-th word in a sentence. gence as follows. Let Hmax(t) be the maximum total harmony in a phase up through time t. If Hmax has not been updated for a certain amount of time (= 0.5 in our simulation study; Section 5), the phase ends and the following phase begins. During phase 2, q increases at a constant rate dq/dt = 1 so the duration of phase 2 is simply q k − q k−1 . 5 Alternatively, we can consider a GSC parser with a discrete commitment policy. Given a new word input w k , the model updates both q and ex discretely from q k−1 and ex k−1 to q k and ex k . Note that the surprisal of w k is computed given the updated internal model in this alternative model. Although this alternative parsed every sentence of a minimal grammar G (see Section 5) equally well, we prefer the proposed model to the alternative for the following reason. While ex k is given from the environment, an optimal value of q k given ex k must be computed by the parser and the computation must take time. 6 In terms of the number of reachable grid points, entropy is reduced during phase 2. Because the phase-2 duration is a monotonically increasing function of the amount of increase in q and q is associated with entropy (roughly speaking, the higher q, the smaller entropy), it is likely that a longer phase-2 duration is associated with a larger entropy reduction, which is consistent with the entropy reduction hypothesis (Hale, 2006), although the exact relation between q and entropy needs further investigation. q 0 = 0 and q N is set to q max ; in this setting, the model is guaranteed to reach a grid point after processing the whole sentence (to a close approximation; the higher q max , the better the approximation).

GSC parser as a probabilistic model
The GSC parser can be related to a structural probabilistic model in the following way. Consider a prefix W k = w 1 · · · w k where w k is not the final word of a sentence. The GSC parser processes the prefix under a policy π k = (q 0 , · · · , q k ). During processing w k , the activation state changes from a k−1 to a k . If we set q k to q max , the parser will be forced to choose a grid point. If T > 0 and the same process is run multiple times, the parser will choose different grid points (encodings of S) in different frequencies. In this way, we can estimate a conditional probability that the parser reaches S if it starts from a tuple (a k−1 , q k−1 ) under ex k . Because a k−1 is reachable after the parser has processed W k−1 under the policy π k , P (S|a k−1 , q k−1 , ex k ) = P (S|W k , π k ). In this way, we can map a tuple of the activation state and the control state (a, q) to a probability distribution over S under the constraint ex. An important special case of this, with k = 0, allows us to estimate the unconditional distribution P (S) by increasing q from 0 to q max with ex 0 = 0: this amounts to using the model as a generator as previewed in Section 1. This estimated distribution is * P (S).

Rational inference
Rational inference with w k is defined as the update from * P (S|W k−1 ) to * P (S|W k ) given * P (S) where * indicates conditional probabilities computed by marginalizing * P (S) over cases where W k were generated for the first k terminal roles.

Optimal commitment policy
We define a commitment policy π to be optimal if, for every W k , it minimizes the KL divergence D k = D( * P (S|W k ) P (S|W k , π k )). If the D k are small, the parser approximates rational inference.

Surprisal as Harmony difference
The GSC parser processes a sentence word-byword and processes every word in three phases. In this section, we argue that surprisal can be computed from the intermediate activation states directly and the value will be approximately proportional to the settling time in phase 1.
As the parser processes the k-th word in phase 1, the activation state changes from a 3 k−1 to a 1 k under the influence of ex k . During this phase, q is fixed at q k−1 . When q and ex are fixed (all the other parameters are constant), the equilibrium probability density follows the Boltzmann distribution (Eq. 6) and the logarithm of the probability ratio of P (a 1 k ) to P (a 3 k−1 ) can be computed as in Eq. 7.
where H is parameterized such that q = q k−1 and ex = ex k . Note that the LHS term of Eq. 7 corresponds to the KL divergence D(P k P k−1 ) = E(ln P k − ln P k−1 ) where E(·) is the expected value. Thus the surprisal at w k is E(∆H)/T , with ∆H being the Harmony difference between the local optima before and after the input update. 7 We can estimate the expected settling time t c from the old to the new optimum by recalling that, on average, da/dt = ∇ a H, so: where the approximation symbol indicates we ignore the stochastic term in Eq. 5. We approximate the average gradient with the average of the gradients at the initial and the final activation states a 3 k−1 and a 1 k . The gradient at a 1 k is 0 because a 1 k is the new optimum. The gradient at a 3 k−1 can be calculated as follows: ∇ a H(a 3 k−1 ; q k−1 , ex k ) = (ex k − ex k−1 ) + ∇ a H(a 3 k−1 ; q k−1 , ex k−1 ). Note that the last term is 0 because it was the optimum under ex k−1 (i.e., before the input word was updated) so the initial gradient is simply (ex k − ex k−1 ). It follows that the magnitude of the average of the initial and final harmony gradients in 7 As the parser processes w k , its state changes from (a 3 k−1 , q k−1 ) through (a 1 k , q k−1 ) to (a 3 k , q k ), all of which have the same future under the influence of ex k . Thus, under an optimal commitment policy, P k = * P (S|W k ) ≈ P (S|a 3 k−1 , q k−1 , ex k ) = P (S|a 1 k , q k−1 , ex k ). P k−1 = * P (S|W k−1 ) ≈ P (S|a 3 k−2 , q k−2 , ex k−1 ) = P (S|a 3 k−1 , q k−1 , ex k−1 ). phase 1 is constant for every w k . 8 Thus, ∆H is approximately proportional to the settling time t c . In sum, surprisal of w k , under an optimal commitment policy, is related to ∆H k = H(a 1 k ; q k−1 , ex k ) − H(a 3 k−1 ; q k−1 , ex k ) which in turn is proportional to settling time. In our model, surprisal has a geometrical meaning: it is the amount of hill climbing required to reach a new optimum due to the update of the word input.

Case study
We investigated a GSC model implementing a minimal probabilistic context-free grammar G = where p k is the probability for the k-th sentence and k p k = 1. Cho et al. (2017) used this minimal grammar (with p 1 =p 2 =p 3 =p 4 =0.25) to investigate whether and how the GSC model can deal with computational challenges arising from local ambiguity. They argued that this language creates the core computational problems of incremental processing in the purest form. For example, after processing 'A' as a first word, an ideal incremental processing system must reject S[3](D,B) and S[4](D,C). At the same time, it must consider both S[1](A,B) and S[2](A,C) as candidate interpretations without choosing one over the other too early. They showed that the GSC model can achieve both computational goals by regulating commitment level q appropriately. When q increased too quickly or too slowly, the model respectively made "garden-path" errors (e.g., S[2](A,C) for an input sentence 'AB'; Bever, 1970;Frazier, 1987) or "local-coherence" errors (e.g., S[3](D,B) for an input sentence 'AB'; Konieczny, 2005).
We investigated the same grammar G but we considered the cases where p 1 ≥ p 2 because our interest is in the relationship between surprisal and processing times. To introduce a structural preference for S[1]/(A,B), a small value ∆h ∈ {0, 0.1, 0.2, 0.3} was added to the Grammar Harmony of S[1]-bindings (see Table 1 in Supplementary Material). (The model parameter ∆h must be distinguished from ∆H discussed above). p k was empirically estimated by running 8 Because w k−1 and w k are presented at two different positions in a sentence, ex k−1 =ex k . In every ex k (for k > 0), only one component has a non-zero value (+2 in the present study) and all the other components have a value of 0. Thus, ex k − ex k−1 is 2 √ 2 for every k > 1; it is 2 for k = 1.
the model as a generator (i.e., with no external input) 800 times. Figure 1 presents the GSC model implementing the grammar. Note that for a different choice of ∆h, the parser implements a different PCFG. In addition to ∆h, we manipulated T (see Eq. 5) in two levels (0.01 or 0.1) to see how the effect of ∆h depends on T . The GSC parser needs a commitment policy. Because every sentence of G is two words long, we considered a commitment policy π = (q 0 , q 1 , q 2 ) where q 0 = 0, q 2 = q max = 15, and q 1 was a free parameter.

Investigation of commitment policy
First, we investigated whether the GSC parser can approximate rational inference as introduced in Section 3. We considered 6 policies in which q 1 was set to one of the values (1, 3, 5, 7, 9, 11).  Figure 2: Plot of KL divergence of * P (S|W 2 ) from P (S|W 2 , π 2 ) against q 1 in π 2 = (0, q 1 , 15). Columns correspond to different T conditions.
Every model with a unique combination of ∆h, T , and q 1 processed each of four sentences (S1=AB, S2=AC, S3=DB, S4=DC) word-by-word 200 times. By applying the algorithm introduced in Section 3, we estimated P (S), P (S|W 1 , π 1 ), and P (S|W 2 , π 2 ). Because processing time was not of interest here, we excluded phase 1 and phase 3 as the parser processes each word. If dq/dt in phase 2 is small (dq/dt = 1 in the simulation), the omission of phase 1 and 3 does not change the result much. An optimal policy was defined as (0, q 1 , 15) that minimizes the divergence D( * P (S|W k ) P (S|W k , π k )) averaged over W k .
For w 2 , we estimated P (S|W 2 , π 2 ) under each of the 6 policies. Figure 2 presents the average KL divergences of * P (S|W 2 ) from P (S|W 2 , π 2 ) as a function of ∆h and T . When T = 0.01, the divergence was 0 when q 1 is either 5 or 7 in every ∆h condition, suggesting the model parsed each of the four sentences accurately. When T = 0.1, the divergence was minimal (< 0.017) when q 1 = 7 for every ∆h condition. 9

Investigation of processing times
To investigate the relationship among harmony difference, surprisal (assuming rational inference), and word processing time, we chose the best of the commitment policies π = (0, 5, 15) for the condition T = 0.01. Each of four GSC parsers, implementing different PCFGs (due to the different ∆h values), processed each of four sentences 200 times under the best policy. Because the goal now 9 See Figures 1 and 2 in Supplementary Material for estimated probability distributions. was to measure word processing time, all three phases were included in this simulation.
In Section 4, we argued that word processing time, more specifically, phase-1 settling time, must be must be proportional to HarmonyDifference ∆H k = H(a 1 k ) − H(a 3 k−1 ). Figure 3 presents w 2 phase-1 duration against ∆H 2 , suggesting a linear trend. 10 In a regression analysis (Model 1A), we modeled w 2 phase-1 duration as a function of SentType (S1=AB, S2=AC, S3=DB, S4=DC to model processing of w 2 in context of w 1 ), NetID (a unique ID for each GSC parser with a unique ∆h value), and HarmonyDifference. SentType and NetID were included to factor out manipulation-irrelevant variance so we do not report the estimates of their coefficients. 11 The coefficient of HarmonyDifference was significant: b = 1.529, SE = 0.024, t = 64.919, p < .001, supporting our claim. The adjusted R 2 statistic was 0.787 and AIC = 3037. We also tested whether ln(∆H) explains the phase-1 settling time well (Model 1B). The coefficient of log harmony difference was significant as well: b = 0.445, SE = 0.008, t = 57.014, p < .001. The adjusted R 2 stastistic was .755 and AIC was 3458, suggesting Model 1A explains processing time data slightly better.
In Section 3, we presented a method to derive a probability distribution over parses S from a tuple of an activation state and a control state q under ex and a commitment policy π. Based on this, we 10 The result was the same when total word processing time was used instead of phase-1 duration. This is because phase 2 has the same length for every sentence under the same policy and phase 3 settling time was not systematic in the current T setting. We present phase-1 duration data because it is theoretically related to harmony difference (Section 4). 11 We did not include the interaction term between Sent-Type and NetID because it covaried with harmony difference and surprisal. Recall that different levels of NetID are associated with different ∆h values which in turn were used to create different surprisal values for different sentence types. argued that the harmony difference (scaled by T ), can be interpreted as the parser-specific surprisal D(P (S|W k , π k ) P (S|W k−1 , π k−1 )), which will be similar to surprisal under rational inference, D( * P (S|W k ) * P (S|W k−1 )), under an optimal commitment policy. Thus, we predict harmony difference is a function of surprisal under rational inference under an optimal commitment policy. Figure 4 presents harmony difference when the input word was updated from w 1 to w 2 against surprisal of w 2 under rational inference, suggesting a non-linear relationship between harmony difference and surprisal. In a regression analysis (Model 2A), we modeled harmony difference as a linear function of surprisal, controlling the effects of SentType and NetID. The coefficient of surprisal was significant: b = 0.342, SE = 0.006, t = 53.933, p < .001. The adjusted R 2 statistic was 0.786 and AIC = −860.4. In another regression analysis (Model 2B), we modeled harmony difference as a linear function of ln(surprisal). The coefficient of ln(surprisal) was significant: b = 0.286, SE = 0.005, t = 60.984, p < .001. The R 2 statistic was 0.811 and AIC = −1259, suggesting Model 2B better explains variance in ∆H.
We summarize the result in the following conceptual model: surprisal under rational inference → harmony difference (under an optimal commitment policy) → word processing time. In other words, harmony difference is the parser's actual surprisal under a commitment policy. The logarithm trend observed between surprisal and harmony difference needs further investigation but we consider two possibilities. First, the average magnitude of the actual gradient is systematically different depending on surprisal so our approximation introduces a bias. Second, although we chose the best commitment policy of 6 candidates, the chosen policy may not be optimal. Note that we used the same commitment policy for all four sentences. However, an optimal q 1 value may differ for the first word A and the first word D.

General Discussion
An important research question concerning online sentence processing is to understand the source of processing difficulty. The surprisal hypothesis (Hale, 2001;Levy, 2008) provides a simple, intuitive, and general explanation at a computational level: processing difficulty is proportional to surprisal. The underlying mechanism is still beyond our understanding but researchers have started developing mechanistic accounts of surprisal (e.g., Rasmussen and Schuler, 2017). In this study, we tried to contribute to this line of research by providing a mechanism that relates surprisal to processing time via a stochastic, wellformednessoptimizing mechanism.
Our effort can be summarized as follows. First, the GSC model encodes structural uncertainty in the gradient activation of constituent symbols. An activation state at a given commitment level is analogous to the state of a symbolic parser but contains uncertainty information. It corresponds to a probability distribution over parses in the following sense: if the system starts from the given activation state and the given commitment level and is forced to choose a parse, it will choose different parses (grid points) with different frequencies (see Section 3). Second, the model updates uncertainty in two ways: in response to the update of external information and via the control of commitment level. On the one hand, external input update makes the previously optimal activation state suboptimal so drives the system to a new optimum. In Section 4, we claimed that the amount of change required to travel from the old to the new optimum, harmony difference, can be interpreted as surprisal. There we showed why the settling time is proportional to the harmony difference. On the other hand, the internal control of commitment level is critical in holding the amount of structural ambiguity at an optimal level; this is implied in Figure 2 in Supplementary Material but was not the focus of this study. See Cho and Smolensky (2016) for the role of commitment policy.
Third, as we demonstrated in a simulation study (Section 5), the model can approximate rational inference under a good commitment policy and simulate the correlation between surprisal and processing time via harmony difference that is the parser's surprisal under the policy. There we reported the result that surprisal under rational inference explains variance in harmony difference, which in turn explains variance in processing time. In other words, surprisal under rational inference → harmony difference (the parser's surprisal) under a commitment policy → processing time.
An implication of our work is that surprisal is not a function of linguistic environment only, which we assume the parser learned well. From the GSC point of view, both the linguistic environment and the parser's commitment policy determine surprisal of each word input. For optimal sentence processing, the model needs both types of knowledge.
A limitation of our work is the simplicity of the grammar we investigated. We are actively investigating (with promising preliminary results) the model's ability to process more complex cases. But we point out that finding a good parameter setting and a good commitment policy, which can be challenging, is a separate issue from understanding the relation between surprisal and processing time. The present study focuses on the latter and the claim we made is generalizable.
Probabilistic models (e.g., Hale, 2001;Levy, 2008) provide a computational account of why and what problems must be solved in online language comprehension. Dynamical connectionist models (e.g., Tabor and Hutchins, 2004;Vosse and Kempen, 2009) provide a mechanistic account of why some sentences (e.g., garden-path sentences) take longer to process than others. By proposing how structural uncertainty can be encoded and updated in a symbolically-interpretable dynamical system model, our work bridges between these two general approaches to modeling human sentence processing.