Predictive power of word surprisal for reading times is a linear function of language model quality

Within human sentence processing, it is known that there are large effects of a word’s probability in context on how long it takes to read it. This relationship has been quantiﬁed using information-theoretic surprisal, or the amount of new information conveyed by a word. Here, we compare surprisals derived from a collection of language models derived from n -grams, neural networks, and a combination of both. We show that the models’ psychological predictive power improves as a tight linear function of language model linguistic quality. We also show that the size of the effect of surprisal is estimated consistently across all types of language models. These ﬁndings point toward surprising robustness of surprisal estimates and suggest that surprisal estimated by low-quality language models are not biased.


Introduction
Decades of work studying human sentence processing have demonstrated that a word's probability in context is strongly related to the amount of time it takes to read it. This relationship has been quantified by surprisal theory (Hale, 2001;Levy, 2008), which states that processing difficulty of a word w in context c is proportional to its information-theoretic surprisal, defined as − log p(w|c). As a word is more likely to occur in its context, and thus communicates less information (Shannon, 1948), it is read more quickly.
One difficulty in testing such effects of a word's probability in context is the need to construct estimates of a word's probability in context. One way of estimating such probabilities is to give human subjects a context, have them guess the next word, and estimate p(w|c) as the proportion of participants who guess word w in context c. This method, called a Cloze task (Taylor, 1953), may yield reliable estimates for words that have relatively high probabilities in their context, and it has been used in a number of studies of the effects of probabilities in context on reading. However, it is an open question whether these human guess-derived proportions may be biased from objective probabilities in some way (Smith & Levy, 2011). Problematically for studying surprisal specifically, however, the Cloze task cannot in principle yield reliable estimates of word probabilities in context that are relatively low, say less than 1 in 100, as many word probabilities are, without requiring an extremely large number of participants (Levy, 2008). Additionally, it is not practical to use the Cloze task to estimate probabilities for large datasets on which surprisal is often studied, for which there can easily be tens of thousands of contexts that would require estimation.
The alternative is to estimate the probabilities of words in context using computational language models, which are trained on large language corpora to estimate the probabilities of words in context. Many studies of surprisal have used such language models (e.g. Hale, 2001;Levy, 2008;Demberg & Keller, 2008;Mitchell et al., 2010;Monsalve et al., 2012).
Unfortunately, however, computational language models are still substantially worse than humans at predicting upcoming words, meaning there is some mismatch between the probabilities p(w|c) being estimated computationally and the implicit probabilities in the brains of readers that humans are using. This situation raises the question of to what extent we can trust results about the effects of surprisal as estimated by such language models. To try to get some information about possible biases that might exist in our results based on language models being worse than humans at predicting upcoming words, poor linguistic quality, we can compare a range of computational language models of varying linguistic quality and see how the estimated effects of surprisal change. If there is a trend in results as the linguistic quality of the language models improves, that would provide evidence that such a trend may be even more present in language models with human-level linguistic quality.
Additionally, recent years have seen rapid progress in computational language modeling, enabled by recent advances in neural networks. As a result, the linguistic quality of contemporary language models is far beyond what has been used in previous work studying surprisal. In this paper, we address both these concerns by analyzing how the predictive power of these surprisal estimates, their psychological quality, varies as a function of language model linguistic quality and type.
There has also been substantial interest in the shape of the effects of surprisal on reading times, because of theories that predict it to be linear (Levy, 2008;Smith & Levy, 2013;Bicknell & Levy, 2010). A secondary goal of this work is to investigate whether the shape of this effect depends on language model quality or type.
In particular, we compare surprisal estimates using a range of language models of varying linguistic qualities and types, from the n-gram models that have been used in most previous work on surprisal to state-of-the-art LSTM and interpolated-LSTM models. We assess the predictive ability and the size and shape of surprisals derived from each language model using generalized additive mixed-effects models (Wood, 2017) fit to a corpus of eye movements in reading.
The plan for the remainder of this paper is as follows. Section 2 introduces the set of language models we compare and establishes the linguistic quality of each. Then, in Section 3 we quantify the ability of surprisals derived from each language model to predict reading times and see the extent to which this changes with language model type and quality, assuming that effects of surprisal on reading times are linear. In Section 4 we do the same but allow surprisal to have non-linear effects, and we additionally use the non-linear models to assess whether there is evidence that the shape of the surprisal effect changes with language model type or quality. Finally, Section 5 concludes.

Corpus
The corpus used for language model estimation was the Google One Billion Word Benchmark (Chelba et al., 2013), hereafter referred to as the "1b corpus". The text data was obtained from news periodicals (similar to the Dundee corpus used for eye-tracking data below). The final corpus contained approximately 0.8 billion words with a vocabulary size of about 800,000.
Although the Dundee Corpus (Kennedy et al., 2003) tokenized entire words with punctuation, our models were trained using separate punctuation as well separated possessives (e.g. Bill's → [Bill , 's]). Contractions were tokenized into their constituent full-form words, although contractions were counted as a single word when utilizing word count in e.g. perplexity calculations. These calculations can be seen in Table 1.

Model types
We compare seven language models of three types: four n-gram models, one LSTM, and two interpolations.

n-gram
The n-gram, count-based models were calculated using kenlm (Heafield et al., 2013). kenlm uses Modified Kneser-Ney Smoothing, and is similar in functionality but significantly faster than SRILM (Stolcke et al., 2011). We calculated 5-grams, 4grams, trigram, bigrams and unigrams. Unigram results were not included in the study, but rather used as a count of word frequency for controlling other models.

LSTM
Neural network-based language models were generated from a Recurrent Neural Network (RNN) with Long-Short Term Memory (LSTM). Each word was encoded as a 50-dimensional one-hot vector, This vector was then fed into a sequence model with an LSTM of 50 hidden units. The model did not evaluate character-level sequences, but rather only word-level sequences. The probability of the next word in the sequence was selected from the output layer of the sequence model.

Interpolation
In addition to the LSTM and n-gram models, two interpolated models were also built from the two models with the lowest perplexity on the Dundee Corpus used in this study (see Table 1). This was similar to the interpolation method utilized in Jozefowicz et al. (2016). Similar to Jozefowicz et al. (2016), the present study also found optimal weightings for combining an LSTM model with a smoothed n-gram model. Optimal weighting was operationalized as the blend weights that resulted in the lowest perplexity. Perplexity of the interpolated LSTM+5=gram model was optimal (lowest) when an interpolated model weighted the LSTM probabilities by 0.71, with the 5-gram model weighted by 0.29. In addition to this optimal model, a balanced interpolated model was also constructed using equal weighting of the LSTM and 5-gram probabilities.

Dundee corpus surprisals
The Dundee Corpus (see Section 3 for corpus details) was tokenized at the word (rather than token) level with leading, trailing and internal punctuation included, e.g. Bill's, couldn't or exist!. Because the 1b Corpus was tokenized, we were required to break words made up of multiple tokens into their constituent parts. The surprisal (log probability) for each token was matched to the 1b Corpus surprisals. In order to realign the tokens with the Dundee Corpus's words, the log probabilities of each constituent token were added together to form a sum total log probability of the word.
Of the approximately 61,000 tokens in the Dundee Corpus, 175 were OOV in the 1b Corpus. These OOV words were removed from the final analysis. In adition, although the 1b Corpus used the sentence-final delimiter </s>, the Dundee Corpus did not. Therefore, while sentence-final delimiters were used in constructing the probabilities of the respective language models, they were also removed from the final analysis.

Perplexity
For each language model, the words' surprisals (log probabilities) were summed and normalized by the word count. The exponent of the inverse of this sum was then calculated. A lower perplexity is indicative of a more accurate language model. For example, a perplexity of 50 means that the model can guess 1 of 50 different options for the model with equal probability. Therefore a lower perplexity means that there are fewer equally likely model options. The perplexity of the seven language models is laid out in Table 1 (73) and the LSTM model (113) are worse than the respective models reported in Jozefowicz et al. (2016) and Chelba et al. (2013). Whereas our best 5-gram model achieves a perplexity of 169 on the Dundee corpus, Jozefowicz et al. (2016) achieves a perplexity of 67 on the lm 1b benchmark using a similar model. However, an important distinction is that the perplexities in Table 1 were calculated after all unknown words were excluded. On the other hand, Chelba et al. (2013) used an <UNK> token for words that were OOV on the test portion of the 1b Corpus. This suggests a substantial mismatch between the test benchmark corpus and the Dundee corpus, even though both corpora are sourced from news media. Nonetheless, both perplexity figures could be considered strong, low perplexities.

Linear effects of surprisal
In this section we investigate the ability of surprisals derived from each of these seven language models described above to predict reading times in a large corpus of eye movements in reading.

Eye movement in reading data
The eye tracking data for our study came from English portion of the Dundee Corpus (Kennedy et al., 2003), which recorded the eye-movement data from 10 English-speaking participants read-ing newspaper editorials in The Independent. For this paper specifically, we predict gaze durations for each word, defined to be the sum of all fixations made on a word between the time the word is initially fixed and when the eyes first move off of the word. This measure is only calculated if the word is fixated by that reader prior to any fixation on a later word (i.e., during 'first pass' reading). If the word was not fixated during first pass reading, this is missing data. We used a total of about 436,000 valid gaze durations in the English portion of the Dundee corpus. After performing the exclusions listed below, we were left with a total of 289,726 gaze durations and a vocabulary size of 37,420 word types.
In line with previous studies of gaze durations in the Dundee corpus (e.g. Smith & Levy, 2013), we excluded: • Words preceding punctuation • Words with non-alphabetical characters • Words that were presented to participants at the beginning or end of a line of text • Words that were outside the vocabulary of the 1b corpus (and thus the language models) Because our statistical model of the gaze duration of each word also included effects of the surprisal of the preceding word, we also excluded: • Words following punctuation • Words that followed words with nonalphabetic characters • Words that followed words that were outside the vocabulary of the 1b corpus (and thus the language models)

Statistical models
Similar to Smith & Levy (2013), we used generalized additive mixed-effects models (GAMMs) to predict reading times with the mgcv (Wood, 2004) package in R (R Core Team, 2013). We estimated seven GAMMs, one for each language model. Each GAMM modeled gaze duration on a word as a function of two linear surprisal terms: one for the surprisal of the current word and one for the surprisal of the previous word. Each GAMM also included random intercepts for each of the 10 readers and a range of linear and non-linear covariates not of direct interest for the present work, identical to those included by Smith & Levy (2013). These covariates were: • a tensor product interaction between orthographic word length and log-frequency (unigram log probability estimated from the 1b corpus) of the current word • a tensor product interaction between orthographic word length and log-frequency of the previous word • a spline effect of word number within the text • a binary variable of whether or not the previous word had received a fixation

Analysis
We compare the predictive power of different language models for reading times by comparing the log likelihoods across GAMMs that include surprisals derived from different language models. 1 To enable comparison of log likelihoods across models, we change two aspects of mgcv's default GAMM fitting procedure: we use maximum likelihood fitting instead of REML and we use splines with fixed degrees of freedom instead of penalized splines. We set the fixed degrees of freedom for each covariate to be a bit above the estimated degrees of freedom from a GAMM estimated in the default way (which was relatively constant across models).
To measure the added predictive power of the two linear surprisal terms in each model, we subtract the models' log likelihood from a model that only includes the covariates, yielding a measure we denote ∆LogLik. (Note that because these models are in a subset relationship -2 times ∆LogLik is a Chi-square distributed deviance as in a likelihood ratio test.) To assess the extent to which this measure of predictive power is related to the language model's linguistic quality, we correlate this ∆LogLik metric with perplexity. Additionally, since these models with linear effects of surprisal also estimate the coefficient of surprisal for predicting reading times -both for the current word's surprisal and the prior word's -we also assess the correlation between these coefficients and the model's perplexity. To the extent to which there are systematic relationships between these coefficients and the language model's linguistic quality, it may suggest that poor Figure 1: Improvements in log likelihood for linear models, charted against decreases in perplexity. Distance from the central trend line is indicative of larger departures in log likelihood as a function of perplexity. The blue line represents a linear best fit, with a coefficient of −1.66 and R 2 = 0.94 quality language models cannot be trusted to accurately estimate the size of the effect of surprisal on reading times.

Log Likelihood
As shown in Figure 1 and Table 2, there is a monotonic effect of language model quality on predictive power. Better language models (lower perplexity) yield surprisal values that better predict reading times, as seen by increased ∆LogLik. Indeed, Figure 1 shows a strikingly strong relationship between a language model's linguistic quality (measured by perplexity) and the ability of surprisal values derived from that model to predict reading times (measured by ∆LogLik). These two values have an R 2 of 0.94.
However, there is one relatively clear departure from this tight linear relationship. Namely, the large decrease in the perplexity going from the 5-gram model to the LSTM is not reflected in a large jump in ∆LogLik. Put another way, although there is a clear systematic relationship between language model linguistic quality and ∆LogLik, there is also some evidence for effects of language model type, such that the LSTM is less useful for predicting reading times than would be expected given its perplexity.

Current Word
The effects of two words' surprisal was incorporated into the GAMs: the surprisal of the current word and the surprisal of the previous word. Despite the different models' very different perplexities, the size of the effects of surprisal were estimated very stably across language models. As seen in Figure 2, all models had surprisal coefficients around 3 (although the LSTM model is again somewhat of a low outlier). There is no clear relationship between the coefficients for the surprisal of the current word and language model quality, with both the best model (optimal interpolation) and the worst model (bigrams) having a value of 3.04.

Previous Word
Similar to the results above for the current word, the previous word's surprisal also had an inconsistent effect across models. In other words, the coefficient for the previous word's surprisal (see Table  2) bore no clear relationship with relative improvements in language model perplexity.

Non-linear effects of surprisal
In addition to the previous set of analyses analyzing the predictive power of linear effects of surprisal on reading times, we conducted another set of analyses allowing for non-linear effects of sur-  Table 2: As the perplexity of a language model increases, its improvement over baseline log likelihood (∆LogLik) decreases. The coefficients for both the current and previous words do not bear a consistent relationship with model perplexity. prisal. These models also let us ask whether the shape of the estimated effect of surprisal on reading times varies with language model quality.

Methodology
The primary methodology was identical to that from the previous analysis, except that instead of including linear effects of current and previous word surprisal in the GAMMs, we included cubic splines (40 d.f.) of current and previous word surprisal. For this non-linear model, since there are not coefficients of current and previous word surprisal, we also investigate the F statistic associated with the strength of each surprisal term predictor. Additionally, to analyze whether the shape of the surprisal effect differs across conditions, we fit additional GAMMs that had the same structure but were estimated in mgcv's usual way (i.e., with splines penalized and REML). These addi-  Table 3: Correlation results for metrics of predictors of linear and non-linear GAMMs tional models were only used for visualization.

Results and discussion
When allowing for non-linear effects of surprisal, the relationship between linguistic quality and predictive power for reading times becomes even more clear. The relationship between ∆LogLik and perplexity becomes even stronger (Figure 4), with an R 2 of 0.98. Further, as seen in Table 4, while the F statistic for the current word surprisal is inconsistent as model perplexity improves (similar to the coefficients of surprisal in the linear models), the F statistic of the previous word is tightly related to perplexity. As perplexity of a model improves, the F statistic of the previous word improves in lockstep. This suggests that at least in the non-linear models, many of the improvements in predictive ability may come specifically from effects of prior word surprisal. As can be seen in the GAM plots in Figures 5  and 6, there are no large differences in the shape Figure 4: Improvements in log likelihood for nonlinear models, charted against decreases in perplexity. The blue line is a linear best fit line with a coefficient of -1.66, R 2 = 0.98. Figure 5: GAM plots on current word using normal estimation of surprisal as language model quality improvesall look roughly linear. If a trend in shape does exist, the highest quality models (interpolation) appear to have the most linear slopes. Additionally, the slope for surprisal of the prior word appears to flatten out for LSTMs for high surprisals. 2

General Discussion
Taking all of the results together, we have shown evidence here for a strong effect of language model linguistic quality on the predictive power of surprisals estimated from that language model for reading times. This effect holds regardless of whether surprisal is modeled as a linear or nonlinear effect. Despite this clear relationship with linguistic quality in terms of predictive power, we also saw remarkable consistency. Across language Figure 6: GAM plots on previous word using normal estimation models that varied by more than a factor of 4 in perplexity, the size of the effect of surprisal was estimated to be the similar and the shape of the effect of surprisal was estimated to be roughly linear. These results suggest that we can put a reasonable amount of trust in results about surprisal estimated with computational language models, despite the state-of-the-art still being far from human quality.
In addition, the way that the language models were composed seems to play a role in its fit to the data. The LSTM-based model does seem to be somewhat of a low-performing outlier. However, when the LSTM model is used with the 5-gram model in interpolation, these yield superior results. Therefore, although a purely LSTM-based model does not predict reading time as well as other models, it provides a good fit for the data. When used in conjunction with a count-based model, this combination provides more accurate predictions of the reading time data.
A number of studies have used the Dundee eyetracking corpus in conjunction with a probabilistic language model. Demberg & Keller (2008), using less sophisticated linear models, found that surprisal is an accurate measure of processing complexity as measured by eye gaze duration. According to Demberg & Keller (2008), greater word surprisal invokes higher "integration costs," which accounts for prolonged gaze duration.
In a neural network language model, word dependencies can span an arbitrary word distance, i.e. not all dependencies are contingent upon adjacent words or even a neighboring word. For example, ellipsis can span multiple clause boundaries to resolve an anaphoric relationship. For this   (2012), on the other hand, make various modifications to the models used in Frank & Bod (2011), adding additional lexical information to the unlexicalized hierarchical models. Fossum & Levy (2012) concludes that hierarchical information, when properly lexicalized, can improve sequence-only lexical models. Similarly, Mitchell et al. (2010) created a model that interpolates syntactic and distributional semantic information, and found that this improved the prediction of eye tracking durations.
As this bears on the present study, the LSTM model is able to detect word relationships that span arbitrary distances. While the LSTM model is not explicitly representing hierarchical information, the model does capture long distance information. Our results show that the LSTM model outperforms the purely n-gram models in terms of predictive capabilities. Thus, while we do not need to build hierarchical information explicitly into our model, the long-distance information does improve both linguistic and psychological accuracy. This could point to the conclusion that eye gaze duration is also sensitive to, if not hierarchical information, then information provided at a long distance from the current word.
In a similar vein to our results, Monsalve et al. (2012) shows that perplexity of a language model (linguistic accuracy) bears a strong relationship to the log likelihood of a reading time model (psy-chological accuracy). The key differences between this study and ours is that Monsalve et al. (2012) analyzes self-paced reading data rather than eyetracking, and that we use higher-performing stateof-the-art language models.
Finally, the present study can, in many respects, be viewed as a follow-up to Smith & Levy (2013). (Smith & Levy, 2013) measured the shape of the surprisal curve, similar to our experiment in Section 4; however, the present study demonstrates that the the effect of surprisal is still linear even with much more (linguistically and psychologically) accurate language models. As many studies have noted (Monsalve et al., 2012;Frank et al., 2013), a corpus such as the Dundee corpus, collected from newspapers, often requires a great deal of global, extra-sentential context. Therefore, when processing a given sentence, the reader must also take into account information provided many sentences prior, or even not provided in the document at all. This limitation could impact the results reported herein.
Despite possible limitations, the results above provide consistent evidence that improving the linguistic accuracy of language models will improve the models' ability to make psychological predictions. This underscores the importance of understanding language structure in order to better understand cognitive processes such as eye gaze duration.