Accounting for Agreement Phenomena in Sentence Comprehension with Transformer Language Models: Effects of Similarity-based Interference on Surprisal and Attention

We advance a novel explanation of similarity-based interference effects in subject-verb and reflexive pronoun agreement processing, grounded in surprisal values computed from a pretrained large-scale Transformer model, GPT-2. Specifically, we show that surprisal of the verb or reflexive pronoun predicts facilitatory interference effects in ungrammatical sentences, where a distractor noun that matches in number with the verb or pronouns leads to faster reading times, despite the distractor not participating in the agreement relation. We review the human empirical evidence for such effects, including recent meta-analyses and large-scale studies. We also show that attention patterns (indexed by entropy and other measures) in the Transformer show patterns of diffuse attention in the presence of similar distractors, consistent with cue-based retrieval models of parsing. But in contrast to these models, the attentional cues and memory representations are learned entirely from the simple self-supervised task of predicting the next word.


Introduction
Deep Neural Network (DNN) language models (Le-Cun et al., 2015;Sundermeyer et al., 2012;Vaswani et al., 2017) have recently attracted the attention of researchers interested in assessing their linguistic competence Da Costa and Chaves, 2020;Ettinger, 2020;Wilcox et al., , 2019 and potential to provide accounts of psycholinguistic phenomena in sentence processing Linzen and Baroni, 2021;Van Schijndel and Linzen, 2018;Wilcox et al., 2020). In this paper we show how attention-based transformer models (we use a pre-trained version of GPT-2) provide the basis for a new theoretical account of facilitatory interference effects in subject-verb and reflexive agreement processing. These effects, which we review in detail below, have played an important role in psycholinguistic theory because they show that properties of noun phrases that are not the grammatical targets of agreement relations may nonetheless exert an influence on processing time at points where those agreement relations are computed.
The explanation we propose here is a novel one grounded in surprisal (Hale, 2001;Levy, 2008), but with origins in graded attention and similaritybased interference (Van Dyke and Lewis, 2003;Lewis et al., 2006;Jäger et al., 2017). We use surprisal as the key predictor of reading time (Levy, 2013), and through targeted analyses of patterns of attention in the transformer, show that the model behaves in ways consistent with cue-based retrieval theories of sentence processing. The account thus provides a new integration of surprisal and similarity-based interference theories of sentence processing, adding to a growing literature of work integrating noisy memory and surprisal (Futrell et al., 2020). In this case, the noisy representations arise from training the transformer, and interference must exert its influence on reading times through a surprisal bottleneck (Levy, 2008).
The remainder of this paper is organized as follows. We first provide an overview of some of key empirical work in human sentence processing concerning subject-verb and reflexive pronoun agreement. We then provide a brief overview of the GPT-2 architecture, its interesting psycholinguistic properties, and the method and metrics that we will use to examine the agreement effects. We then apply GPT-2 to the materials used in several different human reading time studies. We conclude with some theoretical reflections, identification of weaknesses, and suggestions for future work.

Agreement Interference Effects in Human Sentence Processing
One long-standing focus of work in sentence comprehension is understanding how the structure of human short-term memory might support and con-strain the incremental formation of linguistic dependencies among phrases and words (Gibson, 1998;Lewis, 1996;Lewis et al., 2006;Miller and Chomsky, 1963;Nicenboim et al., 2015). A key property of human memory thought to shape sentence processing is similarity-based interference (Miller and Chomsky, 1963;Lewis, 1993Lewis, , 1996. Figure  1 shows a simple example of how such interference arises in cue-based retrieval models of sentence processing, as a function of the compatibility of retrieval targets and distractors with retrieval cues (Lewis and Vasishth, 2005;Lewis et al., 2006;Van Dyke and Lewis, 2003) Table 1). Inhibitory interference effects occur when features of the target perfectly match the retrieval cue and features of a distractor partially matches, while facilitatory interference effects occur when the features of both target and distractor partially match the features of retrieval cue.
In this study, we focus on interference effects in subject-verb number agreement and reflexive pronoun-antecedent agreement, specifically in languages where the agreement features include syntactic number which is morphologically marked on the verb or pronoun. In such cases, number is plausibly a useful retrieval cue, and it is easy to manipulate the number of distractor noun phrases to allow for carefully controlled empirical contrasts.
Interference in subject-verb agreement. Previous studies (Pearlmutter et al., 1999;Wagers et al., 2009;Dillon et al., 2013;Lago et al., 2015;Jäger et al., 2020) attest to both inhibitory interference (slower processing in the presence of an interfering distractor) and facilitatory interference (faster processing in the presence of an interfering distractor), but the existing empirical support for inhibitory interference is weak, and many studies fail to find any evidence for it (Dillon et al., 2013;Lago et al., 2015;Wagers et al., 2009). There is stronger evidence for facilitatory effects, which arise in ungrammatical structures where the verb or pronoun fails to agree in number with the structurally correct target noun phrase, but where either an intervening or preceding distractor noun phrase does match in number. Example A. below illustrates, taken from Wagers et al. (2009), where the subject and verb are boldfaced and the distractor noun is underlined: A. The slogan on the posters were designed to get attention. Figure 1: How facilitatory and inhibitory interference effects arise in subject-verb dependency creation in cuebased retrieval parsing. The critical manipulation concerns the overlap of number feature between the distractor, target, and retrieval cue.
A Bayesian meta-analysis of agreement phenomena was recently conducted with an extensive set of studies (Jäger et al., 2017;Vasishth and Engelmann, 2021). Their analysis of first-pass reading times from eye-tracking experiments on subjectverb number agreement is shown in Figure 1. The evidence from the meta-analysis is consistent with a very small or nonexistent inhibitory interference effect in in the grammatical conditions, with a small but robust facilitatory interference effects in the ungrammatical conditions. Concerned that the existing experiments did not have sufficient power to detect the inhibitory effects, Nicenboim et al. (2018) ran a large scale eye-tracking study (185 participants) with materials designed to increase the inhibition effect, and did detect a 9ms effect (95% credible posterior interval 0-18ms). This represents the strongest evidence to date for inhibitory effects in grammatical agreement structures, but even this evidence indicates the effect may be near zero.
Interference in reflexive pronoun agreement. Example B. below shows a pair of sentences from Dillon et al. (2013) used to probe facilitatory effects in reflexive pronoun agreement (again, the target antecedent and pronoun are boldfaced and the distractor is underlined): (1) interfering The basketball coach who trained the star players usually blamed themselves for the ... (2) non-interfering The basketball coach who trained the star player usually blamed themselves for the ...
The empirical record concerning facilitatory effects in reflexive agreement is mixed. Some have claimed that such effects do not arise (Sturt, 2003;Xiang et al., 2009;Dillon et al., 2013), and that this is expected under a model in which the structural constraints from binding theory (Chomsky et al., 1982) serve to effectively filter candidates for retrieval-in short, the parser does not consider or make contact with the ungrammatical distractor noun phrases (Sturt, 2003;Dillon et al., 2013).
However, a recent Bayesian meta-analysis of key experiments by Dillon et al. (2013) indicates substantially overlapping posterior estimates of facilitatory effects for subject-verb agreement and reflexive agreement (Vasishth and Engelmann, 2021). Concerned again about under-powered studies, Jäger et al. (2020) undertook a large scale (181 participants) eye-tracking replication and did find evidence for nearly equivalent facilitatory speedups for reflexive and subject-verb agreement (Figure 3). This result is not inconsistent with the metaanalysis, but provides stronger evidence that the facilitation effects in reflexives are real.
We take advantage of the very broad coverage  of GPT-2 by having GPT-2 process the same set of sentence materials as human subjects in four different agreement experiments. To anticipate our key results, we find GPT-2 yields lower surprisal, i.e. facilitatory effects, in both subject-verb and reflexive pronoun conditions. Furthermore, we show that attention at the verb or pronoun is distributed to both target and distractor in just those conditions where the distractor matches the hypothesized number retrieval cue (Lin et al., 2019). Finally, we show that the surprisal contrasts between matching and nonmatching distractors in the grammatical (inhibitory) interference conditions are essentially zero.

GPT-2 for Psycholinguistic Analysis
The psycholinguistic relevance of GPT-2 and its training method. GPT-2 (Generative Pre-trained Transformer-2), introduced by OpenAI in Radford et al. (2019), is a language model with a decoder-only Transformer architecture (Vaswani et al., 2017), and has achieved state-of-the-art performance in diverse downstream tasks. GPT-2 and other large-scaled language models based on transformer architectures were trained on billions of words of text, and engineered with performance in mind, not with concern for psycholinguistic plausibility. Why then should we then take them seriously as the basis of psycholinguistic models?
We believe that the new transformer-based models have three important properties that make them of psycholinguistic interest. (a) The models are among the first to serve as the basis of systems that achieve human-level performance on a range of linguistic tasks, and they directly generate a key quantity, surprisal of the next word, that we know is an important predictor of reading times in humans (Hale, 2001;Levy, 2008). (b) Although the data requirements are currently much greater than that for human language acquisition, the models are trained on a simple task-predict the next word-that may plausibly serve as the basis of a self-supervised learning signal in human language acquisition. The representations that arise from such learning are thus psycholinguistically interesting. (c) The learned soft-attention and parallel content-based retrieval of representations of prior input are architectural properties of the GPT models that align very closely with retrieval-based models of sentence comprehension (Lewis et al., 2006). And the structure of these psycholinguistic models was proposed as a response to the challenges of computing long-distance dependencies-the same challenge that motivated the transformer as a departure from standard recurrent architectures (Vaswani et al., 2017;Galassi et al., 2020).
Identifying specialized heads in GPT-2. Here we use the medium-sized GPT-2 which is constructed with 12 layers, each of which includes 12 attention heads. Previous studies have revealed that individual attention heads in Transformer models serve are at least partially specialized in function (Clark et al., 2019;Vig, 2019;Vig and Belinkov, 2019;Voita et al., 2019). Specifically, Voita et al. (2019) found that certain attention heads are specialized for different dependency relations.
Following Voita et al. (2019)'s method, we identified heads that are specialized for subject-verb relations and reflexive anaphora resolution. Voita et al. (2019)'s method works as follows. First, sentences are parsed using CoreNLP dependency parser (Manning et al., 2014). Then, relative string positions (e.g., one token back, two tokens back) of all instances in each syntactic dependency were counted. Considering the proportion of the most frequent relative position as the baseline, attention heads are selected as specialized for a particular dependency relation if attention is paid for the corresponding dependent at least 10% more often than the baseline. In other words, there must be some evidence that the attention head is sensitive to the dependency and not merely string position.
To find attention heads responsible for the relation between subjects and verbs, we used the CoreNLP parser on 148,376 sentences from the Brown corpus and Gutenberg corpus provided via Natural Language Toolkit (NLTK) (Bird et al., 2009), extracting 49,145 nsubj relations, which associate nominal subjects and their governors which are mostly verbs. The most frequent relative position for nsubj dependency relation is -1, which means that the nominal subjects usually come right before their governor, taking up 42% of the cases.
After analyzing the attention distribution pattern using GPT-2, we obtained four syntactic heads that were found to be partly specialized for nsubj dependency relations: head4_3 (59%); head3_6 (51%); head6_0 (49%); head2_9 (49%) 1 . Although we expect that the four syntactic heads responsible for nsubj dependency relation may play distinct roles, in our analyses here we simply use the best performing head (head4_3).
The same method was implemented to find attention heads responsible for reflexive anaphora resolution. The only difference was that we used NeuralCoref (Wolf et al., 2018) to count relative position of antecedents to reflexive anaphora since the dependency parser does not associate antecedents and anaphora. Out of 2,660 sentences that includes reflexive anaphora, we extracted 510 sentences where NeuralCoref identified a single unique antecedent for the reflexive pronoun. The most fre- apparently was dishonest ... Exp 1 agrmt int ungram *The executive who oversaw the middle managers apparently were dishonest ... non-int ungram *The executive who oversaw the middle manager apparently were dishonest ... quent relative position for reflexive anaphora and their antecedents was -2, meaning that antecedents appear before reflexive anaphora having one word in between. The proportion of the highest relative position was 22%, requiring 24.2 % of accuracy for attention heads to be considered responsible for reflexive anaphora resolution. We found four heads whose accuracies are higher than the threshold: head1_5 (44%); head3_5 (39%); head4_3 (27%); head6_0 (25%), and we again take the best performing head (head1_5) for further analysis.
Metrics. We define here three metrics for our analyses: surprisal, attention entropy from syntactic heads, and attention to target. We use surprisal for making reading time predictions, but use the attention metrics to provide insight into the processing at the critical region and therefore the representations computed in the prefix before the critical region. Surprisal is thus based on the final prediction of the entire model, but the attention metrics are associated with the attention heads most specialized for our dependencies of interest.
Surprisal (Hale, 2001;Levy, 2008) is defined as the negative log probability of the word given left context.
Surprisal(w) = −log 2 P (w|context) (1) Any use of surprisal requires adoption of some kind of language model; e.g. some past work has used probabilistic CFGs (Levy, 2008). Here we use GPT-2, which computes after each word a probability distribution over its large lexicon that is conditioned on its internal representation of the left context.
Attention to target is simply the value of the soft attention vector element that corresponds to the target word position, which we denote Attn(w cue , w target ), and indicates how much attention is allocated to the target by one of the specialized attention heads (head4_3 for subject-verb and head1_5 for reflexives.) Attention entropy is a variant of Shannon (1948)'s information entropy that we use as a measure of how sharply focused (low entropy) or diffuse (high entropy) the attention pattern is. (It may be thought of as a measure of the uncertainty about the attentional target, but because the attention values are not probabilities from which targets are sampled, this interpretation is not strictly warranted).
where i refers to the location of the critical word, j are locations of prior words, and Attn(w i , w j ) is attention allocated to w j from w i .

Subject-verb Agreement Experiments
To investigate whether GPT-2 may predict facilitatory interference effects in subject-verb agreement, we ran GPT-2 on materials from three studies (Dillon et al., 2013;Wagers et al., 2009) Table 1).
These three sets of sentences have in common a 2 × 2 structure with the factors grammaticality (grammatical/ungrammatical) and interference (interfering/non-interfering), as described above. Additionally, Wagers et al. (2009)'s Exp 3 also includes an additional condition, subject (singular/plural) for investigating a possible singularplural asymmetry, i.e., asking whether interference effects are equivalent for plural (for plural verbs) and singular (for singular verbs) distractors.
Note that sentences from Experiments 2-3 in Wagers et al. (2009) involve structures in which the distractor appears before the target, and so test effects of proactive interference. Thus the distractors are also more distant from verbs than in the other experimental materials.
Results of surprisal analyses. Figure 4 shows the surprisal computed at the critical verbs in each of the experiments and in each of the four conditions separately (red dots and intervals represent means and conventional 95% confidence intervals). Surprisal matches the important qualitative pattern found in the meta-analysis of first-pass reading times: lower surprisal-facilitatory effects-are found in the ungrammatical conditions when the distractor matches the verb's number, and no inhibitory effects are found in the grammatical conditions. Furthermore, the effects are largest for the case of retroactive interference, where the distractor follows the target and immediately precedes the verb (Figure 4a), compared to proactive inteference, where the distractor precedes the target (Figure 4c). The exception is that no facilitatory effects were found when the verb is singular and the target subject is plural (see Figure 4d). But the facilitatory effect in this condition was not reliably different from zero in the meta-analysis, and it mirrors a plural-singular asymmetry (or markedness effect) found in agreement attraction in production. Results of attention analyses. Our conjecture is that in the interfering conditions where the distractor matches the verb in number that the attention of the nsubj-specialized attention head head4_3 will be distributed to both the target and the distractor. It is possible to visualize exactly this pattern using a tool developed by Vig (2019). Figure 5 shows an example visualization.
Analyses of the attention entropy and attention to target metrics provide quantitative evidence for this conjecture: Figure 6 shows two metrics across the four datasets. The interfering conditions always show the highest value of attention entropy and the lowest value of attention to target, which means that the head most specialized for subject-verb relations distributes attention more diffusely and away from the target subject. There is evidence for the expected attention effects even in the grammatical conditions, but in these conditions there is no effect of surprisal. Thus, under a theory in which similarity-based interference exerts its effects on reading time through a surprisal bottleneck (Levy, 2008), no reading time differences are expected here-even though the underlying representations and attention patterns may reflect the interference.
Preliminary corpus analysis of ungrammatical subject-verb agreement sentences. One possible explanation for the observed facilitatory interference effects is that GPT-2 was exposed to ungrammatical sentences in the training data that have precisely the interference patterns of the ungrammatical sentences in our experiments. To examine such possibility, we analyzed 241 sentences randomly extracted from a Reddit corpus (Chang et al., 2020) whose subjects and verbs do not agree in number, and have either interfering or non-interfering distractors in between. The results shown in Table 2 suggest that interfering distractors occur about twice as often as non-interfering distractors in the case of singular subjects with an ungrammatical plural verb, consistent with our expectations that agreement-attraction errors in production may be evident in un-edited corpora.
But it seems unlikely that this 2:1 ratio, which singular subj plural subj interfering 80 71 non-interfering 39 51 Table 2: Results from a preliminary corpus analysis of patterns of ungrammatical subject-verb agreement.
In the key case of a singular subject and a plural verb, the number of an intervening distractor is about twice as likely to be plural (interfering) rather than singular (non-interfering). See text for a discussion.
corresponds to about a 1 bit difference in surprisal, is sufficient alone to explain the observed surprisal differences. For example, in the Wagers et al Experiment 4-6, we observed about a 3 bit difference in surprisal, a 2 bit or 4x difference in probability relative to what would be expected on the basis of the corpus counts. More extensive corpus analysis is necessary to confidently rule out this explanation.

Reflexive Agreement Experiments
To examine whether the prediction of GPT-2 are consistent with the null interference effects argued for by Dillon et al. (2013), or show facilitatory interference effects as in the large scale Jäger et al.
(2020) replication, we conducted an experiment using the same methodology as described above for the subject-verb experiments, but using the reflexive materials in Dillon et al. (2013), and focusing the attention analyses on the head most specialized for reflexive anaphor resolution. Examples of the materials are shown in Table 3. Results of the attention analyses. We found little or no differences between interfering and noninterfering cases in the two attention metrics at-tention entropy and attention to target. It is possible that this is because the attention head head1_5 that we found to be partly specialized for reflexive anaphora resolution is actually not as specialized in reflexive anaphora resolution as head4_3 specialized in nsubj dependency resolution. We cannot conclude yet whether there exist heads that serve this function better (that are not detected by the method of Voita et al. (2019)), whether GPT-2 is not reliably resolving the reflexive anaphora, or whether GPT-2 is doing so in a way that is dis-

Interference Grammaticality Example sentences int gram
The basketball coach who trained the star player usually blamed himself for the ... non-int gram The basketball coach who trained the star players Dillon 2013 usually blamed himself for the ... Exp 1 reflexive int ungram *The basketball coach who trained the star players usually blamed themselves for the ... non-int ungram *The basketball coach who trained the star player usually blamed themselves for the ... Table 3: Examples from Dillon et al. (2013), used in the GPT-2 experiment on reflexive pronoun agreement. Figure 7: Results of the GPT-2 reflexive agreement experiment using materials from Dillon et al. (2013).
tributed across many attention heads.

Discussion and Future Directions
Effects of similarity-based interference have been the province of models of noisy memory rather than models of probabilistic expectations, because in standard probabilistic grammars the expectation for the agreement features of a licensor such as a verb or pronoun should not be conditioned upon the agreement features of constituents other than the target licensee. But we show here that a largescale Transformer language model, GPT-2, trained only to predict the next word, nevertheless yields surprisal values that are consistent with facilitatory interference effects due to distractor noun phrases that do not participate in the agreement relations. We also confirmed that two metrics that are easily computed from the Transformers' attention mechanism, attention entropy and attention to target, show patterns in the subject-verb experiments that are consistent with cue-based retrieval models.
Our results are suggestive of a possible interesting link between surprisal and noisy memory representations. The attention patterns that we have discovered must reflect similarity between the representations of the target and distractor noun phrases. This representational similarity is the source of great generalization power, but this generalization can lead to linguistic expectations that are not derived by conventional grammatical analyses.
One limitation of our analyses of attention is that they depend on methods for identifying specialized heads for specific dependency types. It is not clear that we understand enough about Transformer models to do this reliably. But our results suggest that for at least some dependencies, these simple attention metrics and head selection methods can yield interesting insights.
The approach outlined may provide an important way to combine surprisal and noisy memory accounts, maintaining a surprisal bottleneck. Using trained Transformers has the significant theoretical advantage that the memory representations, the attention/retrieval cues, and thus the predicted similarity effects are learned via a self-supervised prediction task. And so such models naturally yield experience-driven sources of noisy representations that are independent of the process noise assumed in existing memory-based models. Combining the process-and experience-based noise in a single model is an important goal for psycholinguistic theory.