Predicting Reference: What Do Language Models Learn about Discourse Models?

Whereas there is a growing literature that probes neural language models to assess the degree to which they have latently acquired grammatical knowledge, little if any research has investigated their acquisition of discourse modeling ability. We address this question by drawing on a rich psycholinguistic literature that has established how different contexts affect referential biases concerning who is likely to be referred to next. The results reveal that, for the most part, the prediction behavior of neural language models does not resemble that of human language users.


Introduction
The impressive power of deep learning based language models has inspired a new line of computational psycholinguistics research that examines the extent to which linguistic knowledge lies latent within their distributed networks. This work has primarily focused on linguistic phenomena that syntactic theory tells us requires syntactic knowledge to capture, with mixed results (Linzen et al. 2016;Lau et al. 2017;Goldberg 2019;Warstadt et al. 2019;inter alia). This paper asks a new question: to what extent do these language models capture the linguistic knowledge required to perform discourse modeling?
We are unaware of any work that has addressed this question directly. Perhaps the closest research has centered on the Winograd Schema Challenge (WSC) (Levesque et al., 2012), which evaluates the ability of systems to employ world knowledge to interpret ambiguous pronouns in minimal pairs that resemble Winograd's famous example (1).
(1) The city councilmen refused the demonstrators a permit because a. they feared violence. [they = city council] b. they advocated violence. [they = demonstrators] However, WSC is essentially a 'fill in the blank' problem-solving task, and doesn't evaluate the extent to which systems display humanlike ability to model discourse in an online, incremental fashion. We instead take our inspiration from psycholinguistic work that has focused on this question. For instance, the Bayesian Model of pronoun interpretation (Kehler et al., 2008;Kehler and Rohde, 2013) posits that comprehenders resolve the meaning of a pronoun via Bayesian principles by combining their estimates of the speaker's production biases (the LIKELIHOOD term) with their top-down expectations about which entities are likely to be mentioned next (the PRIOR term, which we refer to as the NEXT-MENTION BIAS). Kehler and Rohde (2013) demonstrate that an array of semantic biases (e.g., verb semantics) and pragmatic biases (e.g., coherence relations) that have been claimed to influence pronoun interpretation directly actually do so only indirectly, by conditioning the prior. The role of the prior in the Bayesian Model is directly analogous to its role in Bayesian approaches to tasks such as speech recognition and machine translation, where a language model provides the prior probabilities. We argue that the ability to capture the influence of context on next-mention biases is thus a particularly appropriate task for evaluating the extent to which language models capture discourse modeling knowledge. Our focus will be on effects of verb semantics that the psycholinguistic literature has shown to influence next-mention biases. These studies have used a PASSAGE COM-PLETION paradigm, in which experimental participants are presented with context clauses followed by either a full stop (2a) or a conjunction (2b-c), and asked to complete the passage with the first follow-on sentence that comes to mind.
b. John impresses Mary because c. John impresses Mary, and as a result Analysis of the completions yields estimates of next-mention biases and of referential form production. In the task described in §3, we will probe the next-mention biases produced by two language models in different contexts that we describe now.

Comparisons and Predictions
If neural language models latently acquire discourse modeling knowledge, they should be able to distinguish between contexts that are superficially similar but which are known from psychological research to yield significant effects on next-mention biases. We focus on three such contrasts.

Implicit Causality Verbs
The first comparison is between two kinds of so-called IMPLICIT CAUSALITY (IC) verb, exemplified in (3a-b).
[IC1] b. John praised Mary. [IC2] Sentences with IC verbs generate an expectation that the follow-on sentence will participate in an Explanation coherence relation, in which the second sentence provides a cause or reason for the eventuality described by the first (Kehler et al., 2008). However, the two types differ in which event participant causality is attributed to. IC1 verbs (3a) have been experimentally shown to generate a strong expectation that the preceding subject will be mentioned next in the follow-on sentence-we heard that John is aggravating, and we now expect to hear why (Garvey and Caramazza 1974;Caramazza et al. 1977;Brown and Fish 1983;Terry Kit-fong Au 1986;McKoon et al. 1993;Koornneef and van Berkum 2006;Kehler et al. 2008; inter alia). IC2 verbs (3b), on the other hand, have been shown to generate a strong expectation that the preceding object will be mentioned next in the follow-on sentence-we heard that Mary received praise, and we now expect to hear why. We can then ask: do IC1 and IC2 verbs generate different expectations in language models for next mention in otherwise identical contexts? There are also subsidiary predictions regarding the use of connective prompts as in (2b-c). For both types of IC verbs, because prompts strengthen their biases, since virtually 100% of the continuations will now be Explanations rather than 60% as found in full stop prompt conditions (Kehler et al., 2008). So we expect to see a higher probability of nextmention of the subject with because prompts for IC1 verbs, and likewise for objects for IC2 verbs. Both types of IC verb, however, are known to have a strong bias to the object in Result coherence relations (Stewart et al., 1998;Kehler et al., 2008)-in which the follow-on describes an effect rather than a cause-which are enforced by the and as a result prompt. For IC1 verbs, therefore, we should see a strong shift toward the object with and as a result prompts compared to full stop prompts. To summarize the predictions: 1a. IC1 contexts with full stop prompts should display a stronger next-mention bias to the subject compared to IC2 contexts.
1b. Contexts with because prompts should strengthen the next-mention bias associated with each type of verb compared to full stops.
1c. And as a result prompts in IC1 contexts should result in a greater next-mention bias toward the object compared to full stops.

Motion vs. Transfer of Possession Verbs The second comparison is between Motion (4a) and
Transfer-of-Possession (ToP) verbs (4b).
(4) a. The man jogged to the woman.
[Motion] b. The man handed a gift to the woman. [ToP] These sentence types are superficially similar: they each have a grammatical subject that functions as a thematic Agent/Source, and a grammatical objectof-preposition that functions as a thematic Goal. However, they are known to yield very different next-mention biases. Specifically, previous studies have revealed that whereas motion verbs have a strong next-mention bias toward the previous subject (e.g., 84.4% in a study run by Stevenson et al. (1994)), ToP contexts give rise to a distribution that's closer to 50/50 (51.0%). The reason is that the Goal in ToP sentences functions not only as a location but a recipient as well, leading to an expectation that we'll next hear about what the recipient did with the object of transfer, which counteracts the typical subject bias. We thus expect to see a much stronger next-mention bias toward the subject for Motion contexts as compared to ToP contexts, despite their superficially similar properties. Further, we expect a large effect of the connective conditions: previous work (Stevenson et al., 1994;Kehler et al., 2008) has shown Explanations to be strongly biased to the Source, and Result continuations to be strongly biased to the Goal for ToP contexts. 1 To summarize the predictions: 2a. Motion contexts with full stop prompts should display a stronger next-mention bias to the subject compared to ToP contexts.
2b. ToP contexts with because prompts should yield a stronger bias toward the subject compared to full stop prompts.
2c. ToP contexts with and as a result prompts should yield a stronger bias to the object compared to full stop prompts.
Aspectual Marking with Transfer of Possession Verbs The final comparison varies aspectual marking rather than the semantic class of the verb. Kehler et al. (2008) compared ToP contexts in the perfective such as (4b) with otherwise identical sentences in the imperfective (5): (5) The man was handing a gift to the woman. This gives rise to the following prediction: 3. Imperfective ToP contexts should display a stronger next-mention bias to the subject compared to perfective ToP contexts.

Experimental Setup
We evaluated two state-of-the-art, pre-trained autoregressive language models (LMs): GPT-2 large (Radford et al., 2018) and Transformer-XL (Dai  The experiments were conducted in a zero-shot setting, and the task of generating continuations was reformulated to a next-word prediction task. Prior to tokenization, the input stimulus was prepended with a token indicating the beginning of the sentence. Additionally, the inputs for Transformer-XL were prepended with a padding text to account for the shorter stimulus length. 3 To capture the diversity of ways in which event participants can be mentioned in the context sentence, the twelve frames shown in Table 1 were used. In order to balance for the effects of gender (Zhao et al., 2018;Bordia and Bowman, 2019), each frame was used again with the order of the event participants reversed, for a total of 24 frames. 20 IC1 verbs, 20 IC2 verbs, 18 Motion verbs, and 18 ToP verbal complexes (in both perfective and imperfective variants) were each run in the full stop prompt, because prompt, and and as a result prompt conditions, in each of the 24 frames. 4 After presenting a pairing of a context sentence and prompt, we compute the (normalized) conditional probabilities of He and She in the full stop prompt condition and their lowercase equivalents for the connective prompt conditions. The average biases to the subject are computed for each verb over the sentence frames, which are in turn averaged to compute the overall subject bias for each context type. The latter averages are reported with 95% confidence intervals in the tables below.

Results
Implicit Causality Comparison The nextmention biases toward the subject produced by each system in the IC verb conditions are shown in Tables 2 and 3.   Our first question (Prediction 1a) is whether the LMs would display a greater next-mention bias toward the preceding subject in IC1 contexts than IC2 contexts. The answer is no: As can be seen in the first rows of Tables 2 and 3, the biases across conditions for Transformer-XL are identical (.51) and the difference witnessed for GPT-2 goes in the wrong direction (.59 vs. .66). These results therefore do not align with the more polar biases for IC contexts that the psycholinguistic literature has revealed in human studies.
The second question (Prediction 1b) is whether the occurrence of because at the end of the prompt-which for human language users shifts discourse coherence expectations toward Explanation continuations-strengthens the respective IC biases. This prediction receives only limited support: The results in Table 2 reveal increased biases toward the subject compared to the full stop condition for IC1 verbs, and those in Table 3 reveal similar decreases for IC2 verbs. However, only GPT-2 in the IC2 condition yielded an effect of the magnitude that human language studies might lead us to expect. 5 The final question (Prediction 1c) is whether the occurrence of and as a result at the end of the prompt-which for human language users shifts discourse coherence expectations toward Result continuations-generates a stronger bias toward the preceding object compared to the free prompt baseline in IC1 contexts. This prediction was confirmed for GPT-2, where the connective prompt reduced the bias to the subject by .28. Whereas Transformer-XL witnessed a lower bias in this condition as well, the effect was smaller (.08).
To sum, both models failed to yield the hypothesized effect of verb type in the full stop condition. However, there was some degree of sensitivity to the occurrence of a connective, with GPT-2 in particular displaying a strong numerical difference compared to the free prompt baseline in all but the IC1/because condition.

Motion vs. ToP Verb Comparison
The nextmention biases toward the subject produced by each system in the Motion and ToP context conditions are shown in Tables 4 and 5.

Prompt
Transformer-XL GPT-2 full stop .57 ± .01 .63 ± .01 because .61 ± .02 .65 ± .01 and as a result .54 ± .02 .47 ± .02  Our first question (Prediction 2a) asked whether the LMs would display a greater next-mention bias toward the preceding subject in Motion contexts than ToP contexts in the full stop condition. The answer is mostly no: Whereas there is a small numerical difference for each system in the right direction, it is far from what the results of experimental studies would predict. In particular, whereas the bias found for ToP verbs is aligned with established experimental results, the expected strong subject bias for Motion verbs did not materialize.
The second and third questions (Predictions 2b and 2c) asked about the effect of connectives in the ToP condition, whereby because and and as a result prompts should pull expectations toward the subject and object compared to the full stop prompt baselines respectively. As with IC verbs, no strong effect was witnessed for Transformer-XL, whereas GPT-2 did show a strong shift in the predicted direction for and as a result prompts. However, no appreciable effect was seen for GPT-2 in the because prompt condition.

Aspectual Marking in ToP Verbs Comparison
Our final question (Prediction 3) probes the poten-tial effects of aspectual marking on next-mention biases, in particular whether imperfective ToP contexts will yield a stronger next-mention bias to the subject compared to perfective ToP contexts. The results for perfective and imperfective ToP contexts are shown in Tables 5 and 6 respectively.

Prompt
Transformer-XL GPT-2 full stop .57 ± .01 .62 ± .02 because .56 ± .03 .57 ± .02 and as a result .63 ± .03 .45 ± .03 Prediction 3 was mostly disconfirmed: There is only a modest difference between ToP contexts using the perfective and imperfective aspect in the full stop prompt condition. Interestingly, however, the predicted effect did exist for both systems in the and as a result condition. It is not clear to us why the effect would be limited to only this condition.

Conclusions
We set out to evaluate the extent to which neural LMs latently acquire the discourse modeling capability necessary to perform a particular type of incremental processing that human language users do: The ability to predict what entities are most likely to be mentioned next. We examined three context pairs with superficially similar linguistic properties that the experimental literature has shown to result in divergent next-mention biases, both with and without connectives.
The results were mostly, but not entirely, negative. On the one hand, we found no compelling evidence that the LMs are sensitive to any of the three manipulations within the verbal complex in the context sentence. On the other hand, one could argue for preliminary support for the claim that one of the LMs-GPT-2-is sensitive to the occurence of the two connectives examined here. Future work will be required to assess the extent to which these effects do in fact reflect the acquisition of a latent form of discourse modeling ability.
Our conclusions, of course, remain preliminary in a number of respects. First, we have analyzed the behavior of only two systems. Since each system can be said to stand proxy for a single experimental participant, these results could be argued to be less robust than human language studies, which typically utilize several dozen participants.
Whereas this limitation is shared with previous work that probes LMs for inherently acquired syntactic knowledge, the robustness of the findings would be enhanced by examining a broader range of systems and/or system configurations so as to better capture the kinds of variation found among groups of human participants.
Second, we have focused here on broad contrasts between context types that have been studied in the psycholinguistic literature. Although the stimuli employed were modeled after those used in experimental studies, to improve the robustness of the findings we felt it necessary to compute means over a variety of sentence frames (Table 1), so that any idiosyncrasies of particular frames that are independent of the manipulation under scrutiny wouldn't unduly (and undetectably) drive the results. This improves the robustness of our results in terms of items-whereas participants in psycholinguistic studies typically see only one example sentence for each verb, the LMs here saw 24-it also means that no lab data exists for the exact stimuli used here. Since an experiment that collects data on this scale would require a substantial annotation effort, a more careful comparison of this sort must be left for future work.
Third, there are many variations of the studies presented here that could be attempted. Examples would include variants that employ longer and more realistic contexts. In this initial investigation we focused on single-sentence contexts so as to hew as closely as possible to previous experimental work. We hope that this short paper will inspire further research that takes next steps in this and a variety of other directions.
Finally, we want to be clear that we do not claim that the two LMs examined have in any sense 'failed' at this task-they were obviously not trained for this purpose. Our goal instead was to pose the novel question of to what extent discourse knowledge of the sort examined here may exist latently in the models. That having been said, we consider the identification of alternative language model architectures that are capable of capturing the requisite discourse modeling capability for this task to be an interesting challenge problem for future work.