Why discourse affects speakers’ choice of referring expressions

We propose a language production model that uses dynamic discourse information to account for speakers’ choices of referring expressions. Our model extends previous rational speech act models (Frank and Goodman, 2012) to more naturally distributed linguistic data, instead of assuming a controlled experimental setting. Simulations show a close match between speakers’ utterances and model predictions, indicating that speakers’ behavior can be modeled in a principled way by considering the probabilities of referents in the discourse and the information conveyed by each word.


Introduction
Discourse information plays an important role in various aspects of linguistic processing, such as predictions about upcoming words (Nieuwland and Van Berkum, 2006) and scalar implicature processing (Breheny et al., 2006). The relationship between discourse information and speakers' choices of referring expression is one of the most studied problems. Speakers' choices of referring expressions have long been thought to depend on the salience of entities in the discourse (Givón, 1983). For example, speakers normally do not choose a pronoun to refer to a new entity in the discourse, but are more likely to use pronouns for referents that have been referred to earlier in the discourse. A number of grammatical, semantic, and distributional factors related to salience have been found to influence choices of referring expressions (Arnold, 2008). While the relationship between discourse salience and speakers' choices of referring expressions is well known, there is not yet a formal account of why this relationship exists.
In recent years, a number of formal models have been proposed to capture inferences between speakers and listeners in the context of Gricean pragmatics (Grice, 1975;Frank and Goodman, 2012). These models take a game theoretic approach in which speakers optimize productions to convey information for listeners, and listeners infer meaning based on speakers' likely productions. These models have been argued to account for human communication (Jager, 2007;Frank and Goodman, 2012;Bergen et al., 2012a;Smith et al., 2013), and studies report that they robustly predict various linguistic phenomena in experimental settings (Goodman and Stuhlmüller, 2013;Degen et al., 2013;Kao et al., 2014;Nordmeyer and Frank, 2014). However, these models have not yet been applied to language produced outside of the laboratory, nor have they incorporated measures of discourse salience that can be computed over corpora.
In this paper, we propose a probabilistic model to explain speakers' choices of referring expressions based on discourse salience. Our model extends the rational speech act model from Frank and Goodman (2012) to incorporate updates to listeners' beliefs as discourse proceeds. The model predicts that a speaker's choice of referring expressions should depend directly on the amount of information that each word carries in the discourse. Simulations probe the contribution of each model component and show that the model can predict speakers' pronominalization in a corpus. These results suggest that this model formalizes underlying principles that account for speakers' choices of referring expressions.
The paper is organized as follows. Section 2 reviews relevant studies on choices of referring expressions. Section 3 describes the details of our model. Section 4 describes the data, preprocessing and annotation procedure. Section 5 presents simulation results. Section 6 summarizes this study and discusses implications and future directions.
2 Relevant Work 2.1 Discourse salience Speakers' choices of referring expressions have long been an object of study. Pronominalization has been examined particularly often in both theoretical and experimental studies. Discourse theories predict that speakers use pronouns when they think that a referent is salient in the discourse (Givón, 1983;Ariel, 1990;Gundel et al., 1993;Grosz et al., 1995), where salience of the referent is influenced by various factors such as grammatical position (Brennan, 1995), recency (Chafe, 1994), topicality (Arnold, 1998), competitors (Fukumura et al., 2011), visual salience (Vogels et al., 2013b), and so on.
Discourse theories have characterized the link between referring expressions and discourse salience by stipulating constructs such as a scale of topicality (Givón, 1983), accessibility hierarchy (Ariel, 1990), or implicational hierarchy (Gundel et al., 1993). All of these assume fixed form-salience correspondences in that a certain referring expression encodes a certain degree of salience. However, it is not clear how this form-salience mapping holds nor why it should be.
There is also a rich body of research that points to the importance of production cost (Rohde et al., 2012;Bergen et al., 2012b;Degen et al., 2013) and listener models (Bard et al., 2004;Van der Wege, 2009;Galati and Brennan, 2010;Fukumura and van Gompel, 2012) in language production. These studies suggest that only considering discourse salience of the referent may not precisely capture speakers' choices of referring expressions, and it is necessary to examine discourse salience in relation to these other factors.

Formal models
Computational models relevant to speakers' choices of referring expressions have been proposed, but there is a gap between questions that previous models have addressed and the questions that we have raised above. Grüning and Kibrik (2005) and Khudyakova et al. (2011) examine the significance of various factors that might influence choices of referring expressions by using machine learning models such as neural networks, logistic regression and decision trees. Although these models qualitatively show some significant factors, they are data-driven rather than being explanatory, and have not focused on why and how these factors result in the observed referring choices.
Formal models that go beyond identifying superficial factors focus on only pronouns rather than accounting for speakers' word choices per se. For example, Kehler et al. (2008) formalize a relationship between pronoun comprehension and production using Bayes' rule to account for comprehender's semantic bias in experimental data. Rij et al. (2013) use ACT-R (Anderson, 2007) to examine the effects of working memory load in pronoun interpretation. These models show how certain factors influence pronoun production/interpretation, but it is not clear how these models would predict speakers' choices of referring expressions.
Relevant formal models in computational linguistics include Centering theory (Grosz et al., 1995;Poesio et al., 2004) and Referring Expression Generation (Krahmer and Van Deemter, 2012). These models propose deterministic constraints governing when pronouns are preferred in local discourse, but it is not clear how these would account for speakers' choices of referring expressions, nor it is clear why there should be such deterministic constraints.

Uniform Information Density
One potential formal explanation for the relation between discourse salience and speakers' choices of referring expressions is the Uniform Information Density hypothesis (UID) (Levy and Jaeger, 2007;Tily and Piantadosi, 2009;Jaeger, 2010). UID states that speakers prefer to smooth the information density distribution of their utterances over time to achieve optimal communication. This theory predicts that speakers should use pronouns instead of longer forms (e.g., the president) when a referent is predictable in the context, whereas they should use longer forms for unpredictable referents that carry more information (Jaeger, 2010). Tily and Piantadosi (2009) empirically examined the relationship between predictability of a referent and choice of referring expressions. They found that predictability is a significant predictor of writers' choices of referring expressions, in that pronouns are used when a referent is predictable.
While these results appear to support UID, there are several inconsistencies with previous UID accounts. Information content of words has been estimated using an n-gram language model (Levy and Jaeger, 2007), a verb's subcategorization frequency (Jaeger, 2010), and so on, whereas here the information content is that of referents with respect to discourse salience. In addition, selecting between a pronoun and a more specified referring expression involves deciding how much information to convey, whereas previous applications of UID (Levy and Jaeger, 2007) have been concerned with deciding between different ways of expressing the same information content. We show in the next section that we can derive predictions about referring expressions directly from a model of language production.

Summary
Previous linguistic studies have focused on identifying factors that might influence choices of referring expressions. However, it is not clear from this previous work how and why these factors result in the observed patterns of referring expressions. Where formal models relevant to this topic do exist, they have not been built to explain why there is a relation between discourse salience and speakers' choices of referring expressions. Even UID, which relates predictability to word length, is not set up to account for the choice between words that vary in their information content.
In the next section, we propose a speaker model that formalizes the relation between discourse salience and speakers' choices of referring expressions, considering production cost and speakers' inference about listeners in a principled and explanatory way.

Speaker model 3.1 Rational speaker-listener model
We adopt the rational speaker-listener model from Frank and Goodman (2012) and extend this model to predict speakers' choices of referring expressions using discourse information.
The main idea of Frank and Goodman's model is that a rational pragmatic listener uses Bayesian inference to infer the speaker's intended referent r s given the word w, their vocabulary (e.g., 'blue', 'circle'), and shared context that consists of a set of objects O (e.g., visual access to object referents) as in (1), assuming that a speaker has chosen the word informatively.
While our work does not make use of this pragmatic listener, it does build on the speaker model assumed by the pragmatic listener. This speaker model (the likelihood term in the listener model) is defined using an exponentiated utility function as in (2).
represents informativeness of word w (quantified as surprisal) and D(w) represents its speech cost. If a listener interprets word w literally and cost D(w) is constant, the exponentiated utility function can be reduced to (3) where |w| denotes the number of referents that the word w can be used to refer to.
Thus, the speaker model chooses a word based on its specificity. We show in the next section that this corresponds to a speaker who is optimizing informativeness for a listener with uniform beliefs about what will be referred to in the discourse. The assumption of uniform discourse salience works well in a simple language game where there are a limited number of referents that have roughly equal salience, but we show that a model that lacks a sophisticated notion of discourse falls short in more realistic settings.

Incorporating discourse salience
To extend Frank and Goodman's model to a natural linguistic situation, we assume that the speaker estimates the listener's interpretation of a word (or referring expression) w based on discourse information. We extend the speaker model from (3) by assuming that a speaker S chooses w to optimize a listener's belief in speaker's intended referent r relative to the speaker's own speech cost C w . This cost is another factor in the speaker model, roughly corresponding to utterance complexity such as word length. 1 The term P L (r|w) in (4) represents informativeness of word w: the speaker chooses w that most helps a listener L to infer referent r. The term C w in (4) is a cost function: the speaker chooses w that is least costly to speak. The speaker's listener model, P L (r|w), infers referent r that is referred to by word w according to Bayes' rule as in (5).
The first term in the numerator, P (w|r), is a word probability: the listener in the speaker's mind guesses how likely the speaker would be to use w to refer to r. The second term in the numerator, P (r), is the discourse salience (or predictability) of referent r. The denominator Σ r P (w|r )P (r ) is a sum of potential referents r that could be referred to by word w. The terms in this sum are non-zero only for referents that are compatible with the meaning of the word. If there are many potential referents that could be referred to by word w, that word would be more ambiguous thus less informative. The whole of the right side in Equation (5) represents the speaker's assumption about the listener: given word w the listener would infer referent r that is salient in a discourse and less ambiguously referred to by word w.
If P (r) is uniform over referents and P (w|r) is constant across words and referents, this listener model reduces to 1 |w| . Thus, Frank and Goodman (2012)'s speaker model in (3) is a special case of our speaker model in (4) that assumes uniform discourse salience and constant cost.
Our model predicts that the speaker's probability of choosing a word for a given referent should depend on its cost relative to its information content. To see this, we combine (4) and (5), yielding Because the speaker is deciding what word to use for an intended referent, and the term P (r) denotes the probability of this referent, P (r) is constant in the speaker model and does not affect the relative probability of a speaker producing different words. We further assume for simplicity that P (w|r) is constant across words and referents. This means that all referents have about the same number of words that can be used to refer to them, and that all words for a given referent are equally probable for a naive listener. In this scenario, the speaker's probability of choosing a word is where the sum denotes the total discourse probability of the referents referred to by that word. The information content of an event is defined as the negative log probability of that event. In this scenario, the information conveyed by a word is the logarithm of the first term in (7), − log r P (r ). This means that in deciding which word to use, the highest cost a speaker should be willing to pay for a word should depend directly on that word's information content.
This relationship between cost and information content allows us to derive the prediction tested by Tily and Piantadosi (2009) that the use of referring expressions should depend on the predictability of a referent. For referents that are highly predictable from the discourse, different referring expressions (e.g., pronouns and proper names) will have roughly equal information content, and speakers should choose the referring expression that has the lowest cost. In contrast, for less predictable referents, proper names will carry substantially more information than pronouns, leading speakers to pay a higher cost for the proper names. These are the same predictions that have been discussed in the context of UID, but here the predictions are derived from a principled model of speakers who are trying to provide information to listeners. The extent to which our model can also capture other cases that have been put forward as evidence for the UID hypothesis remains a question for future research.

Predicting behavior from corpora
The model described in Section 3.2 is fully general, applying to arbitrary word choices, discourse probabilities, and cost fuctions. As an initial step, our simulations focus on the choice between pronouns and proper names. Our work tests the speaker model from (4) directly, asking whether it can predict the referring expressions from corpora of writ-ten and spoken language. Implementing the model requires computing word probabilities P (w|r), discourse salience P (r), and word costs C w .
We simplify the word probability P (w|r) in the speaker's listener model as in (8): where the count V is the number of words that can refer to referent r. We assume that V is constant across all referents. Our reasoning is as follows.
There could be many ways to refer to a single entity.
For example, to refer to entity Barack Obama, we could say 'he', 'The U.S. president', 'Barack', and so on. We assume that there are the same number of referring expressions for each entity and that each referring expression is equally probable under the listener's likelihood model. In our simulations, we assume that a speaker is choosing between a proper name and a pronoun. For example, we assume that an entity Barack Obama has one and only one proper name 'Barack Obama', and this entity is unambiguously associated with male and singular. Although we use an example with two possible referring expressions, as long as P (w|r) is constant across all referents and words, it does not make a difference to the computation in (5) how many competing words we assume for each referent.
To estimate the salience of a referent, P (r), our framework employs factors such as referent frequency or recency. Although there are other important factors such as topicality of the referent (Orita et al., 2014) that are not incorporated in our simulations, this model sets up a framework to test the role and interaction of various potential factors suggested in the discourse literature.
Salience of the referent is computed differently depending on its information status: old or new. The following illustrates the speaker's assumptions about the listener's discourse model: For each referent r ∈ [1, R d ]: 1. If r = old, choose r in proportion to N r (the number of times referent r has been referred to in the preceding discourse). 2. Otherwise, r = new with probability proportional to α (a hyperparameter that controls how likely the speaker is to refer to a new referent).
3. If r = new, sample that new referent r from the base distribution over entities with probability 1 U· (count U · denotes a total number of unseen entities that is estimated from a named entity list (Bergsma and Lin, 2006)). The above discourse model is frequency-based. We can replace the term N r for the old referent with f (d i,j ) = e −d i,j /a that captures recency, where the recency function f (d i,j ) decays exponentially with the distance between the current referent r i and the same referent r j that has previously been referred to. This framework for frequency and recency of new and old referents exactly corresponds to priors in the Chinese Restaurant Process (Teh et al., 2006) and the distance-dependent Chinese Restaurant Process (Blei and Frazier, 2011).
The denominator in (5) represents the sum of potential referents that could be referred to by word w. We assume that a pronoun can refer to a large number of unseen referents if gender and number match, but a proper name cannot. For example, 'he' could refer to all singular and male referents, but 'Barack Obama' can only refer to Barack Obama. This assumption is reflected as a probability of unseen referents for the pronoun as illustrated in (10) below.
In our simulations, the speaker's cost function C w is estimated based on word length as in (9). We assume that longer words are costly to produce.
Suppose that the speaker is considering using 'he' to refer to Barack Obama, which has been referred to N O times in the preceding discourse, and there is another singular and male entity, Joe Biden, in the preceding discourse that has been referred to N B times. In this situation, the model computes the probability that the speaker uses 'he' to refer to Barack Obama as follows: where count U sing&masc in the denominator of the last line denotes the number of unseen singular & male entities that could be referred to by 'he'. We estimate this number for each type of pronoun we evaluate (singular-female, singular-male, singularneuter, and plural) based on the named entity list in Bergsma and Lin (2006). The term ( 1 V · α · U sing&masc U· ) is the sum of probabilities of unseen referents that could be referred to by the pronoun 'he'. The unseen referents can be interpreted as a penalty for the inexplicitness of pronouns. In the case of proper names, the denominator is always the same as the numerator, under the assumption that each entity has one unique proper name.

Corpora
Our model was run on both adult-directed speech and child-directed speech. We chose to use the SemEval-2010 Task 1 subset of OntoNotes (Recasens et al., 2011), a corpus of news text, as our corpus of adult-directed speech. The Gleason et al. (1984) subset of CHILDES (MacWhinney, 2000) was chosen as our corpus of child-directed speech.
The model requires coreference chains, agreement information, grammatical position, and part of speech. These were extracted from each corpus, either manually or automatically. The coreference chains let us easily count how many times/how recently each referent is mentioned in the discourse, which is necessary for computing discourse salience. The agreement information (gender and number of each referent) is required so that the model can identify all possible competing referents for pronouns. For instance, Barack Obama will be ruled out as a possible competitor for the pronoun she. The grammatical position that each proper name occupies 2 determines the form of the alternative pronoun that could be used there. For example, the difference between he and him is the grammatical position that each can appear in. The part of speech is used to identify the form of the referring expression (pronouns and proper names), which is what our model aims to predict. 3 OntoNotes includes information about coreference chains, part of speech, and grammatical dependencies. Gleason CHILDES has parsed part of speech and grammatical dependencies (Sagae et al., 2010), but it does not have coreference chains.
Neither corpus has agreement information. The following section describes manual annotations that we have done for this study. Due to time constraints, we annotated only a part of the CHILDES Gleason corpus, 9 out of 70 scripts.

Mention annotation
We considered only maximally spanning noun phrases as mentions, ignoring nested NPs and nested coreference chains. For the sentence "Both Al Gore and George W. Bush have different ideas on how to spend that extra money" from OntoNotes, the extracted NPs are Both Al Gore and George W.
Bush and different ideas about how to spend that extra money.
These maximally spanning NPs were automatically extracted from the OntoNotes data, but were manually annotated for the CHILDES data using brat (Stenetorp et al., 2012) by two annotators. 4

Agreement annotation
Many mentions (46,246 out of 56,575 mentions in OntoNotes and 10,141 out of 10,530 mentions in CHILDES Gleason) were automatically annotated using agreement information from the named entity list in Bergsma and Lin (2006), leaving 10,329 to be manually annotated from OntoNotes (about 18%) and 389 from CHILDES (about 4%). 5 The guidelines we followed for this manual agreement annotation were largely based on pronoun replacement tests. NPs that referred to a single man and could be replaced with he or him were labeled 'male singular', NPs that could be replaced by it, such as the comment, were labeled 'neuter singular', and so on. NPs that could not be replaced with a pronoun, such as about 30 years earnings for the average peasant, who makes $145 a year, were excluded from the analysis.

Coreference annotation
We used the provided coreference chains for the OntoNotes data, but for the CHILDES data, it was necessary to do this manually using brat. The guidelines we followed for determining whether mentions coreferred came from the OntoNotes corefer-ence guidelines (BBN Technologies, 2007). 6

Experiments
Our experiments are designed to quantify the contributions of the various components of the complete model described in Section 3.2 that incorporates discourse salience, cost, and unseen referents. We contrast the complete model with three impoverished models that lack precisely one of these components. The comparison model without discourse uses a uniform discourse salience distribution. The model without cost uses constant speech cost. The model without good estimates of unseen referents always assigns probability 1 V · α · 1 C· to unseen referents in the denominator of (5), regardless of whether the word is a proper name or pronoun. In other words, this model does not have good estimates of unseen referents like the complete model does.
We use the adult-and child-directed corpora to examine to what extent each model captures speakers' referring expressions. We selected pronouns and proper names in each corpus according to several criteria. First, the referring expression had to be in a coreference chain that had at least one proper name, in order to facilitate computing the cost of the proper name alternative. Second, pronouns were only included if they were third person pronouns in subject or object position, and indexicals and reflexives were excluded. Finally, for the CHILDES corpus, children's utterances were excluded.
After filtering pronouns and proper names according to these criteria, 553 pronouns and 1,332 proper names (total 1,885 items) in the OntoNotes corpus, and 165 pronouns and 149 proper names (total 314 items) in the CHILDES Gleason corpus remained for use in the analysis.
Each model chooses referring expressions given information extracted from each corpus as described in Section 4.1. For evaluation, we computed accuracies (total, pronoun, and proper name) and model log likelihood (summing log P S (w|r) for the words in the corpus) for each model. Table 1 summarizes the results of each model with the OntoNotes and CHILDES datasets. The new referent hyperparameter α and the decay parameter for discourse recency salience were fixed at 0.1 and 3.0, respectively. 7

News
Overall, the recency salience measure provides a better fit than the frequency salience measure with respect to accuracies, suggesting that recency better captures speakers' representations of discourse salience that influence choices of referring expressions. On the other hand, the models with frequency discourse salience have higher model log likelihood than the models with recency do. This is because of the peakiness of the recency models. Model log likelihood computed over pronouns and proper names (complete model) were -1022. 33 and -222.76, respectively, with recency, and -491.81 and -467.06 with frequency. The recency model tends to return a higher probability for a proper name than the frequency model does. Some pronouns receive a very low probability for this reason, and this lowers the model log likelihood.
The model without discourse and the model without cost consistently failed to predict pronouns (these models predicted all proper names). This happens because in the model without discourse, the information content of pronouns is extremely low due to the large number of consistent unseen referents. In the model without cost, pronouns are disfavored because they always convey less information than proper names. The log likelihoods of these models were also below that of the complete model. These results show that pronominalization depends on subtle interaction between discourse salience and speech cost. Neither of them is sufficient to explain the distribution of pronouns and nouns on its own.
The total accuracy of the model without good estimates of unseen referents was the worst among the four models, but this model did predict pronouns to some extent. Because the number of proper names is larger than the number of pronouns in this dataset, the difference in total accuracies between the model without good estimates of unseen referents and the models without discourse or cost reflects this asymmetry. Comparison between the complete model and the model without good estimates of unseen referents also suggests that having knowledge of unseen referents helps correctly pre-

Child-directed speech
Unlike the adult-directed news text, neither recency nor frequency discourse salience provides a good fit to the data. The low accuracies of pronouns and the high accuracies of proper names in all models indicate that the models are more likely to predict proper names than pronouns. There are several possible reasons for this. First, the CHILDES transcripts involve long conversations in a natural settings. Compared to the news, interlocutors are not focusing on a specific topic, but rather they often switch topics (e.g., a child interrupts her parents' conversation about her father's coworker to talk about her eggs). This topic switching makes it difficult for the model to estimate discourse salience using simple frequency or recency measures. Second, interlocutors are a family and they share a good deal of common knowledge/background (e.g., a mother said she as the first mention of her child's friend's mother). The current model is not able to incorporate this kind of background knowledge. Third, many referents are visually available. The current model is not able to use visual salience. In general, these problems arise due to our impoverished estimates of salience, and we would expect a more sophisticated discourse model that accurately measured salience to show better performance.

Summary
Experiments with the adult-directed news corpus show a close match between speakers' utterances and model predictions. On the other hand, experiments with child-directed speech show that the models were more likely to predict proper names where pronouns were used, suggesting that the estimates of discourse salience using simple measures were not sufficient to capture a conversation.

Discussion
This paper proposes a language production model that extends the rational speech act model from Frank and Goodman (2012) to incorporate updates to listeners' beliefs as discourse proceeds. We show that the predictions suggested from UID in this domain can be derived from our speaker model, providing an explanation from first principles for the relation between discourse salience and speakers' choices of referring expressions. Experiments with an adult-directed news corpus show a close match between speakers' utterances and model predictions, and experiments with child-directed speech show a qualitatively similar pattern. This suggests that speakers' behavior can be modeled in a principled way by considering the probabilities of referents in the discourse and the information conveyed by each word. A controversial issue in language production is to what extent speakers consider a listener's discourse model (Fukumura and van Gompel, 2012). By incorporating an explicit model of listeners, our model provides a way to explore this question. For example, the speaker's listener model P L (r|w) in (4) might differ between contexts and could also be extended to sum over possible listener identity q in mixed contexts, as in (11). P L (r|w) = Σ q P (r|w, q)P (q) This provides a way to probe speakers' sensitivity to differences in listener characteristics across situations.
Although the simulations in this paper employed simple measures for discourse salience (referent frequency and recency), the discourse models used by speakers are likely to be more complex. Studies show that semantic information that cannot be captured with these simple measures, such as topicality (Orita et al., 2014) and animacy (Vogels et al., 2013a), affects speakers' choices of referring expressions. Future work will test to what extent this latent discourse information could affect the model predictions.
Our model predicts a tight coupling between the probability of a referent being mentioned, p(r), and the choice of referring expression. However, these two quantities appear to be dissociated in some cases. For example, Fukumura and Van Gompel (2010) show that semantic bias (as a measure of predictability) affects what to refer to (i.e., the referent), but not how to refer (i.e., the referring expression), while grammatical position does affect how you refer. One way of dissociating the probability of mention from the choice of referring expression in our model would be through the likelihood term, the word probability p(w|r). While we have assumed this word probability to be constant across words and referents, Kehler et al. (2008) use grammatical position to define this probability and show that their model captures experimental data. Syntactic constraints (such as Binding principles) also influence form choices, and this kind of knowledge may also be reflected in the word probability. Examining the role of the word probability p(w|r) more closely would allow us to further explore these issues.
Despite these limitations, our model provides a principled explanation for speakers' choices of referring expressions. In future work we hope to look at a broader range of referring expressions, such as null pronouns and definite descriptions, and to explore the extent to which our model can be applied to other linguistic phenomena that rely on discourse information.