Modelling Pro-drop with the Rational Speech Acts Model

We extend the classic Referring Expressions Generation task by considering zero pronouns in “pro-drop” languages such as Chinese, modelling their use by means of the Bayesian Rational Speech Acts model (Frank and Goodman, 2012). By assuming that highly salient referents are most likely to be referred to by zero pronouns (i.e., pro-drop is more likely for salient referents than the less salient ones), the model offers an attractive explanation of a phenomenon not previously addressed probabilistically.


Introduction
Languages such as Chinese and Japanese make liberal use of zero pronouns (ZP) (Huang, 1984). The analysis of  on a large Chinese-English parallel dialogue corpus shows that 26% of the English pronouns are dropped in Chinese. Such an abundant use of zero pronouns has been a key factor in linguist's idea (Huang, 1984(Huang, , 1989 that Chinese is a "cool" language or a discourse-oriented language (Cao, 1979), i.e., one that relies heavily on context.
To exemplify zero pronouns in Chinese, consider the question "你今天看见比尔了吗?" (Did you see Bill today?). A Chinese speaker can respond in a variety of shorter expressions which are equivalent to "我看见他了" (Yes, I saw him), for example, "∅看见他了" (Yes, ∅ saw him), "我看 见∅了" (Yes, I saw ∅), or even "∅看见∅了" (Yes, ∅ saw ∅). Here the ∅ symbol indicates the place from where a pronoun appears to have been "dropped" from a full sentence.
Generating zero pronouns (only) where they are appropriate is a difficult challenge for Referring Expression Generation (REG) (Van Deemter, 2016), and more specifically for the task of choosing referential form, a key step in the classic Natural Language Generation (NLG) architecture (Reiter and Dale, 2000). Traditionally, choosing referential form is framed as modelling speakers' behaviour of deciding whether entities are referred to using a pronoun, a proper name, or a description. However, for "cool" languages, an extra option, namely of choosing a zero pronoun, needs to be added (Yeh and Mellish, 1997) for fully simulating speakers' behaviour.
In this paper, we model the use of zero pronouns in Chinese with the Rational Speech Acts (RSA) model (Frank and Goodman, 2012) by assuming that speakers tend to choose a ZP if it is salient enough for successful communication (see §2). For computing discourse salience, we focus on ZPs that are recoverable, meaning that they either refer anaphorically to an entity mentioned earlier in the text (i.e., anaphoric ZPs, or AZPs for short), or to the speaker or hearer (i.e., deictic nonanaphoric ZPs or DNZPs for short) (Zhao and Ng, 2007); a ZP is unrecoverable if it cannot be linked to any referent, for example: in which the ∅ cannot be recovered.

Related Work
Pro-drop raises challenges for a number of NLP tasks including, machine translation (MT), coreference resolution, and REG. When translating from a pro-drop language, recovering the dropped pronouns of the source language can improve the overall performance of MT (Wang et al., 2016. Co-reference resolution of ZPs has been widely explored with a variety of techniques including the centring theory (Rao et al., 2015), statistical machine learning (Zhao and Ng, 2007;Ng, 2014, 2015), deep learning (Chen and Ng, 2016;Yin et al., 2016Yin et al., , 2017 and reinforcement learning (Yin et al., 2018). REG of ZPs for "cool" languages has been addressed through rule-based methods (Yeh and Mellish, 1997) including centring theory (Yamura-Takei et al., 2001) (for Japanese), but we are not aware of any testable computational account. 1 We offer such an account, along probabilistic lines.
Some discourse theories suggest that speakers choose referring expressions (REs) by considering discourse salience (Givón, 1983), i.e., speakers tend to choose pronouns if they believe the referent is highly salient. The intuition behind is that a highly salient referent tends to be highly prominent in the mind of the speaker and/or hearer. Orita et al. (2015) shared a similar view and argued that highly salient REs are highly predictable, so they are referred with pronouns (as opposed to full NPs) more often than the less salient ones.
A theory that is sometimes used for explaining the relation between discourse salience and human choice of referential forms is Uniform Information Density (UID) (Jaeger and Levy, 2007). UID asserts that speaker tends to optimise information density (quantity of information) of the utterances to achieve optimal communication. In other words, speakers tend to drop a RE when the referent of the RE is predictable (or recoverable), and vise versa.
Apart from salience, production cost (Rohde et al., 2012) and the listener models (Bard et al., 2004), meaning the models that how speakers model listeners' interpretation of the utterance, also have impact on language production. It suggests to us that the salience of the referent may not be enough for modelling speakers' choice. The RSA model (see §3) used in this paper is possible to take all these factors into consideration.

The Rational Speech Acts Model
The Rational Speech Acts (RSA) model (Frank and Goodman, 2012) has been used for a variety of tasks including modelling speakers' referential choice between pronouns and proper names (Orita et al., 2015), the selection of attributes for referring expressions (Monroe and Potts, 2015), and the generation of colour references (Monroe et al., 2017(Monroe et al., , 2018. The key idea of RSA is to model human communication by assuming that a rational listener P L uses Bayesian inference to recover a speaker's intended referent r s for word w under context C. In this way, RSA claims to offer not only accurate models, but highly explanatory ones as well. Formally, P L is defined as where r denotes a referent in context C, P (r s ) represents the discourse salience of r s , P S is the speaker model defined by an exponential utility function: Here I(w; r s , C) is the informativeness of word w, C(w) represents the speech cost. Orita et al. (2015) extended the RSA by assuming that speakers estimate listener's interpretation of the (form of) RE w based on discourse information. The speaker chooses w by maximising the listener's belief in the speaker's intended referent r s in relation to the speaker's speech cost C(w), where the cost is estimated according to the complexity of the utterance, such as the length of w: Here P L (r s |w) estimates the informativeness of w, and P (w|r s , C) estimates the likelihood (according to the speaker) that the listener guesses that the speaker used w to refer to r s .

Modelling Pro-drop with the RSA Model
We model the decision of whether to use a ZPbased on the formulation expressed in Eq. 3. The speaker model is P S (z|r s ), which is the probability that the speaker uses ZP (i.e., drops the RE). We assume that the speaker makes a binary choice (i.e., z = {1, 0}), with z = 1 indicating a ZP and z = 0 indicating a non-zero form of RE (NZRE). Note that whether the speaker uses a pronoun or a proper name is not in the scope of this model. To simulate the speaker's choice, we need to estimate the dropping probability P (z|r s ), the discourse salience of the referent P (r s ), and the cost C(z).
According to the UID theory (see §2), if a RE is recoverable, then the speaker prefers a ZP over a NZRE to maximise the information density since a ZP is shorter than any other referential form. In that sense, we follow Orita et al. (2015) to estimate the cost function C(z) based on the length of the RE, i.e., the total number of words the RE contains. However, the length of the NZRE is not known in advance, thus we use the average length of a set of REs W instead: We experimented with two ways of calculating the average length: (i) global average length, meaning that W is the set of all referring expressions in the corpus, and (ii) local average length, in which W is the set of expressions that can refer to referent r s . For instance, if r s is "Barack Obama", then given a corpus for computing local average length in which he is referred to, W might be the set {Barack Obama, Obama, he, former president}. The cost of a zero pronoun is always C(z = 1) = 1, which means no discount on P (z = 1|w) and the plus 1 in Eq. 4 is to make the cost of choosing NZRE different from choosing ZP if W only contains pronouns (i.e., if length equals to 1).
We assume that the dropping probability P (z|r s ) is dependent on whether the referent r s is one of the participants in the dialogue (i.e., speaker or listener). For example, in the OntoNote 5.0 corpus, 30% of maximally salient entities are dropped, which is much higher than the 10% dropping rate of non-maximally salient entities. If r s is one of the participants, we call it maximally salient entity (denoted as s). Otherwise, r s is called nonmaximally salient entity (denoted as ns). This assumption causes AZP and DNZP to have different proportions in the predicted results. Suppose P (z = 1|r s = ns) = a and P (z = 1|r s = s) = b, then we have a < b, which implies that the speaker thinks the listener expects a maximally salient entity (i.e., speaker or listener).
Let α = a b be the dropping ratio, then the probability of dropping a noun phrase that refers to the speaker is: P (Speaker) is the salience of the speaker. 2 In general, we take the salience of a referent x to be in proportion to N x , which is the number of times that x has been referred to in the preceding discourse, hence the use of N Speaker , N S , and N NS in the equation. Note that N S + N NS is the total number of REs in the preceding discourse.
Equation 5 shows that modelling the dropping probability for maximally salient entities and nonmaximally salient entities differently acts as a discount for the number of referents that the ZP can refer to when predicting DNZP. Similarly, using the dropping ratio α, the dropping probability for AZPs is estimated as: which can be seen as adding a penalty. The frequencies counted above are all based on the whole preceding discourse of a referent, which might not be reasonable for predicting ZPs. We hypothesise that the informativeness of a ZP depends on only a part of the preceding context. We tested two possible set-ups. One is setting a discourse window to limit the number of sentences that the simulator can look back to. The other uses recency (Chafe, 1994). Following Orita et al. (2015), we replace each count with: Count(r i , r j ) = e −d(r i ,r j )/a , where r j is the same referent as the r i that has previously been referred to and d is the number of sentences between two REs. Instead of taking the direct raw count 1, Count(r i , r j ) decays exponentially with respect to how far it is from the predicting RE. The RE that has larger distance contributes less to the overall count of that referent.
For NZREs (z = 0), we assume that the number of times that the referent has been referred to is equal to the total number of referents referred to by that NZRE. Thus, the speaker believes that the listener can always resolve the reference by giving 2 Our use of the term salience is similar to Hovy et al. (2006)'s use of "recoverability". them a NZRE. In other words, their informativeness equals 1.

The Dataset
We tested our model on the Chinese portion of OntoNotes Release 5.0 data 3 (Hovy et al., 2006), which has been widely used in (ZP) co-reference resolution tasks. The corpus contains 1,729 documents, including 143620 referring expressions. In Table 1, there is the basic statistics about the recoverable zero pronouns in OntoNotes corpus.  Baseline. In this work, we used the modified rule 1 in Yeh and Mellish (1997), i.e., the RE in the subject position will be a ZP if it was referred to in the immediately preceding sentence, as the baseline. The modification is inspired by the fact that 99.2% ZPs in OntoNotes corpus are in the subject position. Table 2 shows the results (reported in accuracy) of various models on the OntoNote dataset. The dropping ratio α was empirically set to 0.1 and the decay parameter a of recency was set to 0.8. The window size was 1, so the simulator only looks at the current sentence and preceding sentence.

Experiment Results
As expected, the models that look back to the whole preceding discourse perform badly on predicting ZPs (i.e., 8.35% of accuracy), especially DNZPs. They tend to predict all REs as NZREs, which even performs worse than the model using simple rule (i.e., the baseline). In contrast, limiting the discourse history by applying discourse windows or replacing frequency with recency have a negative impact on predicting NZREs, more specifically pronouns. Such an impact is caused by the idea that every NZRE can always be resolved by the listener, which is not correct for pronouns. However, so far, we cannot calculate the informativeness of pronouns properly since we do 3 https://catalog.ldc.upenn.edu/ ldc2013t19 not know which referent (speaker or listener) a deictic pronoun in the corpus refers to. For example, in the corpus, both the speaker and listener will use "I" to refer to themselves, so we don't know whether "I" refers to the speaker or the listener. This setting will lead to over-estimation of the informativeness of pronouns. On the other hand, computing cost by average length (as we do) overestimates the costs of pronouns, whose lengths are generally shorter than proper names.
The baseline model's performance is not bad, especially for predicting AZPs. This is partly because the rule predicts that all REs in object position are NZREs and this is nearly always correct. (Recall that 99.84% REs in object position are NZREs). At the same time, if the referent was referred to in the immediately preceding sentence (as the baseline model requires), then it is clearly more salient than if it wasn't. The baseline model is therefore quite similar to the model with discourse window, but its decisions are made in a simpler way (i.e., based on a simple "if-then" rule).
With respect to overall accuracy for predicting ZPs and NZREs, models with recency perform similarly to those that use a discourse window. However, recency offers better prediction on AZPs. Adding a dropping ratio could significantly improve the performance on predicting DNZPs without decreasing the accuracies of AZPs and NZREs very much (i.e., accuracy increase from 62.02% to 95.35%). For the choice of cost function, we found that using global average length is the best.

Conclusion and Future Work
This paper has explored the possibilities of using the RSA model for probabilistic simulation of speakers' use of ZPs (i.e., pro-drop), and investigated factors that influence speakers' choice.
Our model performs respectably yet, as mentioned in Section 4, it under-estimates the probability of choosing a pronoun. Solving this problem will require a more fine-grained annotation of the corpus, indicating which person each occurrence of the deictic pronouns "I" and "you" refers to. Once this has been done, we also hope to let the generator distinguish between ZP, pronoun, proper name, and full noun phrase.
When speakers are choosing between pronouns and full NPs, sentence position is known to be rel-  Table 2: Accuracies of each model, recall that AZP and DNZP are two sub-categories of ZP.
evant. For example, pronouns are less common in object than in subject position Brennan (1995), which somehow dues to the fact that REs in subject position are more salient than in object position. In the OntoNotes corpus, 99.2% of ZPs appear in subject position; in Chinese, empty categories are acceptable in both subject and object (including the topic position (Huang, 1984)), but even there they are most frequent in subject position. The baseline model introduced in this paper has somehow proved that considering positions works in modelling pro-drop. In future we shall explore the way of combining that factor with the RSA for pro-drop model introduced in this paper. In future, we will investigate alternative ways to estimate informativeness and costs. For example, it would be natural to use a co-reference resolver for calculating informativeness. Furthermore, one could follow on from (Yamura-Takei et al., 2001;Roh and Lee, 2003) by using elements of centring theory (Grosz et al., 1995) in the definition of cost (e.g., giving Rough Shifts a high cost). Alternatively, one could improve the model by adopting a trainable function for estimating both informativeness and costs.