Modelling the Interpretation of Discourse Connectives by Bayesian Pragmatics

We propose a framework to model human comprehension of discourse connectives. Following the Bayesian pragmatic paradigm, we advocate that discourse connectives are interpreted based on a simulation of the production process by the speaker, who, in turn, considers the ease of interpretation for the listener when choosing connectives. Evaluation against the sense annotation of the Penn Discourse Treebank conﬁrms the superiority of the model over literal comprehension. A further experiment demonstrates that the proposed model also improves automatic discourse parsing.


Introduction
A growing body of evidence shows that human interpretation and production of natural language are inter-related (Clark, 1996;Pickering and Garrod, 2007;Zeevat, 2011;Zeevat, 2015). In particular, evidence shows that during interpretation, listeners simulate how the utterance is produced; and during language production, speakers simulate how the utterance will be perceived. One explanation is that the human brain reasons by Bayesian inference (Doya, 2007;Kilner et al., 2007), which is, at the same time, a popular formulation used in language technology.
In this work, we model how humans interpret the sense of a discourse relation based on the Bayesian pragmatic framework. Discourse relations are relations between units of texts that make a document coherent. These relations are either marked by discourse connectives (DCs), such as 'but', 'as a result', or implied implicitly, as in the following examples: 1. He came late. In fact, he came at noon.
2. It is late. I will go to bed.
The explicit DC 'in fact' in Example (1) marks a Specification relation. On the other hand, a Result relation can be inferred between the two sentences in Example (2) although there are not any explicit markers. We say the two sentences (called arguments) are connected by an implicit DC.
Discourse relations have a mixture of semantic and pragmatic properties (Van Dijk, 1980;Lewis, 2006). For example, the sense of a discourse relation is encoded in the semantics of a DC (Example (1)), yet the interpretation of polysemic DCs (such as 'since', 'as') and implicit DCs relies on the pragmatic context (Example (2)).
This work seeks to find out if Bayesian pragmatic approaches are applicable to human comprehension of discourse relations. Our contribution includes: (i) an adaptation of the Bayesian Rational Speech Acts model to DC interpretation using a discourse-annotated corpus, the Penn Discourse Treebank; (ii) integration of the proposed model with a state-of-the-art automatic discourse parser to improve discourse sense classification.

Related work
There is increasing literature arguing that the human motor control and sensory systems make estimations based on a Bayesian perspective (Doya, 2007;Oaksford and Chater, 2009). For example, it is proposed that the brain's mirror neuron system recognizes a perceptual input by Bayesian inference (Kilner et al., 2007). Similarly, behavioural, physiological and neurocognitive evidences support that the human brain reasons about the uncertainty in natural languages comprehension by emulating the language production processes (Galantucci et al., 2006;Pickering and Garrod, 2013).
Analogous to this principle of Bayesian language perception, a series of studies have developed the Grice's Maxims (Grice, 1975) based on game-theoretic approaches (Jäger, 2012;Frank and Goodman, 2012;Goodman and Stuhlmüller, 2013;Goodman and Lassiter, 2014;Benz et al., 2016). These proposals argue that the speaker and the listener cooperate in a conversation by recursively inferring the reasoning of each other in a Bayesian manner. The proposed framework successfully explains existing psycholinguistic theories and predict experimental results at various linguistic levels, such as the perception of scalar implicatures (e.g. 'some' meaning 'not all' in pragmatic usage) and the production of referring expressions (Lassiter and Goodman, 2013;Kao et al., 2014;Lassiter and Goodman, 2015). Recent efforts also acquire and evaluate the models using corpus data (Orita et al., 2015;Monroe and Potts, 2015).
Production and interpretation of discourse relations is also a kind of cooperative communication between speakers and listeners (or authors and readers). We hypothesize that the game-theoretic account of Bayesian pragmatics also applies to human comprehension of the meaning of a DC, which can be ambiguous or even dropped.

Method
This section explains how we model the interpretation of discourse relations by Bayesian pragmatics. The model is based on the formal framework known as Rational Speech Acts model (Frank and Goodman, 2012;Lassiter and Goodman, 2015). Section 3.1 explains the key elements of the RSA model, and Section 3.2 illustrates how it is adapted for discourse interpretation.

The Rational Speech Acts model
The Rational Speech Acts (RSA) model describes the speaker and listener as rational agents who cooperate towards efficient communication. It is composed of a speaker model and a listener model. In the speaker model, the utility function U de-fines the effectiveness for the speaker to use utterance d to express the meaning s in context C.
is the probability that the listener can interpret speaker's intended meaning s. The speaker selects an utterance which, s/he thinks, is informative to the listener. The utility of d is thus defined by its informativeness towards the intended interpretation, which is quantified by negative surprisal (ln P L (s|d, C)), according to Information Theory (Shannon, 1948). The utility is modified by production cost (cost(d)), which is related to articulation and retrieval difficulties, etc. P S (d|s, C) is the probability for the speaker to use utterance d for meaning s. It is proportional to the soft-max of the utility of d.
where α, the decision noise parameter, is set to 1.
On the other hand, the probability for the listener to infer meaning s from utterance d is defined by Bayes' rule.
The listener infers the speaker's intended meaning by considering how likely, s/he thinks, the speaker uses that utterance (P S (d|s, C)). The inference is also related to the salience of the meaning (P L (s)), a private preference of the listener.
To summarize, the speaker and listener emulate the language processing of each other. However, instead of unlimitted iterations (i.e. the speaker thinks the listener thinks the speaker thinks..), the inference is grounded on literal interpretation of the utterance. Figure 1 illustrates the direction of pragmatic inference between the speaker and listener in their minds. Our experiment compares the predictions of the literal listener (L 0 ), the pragmatic listener who reasons for one level (L 1 ), and the pragmatic listener who reasons for two levels (L 2 ). Previous works demonstrate that one level of reasoning is robust in modeling human's interpretation of scalar implicatures (Lassiter and Goodman, 2013;Goodman and Stuhlmüller, 2013).

Applying the RSA model on discourse relation interpretation
We use the listener model of RSA to model how listeners interpret the sense a DC. Given the DC d and context C in a text, the listener's interpreted relation sense s i is the sense that maximizes P L (s|d, C). s i is specifically defined as where S is the set of defined relation senses. The literal listener, L 0 , interprets a DC directly by its most likely sense in the context. The probability is estimated by counting the co-occurrences in corpus data, the Penn Discourse Treebank, in which explicit and implicit DCs are labelled with discourse relation senses.
More details about the annotation of PDTB will be explained in Section 4.1. As shown in Figure 1, the pragmatic speaker S 1 estimates the utility of a DC by emulating the comprehension of the literal listener L 0 (Eq. 1, 2). The probability for the pragmatic speaker S n to use DC d to express meaning s is estimated as: where n ≥ 1. D is the set of annotated DCs, including 'null', which stands for an implicit DC. The cost function in Equation 6, cost(d), measures the production effort of the DC. As DCs are mostly short words, we simply define the cost of producing any explicit DC by a constant positive value, which is tuned manually in the experiments. On the other hand, the production cost for an implicit DC is 0, since no word is produced .
In turn, the pragmatic listener L 1 emulates the DC production of the pragmatic speaker S 1 (Eq. 3). The probability for the pragmatic listener L n to assign meaning s to DC d is estimated as: P Ln (s|d, C) = P Sn (d|s, C)P L (s) s ∈S P Sn (d|s , C)P L (s ) (7) where n ≥ 1 and S is the set of defined sense. The salience of a relation sense in Equation 7, P L (s), is defined by the frequency of the sense in the corpus.
Lastly, we propose to define the context variable C by the the immediately previous discourse relation to resemble incremental processing. We hypothesize that certain patterns of relation transitions are more expected and predictable. Discourse context in terms of relation sense, relation form (explicit DC or not), and the sense-form pair are compared in the experiments.

Experiment
This section describes experiments that evaluate the model against discourse-annotated corpus. We seek to answer the following questions: (1) Can the proposed model explain the sense interpretation (annotation) of the DCs in the corpus? (2) Is the DC interpretation refined by the context in terms of previous discourse structure? (3) Does the proposed model help automatic discourse parsing? We first briefly introduce the corpus resource we use, the Penn Discourse Treebank.

Penn Discourse Treebank
The Penn Discourse Treebank (PDTB) (Prasad et al., 2008) is the largest available discourseannotated resource in English. The raw text are collected from news articles of the Wall Street Journals. On the PDTB, all explicit DCs are annotated with discourse senses, while implicit discourse senses are annotated between two adjacent sentences. Other forms of discourse relations, such as 'entity relations', are also labeled. In total, there are 5 form labels and 42 distinct sense labels, some of which only occur very sparsely.
We thus use a simplified version of the annotation, which has 2 form labels (Explicit and Nonexplicit DC) and 15 sense labels (first column of Table 3), following the mapping convention of the CONLL shallow discourse parsing shared task (Xue et al., 2015). Sections 2-22 are used as the training set and the rest of the corpus, Sections 0, 1, 23 and 24, are combined as the test set. Sizes of the data sets are summarized in Table 1.   Train  Test  Total  Explicit  15,402 3,057 18,459  Non-Exp 18,569 3,318 21,887  Total 33,971 6,375 40,346 The RSA model argues that a rational listener does not just stick to the literal meaning of an utterance. S/he should reason about how likely the speaker will use that utterance, in the current context, based on the informativeness and production effort of the utterance. If the RSA model explains DC interpretation as well, discourse sense predictions made by the pragmatic listeners should outperform predictions by the literal listener.
In this experiment, we compare the DC interpretation by the literal listener L 0 , and pragmatic listeners L 1 and L 2 . Given a DC d and the discourse context C for each test instance, the relation sense is deduced by maximizing the probability estimate P L (s|d, C). P L 0 (s|d, C) is simply based on co-occurrences in the training data (Eq. 5). P L 1 (s|d, C) and P L 2 (s|d, C) are calculated by Eq. 6 and 7, in which the salience of each sense is also extracted from the training data (Eq. 8).
context C Explicit Non-Explicit L 0 constant (BL  Table 2: Accuracy of prediction by L 0 , L 1 and L 2 . Improvements above the baseline are bolded. * means significant at p < 0.02 by McNemar Test. Table 2 shows the accuracy of discourse sense prediction by listeners L 0 , L 1 and L 2 , when provided with various discourse contexts. Predictions by L 1 , when they are differ from the predictions by L 0 under 'constant' context, are more accurate than expected by chance. This provides support that the RSA framework models DC interpretation. Overall, predictions of non-implicit senses hardly differ among different models, since an implicit DC is much less informative than an explicit DC. Moreover, previous relation senses or forms do not improve the accuracy, suggesting that a more generalized formulation of contextual information is required to refine discourse understanding. It is also observed that predictions by L 2 are mostly the same as L 1 . This implies that the listener is unlikely to emulate speaker's production iteratively at deeper levels.

Insights on automatic discourse parsing
Next, we investigate if the proposed method helps automatic discourse sense classification. A full discourse parser typically consists of a pipeline of classifiers: explicit and implicit DCs are first classified and then processed separately by 2 classifiers (Xue et al., 2015). On the contrary, the pragmatic listener of the RSA model considers if the speaker would prefer a particular DC, explicit or implicit, when expressing the intended sense.
In this experiment, we integrate the output of an automatic discourse parser with the probability prediction by the pragmatic listener L 1 . We employ the winning parser of the CONLL shared task (Wang and Lan, 2015). The parser is also trained on Sections 2-22 of PDTB, and thus does not overlap with our test set. The sense classification of the parser is based on a pool of lexicosyntactic features drawn from gold standard arguments, DCs and automatic parsed trees produced by CoreNLP (Manning et al., 2014).
For each test sample, the parser outputs a probability estimate for each sense. We use these estimates to replace the salience measure (P L (s)) (in Eq. 8) and deduce P L 1 (s|d, C), where C is the previous relation form.  Significant improvement in classification accuracy is achieved and the F1 scores for most senses are improved. This confirms the applicational potential of our model on automatic discourse parsing.

Conclusion
We propose a new framework to model the interpretation of discourse relations based on Bayesian pragmatics. Experimental results support the applicability of the model on human DC comprehension and automatic discourse parsing. As future work, we plan to deduce a more general abstraction of the context governing DC interpretation. A larger picture is to design a full, incremental discourse parsing algorithm that is motivated by the psycholinguistic reality of human discourse processing.