Modelling the Usage of Discourse Connectives as Rational Speech Acts

Discourse relations can either be implicit or explicitly expressed by markers, such as ’therefore’ and ’but’ . How a speaker makes this choice is a question that is not well understood. We propose a psycholinguistic model that predicts whether a speaker will produce an explicit marker given the discourse relation s/he wishes to express. Based on the framework of the Rational Speech Acts model, we quantify the utility of producing a marker based on the information-theoretic measure of surprisal, the cost of production, and a bias to maintain uniform information density throughout the utterance. Experiments based on the Penn Discourse Treebank show that our approach outperforms state-of-the-art approaches, while giving an explanatory account of the speaker’s choice.


Introduction
Speakers or authors 1 produce informative utterances, such that the listeners or readers can understand his/her message. Grice's Maxim of Quantity states that human speakers communicate by being as informative as required, but no more (Grice, 1975). If a speaker always tries to provide as much information as possible, the resulting utterance could become excessively long and tedious. Such utterance is not only effort consuming for the speaker to produce, but also contains redundant information that is not necessary for the listener.
In this work, we model how speakers plan the presentation of discourse structure optimally in terms of informativeness. Specifically, we propose a model that predicts whether the speaker will use or omit a discourse connective, given the sense of discourse relation s/he wants to convey.
Discourse relations are relations between unit of texts (known as arguments) that make a document coherent. These relations can be marked in the surface text or inferred by the readers, as shown in the below examples.
1. It was a great movie, but I did not like it.
2. It was a great movie, therefore I liked it.
3. It was a great movie. I liked it.
The word 'but' indicates a Concession relation in Example (1), and 'therefore' indicates a Result relation in Example (2). We call 'but' and 'therefore' explicit discourse connectives (DCs). In Example (3), DCs are absent but a Result relation can be inferred. We say the DC is implicit in this case.
Explicit DCs are highly informative cues to identify discourse relations (Pitler et al., 2008) while implicit DCs are more ambiguous. For example, 'I liked it' can also be read as a Justification for the first sentence in Example (3).
Marking a discourse relation or not is subject to ambiguity and redundancy. On one hand, using an explicit DC avoids ambiguity. For example, if the DC 'but' is omitted in Example (1), readers may have problems in inferring the Concession sense.
On the other hand, if the intended discourse sense is highly predictable, it is verbose or redundant to insert an explicit DC in the utterance, such as the DC 'therefore' in Example (2).
A model that predicts the markedness of discourse relations not only contributes to a better understanding of the human language production mechanism, but is also important in generating natural, humanlike texts and dialogues. In particular, the degree of markedness in discourse relations differs cross-lingually. Yung et al. (2015) analyze the manual alignments of explicit and implicit DCs in a Chinese-English translation corpus and find that 30% of implicit DCs in Chinese are translated to explicit DCs in English. It remains a challenge for machine translation systems to explicitate or implicitate discourse relations in the source texts as human translators do (Becher, 2011;Meyer and Webber, 2013;Zuffery and Cartoni, 2014;Hoek and Zufferey, 2015;, since the markedness of the translation is subject to the discourse planning of the target text. In order to explain how human speakers choose the optimal level of markedness in his utterance, we model how speakers rationally balance between ambiguity and redundancy. In particular, we use the Rational Speech Acts (RSA) model (Frank and Goodman, 2012) to predict how speakers reason about the ambiguity of an utterance. In addition, we model how speakers adjust the redundancy of the utterance following the Uniform Information Density (UID) principle (Levy and Jaeger, 2006).
We apply the framework to predict whether an explicit or implicit DC is used in corpus data, given the two arguments of the discourse relations and the discourse sense to be conveyed. Our model not only achieves higher accuracy comparing with previous work (Patterson and Kehler, 2013), but also provides an interpretable account of various cognitive factors behind the predicted decision.
We start by a review of related work in Section 2, followed by the descriptions of our model in Section 3 and experiments in Section 4.

Related work
We first provide background information on RSA and UID, which are used in our proposed method. It is followed by introduction of previous work about prediction of DC markedness in corpus data.

Rational Speech Acts model
The RSA model (Frank and Goodman, 2012) is a variation of the game-theoretic approach in prag-matics (Jäger, 2012). It explains the communicative reasoning of a speaker and a listener in terms of Bayesian probabilities.
A rational listener assumes the utterance s/he hears contains the optimal amount of information. S/he predicts the intended message of a speaker by Bayesian inference (Equation 1).
where w is the utterance produced by the speaker; s is the message of an utterance; and C is the context. P speaker (w|s, C) represents the listener's predicted speaker's model, and P (s) represents the salience of the message, which is shared knowledge between the speaker and listener.
A rational speaker chooses an utterance by softmax optimizing the expected utility (U (w; s, C)) of the utterance (Equation 2). P speaker (w|s, C) ∝ e α·U (w;s,C) (2) α is the decision noise parameter, which is set to 1 to represent a rational speaker 2 . S/He emulates the listener's interpretation and chooses an utterance s/he believes to be informative. Also, an utterance that is easy to produce is preferred. Utility is thus defined as the informativeness (I(s; w, C)) of the utterance, deducted by the cost (D(w)) to produce it (Equation 3).
Since utterances that are unconventional and surprising are less useful, Informativeness is quantified as the negative surprisal of the utterance with respect to the message to be conveyed (Equation 4). I(s; w, C) = ln P (s|w, C) The RSA model has successfully simulated results of psycholinguistic experiments concerning different aspects of human communication, such as scalar implicature, referential expressions and language acquisition (Frank and Goodman, 2012;Goodman and Stuhlmüller, 2013;Smith et al., 2013;Kao et al., 2014;. Besides experimental data, Orita et al.(2015) applies RSA model to predict the choice of referring expressions in corpus data and Monroe and  optimizes a classifier based on RSA by inducing the semantic lexicon from a training corpus. These works focus on the pragmatic use of language, where the informativeness and lexicon of an utterance largely depends on the context (e.g. 'Red' is not valid to be used to refer to a blue ball).
In this work, we apply RSA to predict the usage of DCs, which is more universal across different contexts (i.e. A DC can be used or dropped given various discourse senses and contexts). Our model is built upon the speaker's model of RSA to predict speaker's choice of explicit or implicit DCs.

Uniform Information Density
The UID principle views language communication as a form of information transmission through a noisy channel and a constant rate of information flow is optimal according to Shannon's Information Theory (Levy and Jaeger, 2006;Genzel and Charniak, 2002;Shannon, 1948). It states that speakers structure utterances by optimizing information density, which is the quantity of information (measured by surprisal 3 ) transmitted per unit of utterance, such as word.
Information density rises when the utterance is 'surprising' and drops when an utterance is highly predictable. To smooth the peaks and troughs, speakers adjust the ambiguity of an utterance by including or reducing linguistic markers.
Following the UID principle, linguistic choices made by speakers are predicted more accurately by incorporating an information density predictor on top of other constraints. The predictor measures how easily a candidate utterance can be predicted and the speaker adjusts information density based on the expected predictability.

Explicit vs. Implicit DCs
The choice of discourse marking strategies has been studied in earlier works as a subtask for natural language generation (Scott and de Souza, 1990;Moser and Moore, 1995;Grote and Stede, 1998;Soria and Ferrari, 1998;Allbritton and Moore, 1999). In the absence of large-scale resources, investigations are based on manually derived rules and lexicons or psycholinguistic experiments.
More recently, Asr and Demberg (2012) presents an analysis of the PDTB, showing that 'causal' and 'continuous' senses are more often implicit, or marked by less specific DCs. Indeed these senses are presupposed by listeners according to linguistics theories (Segal et al., 1991;Murray, 1997;Levinson, 2000;Sanders, 2005;Kuperberg et al., 2011). On the other hand, Asr and Demberg (2015) finds that DCs are more often dropped for the discourse relation Chosen Alternative (the relation typically signalled by the DC 'instead'), if the context contains negation words, which are identified cues for this relation. Similarly, contextual difference in explicit and implicit discourse relations are reported in attempts to train implicit DC classifiers based on explicit DC instances (Sporleder and Lascarides, 2008;Webber, 2009). Asr and Demberg (2012; attribute the corpus statistics to the UID hypothesis, which explains that expected, predictable relations are more likely to be conveyed implicitly, and thus more ambiguously, to maintain steady information flow. However, there are explicit 'causal' and 'continuous' relations and some Chosen Alternative are marked even argument 1 is negated. Although markedness measures are proposed to rate the implicitness of a relation sense (Asr and Demberg, 2013;Jin and de Marneffe, 2015), these measures only quantify the general markedness of the sense in the data, but not the speaker's choice for each particular instance. In contrast, this work specifically measures the predictability of a given relation; generalizes the approach to all discourse senses instead of particular senses or cues; and combines the markedness preference with other language production factors, in order to model each instance of relation. Patterson and Kehler (2013) is the only study we are aware of that predicts the choice of explicit or implicit DCs of each instance of relation. They argue that while the decision is related to the ease to infer the relation, it may also depend on other stylistic or textual factors. A classifier is trained to predict whether a candidate DC (i.e. the DC that actually occurs in the text as an explicit DC, or annotated as an implicit DC) is actually present, given the sense of the discourse relation and the arguments. Relatively shallow linguistic features are used, such as whether the relations are em-bedded or shared, the previous discourse relation, argument lengths, and content word ratios. The classifier is trained and tested on a subset of relations from the PDTB, after screening away infrequent senses and DCs. An overall high classification accuracy is achieved. Relation-level and discourse-level features are found to be more useful than argument-level features.
However, this work does not target at explaining why an utterance is preferred by the speaker. The focus is a data-driven approach that replicates the occurrence of DCs in the corpus data. Our work differs in that we model the option of markedness from the viewpoint of human language production, explaining the factors behind the speaker's choice. For example, we do not make use of the candidate DC as a feature, since it is the result of the speaker's choice, if an explicit DC is preferred. Nonetheless, our model achieves higher accuracy when evaluated on the same test set.

The markedness model
Our model is based on the speaker's model of RSA. We first explain how we adapt the RSA model to discourse presentation, followed by the details of each component.

RSA for discourse relation presentation
According to Equation (2), the probability for a speaker to use utterance w to convey his intended message s in context C is: In the case of discourse connectives, the utterance w comes from the set W = {(exp)licit, (imp)licit}, if both explicit and implicit DCs are grammatically valid to convey s, the sense of discourse relation. Our model thus predicts speaker's choice of DCs based on the following two probabilities: According to Equation (3), the utility U of an explicit DC equals to its informativeness I deducted by production cost D.
U (exp; s, C) = I(s; exp, C) − D(exp) (7) I(s; exp, C) is the informativeness of using an explicit DC to present the sense s in discourselevel context C. Each discourse sense has its salience within the discourse context. It means C is also informative, but we want to quantify the informativeness of the DC only. Therefore, we define I(s; exp, C) by the difference between the informativess of 'the explicit DC in context C' and the informativeness of 'context C', which are quantified by negative surprisal.
High I(s; exp, C) means it is informative and not surprising to use an explicit DC for this sense. P (s|exp, C) and P (s|C) are extracted from corpus data. Details are explained in Subsection 3.2.
The principle of UID is incorporated into the RSA model as a bias on the utility of the DCs. A discourse relation is presented not only by the DCs but also the arguments, and the amount of discourse information of the whole utterance (DC + arguments) is fixed. According to UID, information should be transmitted uniformly across the utterance. If the arguments has much information about the sense, the sense is predictable from the arguments and thus the surprisal is small. The information density drops and has to be smoothed by using a more ambiguous, less predictable utterance, which can be achieved by reduction of a DC (Asr and Demberg, 2015).
Therefore, according to UID, an implicit DC is preferred if the arguments are informative. We thus raise the utility of an implicit DC by defining the probability for a speaker to choose an implicit DC to be proportional to the sum of the the utilities of a null DC and the arguments (args) 4 .
The amount of information that the null DC provides for the discourse relation is defined similarly as in Equation (8): On the other hand, the informativeness of arguments, I(s; arg, C) is quantified by negative surprisal in RSA. However, arguments are clauses and sentences. It is not applicable to extract P (s|args, C) from the corpus. We thus approximate I(s; arg, C) by the confidence of a discourse parser in predicting discourse senses from the arguments. Details will be explained in Section 3.3.
Lastly, various psycholinguistically motivated measures are explored to approximate the prodcution cost D(exp) in Subsection 3.4. In contrast, no effort is required to produce a null DC. Also, we assume that the arguments have been produced to convey other information irrespective of their discourse informativeness, so no extra effort is needed. Therefore, D(null) and D(args) both equal 0.
To summarize, the model predicts that the speaker will use an explicit DC if: e U (exp;s,C) > e U (null;s,C) + e U (args;s,C) (13) and that s/he will use an implicit DC otherwise.

Informativeness of DCs
This section explains how we estimate the informativeness in Equations (8) and (12). In discourse production, the utterance lexicon, W = {exp, imp} in Equation (5), and the set of speaker's intended messages (all possible discourse relation senses) are always valid 5 . Thus P (s|C), P (s|exp, C), and P (s|null, C) are universal distributions and can be extracted from corpus data based on the co-occurrences of senses, DCs, and contexts. We extract these empirical distributions from the training portion of the corpus.
We define context C as the surrounding discourse relations. Specifically, the discourse contexts (and their abbreviation in Table 2) are: the full discourse sense annotated in PDTB (S), the 4-way top level sense (TS), the form of discourse presentation (F) such as 'explicit' or 'implicit' 6 , and the pair of sense and form (SF or TSF). The contexts are taken from window sizes of 1 to 2: previous one (10) , next one (01), previous two (20), next two (02), previous one paired with next one (11). We hypothesize that the speaker also 5 In case of referring expressions, for example, the lists of referents and grammatically correct pronouns differ case by case, e.g. 'she' is not a valid pronoun for a male. 6 We use the 5 forms of discourse presentation defined in the PDTB: explicit DC, implicit DC, alternative lexicalization, entity relation and 'no relation'. thinks ahead the coming discourse structures when planning the current ones. Various discourse contexts are compared in the experiment.

Informativeness of arguments
I(s; arg, C) in Equation (11) refers to the amount of information in the arguments that contributes to the interpretation of the discourse sense. According to UID, information density drops when the discourse sense is predictable from the arguments alone, and an implicit DC is preferred.
Presence of features in the arguments that signal a particular sense makes the sense more predictable, and thus promote the reduction of a DC. For example, the DC 'instead' is less used to present the Chosen Alternative sense if the first argument is negated (Asr and Demberg, 2015).
Generalizing this idea to capture various cues in the arguments for various senses, we approximate I(s; arg, C) by the confidence of an automatic discourse parser in predicting the discourse sense. An implicit relation parser uses various features in the arguments to identify the implicit relation sense (Pitler et al., 2009;Lin et al., 2009;Park and Cardi, 2012;Rutherford and Xue, 2014). If the arguments contain much informative features, the parser will predict the sense more confidently.
We propose two methods, for comparison, to measure the confidence of the parser prediction. A confident prediction means the parser will assign a high probability to the one output sense. Therefore, we use the negative surprisal of the estimated probability P p of the parser output sense s output (Equation 14) to approximate I(s; arg, C). I(s; arg, C) ≈ w a · ln P p (s output ) At the same time, the probability distribution of all senses is less uniform if one sense is assigned a high probability. We thus alternatively approximate I(s; arg, C) by the negative entropy of the probability distribution estimated by the parser (Equation 15) 7 . I(s; arg, C) ≈ w a sp∈O P p (s p ) log P p (s p ) (15) where O is the set of senses defined in the parser and w a is a positive weight tuned on the dev set.
We measure the general informativeness of the arguments to imply any discourse senses, so s output does not necessarily equal s.
We employ the implicit sense classifier from the winning parser of shared task 2015 (Wang and Lan, 2015), which is designed to identify a subset of 14 implicit senses plus the entity relation. The two arguments of a relation instance, which can actually be explicit or implicit, are passed to the implicit DC classifier and I(s; arg, C) is approximated based on the output probabilities 8 . Although the performance of this state-of-the-art implicit DC classifier is still unsatisfactory (34.45% on PDTB Section 23 9 ) , our method only makes use of the probability estimation of the prediction 10 .
Our motivation of using the implicit DC classifier is based on the hypothesis that the classifier can better predict the sense of relations that are actually implicit, than those that are actually explicit, since more features in the arguments are identifiable. In fact, it is the case. The classification accuracy of the originally explicit relations is significantly lower. This supports our motivation to use the parser estimation as an information density predictor.

Cost function
The cost function D(exp) models speaker's effort required to produce an explicit DC for the intended discourse sense. We propose 5 versions of the cost function that are inspired by existing psycholinguistic findings.
Mean DC length: Production cost intuitively increases with word length. We define the mean DC length of a discourse relation as the mean word length of all valid DCs for that sense, normalized by the average word length of all DCs. A lexicon of possible DC per each discourse sense is derived from the whole corpus. For multi-word DCs, a white space is simply counted 8 The implicit DC classifier is trained by Naïve Bayes based on features including syntactic features, polarity, immediately preceding DC, and Brown cluster pairs. Syntactic features are based on automatic parsing using Stanford CoreNLP (Manning et al., 2014). The parser is trained on the same sections of the PDTB as the training set used in our experiment. 9 http://www.cs.brandeis.edu/˜clp/ conll15st/results.html 10 We use the parser's probability estimates as is; conceivably it may be improved by an additional probabilistic calibration step (Nguyen and O'Connor, 2015). as one character. We do not use the length of the candidate DC (refer to Section 2.3), because we view that speakers first decide to use an explicit DC or not, then decide which DC best expresses the relation.
DC/arg2 ratio: Similarly, we use the mean word count normalized by the word count of argument 2 as another version of cost function.
Prime frequency: Structural priming refers to the tendency for human to process a linguistic construction (the target) more easily if the construction is used before. In terms of language production, a speaker tends to repeat a previous construction (the prime) since it consumes less effort than to generate an alternative construction. We use the reciprocal of the count of primes (any explicit DC occurring before the current position) as the production cost, since the strength of priming effect is known to be increasing with the frequency of the primes (Levelt and Kelter, 1982;Bock, 1986;Smith and Wheeldon, 2001).
Prime distance: We also use the prime-target distance, normalized by the length of the article, as another version of the production cost. Psycholinguistic findings suggest that the priming effect is more subtly affected by the prime-target distance (Gries, 2005;Bock et al., 2007;Jaeger and Snider, 2008).
Distance from start: We use the relative position of the relation within the article as the production cost. We hypothesize that more effort is needed as the production proceeds.
The range of values of the cost function depends on the cost definition. We thus adjust the values with a constant weight w c that is tuned on the dev set in the experiments:

Experiment
We apply the model to simulate speaker's choice of explicit or implicit DC for discourse relations in the PDTB corpus. The aim of the experiment is to answer two questions: (1) Does the model explain the factors affecting speaker's choice of DC markedness? If the hypotheses of the model is appropriate, each component in the model should contribute to the prediction accuracy.
(2) How does the prediction performance compare with the state-of-the-art, i.e. Patterson and Kelher (2013)?
We first describe the details of the data we use in the experiments.

Data: The Penn Discourse Treebank
The Penn Discourse Treebank (PDTB) is the largest available discourse-annotated corpus in English (Prasad et al., 2008). The text are news articles collected from the Wall Street Journals. Below are 3 examples of the annotation. Explicit DCs are labelled with relation senses (Example 1). If an explicit DC is absent between two sentences within the same paragraph and an implicit relation can be inferred, a candidate DC and the relation sense are annotated (Example 2).
Our model is based on the assumption that W = {explicit, implicit} for all relations, yet it is notable that intra-sentential implicit DCs are not annotated in the PDTB (Prasad et al., 2014). We thus exclude intra-sentential samples, such that W = {explicit, implicit} is always true and free of grammatical constraints. Also, as a result of the annotation procedure, implicit DCs always occur in between 2 arguments in their original order, i.e. Arg1-DC-Arg2. To preserve the original order of the discourse arguments, which is also part of the communicative structure intended by the speaker but out of the scope of this model, we only use samples in the Arg1-DC-Arg2 order. For example, Example (3) is excluded from our training data. Finally, annotations of other forms of discourse relations, such as entity relations and attributions, are also excluded.
The screened data set contains 5,201 explicit and 16,049 implicit relations 11 . Sections 2-22 are used as the training set, from which probability distributions are extracted. For easier comparison with previous work, we select the dev set (sections 0-1) and test set (sections 23-24) in the same way as in Patterson and Kehler (2013), where only relations of infrequent DCs and senses are removed.
The resulting dev and test sets contain 1720 and 1878 relations respectively. Samples not included in our screened dataset are classified as explicit by default.   Table 2: Accuracies and F1 scores of predicted DC markedness. The best values are bolded. + / ++ :significant improvement over baseline (BL) accuracy at p < 0.05 and p < 0.001 respectively; * :significant improvement over state-of-the-art (SOA) accuracy at p < 0.03 (by Pearson's χ 2 test) (refer to Section 3.2 for abbreviations of discourse context C.) samples, each labelled with one of the senses. However, it is notable that the individual senses of a multi-sense relation are not disjoint 12 and having multiple senses is part of the sense (Asr and Dem berg, 2013;Prasad et al., 2014). Multi-sense is an important factor of our DC production model: a speaker could have chosen an explicit DC for each sense, but if s/he has to express two senses at the same time, an implicit DC could be more usable. Therefore, we treat all combination of senses as individual senses, each containing 1 to 3 joint sense labels 13 This results in a total of 122 senses. Table 1 is a summary of the distribution in descending order of frequency. In fact, joint multisenses are not rare: the most frequent multi-sense is the 17th most frequent sense.

Results
We apply the markedness model to predict the speaker's choice of DC markedness on the dev and test sets. Table 2 shows the results under 12 Similarly, certain level 2 senses, as in Example (2), are backoffed from level 3 senses due to annotator disagreement. This is also a kind of multi-sense. 13 There is only 1 sample of 3 joint labels in our screened dataset. various settings, evaluated by accuracy and the harmonic mean of precision and recall for explicit and implicit relations respectively.
Row BL shows the results of the markedness model without the cost function and argument informativeness component, and with constant context C. We consider this setting as the baseline, in which the prediction is solely based on the distributions of P (s|exp) and P (s|imp). Considerably high accuracy is achieved, suggesting that the speaker's choice of markedness is strongly related to the intended discourse sense.
Row (a) shows the prediction results based on the distributions of P (s|exp, C) and P (s|imp, C), where C is the discourse context. The 5 best combinations of contexts and window sizes are shown. Refining the utility of DCs by these contextual constraints, in particular previous contexts, improves the classification accuracy, but the improvement is not significant. This suggests that speaker's choice of markedness not only depends on surrounding discourse relations but also other contextual factors.
Row (b) shows the contribution of the argument informativeness component, under constant dis-course context and production cost. Classification accuracy increases (significantly for the dev set) when the usability of explicit DC is deducted by the estimated informativeness of the arguments, supporting the UID principle. Predictions based on the surprisal of the parser output sense and the entropy of the parser output distribution are similar. We also experiment by adjusting with the estimated argument informativeness only if the parser output sense is correct (matching at the top level sense). Similar improvement is observed.
Row (c) shows the contribution of the cost function, when discourse context is set as constant and argument informativeness is not considered. Adjusting the utility of explicit DCs by their production cost increases the classification accuracy most significantly. Among the various features to model production cost, 'DC length' and 'distance from start' features give the best results.
Row (d) shows the performance of predictions based on the 3 best combinations of components. The highest accuracies and F 1 scores are achieved for both explicit and implicit relations.
These results answer the first question of the experiment purpose: the proposed model explains the speaker's choice of DC markedness in terms of DC and argument informativeness, and production cost, while contextual discourse structure is a moderate constraint to the choice.
The answer to the second question is also positive. Significant improvement above the state-of-the-art (Row SOA) is achieved by the 2 best combinations (89.0%, 88.9% vs. 86.6%).
Lastly, we compare the results with a linear classifier trained on the features specified in the model, i.e. the discrete values of the intended sense and various discourse context definitions, and real values of various cost functions and argument informativeness estimates. Note that in the proposed model, the training data is used to derive the P (s|exp, C) and P (s|null, C) distributions only, while the linear classifier learns from the features and DC markedness of the training set 14 . The classifier achieves accuracy of 88.3% on the test set, which does not significantly outperform previous work. This suggests the advantage of the 14 When extracting the argument informativeness features from the training set, using the automatic discourse parser, we penalize the parser estimates of the implicit samples by a constant ratio, since the discourse parser is also trained on these samples. We use LIBLINEAR (Fan et al., 2008) to build the classifiers.
information-theoretic configuration of our model.

Conclusion
We present a language production model that predicts a speaker's choice of using an explicit DC or not given the discourse relation s/he wants to express. Our model gives an cognitive account of the speaker's choice and also outperforms previous work on the same task.
Our study shows that a speaker organizes the discourse structure by balancing the pro (informativeness) and con (production cost and redundancy) of using an explicit marker, although the option is a subtle preference in the absence of other grammatical constraints. Using an information-theoretic approach, our model tackles the option as a rational preference by the speaker, who wants to contribute to an informative speech act. Furthermore, we take a logical step forward to formalize the idea of the UID theory, that redundant explicit markers are avoided if the discourse relation is clear enough from the context.
As future work, we plan to improve the markedness model by making fuller use of the training data, such as learning a more expressive formulation of the context governing the choice of explicit or implicit DCs. We also plan to evaluate the effectiveness of the model in applications, such as natural language generation or machine translation tasks. On the other hand, as discourse presentation differs across genres (Webber, 2009) and mediums (Tonelli et al., 2010), the model can be applied to predict the explicitation of discourse relations from, for example, news articles to spoken dialogues. Another direction is to apply the RSA framework in the opposite direction -to build a listener's model that simulates a listener's recognition of a discourse sense given an utterance, as proposed in Yung et al.(2016).