Better Document-level Sentiment Analysis from RST Discourse Parsing

Discourse structure is the hidden link between surface features and document-level properties, such as sentiment polarity. We show that the discourse analyses produced by Rhetorical Structure Theory (RST) parsers can improve document-level sentiment analysis, via composition of local information up the discourse tree. First, we show that reweighting discourse units according to their position in a dependency representation of the rhetorical structure can yield substantial improvements on lexicon-based sentiment analysis. Next, we present a recursive neural network over the RST structure, which offers significant improvements over classification-based methods.


Introduction
Sentiment analysis and opinion mining are among the most widely-used applications of language technology, impacting both industry and a variety of other academic disciplines (Feldman, 2013;Liu, 2012;Pang and Lee, 2008). Yet sentiment analysis is still dominated by bag-of-words approaches, and attempts to include additional linguistic context typically stop at the sentence level (Socher et al., 2013). Since document-level opinion mining inherently involves multi-sentence texts, it seems that analysis of document-level structure should have a role to play.
A classic example of the potential relevance of discourse to sentiment analysis is shown in Figure 1. In this review of the film The Last Samurai, the positive sentiment words far outnumber the :::::::: negative ::::::::: sentiment words. But the discourse structure -indicated here with Rhetorical Structure Theory (RST; Mann and Thompson, 1988) -  Voll and Taboada (2007).
clearly favors the final sentence, whose polarity is negative. This example is illustrative in more than one way: it was originally identified by Voll and Taboada (2007), who found that manuallyannotated RST parse trees improved lexiconbased sentiment analysis, but that automaticallygenerated parses from the SPADE parser (Soricut and Marcu, 2003), which was then state-of-the-art, did not.
Since this time, RST discourse parsing has improved considerably, with the best systems now yielding 5-10% greater raw accuracy than SPADE, depending on the metric. The time is therefore right to reconsider the effectiveness of RST for document-level sentiment analysis. In this paper, we present two different ways of combining RST discourse parses with sentiment analysis. The methods are both relatively simple, and can be used in combination with an "off the shelf" discourse parser. We consider the following two architectures: • Reweighting the contribution of each discourse unit, based on its position in a dependency-like representation of the discourse structure. Such weights can be defined using a simple function, or learned from a small of data.
Both architectures can be used in combination with either a lexicon-based sentiment analyzer, or a trained classifier. Indeed, for users whose starting point is a lexicon-based approach, a simple RST-based reweighting function can offer significant improvements. For those who are willing to train a sentiment classifier, the recursive model yields further gains.

Rhetorical Structure Theory
RST is a compositional model of discourse structure, in which elementary discourse units (EDUs) are combined intro progressively larger discourse units, ultimately covering the entire document. Discourse relations may involve a nucleus and a satellite, or they may be multinuclear. In the example in Figure 1, the unit 1C is the satellite of a relationship with its nucleus 1B; together they form a larger discourse unit, which is involved in a multinuclear CONJUNCTION relation.
The nuclearity structure of RST trees suggests a natural approach to evaluating the importance of segments of text: satellites tend to be less important, and nucleii tend to be more important (Marcu, 1999). This idea has been leveraged extensively in document summarization (Gerani et al., 2014;Uzêda et al., 2010;Yoshida et al., 2014), and was the inspiration for Voll and Taboada (2007), who examined intra-sentential relations, eliminating all words except those in the top-most nucleus within each sentence. More recent work focuses on reweighting each discourse unit depending on the relations in which it participates (Heerschop et al., 2011;Hogenboom et al., 2015). We consider such an approach, and compare it with a compositional method, in which sentiment polarity is propagated up the discourse tree. Marcu (1997) provides the seminal work on automatic RST parsing, but there has been a recent spike of interest in this task, with contemporary approaches employing discriminative learning (Hernault et al., 2010), rich features (Feng and Hirst, 2012), structured prediction (Joty et al., 2015), and representation learning (Ji and Eisenstein, 2014;Li et al., 2014). With many strong systems to choose from, we employ the publiclyavailable DPLP parser (Ji and Eisenstein, 2014), 1 . To our knowledge, this system currently gives the best F-measure on relation identification, the most difficult subtask of RST parsing. DPLP is a shiftreduce parser (Sagae, 2009), and its time complexity is linear in the length of the document.

Sentiment analysis
There is a huge literature on sentiment analysis (Pang and Lee, 2008;Liu, 2012), with particular interest in determining the overall sentiment polarity (positive or negative) of a document. Bagof-words models are widely used for this task, as they offer accuracy that is often very competitive with more complex approaches. Given labeled data, supervised learning can be applied to obtain sentiment weights for each word. However, the effectiveness of supervised sentiment analysis depends on having training data in the same domain as the target, and this is not always possible. Moreover, in social science applications, the desired labels may not correspond directly to positive or negative sentiment, but may focus on other categories, such as politeness (Danescu-Niculescu-Mizil et al., 2013), narrative frames (Jurafsky et al., 2014), or a multidimensional spectrum of emotions (Kim et al., 2012). In these cases, labeled documents may not be available, so users often employ a simpler method: counting matches against lists of words associated with each category. Such lists may be built manually from introspection, as in LIWC (Tausczik and Pennebaker, 2010) and the General Inquirer (Stone, 1966). Alternatively, they may be induced by bootstrapping from a seed set of words (Hatzivassiloglou and McKeown, 1997;Taboada et al., 2011). While lexicon-based methods may be less accurate than supervised classifiers, they are easier to apply to Figure 2: Dependency-based discourse tree representation of the discourse in Figure 1 new domains and problem settings. Our proposed approach can be used in combination with either method for sentiment analysis, and in principle, could be directly applied to other document-level categories, such as politeness.

Datasets
We evaluate on two review datasets. In both cases, the goal is to correctly classify the opinion polarity as positive or negative. The first dataset is comprised of 2000 movie reviews, gathered by Pang and Lee (2004). We perform ten-fold crossvalidation on this data. The second dataset is larger, consisting of 50,000 movie reviews, gathered by Socher et al. (2013), with a predefined 50/50 split into training and test sets. Documents are scored on a 1-10 scale, and we treat scores ≤ 4 as negative, ≥ 7 as positive, and ignore scores of 5-6 as neutral -although in principle nothing prevents extension of our approaches to more than two sentiment classes.

Discourse depth reweighting
Our first approach to incorporating discourse information into sentiment analysis is based on quantifying the importance of each unit of text in terms of its discourse depth. To do this, we employ the dependency-based discourse tree (DEP-DT) formulation from prior work on summarization (Hirao et al., 2013). The DEP-DT formalism converts the constituent-like RST tree into a directed graph over elementary discourse units (EDUs), in a process that is a close analogue of the transformation of a headed syntactic constituent parse to a syntactic dependency graph (Kübler et al., 2009). The DEP-DT representation of the discourse in Figure 1 in shown in Figure 2. The graph is constructed by propagating "head" information up the RST tree; if the elementary discourse unit e i is the satellite in a discourse relation headed by e j , then there is an edge from e j to e i . Thus, the "depth" of each EDU is the number of times in which it is embedded in the satellite of a discourse relation. The exact algorithm for constructing DEP-DTs is given by Hirao et al. (2013).
Given this representation, we construct a simple linear function for weighting the contribution of the EDU at depth d i : Thus, at d i = 0, we have λ i = 1, and at d i ≥ 3, we have λ i = 0.5. Now assume each elementary discourse unit contributes a prediction ψ i = θ w i , where w i is the bag-of-words vector, and θ is a vector of weights, which may be either learned or specified by a sentiment lexicon. Then the overall prediction for a document is given by, Evaluation We apply this approach in combination with both lexicon-based and classificationbased sentiment analysis. We use the lexicon of Wilson et al. (2005), and set θ j = 1 for words marked "positive", and θ j = −1 for words marked negative. For classification-based analysis, we set θ equal to the weights obtained by training a logistic regression classifier, tuning the regularization coefficient on held-out data. Results are shown in Table 1. As seen in the comparison between lines B1 and D1, discourse depth weighting offers substantial improvements over the bag-of-words approach for lexiconbased sentiment analysis, with raw improvements of 4−5%. Given the simplicity of this approachwhich requires only a sentiment lexicon and a discourse parser -we strongly recommend the application of discourse depth weighting for lexiconbased sentiment analysis at the document level. However, the improvements for the classificationbased models are considerably smaller, less than 1% in both datasets.

Rhetorical Recursive Neural Networks
Discourse-depth reweighting offers significant improvements for lexicon-based sentiment analysis, but the improvements over the more accurate classification-based method are meager. We therefore turn to a data-driven approach for combining sentiment analysis with rhetorical structure theory, based on recursive neural networks (Socher et where i indexes a discourse unit composed from relation r i , n(i) indicates its nucleus, and s(i) indicates its satellite. Returning to the example in Figure 1, the sentiment score for the discourse unit obtained by combining 1B and 1C is obtained from tanh(K (elaboration) n . Similarly, for multinuclear relations, we have, In the base case, each elementary discourse unit's sentiment is constructed from its bag-of-words, Ψ i = θ w i . Because the structure of each document is different, the network architecture varies in each example; nonetheless, the parameters can be reused across all instances. This approach, which we call a Rhetorical Recursive Neural Network (R2N2), is reminiscent of the compositional model proposed by Socher et al. (2013), where composition is over the constituents of the syntactic parse of a sentence, rather than the units of a discourse. However, a crucial difference is that in R2N2s, the elements Ψ and K are scalars: we do not attempt to learn a latent distributed representation of the sub-document units. This is because discourse units typically comprise multiple words, so that accurate analysis of the sentiment for elementary discourse units is not so difficult as accurate analysis of individual words.
The scores for individual discourse units can be computed from a bag-of-words classifier, or, in future work, from a more complex model such as a recursive or recurrent neural network.
While this neural network structure captures the idea of compositionality over the RST tree, the most deeply embedded discourse units can be heavily down-weighted by the recursive composition (assuming K s < K n ): in the most extreme case of a right-branching or left-branching structure, the recursive operator may be applied N times to the most deeply embedded EDU. In contrast, discourse depth reweighting applies a uniform weight of 0.5 to all discourse units with depth ≥ 3. In the spirit of this approach, we add an additional component to the network architecture, capturing the bag-of-words for the entire document. Thus, at the root node we have: with Ψ rst-root defined recursively from Equations 3 and 4, θ indicating the vector of per-word weights, and the scalar γ controlling the tradeoff between these two components.
Learning R2N2 is trained by backpropagating from a hinge loss objective; assuming y t ∈ {−1, 1} for each document t, we have the loss L t = (1 − y t Ψ doc,t ) + . From this loss, we use backpropagation through structure to obtain gradients on the parameters (Goller and Kuchler, 1996). Training is performed using stochastic gradient descent. For simplicity, we follow Zirn et al. (2011) and focus on the distinction between contrastive and non-contrastive relations. The set of contrastive relations includes CONTRAST, COMPARISON, ANTITHE-SIS, ANTITHESIS-E, CONSEQUENCE-S, CON-CESSION, and PROBLEM-SOLUTION.
Evaluation Results for this approach are shown in lines R1 and R2 of Table 1. Even without distinguishing between discourse relations, we get an improvement of more than 3% accuracy on the Stanford data, and 0.5% on the smaller Pang & Lee dataset. Adding sensitivity to discourse relations (distinguishing K (r) for contrastive and noncontrastive relations) offers further improvements on the Pang & Lee data, outperforming the baseline classifier (D2) by 1.3%. The accuracy of discourse relation detection is only 60% for even the best systems (Ji and Eisen-stein, 2014), which may help to explain why relations do not offer a more substantial boost. An anonymous reviewer recommended evaluating on gold RST parse trees to determine the extent to which improvements in RST parsing might transfer to downstream document analysis. Such an evaluation would seem to require a large corpus of texts with both gold RST parse trees and sentiment polarity labels; the SFU Review Corpus (Taboada, 2008) of 30 review texts offers a starting point, but is probably too small to train a competitive sentiment analysis system.

Related Work
Section 2 mentions some especially relevant prior work. Other efforts to incorporate RST into sentiment analysis have often focused on intrasentential discourse relations (Heerschop et al., 2011;Zhou et al., 2011;Chenlo et al., 2014), rather than relations over the entire document. Wang et al. (2012) address sentiment analysis in Chinese. Lacking a discourse parser, they focus on explicit connectives, using a strategy that is related to our discourse depth reweighting. Wang and Wu (2013) use manually-annotated discourse parses in combination with a sentiment lexicon, which is automatically updated based on the discourse structure. Zirn et al. (2011) use an RST parser in a Markov Logic Network, with the goal of making polarity predictions at the sub-sentence level, rather than improving document-level prediction. None of the prior work considers the sort of recurrent compositional model presented here.
An alternative to RST is to incorporate "shallow" discourse structure, such as the relations from the Penn Discourse Treebank (PDTB). PDTB relations were shown to improve sentencelevel sentiment analysis by Somasundaran et al. (2009), and were incorporated in a model of sentiment flow by Wachsmuth et al. (2014). PDTB relations are often signaled with explicit discourse connectives, and these may be used as a feature (Trivedi and Eisenstein, 2013;Lazaridou et al., 2013) or as posterior constraints (Yang and Cardie, 2014). This prior work on discourse relations within sentences and between adjacent sentences can be viewed as complementary to our focus on higher-level discourse relations across the entire document.
There are unfortunately few possibilities for direct comparison of our approach against prior work. Heerschop et al. (2011) andWachsmuth et al. (2014) also employ the Pang and Lee (2004) dataset, but neither of their results are directly comparable: Heerschop et al. (2011) exclude documents that SPADE fails to parse, and Wachsmuth et al. (2014) evaluates only on individual sentences rather than entire documents. The only possible direct comparison is with very recent work from Hogenboom et al. (2015), who employ a weighting scheme that is similar to the approach described in Section 3. They evaluate on the Pang and Lee data, and consider only lexicon-based sentiment analysis, obtaining document-level accuracies between 65% (for the baseline) and 72% (for their best discourse-augmented system). Table 1 shows that fully supervised methods give much stronger performance on this dataset, with accuracies more than 10% higher.

Conclusion
Sentiment polarity analysis has typically relied on a "preponderance of evidence" strategy, hoping that the words or sentences representing the overall polarity will outweigh those representing counterpoints or rhetorical concessions. However, with the availability of off-the-shelf RST discourse parsers, it is now easy to include documentlevel structure in sentiment analysis. We show that a simple reweighting approach offers robust advantages in lexicon-based sentiment analysis, and that a recursive neural network can substantially outperform a bag-of-words classifier. Future work will focus on combining models of discourse structure with richer models at the sentence level.
Acknowledgments Thanks to the anonymous reviewers for their helpful suggestions on how to improve the paper.